A disambiguator might be used for a language in case when the tagger creates many interpretations for a token and rules get very complex because of the same set of exceptions used everywhere to disambiguate part-of-speech tags.
The disambiguator might be rule-based, as it is for French or English, or it can implement a completely different scheme (statistical). Note that you cannot simply adapt existing disambiguators, even rule-based, as they are used to make taggers robust. Robustness means that good taggers should ignore small grammatical problems when tagging. However, we want to recognize them rather then hide from linguistic processing. Anyway, I found that even automatically created rules (such as ones generated by training a Brill tagger for English) can be a source of inspiration.
Note that in contradistinction to XML grammar rules, the order of disambiguation rules is important (like in Brill tagger rules, they are cascaded). They are applied in the order as they appear in the file, so you can use a step-by-step strategy and use the results of previous rules in what follows after them.
The rule-based disambiguator may be used to add additional markup and simplify error-matching rules. For example, you can conditionally mark up some punctuation, or phrases. It's also useful to mark up tokens that you would otherwise with lengthy regular-expression based disjunctions (word1|word2…|wordn, if these disjunctions appear in multiple rules. This will be more efficient in terms of processing speed and will make the rules a bit more understandable for a human being.
XML syntax
Rule based XML disambiguator uses a syntax very similar to XML rules. For example:
<rule name="determiner + verb/NN -> NN" id="DT_VB_NN"> <pattern mark="2"> <token postag="DT"><exception postag="PDT" /></token> <and> <token postag="VB" /> <token postag="NN" ><exception negate_pos="yes" postag="VB|NN" postag_regexp="yes"/></token> </and> </pattern> <disambig postag="NN" /> </rule>
The only new element here is disambig. It simply assigns a new POS tag to the word being disambiguated. Note I am using a trick that the rule applies only to words having both NN and VB tags - in English, there are many much more ambiguous words which require much more complex rules. Without the trick, the disambiguation rule could create more damage than good - it would garble the tagger output. This is a constant danger when writing disambiguator rules.
Note that by default disambig is applied to a single token which is selected with the mark attribute of pattern element. However, you can use the action attribute to select more tokens for unification or for adding new interpretations.
The possible values of the action attribute are:
- replace - the default one, assumed in the above example
- filter - used for filtering single tokens
- unify - used for unification of groups of tokens
- remove - used for removing single tokens
- add - used for adding interpretations.
Filtering tags
Instead of adding a single tag, as above, you can select an already existing tag (that would also retain the old lemma that gets overwritten in case of simple assignment as above):
<rule name="his (noun/prep) + noun -> his (prep)" id="HIS_NN_PRP"> <pattern mark="0"> <token>his</token> <token postag="NN.*" postag_regexp="yes" /> </pattern> <disambig><match no="1" postag="PRP\$" postag_regexp="yes" /></disambig> </rule>
In this case, we select an existing interpretation (and only that interpretation) from the set of previous interpretations.
You can also assign a lemma if there are multiple interpretations and you don't want to pick just the first one as supplied by the tagger (this is the default behavior):
<rule name="Don't|do|don/vb ->don't/vb" id="DONT_VB"> <pattern mark="0"> <token>don</token> <token>'</token> <token>t</token> </pattern> <disambig><match no="1" postag="VBP">do</match></disambig> </rule>
In this case, a contracted form of ''do'' is assigned a proper lemma and form tag. All other interpretations are being discarded.
There is another, shorter syntax that you might use for simple forms of filtering:
<rule name="his (noun/prep) + noun -> his (prep)" id="HIS_NN_PRP"> <pattern mark="0"> <token>his</token> <token postag="NN.*" postag_regexp="yes" /> </pattern> <disambig action="filter" postag="PRP\$" /> </rule>
It is exactly equivalent to the first example. Note that you cannot specify a lemma this way, so you need the full syntax for this.
Using unification
Before using unification, you need to define features and equivalences of features, as described in Using unification. In disambiguator file, you add the same unification block as in the rules file (the syntax is the same). Then, in the rule, you can leave only unified tokens, that is tokens that share the same features. For example, take a simple agreement rule from the Polish disambiguator:
<rule name="unifikacja przymiotnika z rzeczownikiem" id="unify_adj_subst"> <pattern mark_from="0" mark_to="0"> <unify feature="number,gender,case"> <token postag="adj.*" postag_regexp="yes"><exception negate_pos="yes" postag_regexp="yes" postag="adj.*"/></token> <token postag="subst.*" postag_regexp="yes"><exception negate_pos="yes" postag_regexp="yes" postag="subst.*"/></token> </unify> </pattern> <disambig action="unify"/> </rule>
It uses unification on three features (defined earlier in the file): number, gender, and case. Note that I am using a uniqueness trick to make sure that only words that are marked only as adjectives or substantives are unified (otherwise the rule is too greedy).
There are several important restriction. You cannot use two unified blocks in the disambiguator file; only one unify sequence per pattern is allowed. Moreover, the length of the matched tokens (selected with mark_from and mark_to) must match the length of the unified sequence. Of course, there might be more tokens in the rule, but they cannot be selected with mark_from and mark_to if the disambiguator is supposed to unify the sequence of tokens.
Removing only some interpretations
Sometimes, instead of filtering, you might want to remove only one interpretation from the token. You can do this in the following way:
<rule name="mają to nie maić" id="MAJA_MAIC"> <pattern mark="0"> <token>mają</token> </pattern> <disambig action="remove"><wd lemma="maić" pos="verb:fin:pl:ter:imperf">mają</wd></disambig> </rule>
The above code removes one interpretation of the word "mają": the one with the POS equal to "verb:fin…", token equal "mają", and lemma "maić". Note you must supply all three parameters to remove a token. If one of them is unknown, use filtering instead.
Adding completely new readings
Adding new readings can be useful to mark up groups, such as noun groups or multi-word expressions. You can add a single reading or many readings to the whole sequence (for example, a start mark, an "inside" mark", and an end mark).
For example:
<rule name="ciemku" id="ciemku"> <pattern mark="0"> <token>ciemku</token> </pattern> <disambig action="add"><wd lemma="po ciemku" pos="adjp">ciemku</wd>/disambig> </rule>
The number of wd elements must match the number of tokens selected with mark or mark_from and mark_to.
Adding only POS tags or tokens
You can also add just POS tags without having to specify the lemmas or tokens added. This is especially useful if you're tagging tokens that are matched by regular expressions or POS tags, so you don't actually know which one you will find. You can add a POS tag just by supplying the wd element without the lemma attribute or without textual content:
<rule name="uppercase tag" id="UPTAG"> <pattern mark="0" case_sensitive="yes"> <token regexp="yes">\p{Lu}+</token> </pattern> <disambig action="add"><wd pos="UP"/>/disambig> </rule>
In the above example, I only added UP tag to uppercase words, the lemma is assumed to be equal to the token content, and the content of the token is not changed. So if the word was "Smiths", it would be tagged as "UP", and the lemma would be "Smiths" (although in other readings it could be "Smith").
If you omit only a token, it will be equal to the token matched by the current rule (rather than empty).
Possible strategies of disambiguation
- Remove very rare but possible POS tag interpretations, possibly ignoring the context (greedy strategy which must be evaluated on a corpus)
- Remove only these ambiguities which create false alarms by looking at the number
- Remove some ambiguities one at a time, starting from very general and safe rules and ending with some very specific ones
Testing disambiguation rules
The best way to test disambiguation rules is to run LanguageTool on a middle-sized corpus (comparable to Brown corpus in English) and see if the previous false alarms are now fixed and no new false alarms are being created. Otherwise, it's very hard to predict the impact of disambiguation rules.
Starting from version 0.9.8, there will be a possibility to test the disambiguation rules in a similar fashion to the way grammar rules are being tested. Let's look at the example.
<example type="ambiguous" inputform="What[what/WDT,what/WP,what/UH]" outputform="What[what/WDT]"><marker>What</marker> kind of bread is this?</example> <example type="untouched">What are you doing?</example>
In the above snippet, we declare that the sentence "What are you doing?" should be left untouched, or unchanged by the disambiguation rule, contrary to the ambiguous sentence that will be processed. Using marker element, we select the token that will be changed. The attribute inputform specifies the input forms of the token, in a word[lemma/POS] format. The outputform is of course what the disambiguation rule should produce.





