Using Unification

Unification is used to match sequences of tokens that match the same criteria, or share some features. In the context of formal grammar, unification grammar stipulate that the linguistic tokens share certain features which are defined formally.

In LanguageTool, unification might be used to match several tokens that share a certain set of features, while the exact values of the features are unknown. This way certain rules of agreement can be defined. For example, if the feature to be matched is the same letter case — all uppercase; all lowercase; starting from uppercase and continuing in lowercase etc. — then you simply specify that all tokens must share this feature and only such tokens will be matched.

Though unification is not limited to matching tokens in XML rules (the support in LanguageTool is universal enough to be used in Java rules; simply look at JUnit tests for some inspiration)/

To make it work, you need to first define the feature. You simply need to give a name to it:

<unification feature="case_sensitivity">
 
...  
 
</unification>

Now, you need to add some possible values, or types of these features. To do that, you specify certain criteria of equivalence between tokens the same way as in rules, that is via the token element:

<unification feature="case_sensitivity">
    <equivalence type="startupper">
      <token regexp="yes">\p{Lu}\p{Ll}+</token>
    </equivalence>
    <equivalence type="lowercase">
      <token regexp="yes">\p{Ll}+</token>
    </equivalence>
 </unification>

Here you can see two possible types of instances of the feature case_sensitivity: startupper and lowercase, both defined with a regular-expression token. The unification block must appear in the XML file before any rules and phrases, immediately after the root element rules.

To match tokens that share some feature, you simply write inside the pattern element:

<unify feature="case_sensitivity" type="startupper">
    <token/>
    <token>York</token>
</unify>

The pattern will match any uppercase-starting word before the word "York" (New York, Old York, Pork York…).

A slightly less trivial would be an example of unification over three features with many values. Take features such as grammatical number and gender: they have different values in different languages (like singular / plural / dual; feminine / masculine / neutral…). Inflected languages usually have tagsets that specify such features in POS tags. You can match those features using token element, and stipulate that following tokens will have the same features as the starting one:

<unify feature="gender,number" type="masc,singular">
   <token/>
   <token>foo</token>
</unify>

This pattern will match only two tokens which have the same gender and number (masculine and singular). Multiple features and types are simply separated with a comma (,). You can also skip specifying the types - in this case, LanguageTool will try to match all possible values defined as equivalences for the features. Note: you cannot skip features.

<unify feature="gender,number">
   <token/>
   <token>foo</token>
</unify>

You can also match all tokens but the ones that share a certain set of features. Simply use negate="yes" on the unify element:

<unify negate="yes" feature="gender,number">
   <token/>
   <token>foo</token>
</unify>

Unification might also be used for disambiguation. I'll integrate it with the XML-based disambiguator soon (the Java code for selecting only those POS interpretations which are shared in the whole sequence of tokens is already written).

page_revision: 1, last_edited: 1228836838|%e %b %Y, %H:%M %Z (%O ago)
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License