There are many simple but non-obvious ways how you can build error-checking rules. Here are some tips.
Stay updated on this Language Tool site
Track changes with RSS by adding this to your RSS feeds : http://languagetool.wikidot.com/feed/site-changes.xml
Tagging a corpus using LanguageTool
LanguageTool has a POS tagger, sometimes it has a disambiguator or a chunker for a language, so you can use it to tag a big corpus. I added a special command-line switch —taggeronly or, for short, -t to disable rule checking and use only the tagger. It works fine.
I didn't go beyond 3 GB of pure text but you should experence no problem even with huge corpora :)
java -jar LanguageTool.jar -l <language> -c <encoding> -t <corpus_file> > <tagged_corpus_file>
With some smart tooling, you can use such corpus to create grammar-checking rules semi-automatically (but this goes beyond a simple trick).
Checking text from standard input
Starting from version 0.9.8, LanguageTool will accept input text from the standard input, so you would be able to pipe the output from other tools to LT. To use STDIN, you must supply - as the filename, or simply specify no filename at all.
Automatic application of suggestions
A new feature in 0.9.8 is automatic application of suggestions generated by the rules. You can switch this mode by using —apply (or -a for short) on the command-line. Note: only the first suggestion is implied, when the rule has more than one suggestion. Make sure that you only select the rules which are reliable enough to allow automatic correction (use the switch -e to select the rules you want to use, or -d to disable some of the rules). The suggestions are applied in the order of the starting position of errors in the text. Try to make some tests to see if you have clashing or conflicting suggestions. LanguageTool checks if the error found by the rule is still there in the text before applying the suggestion, but that's the only quality check. The text is checked only once, so if multiple suggestions create an error that another rule could correct, you can simply run LT on the output from the previous check.
The output from LanguageTool will be the corrected text (without any other messages). You can still get verbose output on STDERR (by specifying -v).
This mode of operation can be used to clean a corpus before running some information extraction on it, or to post-edit some automatically created text (for example, from machine translation).
Getting tags and rule matches
If you want to get all the matches for a given file, just run
java -jar LanguageTool.jar -l <language> -c <encoding> <file>
You can pipe this to a file (add > file at the end of the line). To get part-of-speech tags as assigned by LanguageTool, simply write:
java -jar LanguageTool.jar -l <language> -c <encoding> <file> >results.txt 2>tags.txt
You will get two files: results.txt with rule matches (and you can sort it using the stats.awk script mentioned below) and tags.txt that will contain all tags. Note: some operating systems allow pointing both std-err (2>) and std-in (>) to the same file.
Skipping and regular expressions over tokens
In some corpus query languages (for example, used in Poliqarp), you can have regular expression queries over tokens, like this:
[word="the"] [pos="jj"]{1,2} [pos="nn"]
Such a query would result in matching one or two words with POS tag equal to "jj". There is an equivalent notation in LanguageTool:
<token>word</token> <token postag="jj" skip="1"><exception scope="next" negate_pos="yes" postag="jj"/></token> <token postag="nn"/>
Note that you add negation, as exception already implies negation. This way, you get a positive condition (double-negation elimination).
This regular expression:
[pos="jj"]+
is equivalent to
<token postag="jj" skip="-1"><exception negate_pos="jj" scope="next" postag="jj"/></token>
The only missing operator is "*". You cannot have a token that is matched zero times. In such a case, you have to write two instances of the rule: one without the asterisked token, and another with the notation equivalent to the "+".
Test a corpus and sort rule matches
It's useful to run LanguageTool rules on a larger text and see the number of matches generated by the rules. To do it, run it from the command line (or straight from Eclipse by adding the following parameters)
java -jar LanguageTool.jar -l <language> -c <encoding> <filename.txt> > <output_file.txt>
However, in such a case it's even more convenient to have the rule matches sorted by their hit frequency to see which ones should be corrected first. There is a simple AWK script you can use:
gawk -f stats.awk <output_file.txt> > <sorted_file.txt>
You will get sorted matches, like those:
Rule ID: HE_VERB_AGR[7], matches: 8
Message: The proper name in singular (Government) must be used with a third-person verb: 'inventories'.
Suggestion: inventories
...o the Federal Government inventory on and after July 1, 196...
^^^^^^^^^
Message: The proper name in singular (Man) must be used with a third-person verb: 'hoaxes'.
Suggestion: hoaxes
...ator of the Piltdown Man hoax of 1912, creating the co...
^^^^
Message: The proper name in singular (Pope) must be used with a third-person verb: 'is'.
Suggestion: is
...s Taggart and Betty Pope are from wealthy families. C...
^^^
Message: The proper name in singular (Earth) must be used with a third-person verb: 'orbits'.
Suggestion: orbits
...M in an elliptical Earth orbit with an apogee of 4600 m...
^^^^^
Message: The proper name in singular (River) must be used with a third-person verb: 'sections'.
Suggestion: sections
...t where the Cunene River section of the border with Namib...
^^^^^^^
Message: The proper name in singular (Wright) must be used with a third-person verb: 'designs'.
Suggestion: designs
... that Frank Lloyd Wright design the architectural models...
^^^^^^
Message: The proper name in singular (Korea) must be used with a third-person verb: 'continues'.
Suggestion: continues
...e. Japan and South Korea continue to dominate in the area ...
^^^^^^^^
Message: The proper name in singular (Set) must be used with a third-person verb: 'fires'.
Suggestion: fires
...iots in Los Angeles. Set fire to the camp, and kill th...
Interactive testing of rules using a corpus
It is not too difficult to make an interactive rule development environment using AMP; either Xampp or LAMP.
Run LT as a service, make a PHP-program that reads the corpus line by line, and feed every line to the server process. When a rule hits, interpret the xml, and store the data in the database. Use Apache and php to generate an overview of rules and matched area's. This way, you can edit the results, have results re-processed on a rule update, and see hits coming in while editing the ruleset.
There is a working prototype for Dutch. Interested? request for it on the mailing list.
Testing if the POS tag is the only one
For some purposes, it's useful to have rules that react on words that can only take one part-of-speech tag. The way you do it is the following:
<token postag="postag1"> <exception negate_pos="yes" postag="postag1"/> </token>
Why this works? The answer is simple, it first matches all tokens that have a POS tag postag1, and then checks if it has any other tag other than postag1, using the exception tag. The same method works for more POS tags — you can check if the token has only one of two POS tags (and not any other):
<token postag="tag1|tag2" postag_regexp="yes"> <exception negate_pos="yes" postag="tag1|tag2" postag_regexp="yes"/> </token>
To check if the token has exactly two tags, and only those, you can use the following notation using and:
<and> <token postag="tag1"> <exception negate_pos="yes" postag="tag1|tag2" postag_regexp="yes"/> </token> <token postag="tag2"/> </and>
It's enough to set the exception on any of the tokens but they must be linked with and if both are supposed to appear. This can be useful for writing disambiguation rules, for example.
Various forms of negation
There are three ways to negate the meaning of the token in XML rules:
- by negating the token
- by negating the part of speech
- by using exceptions
Exceptions are the easiest way and you can add many of them but adding a single negation via exception might be wordy. Remember, therefore, that even if you negate a token that has a part-of-speech specified, you'll be matching tokens with this part of speech.
For example, this is correct:
<token postag="SENT_START" negatepos="yes"/>
This will match all tokens that appear in the middle or at the end of the sentence. It is equivalent to:
<token><exception postag="SENT_START"/></token>
This, however, will match any end-of-sentence token (which is probably undesired):
<token postag="SENT_END" negate="yes"/>
It is logically equivalent to:
<token postag="SENT_END"><exception/></token>
Special POS tags
For all languages, there are special part-of-speech tags defined beside the tag set for the language. These are:
- SENT_START that matches the sentence start
- SENT_END that matches the sentence end
- UNKNOWN that matches a token that has no part-of-speech token assigned by the tagger
Suggesting the word with the same POS tag
For some languages - currently English, Polish, and Dutch - it's possible to inflect the words being suggested. However, a nice ability is to inflect the word just the way the matched token is inflected. Here's how it's done:
<pattern> <token>word</token> </pattern> <message>Here's another word: <suggestion> <match no="1" postag="POS.*" postag_regexp="yes" >Wort</match> </suggestion> </message>
Note the postag attribute - it allows to inflect Wort the same way "word" is inflected; without it, it's impossible to choose the right inflection pattern for tokens with multiple readings. But even if there is only one reading, the default logic "replace the word with saving its grammatical form" works only when the postag is selected as above (not necessarily as a regular expression, but it makes things easier). Wort must be the lemma (the base form). Note that POS.* above is just a placeholder for a real POS tag.
However, if the POS tag specified in the postag attribute does not match the original POS tag, the same rule will generate all forms that match just this POS.*, disregarding the original POS tag. So the filtering action happens only when there is a match between the postag attribute value and the POS tag of the word matched.
Remember to test if the proposed corrections are what you meant by using the correction attribute in the example marked as incorrect.
Removing words
One of the most important forms of correction is a suggestion to remove a word. In LanguageTool, this is done implicitly in two ways:
- Replace two words with a single one, i.e., simply match any token that follows or precedes the token to be deleted.
- Replace the word with an empty string as a suggestion. Note that this will work flawlessly only for tokens that were not preceded with whitespace; otherwise you will get a duplicated whitespace. So this is a good way to remove, for example, a closing quotation mark that is exactly at the end of the sentence and can be preceded with any word (but not with whitespace, at least in many languages).
In other words, you cannot correctly remove a single word preceded by a whitespace if you cannot specify a preceding or following token. This happens only if the word to be deleted appears at the end of the skipped block and at the end of the sentence at the same time. We haven't found any evidence that a special feature for deleting the last token in the sentence would be important yet (nobody complained) but if you need the feature, it will be implemented.
Testing if the word is preceded with a whitespace
There is a new feature in the development version (0.9.6-dev) that allows this in XML rules:
<token spacebefore="yes" regexp="yes">"|'</token> <token spacebefore="no">a</token> <token spacebefore="no" regexp="yes">"|'</token>
The above tokens match only "A", 'A' or "a' (but not " A "). Remember that Java rules might be faster when checking for whitespace.
Changing the case of matched word
When you want to change the case of a match, use
<match no="1" case_conversion ="startupper"/>
or
<match no="1" case_conversion ="startlower"/>
allupper and alllower are supported, too.
Note: LT automatically adjusts the case of suggestions if they are added as plain text. If you want to suppress the default behavior, you need to use the match element. If you're not just changing the case but also changing the word, then you can use the following workaround: make match point at any token (it doesn't matter as long it's in the pattern) and use regular expression .* and replace it with the word you want to actually use. And add case_conversion:
<pattern case_sensitive="yes" mark_from="1" mark_to="-1"> <token negate="yes">foo<exception postag="SENT_START"/></token> <token>Bar</token> <token/> </pattern> <message>Change to <suggestion><match no="1" regexp_match=".*" regexp_replace="bar" case_conversion="alllower"/></suggestion></message>
Changing the matched word
When dealing with (partial) barbarisms e.g., one can replace strings in words. An example from the Dutch set:
<match no="1" regexp_match="multiplechoice(.*)" regexp_replace="meerkeuze$1"/>
Selective case sensitivity for a rule
Generally, a rule is either case-sensitive, or not. When needed, that can be changed on the token leven however. In the example below, case sensitivity has been switched of by making it regexp and adding (?iu) to instruct Java to ignore the case sensitivity.
<pattern case_sensitive="yes"> <token negate="yes" skip="1" regexp="yes">(?iu)ten</token> <token regexp="yes">one|two</token> </pattern>
Testing suggestions
Run tests on your grammar rules. There are three ways:
- In Eclipse, by using Junit tests targets
- On the command-line (without Eclipse): simply run testrules.bat (in Windows) or testrules.sh
- Using ant (see build.xml for junit test targets).
For beginners, we recommend testrules.sh/bat method. It tests rules against examples and checks if they're valid XML. During development of rules, you'll find how many times there are stupid mistakes you wouldn't see without tests. So please, test them and test them again.
Testing corrections
If your suggestions are not explicit strings, but special codes or synthesized words, it's recommended to test them. The easiest way is to use the correction attribute of the incorrect example (in correct examples, correction is silently ignored as it makes no sense in it).
For example:
<example correction="back and forth" type="incorrect">How to move <marker>back and fourth</marker> from linux to xmb?</example>
If the rule doesn't supply "back and forth" as a suggestion for this sentence, you will get an error during the standard rule test. Note: you don't supply \1 or anything like that in the correction attribute: you supply the string that would be generated based on the suggestion element you supplied in the rule.
Using xxe for editing rules
It's possible to edit rules using XML Mind XML Editor (xxe) in a WYSYWIM (what you see is what you mean) mode. The program is available in a free version for many platforms (as it's developed in Java) though it's not open source. It offers spell-checking and validation, and we have developed special CSS files to make editing files easier. Put these in the rules directory.
Simply open the file in xxe, and it should display in a form input mode. It is recommended
that you uncheck Options > Preferences > Save > Add open lines. Set a high value in Max line length.





