Introduction
Most taggers in LanguageTool are dictionary-based because statistical or context-oriented taggers are trained to ignore occasional grammar errors. For this reason, their output will be correct even if the input was in fact incorrect. While this is a desired behavior for most natural language processing applications, in grammar checking it is simply wrong.
We haven't however tried to train any statistical taggers on incorrect input to correct their output. This remains to be tested by someone who has enough time. However, we did test lexicon-based taggers/lemmatisers. For most languages, we use finite-state automata encoding for them. This means that the files are prepared for fsa_build program from fsa package. Resulting files are then manipulated in Java code by morfologik-stemming library which is bundled now with LanguageTool.
Preparing files for fsa_build is tricky at first but it's worth the effort: Polish input text file of the dictionary is about 190MB and with fsa_build, it gets squeezed less than 3MB, plus the speed of fsa tagger is really high. I cover the usual steps below.
Building fsa
Programs in fsa package must be built with the following switches in the Makefile:
CPPFLAGS=-O2 --pedantic -Wall \
-DFLEXIBLE \
-DA_TERGO \
-DGENERALIZE \
-DSHOW_FILLERS \
-DSTOPBIT \
-DNEXTBIT \
-DMORPH_INFIX \
-DPOOR_MORPH \
-DGUESS_LEXEMES -DGUESS_PREFIX \
-DGUESS_MMORPH \
-DDUMP_ALL \
-DPROGRESS
Some of the above options (like A_TERGO, GENERALIZE, GUESS_LEXEMES) are not strictly required but we're planning to add the support for them in the Java library as well, so please use them to get a fully functional fsa build. Run make and make install. The package is known to work correctly on Linux platforms and Cygwin under Windows.
Preparing the lexicon
The input file for fsa is a text file with three columns, usually tab-separated. The first field is an inflected word form, the second - the lemma, and the third - a POS (part-of-speech) tag. Don't use any whitespace for complex expressions. Whitespace will be ignored anyway by the tagger, so it's just wasting space in your dictionary.
boyar boyar NN
boyard boyard NN
boyardism boyardism NN:UN
boyards boyard NNS
boyarism boyarism NN:UN
boyarisms boyarism NNS
boyars boyar NNS
Before running any command, you need to change the locale - environment variables LOCALE and LANG - for fsa scripts to run correctly (no matter what encoding you use for your input dictionary!). Set them both to POSIX. Note: in Cygwin, you cannot change the locale. It's always POSIX, which makes your life for the first time easier than in real Unix ;)
Note that fsa scripts need file with UNIX line endings, so if your lexicon file comes from Windows, run dos2unix on it before you proceed.
The following commands are usually used in order to get the resulting dictionary file:
#gawk -f morph_data.awk <input.txt | sort -u |fsa_build -O -o output.dict
Note: The morph_data.awk script is found in the fsa directory. Just tweak the path as needed. To make the file working in LanguageTool standard morfologik-stemming tagger, you need also .info file:
#
# Dictionary properties.
#
fsa.dict.separator=+
fsa.dict.encoding=iso-8859-1
fsa.dict.uses-prefixes=false
fsa.dict.uses-infixes=false
Please note that you can use UTF-8 as the encoding as well but remember that it must match the encoding of the input.txt file and it must be 8-bit encoding.
Making the tagger file smaller
If the lexicon includes many words with prefixes or infixes, you can try to make the dictionary file smaller and faster to read from disk.
Use the following command to get the resulting file:
#gawk -f morph_prefix.awk <input.txt | sort -u |fsa_build -O -o output.txt
In this case you change the line:
fsa.dict.uses-prefixes=false
to
fsa.dict.uses-prefixes=true
However, in many cases you get even better results with infixes:
#gawk -f morph_infix.awk <input.txt | sort -u |fsa_build -O -o output.txt
You need to change properties in the .info file like this:
fsa.dict.uses-prefixes=true
fsa.dict.uses-infixes=true
Note: with infix encoding, both prefixes and infixes are used, so both must be set to "true".
Testing the file from the command line
Simply say:
#fsa_morph -d output.dict
If the dictionary was built with prefixes, add -P:
#fsa_morph -P -d output.dict
If the dictionary was built with infixes, add -I:
#fsa_morph -I -d output.dict
Type a word from the dictionary and press Enter. See if the correct word comes on the standard output.
Dumping the dictionary
If you want to see the contents of the binary file from our CVS repository, use the following command:
fsa_prefix -a -d dictionary.dict >dump
You will get an internal representation ("compression") of the data with plus signs (+). To get the input tabbed file, use de_morph_data.awk from fsa like this:
gawk -f /path/to/fsa/de_morph_data.awk dump
You can then edit the file and send us the patches.
Before you do so, you might want to contact us on the mailing lists, as some input files are generated automatically from many other source files and may result from bugs in our scripts.
Troubleshooting
- Remember to set the LOCALE and LANG variables.
- If the file is being built very slowly and is becoming huge, check if you have lots of ambiguous mapping between POS tags and word endings. If it's true, you might try to use the trick used in the Czech and Polish dictionary: simply join the POS tags with "+" and reuse the Java code from the Czech tagger. It should help with making your file smaller.
- It's wise to test if the input file has always exactly three non-empty fields. This is what this gawk script does:
BEGIN {FS="\t"}
{if (NF!=3) print "Not enough fields in the line: " $0
for (i=1;i<=3;i++)
if ($i=="") print "Empty field no. " i " on the line: " $0
}
- And remember to set the LOCALE and LANG variables.
Building a synthesizer dictionary
It's only for the brave ones ;)
The synthesizer dictionary generates an inflected form if you feed it with a lemma and a POS tag. It works with our Synthesizer class.
You need a very fancy script in AWK to build it. Let's call it synthesis.awk:
BEGIN {FS="\t"} {print $2"|"$3"\t"$1}
What it basically does is reverting the fields and joining them with the "|" sign. The order is very important: otherwise the file will grow very fast and the dictionary will be useless. The command to get a synthesizer dictionary is the following:
gawk -f synthesis.awk input.txt |gawk -f morph_data.awk | sort -u |fsa_build -O -o dictionary_synth.dict
You also need a list of all POS tags in a text file. Save this as tags.awk:
BEGIN {FS="\t"} {print $3}
And run:
gawk -f tags.awk input.txt | sort -u > demo_tags.txt
You also need the properties .info file:
#
# Dictionary properties.
#
fsa.dict.separator=+
fsa.dict.encoding=iso-8859-2
fsa.dict.uses-prefixes=false
fsa.dict.uses-infixes=false
The only thing you can change in it is encoding.
The synthesizer dictionary is used to generate inflected suggestions in heavily inflected languages. Note: it might be helpful to remove from the synthesizer dict all forms where POS tags indicate "unknown form", "foreign word" etc., as they only take space. Probably nobody will ever use them. It is also advisable to remove all archaic forms of main verbs (see English "resource/en/filter-archaic.txt") for an example what you might want to exclude.





