How to Use Indexer and Searcher for Fast Rule Evaluation

There are large corpora like Wikipedia and enormous collections of text (e.g. Dutch documents) for new rule checking. Here is a fast rule checking tool against them for developers. We use Lucene to index the corpus with POS taggers, and use Lucene query to swiftly search the rules. This will greatly improve the performance of new rule checking and increase the speed of new rule creation. Language pack developers/maintainers will benefit from this work to have better user experience of new rule checking. Note that we only consider pattern rules here.

Where is the tool

You can find the source code in this package:
de.danielnaber.languagetool.dev.index

How to use it

There are two phases for fast rule evaluation: (1) Indexing phase (2) Searching phase. We need to index the corpus first, and then search against the index. The indexing phase only requires running one time. The index can be reused for future searching phases.

Indexing

We can index a single text document or Wikipedia XML dumps.

Indexing a single text document

Look at the source code here: Indexer.java.

java Indexer <textFile> <indexDir>
  • textFile: path to a text file to be indexed
  • indexDir: path to a directory storing the index

Indexing Wikipedia XML dumps

Please use WikipediaIndexHandler.java to index Wikipedia dumps. Sample usage:

java -cp LanguageTool.jar:lucene-core-3.1.0.jar:bliki-3.0.3.jar:commons-lang-2.4.jar de.danielnaber.languagetool.dev.wikipedia.WikipediaIndexHandler dewiki-20100815-pages-articles.xml myindexdir de 100

The parameters are:

  • path to the Wikipedia XML dump
  • the directory to write the index to
  • the dump's language as a two-character code (en, de, …)
  • maximum number of documents to be indexed

Searching

We can search the created index by Searcher.java. Sample usage:

java -cp LanguageTool.jar:lucene-core-3.1.0.jar:lucene-queries-3.1.0.jar de.danielnaber.languagetool.dev.index.Searcher RULE_ID rules/de/grammar.xml myindexdir

The parameters are:

  • ID of the rule to search
  • path to a grammar.xml rule file, e.g. rules/en/grammar.xml
  • path to a directory containing the index (see above for how to create this index)

You will get search result as a list of sentences (or candidate sentences, see "Unsupported Rule" below) that match the specified rule.

Performance

Indexing Performance

We have tested indexing on the English dump: enwiki-20110405-pages-articles1.xml.
It takes around 12 minutes to complete indexing 5356 wikipedia pages (392MB in all). The index is 136 MB in size.

Searching Performance

Normally, searching rules would take less than 1 second to get the result.

However, the Searching performance is not so good for the rules with tokens with the attributes in the following two cases:

  • "negate='yes'" and "regexp='yes'".
  • "negate='yes'" and "postag_regexp='yes'".

For the above two, it may take tens of seconds or several minutes to get the search result, depending on the index size.

Unsupported Rule

All the functions of Pattern Rule are not supported due to the limitation of Lucene index capability. Here's the list of the types of the unsupported Pattern Rules:

  • Pattern rules with token exceptions are not supported.
  • Pattern rules with tokens testing "Whitespace before" are not supported.
  • Pattern rules with tokens in "And Group" are not supported.
  • Pattern rules with phrases are not supported.
  • Pattern rules with unified tokens are not supported.
  • Pattern rules with inflected tokens are not supported.

Although not fully supported, Searcher.java will find out the candidate matches: the search result is for the rule that replaces the unsupported tokens with empty ones (<token/>). For example, the following rule

<token skip="-1">both<exception scope="next">and</exception></token>
<token>as</token>
<token>well</token>

is replaced by
<token/>
<token>as</token>
<token>well</token>

Note that you can find more candidate matches than the real ones. You will not miss any potential candidates in this way.
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License