Missing Features
Here are some ideas of possible features of LanguageTool.
| Feature | Pros | Cons | Current state |
|---|---|---|---|
| Unification of grammatic features | Simplifies XML rules | Requires intensive testing | Implemented |
| Introduce a way to check if the token matched in XML rule was preceded with whitespace | Needed to produce meaningful suggestions in some languages for words starting with apostrophes or for some punctuation checks | Could impact current implementation or be kludgy and is not essential | Implemented |
| Automatic correction mode | Useful for using LT in complex workflows, like post-editing machine-translated text or cleaning corpora - instead of displaying suggestions, suggestions would be automatically applied to the text. | Probably not practical in OOo | Implemented in 0.9.8. |
| Use SRX for tokenizing sentences | LT could reuse the same segmentation as in other tools. Simplifies writing segmentation rules by people that don't code in Java | - | Implemented in 0.9.9. (some languages need new rules, not ready yet) |
| Conditional matching for some forms | Match some patterns only if the planned suggestion is a proper dictionary word. Useful for some separately written words - simply checking if the suggestion is in the dictionary could filter out wrong suggestions and simplify rules. | This could slow down LT: the whole match would be generated before it is discarded. | Planned. |
| Simplify configuration dialog | It's too complex for an average user, too much clicking | With JSwing, it requires much work | Planned: (a) add a collapsible tree of checkboxes; (b) add a search box. Maybe JavaFX could simplify UI. |
| Enable red underlines for context-related spelling mistakes in OOo. | Spelling mistakes would be more visible. | Not yet possible in OOo API (feature not yet implemented in OOo). | Planned |
| Enable display of whole sentences matched in LT output | Would make it easier to use corpora-processing scripts | No impact. Can be implemented in a similar way to automatic correction: Only matched sentences would be displayed with error markers, and possibly also rule names. In this way, an annotated error corpus would be created. BTW: The corrected text could be also saved in TEI-compliant format using mostly corr, choice and {{sic}} elements for easy quantitative analysis using TEI tools. | Planned |
| Intra-word tokenization | Some languages require tokenizing words internally for consistent processing (for example, Polish). | Requires some changes in morfologik library and some changes to fsa dictionaries (but not in fsa_build) | Planned. Maybe 0.9.9 or later. |
| User-level rules | It would enable having rules compatible with different style-guides (good for academic and technical writing) | Planned; requires some change in the current rule-loading procedure and writing at least some sets of user rules | |
| Rule exclusion | Would enable writing mutually excluding rules, like "Make US-English" or "Make UK" English | Not clear if it should involve also non-XML rules | Planned |
| Bitext (parallel text) support | LT could be used for translation quality assurance; with minimal changes it could even be used for automated correction of machine-translated texts. | Requires implementing a reader of some standard translation format such as XLIFF and classes for bitext rules | Planned |
| Rule priority/severity and register | Could simplify configuration | Configuration dialog is hard to change | |
| Add rule / false friends online editor | Would simplify adding rules | Requires a lot of new code | Planned; maybe XForms-based |
| Simplifying adding new language | Needed to bootstrap development | Partially implemented | In progress |
| Add XML output for tagger-only mode | Useful for people that would like to create a corpus using LanguageTool. Easy to implement for easy XML formats, harder for TEI-compliant format. | Not essential for development of LT. | - |
| Language Guesser | Removal of false alarms for sentences or fragments in other language | Possible false positives (it's always heuristic) | — |
| Alphabetic indexer | Use Named-Entity Recognition rules to automatically generate an alphabetic index in OpenOffice.org | Requires separate rules for every language (and possibly gazeteer files). Not critical. | It's a fairly known task in IE but it could demonstrate that LT is not just a proof-reader but a shallow-parsing engine. The easiest way to implement it would be to use user-level rules and some Java code in OOo to try to match entries (in case of proper names given names might be missing) and then search text in the opened document. |
| Grammaticality evaluator | Evaluate if the sequence of POS tags seems to be grammatical or not using a statistical trigram language model containing POS tags | Hard to estimate how relevant such a test could be | Planned. |
| Subject domain classifier | Classify the text subject domain to detect words that might be misspellings. | Non-trivial task without a Wordnet or something equivalent to General Inquirer | ? |
page_revision: 15, last_edited: 1245351932|%e %b %Y, %H:%M %Z (%O ago)





