Missing Features
Here are some ideas of possible features of LanguageTool. If you have questions, comments, or would like to suggest another task, please send a message to our mailing list - note that you need to be subscribed first to get your messages through.
Very simple tasks to help you get started
- Write JUnit test case for SlovakVesRule
- Extend the pattern rule test to make sure that the short message (<short> in the grammar XML file), if any, is actually shorter than the other error message (<message>).
- Add an option to support WordFast TM files from the command-line (in Main.java, only tabbed text files are currently supported).
- Add a real clickable link to LanguageTool's (stand-alone) About dialog. Requires knowledge of Swing programming to be easy.
- Write a new XML rule for your native language. See http://www.languagetool.org/development/#xmlrules. Requires a bit knowledge of XML (which is easy to learn), no Java knowledge required.
- Update Ukrainian and Belarussian SRX sentence tokenizer. Requires knowledge of the given language. Contact: Yakov Reztsov
Short Tasks
These are tasks that can probably be finished in a week or less.
- Create a prototype of your UI
- This may be a mock-up or an implementation, either for a new web-based rule-editor, or for a web-based user interface for LT.
- WikiCheck: use the Wikipedia API to make fixes
- As described here http://languagetool-user-forum.2306527.n4.nabble.com/WikiCheck-edit-functionality-td4381789.html - it needs to be made sure that the user is the one who makes the edit, not some kind of "LanguageTool bot". Knowledge of Grails is helpful here. Contact: Daniel Naber
- Integrate Turkish version
- There is already an old version of LanguageTool with support for Turkish at http://code.google.com/p/garniturk/. Try it out and port the changes to the latest version of LanguageTool. Requires good knowledge of Turkish.
- Get the AutoCorrect exceptions via API from LibreOffice and use them during sentence tokenizing
- LibreOffice/OOo knows about abbreviations like "approx.". These contain a dot which we should not interpret as a sentence boundary. LT has its own list but it should, at runtime, be extended with those from LibreOffice. Contact: Daniel Naber
- Support URL with additional information
- LibreOffice can now take an URL from the grammar checker. See http://www.libreoffice.org/download/3-5-new-features-and-fixes/ (look for "FullCommentURL") and the comment in openoffice/Main.java, around line 461
- Add UGTag to support tagging in Ukrainian
- Add support for Ukrainian tagger UGtag. It is already in Java, so it should be quite if not trivially easy.
- XML parsing code cleanup
- Current RuleLoaders classes are SAX-based and messy. Maybe it would be cleaner to rewrite the code using JAXB; this way the XML could be unmarshalled easily for any rule.
- Make more sentence tokenizers use SRX
- just like other languages already do
- affected languages that could use an update to SRX: Belarusian, Ukrainian, Esperanto, French, Galician, Italian, Lithuanian, Malayalam, Swedish
- Difficulty: some work, but quite easy. Requires knowledge of the given language.
- Modularize LanguageTool for LibreOffice/OpenOffice.org
- Current LT distribution gets too large. The idea would be to create a basic (framework) LT extension, possibly containing English only, and individual language extensions. The only thing needed to make it work is to develop a naming scheme for the extension identifiers in OOo, and use OOo API in the framework extension to check whether other extensions (=individual language modules) are installed, as well as prepare a few XML (.xcu) files.
- As all resources are loaded via classpath, so the extension would have to add invidual modules to the classpath. It is yet to be confirmed that this is actually possible under OOo (it may disallow such a thing). In such a case, all individual modules should contain the whole Java matching engine, and there would be no central extension.
- Difficulty: some work with OOo API, so it requires some knowledge of Java, and ant, but in general quite easy.
- The additional task would be to enable similar modularization for our Java Web Start GUI code: the main extension would start with English, and additional languages would be downloaded on demand. This should allow the app to start much quicker.
- Contact: Marcin Miłkowski
- Create a general word decomposition module (interface)
- Make an interface class in LT that would enable using different word decomposition modules (for Dutch, Polish etc.). It would make the code more elegant, but quite easy.
- One class that implements the interface could use hunspell dictionaries for languages that have complex decompounding in hunspell dictionaries (this is a mid-term task).
- Add TMX and XLIFF readers for bitext checking
- New classes for reading and writing TMX (possibly based on JAXB, and using XSLT to convert TMX to the current format) are needed to add real-world support for bitext checking**
- Difficulty: moderately easy, requires just a bit of tweaking. The only difficulty is to support internal tags in XLIFF. Probably two kinds of XLIFF output would be needed: with corrections applied directly to the target text, and corrections as comments. Another way would be to add support of bitext to LT Server, and use CheckMate for checking (see below).
- Port usable English rules from XML copy editor
- Rules from XML Copy Editor for English could be interesting for LT (see its source in the xmlcopyeditor-1.0.9.5/src/rulesets directory).
- Add ability to skip tokens of a particular kind only in XML rules
- There is a workaround, but it does not work well (skips too much). The code would require a new notation in the XML schema and few lines of code in the Pattern matcher, as well as lots of testing :)
- Conversion of LightProof rules
- The rules for the LightProof checker are available for some languages (French, Hungarian). They are a subset of what LT can do, so automatic conversion to XML rules should be easy.
- Enable using multiple rule sets
- Enable using multiple rule sets (different xml files) to implement custom sets that implement different style guides. For example, one could implement Chicago Manual of Style checker that is run only for scientific papers: the user would activate the standard English rules along with the custom sets.
- It seems that it would need some changes in the config dialog. Otherwise, the code for doing this should be fairly simple, as it would just need multiple calls to the method that loads the rules, passing a new filename every time.
- A more advanced version would enable loading the rules from a web-based repository of custom rules.
- Contact: Marcin Miłkowski
- Add more redundancy rules for English
- Add more rules to detect redundant phrases like these.
- Make LT compatible with GNU Java
- Resource loaders do not work with GNU Java. Check why is that and fix this.
- Make IKVM-based OOo extension
- IKVM could be used as native for these people who seem to hate Java. An alternative .NET version of the extension for OOo could be then built.
- TeX processing mode
- Add a command-line switch to remove TeX formatting from the input text when checking (sf bug #2880449).
- One difficulty is that the text positions have to be retained correctly also when applying corrections.
- Check for rule duplication
- Over time, it gets harder and harder to see whether a particular mistake is already detected. Develop a new JUnit test for rules (especially XML rules) that would take the incorrect examples and test whether some other rules get activated: (1) at all in the sentence, which could mean it's wrong in multiple places; (2) in the same place (raise a serious warning). Check additionally if the corrected sentence (after applying the suggestion) is the same - in such a case, the test should fail.
- Easy with XML rules; getting it to work with Java rules is just a bit harder.
- There is already available code for checking for rule duplication in dev/de/danielnaber/languagetool/dev/conversion/RuleCoverage.java. You just need to create a series of JUnit tests on top of that.
- Conditional matching for some forms in XML rules
- Match some patterns only if the planned suggestion is a proper dictionary word. Useful for some separately written words - simply checking if the suggestion is in the dictionary could filter out wrong suggestions and simplify rules. This could slow down LT: the whole match would be generated before it is discarded.
- Easy. Requires two steps: 1. adding an XML construct to check the suggestion (as an attribute of the suggestion element); 2. building a method in PatternRule that calls the tagger on the suggestion text to check if it matches the constant UNKNOWN. About 25 lines of code, including XML Schema and Java support for it.
- Contact: Marcin Miłkowski
- More powerful unification
- Unification is somehow limited as it cannot unify a sequence of tokens that is of unknown length (no regular expressions over tokens, so to say). What we miss is equivalent of * and + operator for tokens.
- This might be easily implemented in Unifier.java, and requires some extensions of the XML Schema (to accomodate equivalents of * and +).
- For the explanation of our unification, see Using Unification.
- Add a layer to XML so that LT could use An Gramadoir rules
- Add direct support or automatic conversion of An Gramadoir rules.
- Create a general mechanism to store rule parameters
- Some rules could take some user-set parameters but we have no general way to store these in configuration files (such as sensitivity level).
- Devise a way to store and retrieve rule parameters from configuration files.
- Contact: Marcin Miłkowski
- Add documentation / help part to XML rules
- XML rules could be self-documenting by describing the source, the reason why some expressions are incorrect etc.
- A general way to store documentation per rule in a language is needed.
- Convert glossaries into terminology checks
- Build a packager that takes a glossary in CSV or tabbed format and outputs a bitext XML rule
- In more advanced version, this could support also TBX (XML terminology format which is quite rare), but that's not required due to the low popularity of the format
- Two interfaces: with a UI to answer questions (drop down a file, answer several questions, and get the file), and the command-line
- This feature would be nicer if accompanied with an ability to load multiple rule sets
- Very easy. Contact: Marcin Miłkowski
- Add country-variant rules
- Add a possibility to have rules for a given country variant of a language. Requires separate rule files (for example, rules-[country-variant].xml, where [country-variant] is a two-letter country code, or country attribute for rule and rulegroup elements. Contact: Juan Martorell
- Needs some extensions to GUI (configuration dialog, language selector) and in OOo binding code.
- See also https://sourceforge.net/tracker/?func=detail&atid=655720&aid=3287388&group_id=110216
Medium-term ideas
- Add spell checking
- Add ability to check for spelling errors via hunspell JNI or natively via our tagger dictionaries. The latter requires some knowledge of finite state machines and porting fsa_spell to Java. Yet it should be moderately difficult for developers with knowledge of C++ and Java. Daniel: we really should use hunspell, compound words are difficult to check with our tagger dictionaries
- Create a web-based user-interface
- We only have a web interface for demonstration purposes. What we need is a fast, good-looking interface for common users that shows errors and suggested corrections directly in the text. http://orangoo.com/labs/GoogieSpell/ could probably be extended for this. Contact: Daniel Naber
- Add rule / false friends online editor
- Would simplify adding rules. It would be probably working in a web browser. Options:
- XML database: use some native XML database system and create a web interface to query it and modify it. This way searching for rules would be easier.
- Use a web-based WYSIWYM (what you see is what you mean) editor to create a new editing interface.
- The features could be:
- Search for a rule (rulegroup) by id and example.
- Display a rule (and its rulegroup in a collapsed way).
- Modify all elements of the rule.
- Check validity.
- Run tests on the file (via JUnit in LanguageTool).
- Create a wizard in a web-browser that would enable writing a rule in several steps, possibly using also the fast evaluation on a corpus.
- The created rule should be evaluated: first on the examples given during the construction of the rule to make sure there are no errors in the rule itself, and then on the corpus. Ideally, in this step, fast rule evaluation should be used.
- The smartest wizard could also use collocation information from the corpus that could help to add exceptions for correct usage (partial evaluation during writing a rule).
- The wizard could be a part of the new rule editor (a separate tool in the editor).
- This should probably be integrated into http://community.languagetool.org. Knowledge of Grails helps with this.
- Contact: Marcin Miłkowski, Daniel Naber
- Would simplify adding rules. It would be probably working in a web browser. Options:
- Chrome/Firefox Extension
- LT should have a better based interface that integrates with the browser like After The Deadline for Chrome and AtD for Firefox. AtD internally uses XML similar to LT, so it would be quite easy to create an XSLT to convert the XML coming from the extension and return the results in the same fashion as AtD (this way, all future bug fixing etc. in these extensions could be integrated in our code as well). LT should be run locally (and configured, via a web-based configuration dialog, see below) or externally.
- Note that the AtD extension has some bugs, notably it does not seem to check the text language as set in the browser. We should not rely completely on that, though: in the future, we should also use our language detection routines to make sure (see other simple tasks).
- AtD also does not work well with some JavaScript-based sites, such as Facebook. This would also need fixing.
- Contact: Marcin Miłkowski
- Increase German part-of-speech coverage
- Extend the German Morphy dictionary to contain more words with their POS tags (especially words after the spelling reform) to get better agreement checking. Write a web-application that lets users easily add new words and their forms, without the need to type every word form. This could be thought of as a web-based version of Morphy. Requires very good knowledge of German and programming knowledge (Grails or Java preferred). Contact: Daniel Naber
- Simplify configuration dialog
- It's too complex for an average user, too much clicking.
- With JSwing, it requires much work. We would need to: (a) add a collapsible tree of checkboxes; (b) add a search box.
- Another idea: use a default web browser and handle the configuration via HTTP request (formatted as XML created in the browser?). Use JavaScript for the browser, if needed. This could be reused for configuring LT HTTP Server interface.
- Contact: Marcin Miłkowski
- Improve German decomposition module (jwordsplitter)
- jwordsplitter is used to split compound words (e.g. Haustür -> Haus + Tür). It could need a major code cleanup and lexicon extension. The lexicon should be extended by looking for common compound words that are not correctly split yet. As hunspell also supports compounds, using hunspell's German dictionary should be evaluated. Requires very good knowledge of German. Contact: Daniel Naber
- Enable display of whole sentences matched in LT output
- Would make it easier to use corpora-processing scripts. No impact. Can be implemented in a similar way to automatic correction: Only matched sentences would be displayed with error markers, and possibly also rule names. In this way, an annotated error corpus would be created. BTW: The corrected text could be also saved in TEI-compliant format using mostly corr, choice and sic elements for easy quantitative analysis using TEI tools.
- Contact: Marcin Miłkowski
- User-level rules
- It would enable having rules compatible with different style-guides (good for academic and technical writing). Requires some change in the current rule-loading procedure and writing at least some sets of user rules.
- Contact: Marcin Miłkowski
- Rule exclusion
- Would enable writing mutually excluding rules, like "Make US-English" or "Make UK" English. Not clear if it should involve also non-XML rules.
- Contact: Marcin Miłkowski
- Bitext check for placeables / numbers
- In translated text, formatting elements or numbers should be left alone or converted to other units. Create a rule that (a) aligns the formatting elements / numbers on a token level; (b) marks up the elements that were not successfully aligned. Use Numbertext to align figures translated into text (i.e., 1 translated into "one").
- There is similar code in Java in the translation QA tool CheckMate. This is also available on LGPL, so one could reuse the code (or call Okapi library).
- Contact: Marcin Miłkowski
- Rule priority/severity and register
- Could simplify configuration, but configuration dialog is hard to change. Requires that the config dialog is changed first.
- Contact: Marcin Miłkowski
- Multi-threading
- Checking long files with LT can be slow. Different rules can be checked in parallel, or different portions of the text can be checked in parallel. The code in the OOo extension would need to be changed appropriately as well (currently it is only single-threaded). Daniel: not sure if the increase in complexity is really worth it
- Rule profiler (for Java & XML rules) run from the command-line on a large corpus
- Could eliminate resource hogs, especially bad regular expressions. Not clear if it should go to /dev or to normal command line.
- Simplifying adding new language
- Needed to bootstrap development. Partially implemented. What remains to be done is:
- store and load configuration for a new language;
- add the same feature to OOo (that would mainly mean the command in the menu)
- One could also create a wizard that would create an empty (or nearly empty - there could be dummy patterns as examples left) XML for a new language and generate standard Java bindings for it so that the classes could be simply copied to the source tree.
- Needed to bootstrap development. Partially implemented. What remains to be done is:
- Create an automatic extractor of rules based on transformation-based learning algorithm
- There is an existing prototype scripting code that takes a dump of Wikipedia history, converts the dump into a corpus of errors. The corpus of errors may be then processed with TBL rule-learning algorithm automatically to prepare rules similar to the ones used by LT. There is Java module for TBL learning here and here. See also here for details of how it would work.
- The code would go to /dev. Contact: Marcin Miłkowski
- Add interfaces for valency checking etc.
- Add an abstract interface that would allow using our finite-state dictionaries for classifying words for purposes other than POS-tagging (for example, a valency lexicon, or a lexicon of proper names etc.) For example, in French, there is Lefff dictionary.
- Prepare a rule that uses valency checks.
- Add support for a new language
- Add support for a new language, including a POS tagger, an initial set of rules (at least 50), and a translation of the UI. Contact: Juan Martorell
- Native speakers are encouraged to work on their own language.
- An ambitious project would be to support a language that needs some new features or is harder to support (written from right to left, like Hebrew, or depending on a lexicon for word segmentation, like Thai).
- Create a POS lexicon extractor for hunspell dictionaries
- Hunspell dictionaries are usually manually built, and contain implicit grammatical information. The task is to create an extractor of regularities from hunspell affix flags (by mapping word forms to lemmas via "flag signatures", that is affix flag sets that create all forms of the word) and then to use the regularities to create a POS tagger lexicon. This could be possible if a tagged corpus exists: the words from the corpus should be matched to their signatures by using maximally informative sets.
- Requires knowledge of data extraction or statistics, and some knowledge of NLP tasks as well.
- Add support for a Ukrainian language
- Create a tagger dictionary for Ukrainian
- Create POS tagger and create rules that use it
- Translation of the UI
- Contact: Andriy Rysin
Long-term ideas
- Microsoft Office integration
- LT could be used to proofread MS Word documents. This requires using CGAPI, which is not publicly available, yet the license is said to be given out freely (though you have to sign NDA, which is an agreement that you won't disclose the information). It is unclear whether the NDA is compatible with making an open-sourced grammar checker.
- That being said, there is a complete open-source add-in for Microsoft Word in C# Virastyar, which seems to work fine. It does spell checking and punctuation checking.
- LiveWriter plugin
- There is an open source AtD plugin in C#, which uses AtD. It would be quite easy to adapt it.
- JEdit plugin
- Create a plugin for JEdit, similar to spell-checking plugins available for it already.
- Scribus plugin
- Abiword plugin
- Just get the original LinkGrammar plugin code and replace calls to Link Grammar with calls to HTTP Server of LT.
- Fairly easy for people that speak C++.
- Google Docs integration
- QuarkXpress, Adobe Pagemaker integration
- Intra-word tokenization
- Some languages require tokenizing words internally for consistent processing (for example, Polish), so that one input word comes out as two tokens (or as a graph composed of one token, and two tokens, in case the word is ambiguous). Requires some changes in morfologik library and some changes to fsa dictionaries (but not in fsa_build)
- Add XML output for tagger-only mode
- Useful for people that would like to create a corpus using LanguageTool. Easy to implement for easy XML formats, harder for TEI-compliant format. Not essential for development of LT.
- Alphabetic indexer
- Use Named-Entity Recognition rules to automatically generate an alphabetic index in OpenOffice.org. Requires separate rules for every language (and possibly gazeteer files). Not critical. It's a fairly known task in IE but it could demonstrate that LT is not just a proof-reader but a shallow-parsing engine. The easiest way to implement it would be to use user-level rules and some Java code in OOo to try to match entries (in case of proper names given names might be missing) and then search text in the opened document.
- Grammaticality evaluator
- Evaluate if the sequence of POS tags seems to be grammatical or not using a statistical trigram language model containing POS tags. Hard to estimate how relevant such a test could be.
- Subject domain classifier
- Classify the text subject domain to detect words that might be misspellings. Non-trivial task without a Wordnet or something equivalent to General Inquirer
- Enable red underlines for context-related spelling mistakes in OOo
- Spelling mistakes would be more visible. Not yet possible in OOo API (feature not yet implemented in OOo).
Unsorted Ideas
- see http://papyr.com/hypertextbooks/grammar/gramchek.htm and An Evaluation of Microsoft Word 97’s Grammar Checker by Caroline Haist
- see Checks for English by Proofread Bot - we could reimplement them
- update languagetool.xml.update automatically (i.e. replace @version@)
- make the dist-src work (= compile out of the box)
- stand-alone GUI: mark errors in upper part of window, see http://stackoverflow.com/questions/10144815/how-to-make-a-red-zig-zag-under-word-in-jeditorpane for the component we could use for that
- enable style registers and/or rule classes
- clean up rule descriptions so that they coherently contain the error or the rule (e.g., "did + baseform" vs. "did + non-baseform")
- new German rule: Vergleichs vs Vergleiches etc -> only one variant per document should be used
- create abstract SentenceRule and TextRule classes to get rid of reset() method?
- check if there's a nice design that lets us extend PatternRule and PatternRuleLoader to make them more powerful, but without having all features in these classes
- create a general mechanism for setting and storing rule parameters (including Java rules and XML rules) like sensitivity level
- new German rule:
- "*Ich kaufe den Hund einen Knochen" (den -> dem), aber:
- "*Ich kaufe dem Hund." (dem -> den)
- see "TODO" / "FIXME" in the source: find . -iname "*.java" -exec egrep -H "TODO|FIXME" {} \;
- see if java.text.RuleBasedBreakIterator would be better for word tokenization than the current scheme (especially check performance)
page revision: 115, last edited: 13 May 2012 09:50





