Customizing Sentence Segmentation In Srx Rules

Starting with version 0.9.9, LanguageTool will support specifying the sentence segmentation rules in SRX format, thanks to segment Java library. The rules for all languages are contained in /resource/segment.srx file, which can be downloaded also directly from the CVS here. The rules are cascading, i.e., there are a few universal rules; alternating rules for paragraph breaking (you don't need to edit them); and rules for specific languages.

The file can be edited using one of the available SRX editors: Ratel (open source, actively developed and used in the project), and SRXEditor (proprietary but free; contains a very helpful example file, which is copyrighted so it can only be consulted for inspiration). You can also use Pangolin, which is a web-based editor using the same code as Ratel.

Please use Ratel to maintain the same file formatting for easy version control. Basically, there are two kinds of rules:

  • specifying the sentence break
  • disallowing the sentence break (specifying the exceptions to breaking rules).

No-break rules should precede the break rules in the file. All rules have to parts:

  • before the break
  • and after the break

Both parts have to be specified using regular expressions. The library we're using, segment, uses standard Java regular expressions which are slightly more expressive than what is described in SRX specification. To maintain portability, one shouldn't use very advanced features such as lookahead - they are missing in the spec.

Note: in SRX specification and most available rules, there is an obvious mistake: sentences that are not the first in the paragraph start with leading whitespaces, which is obviously wrong and unacceptable for our project. Take time to see that the afterbreak section of the rule doesn't contain any space (\s). In most cases, afterbreak should stay empty.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License