tokenizer, putzer, htmlEnt2Char -- three tools for corpus processing
===================================================================
tokenizer -- a tokenizer whith end-of-sentence detection (see "tokenizer -h")
putzer -- remove unnecessary blanks etc. (``putzer'' is German and means `cleaner")
(see "putzer -h")
htmlEnt2Char -- converts HTML-entities into characters (see "htmlEnt2Char -h")
Compile and install (see INSTALL):
./configure
make && make install
I tried to write a fast, rule-based, and also to some extend robust
tokenizer and sentence segmenter.
Actually supported languages are:
* German (see also file LIESMICH)
* English (thanks also to Michaela Geierhos)
* Russian
For each language the corresponding codepage of ISO and MS-Windows,
and partly UTF-8.
Features:
1. customizable through options
- language and codepage
- try to undo hyphenation
- semantics of line breaks (paragraph separator or not)
- etc.
2. problems and strategies for tokenization
- hyphenated words are considered as one token
- option -c concatenates words with hyphen at end-of-line.
This may cause errors, although a small exception list is defined
3. end-of-sentence detection:
- positive:
* end-of-sentence marker followed by blank and uppercase letter
- negative:
* abbreviations (except for, e.g., "etc." which often occurs at EOS)
* dates
- positive:
* negative followed by word usually used exclusively at BOS
* capitalized determiners, conjunctions, etc.
- try to handle additional punctuation symbols following the full-stop
correctly (brackets, apostrophes etc.)
- tests on the Brown corpus support an error rate of about 3%
Version history:
0.1 -- package with tokenizer, putzer, htmlEnt2Char
0.2 -- bug reported by js: when input contains long words or many following newlines
tokenizer stops with "input buffer overflow". To avoid this use putzer as
filter with newly introduced option -m!
0.3 -- optimization (inlines & macros): now about 10% faster
0.4 -- corrected some details in German EOS-detection, changed behaviour with option -sx:
When a newline is recognized a space is printed on a separate line,
instead of an empty line.
0.5 -- ':' now not considered as EOS-mark. Additions to German abbreviation list.
0.6 -- Added more German abbreviations, Roman numerals with point.
Added rudimentary support for utf-8 in German.
0.7 -- Better EOS for English, thanks to Michaela Geierhos;
rudimentary support for utf-8 in English
0.8 -- fixed a bug in the Russian part, which makes the tokenizer hanging
0.9 -- changes to German abbreviations
rudimentary support for utf-8 in Russian
0.10 -- fixed bug raising a segfault for German language option
short sequences in parenthesis are excluded from containit an end-of-sentence
additions to German abbreviations
0.11 -- fixed a bug with options -C and -c.
Introduced positive rules for German EOS : i.e. if a capitalized article, conjunction
or prepositions follows an abbreviation or date, there should be an EOS.
0.12 -- better documentation (in English)
Positive rules also for English: The text «The firm said it
plans to sublease its current headquarters at 55 Water St. A
spokesman declined to elaborate.» (Wall Street Journal) is now
correctely splitted into two sentences
1.0 -- (almost) no changes
GPL licensed now