deler is a tool for simple and accountable segmentation of marked-up text. It provides a wrapper around any existing tool for text segmentation, provided the tools output is mapped so that segments are delimited by new lines. deler works by converting to plain text and tracking alignment. For full details, and to reference deler please see: @InProceedings{Rea:Dri:Oep:13, author = {Read, Jonathon and Dridan, Rebecca and Oepen, Stephan}, title = {Simple and Accountable Segmentation of Marked-up Text}, booktitle = {Proceedings of the 19th Nordic Conference on Computational Linguistics}, month = {May}, year = {2013}, address = {Oslo, Norway}, url = {http://www.delph-in.net/deler/ReaDriOep13.pdf} } SETTING UP ==================================================================== deler has a number of prerequisites: 1) python (version 2.7 recommended) 2) some external tool for segmentation that outputs each segment on a new line. tokenizer (http://www.cis.uni-muenchen.de/~wastl/misc/) is recommended --- see examples/tokenizer for our invocation 3) a configuration file that specifies how to handle elements --- see examples/html-wdc.xml for an example, and more information. RUNNING DELER ================================================================= usage: deler.py [-h] [--config CONFIG] [--gml-mode] [--paragraph-mode] [--post_start POST_START] [--segmenter SEGMENTER] [--validate] [files [files ...]] positional arguments: files a list of files to segment optional arguments: -h, --help show this help message and exit --config CONFIG configuration xml --gml-mode output gml instead of the input markup --paragraph-mode force segmentation at double newlines --post_start POST_START regex to extract the tag that indicates the start of the post --segmenter SEGMENTER path to segmenter executable for example, try the following command: ./deler.py --config examples/html-wdc.xml --gml-mode --paragraph-mode \ --segmenter examples/tokenizer examples/test.html OUTPUT ======================================================================== For each file the tool will produce an output file with a .deler extension. This is the output for the examples/test.html: @examples/test.html 0 104 @0-"
" @12+"⌊∗" @12-"" @24+"∗⌋" @24-"" @49+"⌊>" @49-"" @63+">⌋" @63-"" @100-"
" The name ⌊∗Clanfield∗⌋ is derived from the ⌊>Old English>⌋and means “field clean of weeds”. 105 172 @0-"" @38+"⌊>" @38-"" @48+">⌋" @48-"" @63-"
" Clanfield was historically a small ⌊>farming>⌋ community. Inside this file, the first line (appended with an @) indicates the name of the input file. Then, there are a pair of lines for each segment. The first is the account of the segment, with fields delimited by tabs. The first and second fields are start and end character offsets in the input file that correspond to the produced segment. Each subsequent field is an account of some modification made to the original segment. The account matches the regular expression: /@(\d+)([+-])"(.+)"/ where the capturing groups correspond to: (1) the character offset of this action (relative to the segment start) (2) + (indicating insertion) or - (indication removal) (3) a unicode string (where whitespace is escaped) indicating what was inserted or removed ACKNOWLEDGEMENTS ============================================================== Work on deler was carried out in the Language Technology Group at the University of Oslo as part of the WeSearch project, funded by the Norwegian Research Council through its VerdIKT programme. CONTACT ======================================================================= Jonathon Read, j.read@tees.ac.uk Last updated 21 April 2013