deler is a tool for simple and accountable segmentation of marked-up text. It
provides a wrapper around any existing tool for text segmentation, provided
the tools output is mapped so that segments are delimited by new lines. deler
works by converting to plain text and tracking alignment. For full details,
and to reference deler please see:

@InProceedings{Rea:Dri:Oep:13,
  author    = {Read, Jonathon and Dridan, Rebecca and Oepen, Stephan},
  title     = {Simple and Accountable Segmentation of Marked-up Text},
  booktitle = {Proceedings of the 19th Nordic Conference
               on Computational Linguistics},
  month     = {May},
  year      = {2013},
  address   = {Oslo, Norway},
  url       = {http://www.delph-in.net/deler/ReaDriOep13.pdf}
}

SETTING UP ====================================================================

deler has a number of prerequisites:
  1) python (version 2.7 recommended)
  2) some external tool for segmentation that outputs each segment on a new
     line. tokenizer (http://www.cis.uni-muenchen.de/~wastl/misc/) is
     recommended --- see examples/tokenizer for our invocation
  3) a configuration file that specifies how to handle elements --- see
     examples/html-wdc.xml for an example, and more information.


RUNNING DELER =================================================================

usage: deler.py [-h] [--config CONFIG] [--gml-mode] [--paragraph-mode]
                [--post_start POST_START] [--segmenter SEGMENTER] [--validate]
                [files [files ...]]

positional arguments:
  files                 a list of files to segment

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       configuration xml
  --gml-mode            output gml instead of the input markup
  --paragraph-mode      force segmentation at double newlines
  --post_start POST_START
                        regex to extract the tag that indicates the start of
                        the post
  --segmenter SEGMENTER
                        path to segmenter executable

for example, try the following command:

  ./deler.py --config examples/html-wdc.xml --gml-mode --paragraph-mode \
  --segmenter examples/tokenizer examples/test.html


OUTPUT ========================================================================

For each file the tool will produce an output file with a .deler extension.
This is the output for the examples/test.html:

@examples/test.html
0 104 @0-"<p>"  @12+"⌊∗"  @12-"<b>" @24+"∗⌋"  @24-"</b>"  @49+"⌊>"  @49-"<a>" @63+">⌋"  @63-"</a>"  @100-"</p>"
The name ⌊∗Clanfield∗⌋ is derived from the ⌊>Old English>⌋and means “field clean of weeds”.
105 172 @0-"<p>"  @38+"⌊>"  @38-"<a>" @48+">⌋"  @48-"</a>"  @63-"</p>"
Clanfield was historically a small ⌊>farming>⌋ community.

Inside this file, the first line (appended with an @) indicates the name of
the input file.

Then, there are a pair of lines for each segment.  The first is the account of
the segment, with fields delimited by tabs.  The first and second fields are
start and end character offsets in the input file that correspond to the
produced segment. Each subsequent field is an account of some modification made
to the original segment. The account matches the regular expression:

  /@(\d+)([+-])"(.+)"/

where the capturing groups correspond to:
  (1) the character offset of this action (relative to the segment start)
  (2) + (indicating insertion) or - (indication removal)
  (3) a unicode string (where whitespace is escaped) indicating what was
      inserted or removed


ACKNOWLEDGEMENTS ==============================================================

Work on deler was carried out in the Language Technology Group at the
University of Oslo as part of the WeSearch project, funded by the Norwegian
Research Council through its VerdIKT programme.


CONTACT =======================================================================

Jonathon Read, j.read@tees.ac.uk

Last updated 21 April 2013