;;; -*- mode: fundamental; coding: utf-8; indent-tabs-mode: t; -*- ;;; ;;; Copyright (c) 2009 -- 2018 Stephan Oepen (oe@ifi.uio.no); ;;; see `LICENSE' for conditions. ;;; ;;; ;;; another shot at a finite-state language for preprocessing, normalization, ;;; and tokenization in LKB grammars. requires LKB version of 1-feb-09 or ;;; newer. note that the syntax is rigid: everything starting in column 2 ;;; (i.e. right after the rule type marker) is used as the match pattern until ;;; the first `\t' (tabulator sign); one or more tabulators are considered the ;;; separator between the matching pattern and the replacement, but other ;;; whitespace will be considered part of the patterns. empty lines or lines ;;; with a semicolon in column 1 (i.e. in place of the rule type marker, this ;;; is not Lisp) will be ignored. ;;; ;;; this is a fresh attempt (as of September 2008) at input tokenization. for ;;; increased compatibility with existing tools (specifically taggers trained ;;; on the PTB), we now assume a PTB-like tokenization in pre-processing. the ;;; grammar includes token mapping rules (using the new chart mapping machinery ;;; in PET) to eventually adjust (i.e. correct, in some cases) tokenization to ;;; its needs. specifically, many punctuation marks will be re-combined with ;;; preceding or following tokens, reflecting standard orthographic convention, ;;; and are then analyzed as pseudo-affixes. ;;; ;;; this file is inspired by the PTB `tokenizer.sed' script, and by and large ;;; should yield very similar results. with the addition of token mapping as ;;; a separate step inside the parser, we want to restrict RE-based processing ;;; to pure string-level phenomena. however, to actually tokenize (following ;;; some set of principles), we need to do more than just break at whitespace. ;;; some punctuation marks give rise to token boundaries, but not all. also, ;;; inputs (in the 21st century) may contain some amount of mark-up, where XML ;;; character references have become relatively common. full UniCode support ;;; now makes it possible to represent a much larger range of characters, e.g. ;;; various types of quotes and dashes. we aim to map mark-up to corresponding ;;; UniCode characters, and preserve those in parsing, as much as possible. ;;; ;;; the original `tokenizer.sed' script actually cannot always yield the exact ;;; tokenization found in the PTB. the script unconditionally separates a set ;;; of punctuation or other non-alphanumeric characters (e.g. |&| and |!|) that ;;; may be part of a single token (say in |AT&T| or URLs). we aim to do better ;;; than the original script, here, conditioning on adjacent whitespace. ;;; ;; ;; preprocessor rules versioning; auto-maintained upon CVS (or SVN) check-in. ;; @$Date$ ;; ;; tokenization pattern: after normalization, the string will be broken up at ;; each occurrence of this pattern; the pattern match itself is deleted. ;; :[ \t]+ ;;; ;;; string rewrite rules: all matches, over the entire string, are replaced by ;;; the right-hand side; grouping (using `(' and `)') in the pattern) and group ;;; references (`\1' for the first group, et al.) carry over part of the match. ;;; ;; ;; pad the full string with trailing and leading whitespace; makes matches for ;; word boundaries a little easier down the road; also, squash multiple spaces ;; and replace tabulators and ‘weird’ spaces with an ASCII space. ;; ;; _fix_me_ ;; in the 21st century, there are of course various whitespace code points, so ;; maybe we should look for a UniCode property to use here. however, it used ;; to be the case, at least, that the LKB implementation of REPP did not have ;; the full UniCode CL-PPCRE extensions (and even if it did, one might first ;; have to read up on interactions of locales and UniCode properties, before ;; relying on such fancy whitespace detection. (29-nove-12; oe) ;; !^(.+)$ \1 ![\t ] ! + ;; a set of `mark-up modules', often replacing mark-up character entities with ;; actual UniCode characters (e.g. |—| or |---| as |—|), or just ditching ;; mark-up that has no bearing on parsing for now (e.g. most wiki mark-up). ;; these modules can be activated selectively by name in the REPP environment ;; or the top-level call into REPP. >xml >latex >ascii >html >wiki >lgt >gml ;; ;; we hope to keep it to a minimum and only invoke it in select configurations, ;; but at times we need to heuristically `fix up' deficiencies in the input; ;; for example spurious duplicate punctuation, and such. ;; >robustness ;; ;; our treatment of quote marks continues to evolve. as of february 2010, it ;; seems we need to support two distinct scenarios: (a) inputs using standard ;; typography (quotes either are directional already or can be diambiguated by ;; adjacency to preceding or following whitespace) vs. (b) inputs that already ;; have been pre-tokenized or otherwise damaged (as in the current CoNLL 2010 ;; training data), where we stand no chance of disambiguating quote marks. in ;; downstream processing, we thus need to handle six types of quotes: genuine ;; directional quotes (|“|, |”|, |‘|, and |’|) and ambiguous, straight ones ;; (|"| and |'|). the former group can be restricted to opening and closing ;; functions, and in the case of closing quotes also to apostrophes and units ;; of measure (feet and inches, or seconds and minutes; for which in `proper' ;; typography the double prime and prime symbols could be used); the straight ;; quotes, on the other hand, need to be considered ambiguous between opening ;; and closing, apostrophe, or prime usages. (9-feb-10; oe) ;; >quotes ;; ;; two special cases involving periods: map ASCII ellipsis (|...|) to a single ;; UniCode character (|…|), and convert |..| between numbers into an n-dash, ;; i.e. a numeric range (typically tokenized off, i.e. |42| |–| |43|). maybe ;; the latter can also occur between non-numbers? we could also just preserve ;; it, but always make it a token in its own right? ;; ;; _fix_me_ ;; what about a sentence-final period following the ellipsis (as in cb/7060)? ;; (24-sep-08; oe) ;; _fix_me_ ;; several years later, i believe the ellipsis (|…|) should just be considered ;; a suffix punctuation mark (possibly contributing a piece of semantics) that ;; contributes the same punctuation quality as token-final periods. however, ;; so far we have kept periods following ellipsis as separate tokens, and this ;; also appears to be the analysis in the venerable PTB (and hence we cannot ;; revise it while finalizing DeepBank 1.0). at the same time, the PTB has a ;; number of sentences ending in just ellipsis, and (for maximum compliance) ;; we want to normalize these to the UniCode glyph too. (10-feb-13; oe) ;; !([0-9]) *\.\. *([0-9]) \1 – \2 !(?:(?:\. ){3,}|(?:\.){3,})(?![])}”’ ]*$) … !(?:\. ){3}|(?:\.){3} … ;; ;; some UniCode characters force token boundaries: m-dash, ellipsis. we used ;; to include n-dashes in this list, but some authors (e.g. ESR) use n-dashes ;; much like hyphens, attempting to be clever about bracketing in expressions ;; like |non–source-aware| [users]. ;; !([—…]) \1 ;; ;; remove space after initial |O'| and |L'|, i.e. irish and romance names, to ;; avoid stripping off their apostrophes. ;; ! ([OL])['’] \1’ ;; ;; a new REPP facility: named groups and iterative group calls. there are a ;; number of characters that PTB tokenizes off (unconditionally, it seems, in ;; the original `tokenizer.sed'), though not when they are parts of names or ;; NE patterns, e.g. |AT&T| or |http://www.emmtee.net/?foo.php&bar=42|. thus, ;; we only want these as separate tokens when they are preceded or followed by ;; whitespace; this leaves a problem with, say, |http://www.emmtee.net/|, where ;; one would have to apply NE recognition (what used to be `ersatzing') _prior_ ;; to tokenization. ;; ;; either way, because characters we want to tokenize off might be `clustered' ;; with each other, e.g. |(42%), |, the notion of adjacent whitespace needs to ;; apply transitively through such clusters. it seems an iterative group is ;; the most straightforward way of getting that effect. the rules from the ;; group will be applied repeatedly (in order) at the time the group is called ;; (by means of the `>' operator), until there are no further matches. we need ;; to be careful to avoid indefinite recursion within the group, i.e. not add ;; duplicate spaces. thus, once again, make sure the full string is bracketed ;; by single spaces, and ditch multiple spaces initially; then we can require a ;; non-whitespace preceding or following context, besides the actual spaces. ;; ;; at this point, we exclude a few punctuation characters from this policy, in ;; part because that is the PTB approach (|-| and |/|), in part because they ;; can be prefixes or suffixes of one-token named entities, i.e. |<| and |>| in ;; URLs and email addresses. to work around these, we may need a string-level ;; `ersatzing' facility, associating a sub-string (that can be unambiguously ;; identified by surface properties, e.g. a URL) with an identifier of a token ;; class. ;; ;; like in the original PTB script, periods are only tokenized off in sentence- ;; final position, maybe followed only by closing quote marks or parentheses. ;; DPF 2016-03-03 - Added the GML characters ⌊ and ⌋ so we still pull off the ;; period when there is a strong bracket following the period, as in ;; |we ⌊(⌋ arrived. ⌊)⌋| ;; ;; the story for parentheses and square brackets is somehwat involved, to avoid ;; breaking up inputs like |factor(s)| or |Ca[2+]|. ;; ;; _fix_me_ ;; our current solution unconditionally strips off leading parentheses, which ;; would break an input like |(mis-)read|. talk to dan about how best to deal ;; with these (maybe see whether we find actual examples). (26-feb-10; oe) ;; ;; _fix_me_ ;; there is an issue with some of the characters that are asserted (at least in ;; whitespace adjacency) to constitute separate tokens, specifically the dollar ;; sign. inputs like |HK$ 7.8| will end up with a bogus token boundary (which ;; is the case too in the original PTB sed(1) script). (15-jul-09; oe) ;; DPF 2020-04-24 - Added line for |10%-plus| which needs to be turned into ;; four tokens. (This block applies before the hyphen-splitting rule below.) ;; !^(.+)$ \1 ! + #1 ! ([^ (]+)(\)) ([^ ]|$) \1 \2 \3 ! ([^ \[]+)(]) ([^ ]|$) \1 \2 \3 !([^ ])([\]}?!,,、;:@#$€¢£¥%&”"’']) ([^ ]|$) \1 \2 \3 !([^ ])([\]}?!,,、;:@#$€¢£¥%&”"’'])([–-])([^ ]|$) \1 \2 \3 \4 !([^ ])(\.) ([\])}”"’'… ]*)(⌊.⌋)*$ \1 \2 \3\4 !(^|[^ ]) ([\[({:@#$€¢£¥%&\*“‘])([^ ]) \1 \2 \3 # >1 ;; ;; _fix_me_ ;; to address the above issue about over-tokenizing, for example, |HK$ 7.8|, ;; it might be convenient to introduce non-breaking spaces in the above set of ;; rules, so that we could now put back together |HK $| with peace of mind? ;; (2-dec-12; oe) ;; !((?:AU?|CA?|HK|NZ|US)?\$)([0-9]+) \1 \2 ;; ;; any word-final apostrophe, by now, should be separated (e.g. |abrams'| --> ;; |abrams '|). which only leaves contracted forms, including the undesirable ;; PTB ones, e.g. |don't| --> |do n't|. but not |cannot| --> |can not| and the ;; more obscure ones: |gimme|, |lemme|, |'tis|, |wanna|, et al. ;; ;; _fix_me_ ;; the |cannot| case, especially without characterization information, is a bit ;; challenging: it presumably is frequent enough so that for PTB compliance we ;; should pull it apart, but that would seem to introduce unwanted ambiguity. ;; i doubt that |she cannot participate on monday| has the reading of her being ;; able to `not participate' (stay out of the way) on monday. or does it? ;; (19-sep-08; oe) ;; ;; _fix_me_ ;; starting in mid-2012, normalize towards UniCode apostrophes (where we used ;; to normalize towards ASCII straight quotes). as a consequence, we no longer ;; expect straight quotes in the `standard' setup, i.e. lexical entries need to ;; be available with UniCode apostrophes. for robustness to non-standard modes ;; (e.g. running without REPP), however, we should either keep duplicates of ;; all lexical entries containing apostrophes, or normalize in token mapping? ;; (26-aug-12; oe) ;; DPF 2019-07-12 - For robustness, include opening quote |‘| ;; !([^ ])[’'‘]([dDmMsS]) \1 ’\2 !([^ ])[’'‘](ll|LL|re|RE|ve|VE) \1 ’\2 !([^ ])([nN])[’'‘]([tT]) \1 \2’\3 ;; ;; as part of the 2019 makeover, also split at hyphens and dashes. ;; _fix_me_ ;; think more about modularity: possibly this part should be segregated into ;; a separate module, so that old-style PTB compliance can still be configured? ;; (23-nov-19; oe) #2 !([+-]?[0-9]+(?:\.[0-9]*)?)[–-]([0-9]+(?:\.[0-9]*)?) \1 – \2 # >2 ;; Exclude GML ⌊/ and /⌋ from second pattern ;; DPF 2020-04-13 - Separate surviving suffixed hyphen on |to-| in |$8-to-$9| ;; after |$| has been made a separate token above. #3 !(.+)[–-]([a-zA-Z0-9]+) \1 - \2 !([a-zA-Z]+)[–-] ([#$€¢£¥]) \1 - \2 !([a-zA-Z0-9]+)/([a-zA-Z0-9]+) \1 / \2 # >3 ;; ;; _fix_me_ ;; in principle, we still need to re-attach ERG derivational prefixes. ;; ;; ;; _fix_me_ ;; this is arguably a PTB idiosyncrasy, but (at least while finalizing DeepBank ;; 1.0) we want it in the ‘standard’ parsing configuration. some lines start ;; with lower-case, one-letter identifiers, possibly remnants of footnotes or ;; something similar. (10-feb-13; oe) ;; !^( *[a-z])(-)([\[({“‘0-9A-Z]) \1 \2 \3 >ptb ;; ;; to allow parsing (of inputs involving basic punctuation) in the LKB, there ;; is a REPP module to undo PTB-style separation of tokens. this module will ;; only be activated for use within the LKB, not by preprocess-for-pet(). ;; >lkb