<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:ns2="http://www.tei-c.org/ns/Examples">
    <teiHeader>
        <fileDesc>
            <titleStmt>
                <title>A Description Language for Syntactically Annotated Corpora Esther KSnig and Wolfgang Lezius</title>
            </titleStmt>
        </fileDesc>
    </teiHeader>
    <text>
        <front>
            <div type="abs">
                <head>Abstract</head>
                <p>This paper introduces adescription language for syntactically annotated corpora which allows for encoding both the syntactic annotation to a corpus and the queries to a syntactically annotated corpus. In terms of descriptive adequacy and computational efficiency, the description language is a compromise between script-like corpus query languages and high-level, typed unification-based grammar formalisms. 1 Introduction Syntactically annotated corpora like the Penn Treebank (Marcus et al., 1993), the NeGra corpus (Skut et al., 1998) or the statistically dismnbiguated parses in (Bell et al., 1999) provide a wealth of intbrmation, which can only be exploited with an adequate query language. For example, one might want to retrieve verbs with their sentential complements, or specific fronting or extraposition phenomena. So far, queries to a treebank have been formulated in scripting languages like tgrep, Perl or others. Recently, some powerful query languages have been developed: an exalnple of a highlevel, constraint-based language is described in (Duchier and Niehren, 1999). (Bird et al., 2000) propose a query language for the general concept of annotation grat)hs,, A graphical query notation tbr trees is under development in the ICE project (UCL, 2000). In the current paper, we present a proposal for a graph description language which is meant to fulfill two conflicting requirements: On the one hand, the language should be close to traditional inguistic de-</p>
            </div>
        </front>
        <body>
            <div>
                <p>de/proj ekte/TIGEg scriptions languages, i.e. to grammar formalisms, as a basis for modular, understandable code, even for complex corpus queries. On the other lmnd, the language should not preclude etlicient query evaluation. Our answer is to profit from the research on typed, feature-based/constraintbased grammar tbrmalisms (e.g. (Carpenter, 1992), (Copestake, 1999), (DSrre and Dorna, 1993), (D6I're et al., 1996), (Emele and Zajac, 1990), (H6ht~ld and Smolka, 1988)), and to pick those ingredients which are known to be con~i)utationally 'tractable' in some sense.</p>
            </div>
            <div1>
                <head xml:id="sec2">The Query Language</head>
            </div1>
            <div1>
                <head xml:id="sec2.1">The right kind of graphs</head>
                <p>If syntactic analysis is meant to provide for a basis of semantic interpretation, the predicate-argulnent structure of a sentence nmst be recoverable fi'om its syntactic analysis. Nonlocal dependencies like topicalization, right extraposition, tell us that tr'ccs are not expressive enough. We need a way to connect an extraposed constituent with its syntactic resp. semantic head. This can be done either by introducing empty leaf nodes plus a means for node coreference (like in the Penn Treebank) or by admitting crossing edges. In our project, the latter solution has been chosen (Skut et al., 1997), partly tbr the reason that it is simpler to annotate (no decision on the right place of a trace has to be taken). We call this extension of trees with crossing edges syntaz graphs. An example is shown in Fig. 1.</p>
                <p>In order to discuss the details of the language, we will make reference to the simpler syntax graph in Fig. 2. C K Eq Kq</p>
                <p>r~q + F~q +</p>
                <p>Die Tagung hat</p>
                <p>ART NN VVFIN Def.Fem.Nom.Sg Fem.Nom.Sg.* 3.Akk.Pl mohr HAT %'.* Teilnohmer als</p>
                <p>NN KOKOM Masc.Akk.Pl.* . . . . . .</p>
                <p>\]e ADV zuvor ADV Figure 1: A syntax graph with crossing edges (&quot;the conference has more tmrticipants than ever bet:bre&quot;) eJn Mann Ifiuft ART NN VVFIN Figure 2: A simple syntax graph (&quot;a man I'l\].IIS '~) Syntactic phrases and lexical entries usually come with a bundle of morphosyntaetic information like part-of speech, case, gender, and mnnber. In computational linguistics, t~ature structures are used for that purpose. Since we need only a way to represent morphosyntactic information (not Sylltactic or semantic structures) themselves, we restrict ourselves to feature records, i.e. fiat; feature structures whose tbature values are constants. We admit Boolean tbrmulas, tbr</p>
                <div2>
                    <head xml:id="sec1057"></head>
                    <p>the fl.'ature values, as well as tbr the featurevalue pairs themselves.</p>
                    <p>For example, all proper nouns (&quot;NE&quot;) and nouns (&quot;NN&quot;) can be retrieved by \[pos= &quot;NE&quot; I &quot;NN&quot;\]</p>
                    <p>As usual, strucl;ura\] identity ca.n be expressed by the use of logical variables. However, variables must not occur in the SCOl)e of negation, since this would introduce the colnlmtational overhead of inequality constraints.</p>
                    <p>The values of a feature with 'infinite' range like word or 1emma can be referred to by regular exl)ressions, e.g. the nouns (&quot;NN&quot;) with initial M can be retrieved by \[word = /^M.*/ &amp; pos=&quot;NN&quot;\] The/-symbols inark a regular expression. 2.3 Node relations Since gral)hs are two-dimensional objects, we need one basic node relation tbr each dimension, direct precedence . for the horizontal dilnension and direct dominance &gt; tbr the vertical dimension (the precedence of two inner nodes is defined as the precedence of their leftmost terminal successors (Lezius and KSnig, 2000a)) Some convenient derived node relations are the following: &gt;* dominance (minimum path length 1) &gt;n dominance in n steps (n &gt; 0) &gt;m,n dominance between ~n, and n steps</p>
                    <p>(0 &lt; m &lt; n) &gt;Ol leftmost terminal successor ('left corner') &gt;@r rightmost erminal successor ('right corner') • * precedence (minimum nmnber of inter-</p>
                    <p>vals: 1) • n precedence with rt intervals (n &gt; 0) • m,n precedence between m and 'n, intervals $ siblings $.* siblings with precedence (0 &lt; m &lt; 2.4 Graph descriptions We admit restricted 13oolean expressions over node relations, i.e. conjunction and disjunction, but no negation. For examI)le, tile queries #nl : \[word=&quot;ein&quot; ~ pos=&quot;ART&quot;\] #n2: \[word=&quot;Mann&quot; &amp; pos=&quot;NN&quot;\] #nl #n2 and #nl:\[cat=&quot;NP&quot;\] &gt;&quot;NK&quot; \[pos=&quot;kRT&quot;\] &amp; #nl &gt;&quot;NK&quot; \[word=&quot;Mann&quot;\]</p>
                    <p>art both satisfied by the NP-constituent in Fig. 2. #nl, #n2 art variables. Tile symbol &quot;NR&quot; is an edge label. Edges can be labelled in order to indicate the syntactic relation between two nodes. 2.5 Types For tile t)urpose of conceptual chuity, tile user can define type hierarchies. 'SubtylleS: may also be constants e.g. like in the case of part-of-speech symbols. Here is all excerpt from the type hierarchy tbr the STTS tagset: nominal := noun,properNoun,pronoun. noun := &quot;NN&quot;. properNoun := &quot;NE&quot;. pronoun &quot;=</p>
                    <p>&quot;PPPER&quot;,&quot;PPOS&quot;,&quot;PRELS&quot;, ... .</p>
                    <p>This hierarchy can be used to tbrmulate queries in a more concise manner: \[pos=nominal\] .* \[pos=&quot;VVFIN&quot;\] 2.6 Templates E.g. Ibr a concrete lexicon acquisition task, one might have to define a collection of interdependent, comI)lex queries. In order to keel) tile resulting code tractable and reusable, queries call be organised into telnplates (oi macros). Templates can take logical variables as arguments and may refer to other temi)lates , as long as there is no (embedded) self reference. Logically, templates art offline-compilable Horn fbrmula.</p>
                    <p>Here are some examples tbr template def initions. A simple notion of VerbPhrase is being de.fined with reference to a notion of PrepPhrase. PrepPhrase ( #nO : \[cat=&quot;PP ''\]</p>
                    <p>&gt; #nl : \[pos=&quot;APPR&quot;\] #nO</p>
                    <p>&gt; #n2: \[pos=&quot;NE&quot;\] #nl.#n2 ) ; VerbPhrase ( #nO : \[cat=&quot;VP&quot;\]</p>
                    <p>&gt; #nl : \[pos=&quot;VVFIN&quot;\]</p>
                    <p>#nO &gt; #n2 &amp;</p>
                    <p>#nl.#n2 ) &lt;-</p>
                    <p>PrepPhrase (#n2) ;</p>
                </div2>
            </div1>
            <note n="2.2" place="below">Nodes: feature records</note>
            <div1>
                <head xml:id="sec3">The Corpus Annotation Language</head>
                <p>3.1 Corpus annotation vs. queries Actually, the query language is rather a dc,scription language which (:an 1)e used also for encoding the syntactic annotation of a corpus. \]n the current proje, ct, a SylltaCtically disambiguated corpus is being 1)reduced. This means, that, for corl)us annotation, only a sublanguage of the i)rol)osed language is adnlissibh', with the following restrict;ions: • The graph (;ollstrailltS Illay only inclu(le</p>
                <p>the, t)asi(: node relations (&gt;, .). ,, The only logical contlective on all struc-</p>
                <p>tural levels is the COl\junction el)cra-</p>
                <p>ter &amp;. • lq,egular expressions are, 'not admitted. ,, Tyl)es and teml)lates are 'uo/, admitted.</p>
                <p>The automatically generate(1 corl)us annotation (:ode (generate(1 from the, outl)ut of tile gral)hical annotation interface) for Fig. 2 looks as fl)llows, with some additional markup for ease of processing. &lt;sentence £d=&quot;i&quot; roeC=&quot;5&quot;&gt; &quot;1&quot;: &quot;2&quot;: &quot;3&quot;: &quot;4&quot;: &quot;5&quot;: (&quot;l&quot; (&quot;5&quot; (&quot;4&quot; 3.2 When designing the, architecture of our sysloin, we had to deal with the 1)roblem of various diflhrent formats for the representation has to convert it into the format of the deof syntactically annotated corpora: Penn scription language. Subsequently, the col ~lYe, ebank, Ne, Gra (Skut et al., 1.997), Tip- pus is indexed in order to guarantee fficient st;er, Susmme, several fi)rnlats for chunked lookups during the query evaluation. The texts and the I)roposed des(:ription language,. query processor to date is cal)able of evaluatThus, we have developed an XML based for- ing 1)asic queries (cf. Sect. 2.2-2.4)..To supmat which guarantees maximmn 1)ortabil- port all popular platforms, the tool is impleity (Mengel and Lezius, 2000). An online mented in JawL There, is a servlet available ('onversion tool (NeOra, Penn Treebank -+ on the project web page which illustrates the, XML) is availabh', on our project homepage. cuir(:nt stage of the implementation. \[uord=&quot;ein&quot; &amp; pos=&quot;hRT&quot;\] gg \[word=&quot;Mann&quot; g~ pos=&quot;NN&quot;\] g~ \[uord=&quot;l~iuft&quot; &amp; pos=&quot;VVFIN&quot;\] \[cat=&quot;NP&quot;\] &amp; \[cat=&quot;S ''\] &amp;</p>
                <p>&quot;2&quot;) (&quot;2&quot; &quot;3&quot;) &gt;&quot;SB&quot; &quot;4&quot;) &amp; (&quot;5&quot; &gt;&quot;HD&quot; &quot;3&quot;) &gt;&quot;NK&quot; &quot;1&quot;) &amp; (&quot;4&quot; &gt;&quot;NK&quot; &quot;2&quot;) An XML representation</p>
            </div1>
            <note n="1059" place="below"></note>
            <div1>
                <head xml:id="sec4:">Formal Semantics</head>
                <p>Compared to most other corpus description and corpus query languages, o111 graph (tescription language comes with a ibrmal and a clear-cut operational semantics, which has been described ill a technical report (Lezills anti KSnig, 2000a). The semantics has been compiled from the correslmntling parts of tbrmal semantics of the typed, unificationbased gramlnar tbrmalisms and constraintbased logic programming languages which have been cited above. Due to the, fact that the corpus slid the query are represe, nted in the same description language, one Call detille a (;oi1se(tllellce relation })et\veell the corl)uS and the query. Essentially, the annotated cortms corresponds to a Prolog database, and the corpus query to a Prolog query. A query result is a syntax graph from the tort)us.</p>
            </div1>
            <div1>
                <head xml:id="sec5">Implementation</head>
                <p>One might argue that commercial and research implementations tbr structurally annotated texts are already available, i.e. XML-retrieval systems, e.f. (LTG, 1999). However, we intend to solve t)rol)lems which are spe('ifi(&quot; to natural language descriptions: non-eml)e(t(ling (non-tree-lilw,) structm'al annotations crossing edge, s and, on the long-texm, re, trieval of coindexed sul.)structures (co-refl;rence phenomena). A domain-specific impleme, ntation of the search engine gives the basis for optiinizations wrt. linguistic applications (Lezius and KSnig, 20001)).</p>
                <p>Before queries can be (wahlate.d on a new corl)uS (e.ncoded in the NeGra, Penn Treebank or XML format), a preprocessing tool Conclusion Syntactic corpus annotations, complex corpus queries and comt)utational grammars have one common point: they are descriptions of natural language grammars. Our claim is that corpus query languages hould be close to traditional grammar fbrmalisins in order to make complicated information extraction tasks easier to encode. The level of processing efficiency of scripting languages can still be reached if one restricts oneself to 'off-line' compilable language lements only. References Franz Bell, Glenn Carroll, Detlef Prescher,</p>
                <p>Stefan Riezler, and Mats Rooth. 1999.</p>
                <p>Inside-outside stimation of a lexicalized</p>
                <p>pcfg ibr german. In Proceedings of the</p>
                <p>37th Annual Meeting of the ACL, Mary-</p>
                <p>land. Steven Bird, Peter Buneman, and Tan</p>
                <p>Wang-Chiew. 2000. Towards a query lan-</p>
                <p>guage for annotation graphs. In Proceed-</p>
                <p>ings of the LREC 2000, Athens, Greece. Bob Carpenter. 1992. The Logic of Typed</p>
                <p>Feature Structures. Tracts in Theoretical</p>
                <p>Computer Science. Cambridge University</p>
                <p>Press, Cambridge. Ann Copestake, 1999. Th, e (new) LKB sys-</p>
                <p>tem. www-csli.stanibrd.edu, /~aac/doc5-</p>
                <p>2.pdf Jochen D6rre and Michael Dorna. 1993.</p>
                <p>cur - a formalism tbr linguistic knowl-</p>
                <p>edge representation. Deliverable R.1.2A,</p>
                <p>DYANA 2, August. 3ochen DSrre, Dov M. Gabbay, and Es-</p>
                <p>ther KSnig. 1996. Fibred semantics tbr</p>
                <p>feature-based grammar logic. Journal of</p>
                <p>Logic, Language, and Infi)rmation. Spe-</p>
                <p>cial Issue on Language and Proof Theory,</p>
                <p>5:387-422. Denys Duchier and Joachim Niehren. 1999.</p>
                <p>Solving dominance constraints with finite</p>
                <p>set constraint programming. Technical</p>
                <p>report, Universitiit des Saarlandes, Pro-</p>
                <p>gramming Systems Lab. Martin Emele and Rfmi Zajac. 1990. A</p>
                <p>fixed-point semantics for feature type</p>
                <p>systems. In Proceedings of the 2nd</p>
                <p>International Workshop on Conditional</p>
                <p>and Typed Rewriting Systems, Montreal,</p>
                <p>Canada. Markus HShfeld and Gert Smolka. 1988.</p>
                <p>Definite relations over constraint lan-</p>
                <p>guages. LILOG-Report 53, IBM Deutsch-</p>
                <p>land, Stuttgart, Baden-Wfirttemberg, Oc-</p>
                <p>tober. Wolfgang Lezius and Esther KSnig. 2000a.</p>
                <p>The TIGER language - a description lan-</p>
                <p>guage for syntax graphs. Internal reI)ort,</p>
                <p>IMS, University of Stuttgart. Wolf'gang Lezius and Esther K5nig. 2000b.</p>
                <p>Towards a search engine for syntactically</p>
                <p>annotated corpora. In Proceedings of the</p>
                <p>KONVENS 2000, Ihnenau, Germany. LTG Language Technology Group, Ed-</p>
                <p>inburgh, 1999. LT XML version 1.1.</p>
                <p>User docum.cntation and reference guide. www. ltg. ed. ac. uk, software/xmL Mitchell Marcus, Beatrice Santorini, and</p>
                <p>Mary Ann Marcinkiewicz. 1993. Building</p>
                <p>a large annotated corpus of English: The</p>
                <p>Penn Treebank. Coraputational Linguis-</p>
                <p>tics. Andreas Mengel and Wolfgang Lezius. 2000.</p>
                <p>An XML-based representation tbrmat tbr</p>
                <p>syntactically annotated corpora. In Pro-</p>
                <p>ceedings of the LREC 2000, Athens,</p>
                <p>Greece. Wojciech Skut, Brigitte Krenn, Thorsten</p>
                <p>Brants, and Hans Uszkoreit. 1997. An</p>
                <p>annotation scheme ibr free word order</p>
                <p>languages. In Proceedings of the 5th</p>
                <p>Conference on Applied Natural Language</p>
                <p>Processing (ANLP), Washington, D.C.,</p>
                <p>March. Wojciech Skut, Thorsten Brants, Brigitte</p>
                <p>Krenn, and Hans Uszkoreit. 1998. A lin-</p>
                <p>guistically interpreted corpus of german</p>
                <p>newspaper text. In ESSLI 1998, Work-</p>
                <p>shop on Recent Advances in Corpus An-</p>
                <p>notation. UCL University College London, 2000. ICE</p>
                <p>(International Corpus of English).</p>
            </div1>
        </body>
        <back/>
    </text>
</TEI>
