A Description Language for Syntactically Annotated Corpora Esther KSnig and Wolfgang Lezius

A Description Language for Syntactically Annotated Corpora Esther KSnig and Wolfgang Lezius

Abstract

This paper introduces adescription language for syntactically annotated corpora which allows for encoding both the syntactic annotation to a corpus and the queries to a syntactically annotated corpus. In terms of descriptive adequacy and computational efficiency, the description language is a compromise between script-like corpus query languages and high-level, typed unification-based grammar formalisms. 1 Introduction Syntactically annotated corpora like the Penn Treebank (Marcus et al., 1993), the NeGra corpus (Skut et al., 1998) or the statistically dismnbiguated parses in (Bell et al., 1999) provide a wealth of intbrmation, which can only be exploited with an adequate query language. For example, one might want to retrieve verbs with their sentential complements, or specific fronting or extraposition phenomena. So far, queries to a treebank have been formulated in scripting languages like tgrep, Perl or others. Recently, some powerful query languages have been developed: an exalnple of a highlevel, constraint-based language is described in (Duchier and Niehren, 1999). (Bird et al., 2000) propose a query language for the general concept of annotation grat)hs,, A graphical query notation tbr trees is under development in the ICE project (UCL, 2000). In the current paper, we present a proposal for a graph description language which is meant to fulfill two conflicting requirements: On the one hand, the language should be close to traditional inguistic de-

de/proj ekte/TIGEg scriptions languages, i.e. to grammar formalisms, as a basis for modular, understandable code, even for complex corpus queries. On the other lmnd, the language should not preclude etlicient query evaluation. Our answer is to profit from the research on typed, feature-based/constraintbased grammar tbrmalisms (e.g. (Carpenter, 1992), (Copestake, 1999), (DSrre and Dorna, 1993), (D6I're et al., 1996), (Emele and Zajac, 1990), (H6ht~ld and Smolka, 1988)), and to pick those ingredients which are known to be con~i)utationally 'tractable' in some sense.

The Query Language The right kind of graphs

If syntactic analysis is meant to provide for a basis of semantic interpretation, the predicate-argulnent structure of a sentence nmst be recoverable fi'om its syntactic analysis. Nonlocal dependencies like topicalization, right extraposition, tell us that tr'ccs are not expressive enough. We need a way to connect an extraposed constituent with its syntactic resp. semantic head. This can be done either by introducing empty leaf nodes plus a means for node coreference (like in the Penn Treebank) or by admitting crossing edges. In our project, the latter solution has been chosen (Skut et al., 1997), partly tbr the reason that it is simpler to annotate (no decision on the right place of a trace has to be taken). We call this extension of trees with crossing edges syntaz graphs. An example is shown in Fig. 1.

In order to discuss the details of the language, we will make reference to the simpler syntax graph in Fig. 2. C K Eq Kq

r~q + F~q +

Die Tagung hat

ART NN VVFIN Def.Fem.Nom.Sg Fem.Nom.Sg.* 3.Akk.Pl mohr HAT %'.* Teilnohmer als

NN KOKOM Masc.Akk.Pl.* . . . . . .

\]e ADV zuvor ADV Figure 1: A syntax graph with crossing edges ("the conference has more tmrticipants than ever bet:bre") eJn Mann Ifiuft ART NN VVFIN Figure 2: A simple syntax graph ("a man I'l\].IIS '~) Syntactic phrases and lexical entries usually come with a bundle of morphosyntaetic information like part-of speech, case, gender, and mnnber. In computational linguistics, t~ature structures are used for that purpose. Since we need only a way to represent morphosyntactic information (not Sylltactic or semantic structures) themselves, we restrict ourselves to feature records, i.e. fiat; feature structures whose tbature values are constants. We admit Boolean tbrmulas, tbr

the fl.'ature values, as well as tbr the featurevalue pairs themselves.

For example, all proper nouns ("NE") and nouns ("NN") can be retrieved by \[pos= "NE" I "NN"\]

As usual, strucl;ura\] identity ca.n be expressed by the use of logical variables. However, variables must not occur in the SCOl)e of negation, since this would introduce the colnlmtational overhead of inequality constraints.

The values of a feature with 'infinite' range like word or 1emma can be referred to by regular exl)ressions, e.g. the nouns ("NN") with initial M can be retrieved by \[word = /^M.*/ & pos="NN"\] The/-symbols inark a regular expression. 2.3 Node relations Since gral)hs are two-dimensional objects, we need one basic node relation tbr each dimension, direct precedence . for the horizontal dilnension and direct dominance > tbr the vertical dimension (the precedence of two inner nodes is defined as the precedence of their leftmost terminal successors (Lezius and KSnig, 2000a)) Some convenient derived node relations are the following: >* dominance (minimum path length 1) >n dominance in n steps (n > 0) >m,n dominance between ~n, and n steps

(0 < m < n) >Ol leftmost terminal successor ('left corner') >@r rightmost erminal successor ('right corner') • * precedence (minimum nmnber of inter-

vals: 1) • n precedence with rt intervals (n > 0) • m,n precedence between m and 'n, intervals $ siblings $.* siblings with precedence (0 < m < 2.4 Graph descriptions We admit restricted 13oolean expressions over node relations, i.e. conjunction and disjunction, but no negation. For examI)le, tile queries #nl : \[word="ein" ~ pos="ART"\] #n2: \[word="Mann" & pos="NN"\] #nl #n2 and #nl:\[cat="NP"\] >"NK" \[pos="kRT"\] & #nl >"NK" \[word="Mann"\]

art both satisfied by the NP-constituent in Fig. 2. #nl, #n2 art variables. Tile symbol "NR" is an edge label. Edges can be labelled in order to indicate the syntactic relation between two nodes. 2.5 Types For tile t)urpose of conceptual chuity, tile user can define type hierarchies. 'SubtylleS: may also be constants e.g. like in the case of part-of-speech symbols. Here is all excerpt from the type hierarchy tbr the STTS tagset: nominal := noun,properNoun,pronoun. noun := "NN". properNoun := "NE". pronoun "=

"PPPER","PPOS","PRELS", ... .

This hierarchy can be used to tbrmulate queries in a more concise manner: \[pos=nominal\] .* \[pos="VVFIN"\] 2.6 Templates E.g. Ibr a concrete lexicon acquisition task, one might have to define a collection of interdependent, comI)lex queries. In order to keel) tile resulting code tractable and reusable, queries call be organised into telnplates (oi macros). Templates can take logical variables as arguments and may refer to other temi)lates , as long as there is no (embedded) self reference. Logically, templates art offline-compilable Horn fbrmula.

Here are some examples tbr template def initions. A simple notion of VerbPhrase is being de.fined with reference to a notion of PrepPhrase. PrepPhrase ( #nO : \[cat="PP ''\]

> #nl : \[pos="APPR"\] #nO

> #n2: \[pos="NE"\] #nl.#n2 ) ; VerbPhrase ( #nO : \[cat="VP"\]

> #nl : \[pos="VVFIN"\]

#nO > #n2 &

#nl.#n2 ) <-

PrepPhrase (#n2) ;

Nodes: feature records The Corpus Annotation Language

3.1 Corpus annotation vs. queries Actually, the query language is rather a dc,scription language which (:an 1)e used also for encoding the syntactic annotation of a corpus. \]n the current proje, ct, a SylltaCtically disambiguated corpus is being 1)reduced. This means, that, for corl)us annotation, only a sublanguage of the i)rol)osed language is adnlissibh', with the following restrict;ions: • The graph (;ollstrailltS Illay only inclu(le

the, t)asi(: node relations (>, .). ,, The only logical contlective on all struc-

tural levels is the COl\junction el)cra-

ter &. • lq,egular expressions are, 'not admitted. ,, Tyl)es and teml)lates are 'uo/, admitted.

The automatically generate(1 corl)us annotation (:ode (generate(1 from the, outl)ut of tile gral)hical annotation interface) for Fig. 2 looks as fl)llows, with some additional markup for ease of processing. <sentence £d="i" roeC="5"> "1": "2": "3": "4": "5": ("l" ("5" ("4" 3.2 When designing the, architecture of our sysloin, we had to deal with the 1)roblem of various diflhrent formats for the representation has to convert it into the format of the deof syntactically annotated corpora: Penn scription language. Subsequently, the col ~lYe, ebank, Ne, Gra (Skut et al., 1.997), Tip- pus is indexed in order to guarantee fficient st;er, Susmme, several fi)rnlats for chunked lookups during the query evaluation. The texts and the I)roposed des(:ription language,. query processor to date is cal)able of evaluatThus, we have developed an XML based for- ing 1)asic queries (cf. Sect. 2.2-2.4)..To supmat which guarantees maximmn 1)ortabil- port all popular platforms, the tool is impleity (Mengel and Lezius, 2000). An online mented in JawL There, is a servlet available ('onversion tool (NeOra, Penn Treebank -+ on the project web page which illustrates the, XML) is availabh', on our project homepage. cuir(:nt stage of the implementation. \[uord="ein" & pos="hRT"\] gg \[word="Mann" g~ pos="NN"\] g~ \[uord="l~iuft" & pos="VVFIN"\] \[cat="NP"\] & \[cat="S ''\] &

"2") ("2" "3") >"SB" "4") & ("5" >"HD" "3") >"NK" "1") & ("4" >"NK" "2") An XML representation

Formal Semantics

Compared to most other corpus description and corpus query languages, o111 graph (tescription language comes with a ibrmal and a clear-cut operational semantics, which has been described ill a technical report (Lezills anti KSnig, 2000a). The semantics has been compiled from the correslmntling parts of tbrmal semantics of the typed, unificationbased gramlnar tbrmalisms and constraintbased logic programming languages which have been cited above. Due to the, fact that the corpus slid the query are represe, nted in the same description language, one Call detille a (;oi1se(tllellce relation })et\veell the corl)uS and the query. Essentially, the annotated cortms corresponds to a Prolog database, and the corpus query to a Prolog query. A query result is a syntax graph from the tort)us.

Implementation

One might argue that commercial and research implementations tbr structurally annotated texts are already available, i.e. XML-retrieval systems, e.f. (LTG, 1999). However, we intend to solve t)rol)lems which are spe('ifi(" to natural language descriptions: non-eml)e(t(ling (non-tree-lilw,) structm'al annotations crossing edge, s and, on the long-texm, re, trieval of coindexed sul.)structures (co-refl;rence phenomena). A domain-specific impleme, ntation of the search engine gives the basis for optiinizations wrt. linguistic applications (Lezius and KSnig, 20001)).

Before queries can be (wahlate.d on a new corl)uS (e.ncoded in the NeGra, Penn Treebank or XML format), a preprocessing tool Conclusion Syntactic corpus annotations, complex corpus queries and comt)utational grammars have one common point: they are descriptions of natural language grammars. Our claim is that corpus query languages hould be close to traditional grammar fbrmalisins in order to make complicated information extraction tasks easier to encode. The level of processing efficiency of scripting languages can still be reached if one restricts oneself to 'off-line' compilable language lements only. References Franz Bell, Glenn Carroll, Detlef Prescher,

Stefan Riezler, and Mats Rooth. 1999.

Inside-outside stimation of a lexicalized

pcfg ibr german. In Proceedings of the

37th Annual Meeting of the ACL, Mary-

land. Steven Bird, Peter Buneman, and Tan

Wang-Chiew. 2000. Towards a query lan-

guage for annotation graphs. In Proceed-

ings of the LREC 2000, Athens, Greece. Bob Carpenter. 1992. The Logic of Typed

Feature Structures. Tracts in Theoretical

Computer Science. Cambridge University

Press, Cambridge. Ann Copestake, 1999. Th, e (new) LKB sys-

tem. www-csli.stanibrd.edu, /~aac/doc5-

2.pdf Jochen D6rre and Michael Dorna. 1993.

cur - a formalism tbr linguistic knowl-

edge representation. Deliverable R.1.2A,

DYANA 2, August. 3ochen DSrre, Dov M. Gabbay, and Es-

ther KSnig. 1996. Fibred semantics tbr

feature-based grammar logic. Journal of

Logic, Language, and Infi)rmation. Spe-

cial Issue on Language and Proof Theory,

5:387-422. Denys Duchier and Joachim Niehren. 1999.

Solving dominance constraints with finite

set constraint programming. Technical

report, Universitiit des Saarlandes, Pro-

gramming Systems Lab. Martin Emele and Rfmi Zajac. 1990. A

fixed-point semantics for feature type

systems. In Proceedings of the 2nd

International Workshop on Conditional

and Typed Rewriting Systems, Montreal,

Canada. Markus HShfeld and Gert Smolka. 1988.

Definite relations over constraint lan-

guages. LILOG-Report 53, IBM Deutsch-

land, Stuttgart, Baden-Wfirttemberg, Oc-

tober. Wolfgang Lezius and Esther KSnig. 2000a.

The TIGER language - a description lan-

guage for syntax graphs. Internal reI)ort,

IMS, University of Stuttgart. Wolf'gang Lezius and Esther K5nig. 2000b.

Towards a search engine for syntactically

annotated corpora. In Proceedings of the

KONVENS 2000, Ihnenau, Germany. LTG Language Technology Group, Ed-

inburgh, 1999. LT XML version 1.1.

User docum.cntation and reference guide. www. ltg. ed. ac. uk, software/xmL Mitchell Marcus, Beatrice Santorini, and

Mary Ann Marcinkiewicz. 1993. Building

a large annotated corpus of English: The

Penn Treebank. Coraputational Linguis-

tics. Andreas Mengel and Wolfgang Lezius. 2000.

An XML-based representation tbrmat tbr

syntactically annotated corpora. In Pro-

ceedings of the LREC 2000, Athens,

Greece. Wojciech Skut, Brigitte Krenn, Thorsten

Brants, and Hans Uszkoreit. 1997. An

annotation scheme ibr free word order

languages. In Proceedings of the 5th

Conference on Applied Natural Language

Processing (ANLP), Washington, D.C.,

March. Wojciech Skut, Thorsten Brants, Brigitte

Krenn, and Hans Uszkoreit. 1998. A lin-

guistically interpreted corpus of german

newspaper text. In ESSLI 1998, Work-

shop on Recent Advances in Corpus An-

notation. UCL University College London, 2000. ICE

(International Corpus of English).