Dependency Treebank for Russian: Concept, Tools, Types of Infornmtion

• • • lemmalized leA'Is, for every word, its normal form (lemma) and part of speech are indicated; mowhologically tagged leXlS: for every word, a full set of nlorl)hological attributes it specified along with the lenmm and the part of speech; symactically tagged ldxlx: apart from tile full morphological markup at the word level, every sentence has a syntax structure. lenya, nadya}Oiitp.ru We annotate Russian texts with depmlde,wy structttres - a formalism that is more suitable for Slavonic languages with their relatively fiee word order. The structure not only contains inl'omlation on which words of the sentence are syntactically linked, but also relegates each link to one of the several dozen syntactic types (at present, we use 78 syntactic relations). This formalism ensures a more complete and informative representation than ally other syntactically annotated corpus. This is a major innowttion, since the majority of syntactically annotated corpora, both those already awfilable and under construction, represent the syntactic structure by means of constituents. The closest analogue to our work is the Czech annotated corpus collected at Charles University in Prague - see I tajicova, Panevova, Sgall (19981). In this corpus, the syntactic data are also expressed in a dependency formalism, although the set of syntactic functional relations is much smaller as it only has 23 relations In what follows, we describe the types of texts used to create the coqms (Section 2), markup format (Section 3), annotation tools and procedures (Sectional), and types of linguistic data included in the markup (Section 5). 2. Source text selection The well-known Uppsala University Corpus of contemporary Russian prose, totalling ca. 1,000,000 words, has been chosen as the prilnary source for our work. The Uppsaht Corpus is well balanced between fiction and journalistic genre, with a smaller percentage of scientific and popular science texts. The Corpus includes samples of contemporary Russian prose, as well as excerpts flom newspapers and magazines of recent decades, and gives a representative coverage of written Russian in modern use. Conversational examples are scarce and appear as dialogues inside fiction texts. The design principles were fommlated as follows: ® "layered" markup- several annotation levels

coexist and can be extracted or processed

independently; • • incrementality - it should be easy to add higher annotation levels; convenient parsing of the annotated text by means of standard software packages. The most natural solution to meet this criteria is an XML-based markup language. We have tried to make our format compatible with TEI (Text Encoding for Interchange, see TEI Guidelines (1994)), inuoducing new elements or attributes only in situations where TEI markup does not provide adequate means to describe the text structure in the dependency grammar framework. Listed below are types of iuformation about text structure tlmt must be encoded in the markup, and relative tags/attributes used to bear them. a) Splitting of text into sentences. A special container element <S> (available in TEI) is used to delimit sentence boundaries. The element may have an (optional) ID attribute that supplies a unique identifier for the sentence within the text; this identifier may be used to store infommtion about extra-sentential relations in the text. It may also have a COMMENT attribute, used by linguists to store observations about particular syntactic phenomena encountered in the sentence; b) Splitting of sentences into lexical items ~ . The words are delimited by a container element <W>. Like sentences, words may have a unique "rD attribute that is used to reference the word within the sentence; c) Ascribing morphological features to words. Morphological information is ascribed to the word by means of two attributes born by the <W> tag: LlgNNg_- a normalized word form; FEAT - morphological features. d) Storing information about the syntax structure. To annotate the information about syntactic dependencies, we use two other attributes in the <W> element: DON- the ID of the master word; LINK - syntactic function label. There are also special provisions in the lbrmalism to store auxiliary information, e.g. multiple morphological analyses and syntax trees. They are expected to disappear from the final version of the corpus. The procedure of corpus data acquisition is sentiautomatic. An initial version of markup is generated by a computer using a general l~urpose morphological analyzer and syntax parser engine; after that, the results of the automatic processing are submitted to human post-editing. The analysis engine (morphology and parsing) is based upon the ETAP-3 machine translation engine - see Apresjan et al. (1992, 1993). To support the creation of mmotated ata, a set of tools was designed and implemented. All tools are Win32 applications written in C++. The tools available are: • " a program for sentence boundaries markup, called Chopper; a post-editor for building, editing and managing syntactically annotated texts - Slruclure Edilor (or SirEd). The amount of manual work required to build annotations depends on the complexity of the input data. SirEd offers different options for building structures. Most sentences can be reliably processed without any human intervention; in this case, a linguist should look through the processing result and confirm it. If the structure contains errors, the linguist can edit it using a user-friendly graphical interface (see screenshots below). If the errors are too many or no structure could be produced, the linguist may use a special split-andrtm mode. This mode includes manual prechunking of the input phrase into pieces with a more transparent structure and applying the analyzer/parser to every chunk. Then the linguist must manually link the subtrees produced for every chunk into a single structure. If the linguist has encountered a very peculiar syntactic construction so that he/she is uncertain glbOtil the ton'cot strticture, he/she lllay mark its "doubtful" the whole sentence or sirlgh.', words whoso func:tions are not complelely clear. The hiforniation will be stored hi the niarkt/p, and Sirlgd will visualize the rOSl~eCtiVe SClltellce ;is one in need for further editing. \]qg. i presents the nlain dialog window for editinb, ,soilteiico l)roportios. All operator can edit il:to iluirkup diicctly, or edit single properlics u!;ing a gral,hk:al interfac:e. The sotirt:o loxl illlder analysis is wi-illcn in all edit WilldOW ill lhc top: ,Volj<~ pis'mo ne hylo podpisamJ, ja m,r:novenslo do<r;adal.sja, klo e,qo #mpisal \[A/lhou<~J~ /lle lelier was su.,l sighted, 1 i~,slanlly guessed who had written itl. 'l'ho information about sin~,le words is wriltcn inlo a li:~t: e.g. the first word xotja \]althottgh\] has an identifier :I;D:-:"~ "; llle Icmnlatized forni is XO'IJA; its feature list coi~sisls of a sinp~le roattlre -- ;t l)art-of-spoech characlor\]slk: (it iS a conjtillCtion); the word depends oil ;I word with IO="8" by till adverbial .vottt'ce .venle~Tce I / raw mr~#'/,tq# S~r, terico l\['}~: I'1 ~6-5n~ statusI'tlll Strudure CVt~-D-/.-}tv\]--';~7'77tS,--'(:I-%;'C-(:\]I'4.\]"ilT~.';1 - t.\[-t....~ll,}lTq-"--;--i<:J~r~q '- (INK-",,i-,~./,/ ").k.~l.\] T 9, { ~\IV) <'vV ()Olvl~<"'l" If !~",.1-~":3 HM EJL CP\[_I?, I II \[O.r{" ID- "2" I t_: MMA#'J~)9 I(;bMO" LINK~"rmesxL4K" <W DOM-"4" t-E.RI-="IV'd>,I TM I\[ ~-<"3"ll_El,..lt'..,l./',=="l If!" I_INK=%i b~/4/4!4 1'-I")1 IO (/vV) t <W DOM::"I" I--EA\] = "V I 1POIIJ Elh.~)1.'1~ I l..4:-',1~.~ 1t3 CPIz !-t HECC~/3" " I i0="'\]," LIZMi,.4A="F.:bn-17/ <W DOM,~"4" ~ E/\I-"Vlll ~O\[\[I Eq ~'IPb'ILI KP c{:p\[~!q..CO\[7 CIffA~.' \~." I\[,),= ~J I.EMM/',=' rio,c4) 14 CblDA \[ <W IX:¢.'iJ,<.I" I~@q="S HM E!I MW)K O.fl" 11 )-%" I_EMMA#',~' L INK= \[ Ij-3 e LI. H K II ) .q ( f'i/"/>/// < W D O M =" a" \[" V.ACr =~"ADV" I D ="/' L E M Mi\:~" M I ~ IO L1 \[ HI) Of~ I_ll' .\[,4K=" obc-l-I'>Hr-HoBe.t~,ljO<p, <W DOM="root' Iz\[:I'<.F="V I IF~OI U t-r\[ j} HL11437~>"GI i'.,.f~l.>/., gOB" D="~' LEMMA= rA¢ /A\[{ blLi.,ATbOY- ~j ID . \[1\] \[2\] \[3\] \[4\] \[ \[._\] \] \[7\] \[07 Edit-lree.. _l Word I Xo-ru \[-1'~ rll4Cbl-4t. t 4l Fie l,llq \[~t,lno t41 N Ofl.F1.4C\[tH O it, I-4r ttt_iDOI411U dll.. j'lOl-i~).ftt~jl(-;\[} _,_1_ t Setmp\[c,, iq.us.qi#q-i cer4unc;~ Fit;ure I. Sentence I roporiics dialog in Strli,,d. rc:tation (link type is "adverb"). By doubleclicking all itoi\]i hi the word list or prossh\]g the button, a linguist can invoke dialog whidows f{}r editing 1}roportios {}f single words, ltowovor, the i\]lost coIlvenient way of editing the structure consists in invoking a Tree l~\]dilor whldow} shown in Fig. 2 with the Sall\]O soiltollco ~lS, hi the previous picture. The Tree Editor interface Js .shlipio alld nattlrai. Words of the SOUlCO SOlltCllCt: ;11%; written on the left, their lelllllias aic pill hlto glay roclallgles, alld their inorl)hological foattnes arc written on the right. The syntactic relations are shown as arrows directed from the master to the slave; Ihe link typc.s are indicated in rotmdod rcclanglos oll lhe arcs. All text l\]elds except for tile sotlrco SOIl\[ClICK are edilable in-place. Moreover, one can drag Ihe rOlllldod rectangles: dropping it on a word illeans that this word is; declared ;i new maStOl- It)l die word \['rOlil which the rectangle was dragged. A sh;glo rightd)ulton click on the lolllllla reel;ingle