<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:ns2="http://www.tei-c.org/ns/Examples">
    <teiHeader>
        <fileDesc>
            <titleStmt>
                <title>Dependency Treebank for Russian: Concept, Tools, Types of Infornmtion</title>
            </titleStmt>
        </fileDesc>
    </teiHeader>
    <text>
        <front>
            <div type="abs">
                <head>Abstract</head>
                <p>'File paper describes a tagging scheme designed for the Russian Treebank, and presents tools used for corpus creation. 1. lntrodudory Remarks The present paper describes a project aimed at developing the first annotated corpus of P, ussian texts. I.arge text coq~ora trove been used in the computational linguistics community long enough: at present, over 20 large corpora for the main European languages arc available, the largest of them containing hundreds of millions of words (I.anguage Resources (19971); Marcus, Santorini, and Marcinkiewicz (1993); Kurohashi, Nagao (1998)). So far, however, no annotated corpora for Russian have been developed. To the best of our knowledge, the present project is the first attempt o fill the gap. l)ifferent tasks require different annotation levels that entail different amount of additional information about text structure. The corpus that is; being created in the fiamework of the pre.sent project consists of several subcorpora that differ by the level of annotation. The following three levels are envisaged:</p>
            </div>
        </front>
        <body>
            <div>
                <p>• • • lemmalized leA'Is, for every word, its normal form (lemma) and part of speech are indicated; mowhologically tagged leXlS: for every word, a full set of nlorl)hological attributes it specified along with the lenmm and the part of speech; symactically tagged ldxlx: apart from tile full morphological markup at the word level, every sentence has a syntax structure. lenya, nadya}Oiitp.ru We annotate Russian texts with depmlde,wy structttres - a formalism that is more suitable for Slavonic languages with their relatively fiee word order. The structure not only contains inl'omlation on which words of the sentence are syntactically linked, but also relegates each link to one of the several dozen syntactic types (at present, we use 78 syntactic relations). This formalism ensures a more complete and informative representation than ally other syntactically annotated corpus. This is a major innowttion, since the majority of syntactically annotated corpora, both those already awfilable and under construction, represent the syntactic structure by means of constituents. The closest analogue to our work is the Czech annotated corpus collected at Charles University in Prague - see I tajicova, Panevova, Sgall (19981). In this corpus, the syntactic data are also expressed in a dependency formalism, although the set of syntactic functional relations is much smaller as it only has 23 relations In what follows, we describe the types of texts used to create the coqms (Section 2), markup format (Section 3), annotation tools and procedures (Sectional), and types of linguistic data included in the markup (Section 5). 2. Source text selection The well-known Uppsala University Corpus of contemporary Russian prose, totalling ca. 1,000,000 words, has been chosen as the prilnary source for our work. The Uppsaht Corpus is well balanced between fiction and journalistic genre, with a smaller percentage of scientific and popular science texts. The Corpus includes samples of contemporary Russian prose, as well as excerpts flom newspapers and magazines of recent decades, and gives a representative coverage of written Russian in modern use. Conversational examples are scarce and appear as dialogues inside fiction texts. The design principles were fommlated as follows: ® &quot;layered&quot; markup- several annotation levels</p>
                <p>coexist and can be extracted or processed</p>
                <p>independently; • • incrementality - it should be easy to add higher annotation levels; convenient parsing of the annotated text by means of standard software packages. The most natural solution to meet this criteria is an XML-based markup language. We have tried to make our format compatible with TEI (Text Encoding for Interchange, see TEI Guidelines (1994)), inuoducing new elements or attributes only in situations where TEI markup does not provide adequate means to describe the text structure in the dependency grammar framework. Listed below are types of iuformation about text structure tlmt must be encoded in the markup, and relative tags/attributes used to bear them. a) Splitting of text into sentences. A special container element &lt;S&gt; (available in TEI) is used to delimit sentence boundaries. The element may have an (optional) ID attribute that supplies a unique identifier for the sentence within the text; this identifier may be used to store infommtion about extra-sentential relations in the text. It may also have a COMMENT attribute, used by linguists to store observations about particular syntactic phenomena encountered in the sentence; b) Splitting of sentences into lexical items ~ . The words are delimited by a container element &lt;W&gt;. Like sentences, words may have a unique &quot;rD attribute that is used to reference the word within the sentence; c) Ascribing morphological features to words. Morphological information is ascribed to the word by means of two attributes born by the &lt;W&gt; tag: LlgNNg_- a normalized word form; FEAT - morphological features. d) Storing information about the syntax structure. To annotate the information about syntactic dependencies, we use two other attributes in the &lt;W&gt; element: DON- the ID of the master word; LINK - syntactic function label. There are also special provisions in the lbrmalism to store auxiliary information, e.g. multiple morphological analyses and syntax trees. They are expected to disappear from the final version of the corpus. The procedure of corpus data acquisition is sentiautomatic. An initial version of markup is generated by a computer using a general l~urpose morphological analyzer and syntax parser engine; after that, the results of the automatic processing are submitted to human post-editing. The analysis engine (morphology and parsing) is based upon the ETAP-3 machine translation engine - see Apresjan et al. (1992, 1993). To support the creation of mmotated ata, a set of tools was designed and implemented. All tools are Win32 applications written in C++. The tools available are: • &quot; a program for sentence boundaries markup, called Chopper; a post-editor for building, editing and managing syntactically annotated texts - Slruclure Edilor (or SirEd). The amount of manual work required to build annotations depends on the complexity of the input data. SirEd offers different options for building structures. Most sentences can be reliably processed without any human intervention; in this case, a linguist should look through the processing result and confirm it. If the structure contains errors, the linguist can edit it using a user-friendly graphical interface (see screenshots below). If the errors are too many or no structure could be produced, the linguist may use a special split-andrtm mode. This mode includes manual prechunking of the input phrase into pieces with a more transparent structure and applying the analyzer/parser to every chunk. Then the linguist must manually link the subtrees produced for every chunk into a single structure. If the linguist has encountered a very peculiar syntactic construction so that he/she is uncertain glbOtil the ton'cot strticture, he/she lllay mark its &quot;doubtful&quot; the whole sentence or sirlgh.', words whoso func:tions are not complelely clear. The hiforniation will be stored hi the niarkt/p, and Sirlgd will visualize the rOSl~eCtiVe SClltellce ;is one in need for further editing. \]qg. i presents the nlain dialog window for editinb, ,soilteiico l)roportios. All operator can edit il:to iluirkup diicctly, or edit single properlics u!;ing a gral,hk:al interfac:e. The sotirt:o loxl illlder analysis is wi-illcn in all edit WilldOW ill lhc top: ,Volj&lt;~ pis'mo ne hylo podpisamJ, ja m,r:novenslo do&lt;r;adal.sja, klo e,qo #mpisal \[A/lhou&lt;~J~ /lle lelier was su.,l sighted, 1 i~,slanlly guessed who had written itl. 'l'ho information about sin~,le words is wriltcn inlo a li:~t: e.g. the first word xotja \]althottgh\] has an identifier :I;D:-:&quot;~ &quot;; llle Icmnlatized forni is XO'IJA; its feature list coi~sisls of a sinp~le roattlre -- ;t l)art-of-spoech characlor\]slk: (it iS a conjtillCtion); the word depends oil ;I word with IO=&quot;8&quot; by till adverbial .vottt'ce .venle~Tce I / raw mr~#'/,tq# S~r, terico l\['}~: I'1 ~6-5n~ statusI'tlll Strudure CVt~-D-/.-}tv\]--';~7'77tS,--'(:I-%;'C-(:\]I'4.\]&quot;ilT~.';1 - t.\[-t....~ll,}lTq-&quot;--;--i&lt;:J~r~q '- (INK-&quot;,,i-,~./,/ &quot;).k.~l.\] T 9, { ~\IV) &lt;'vV ()Olvl~&lt;&quot;'l&quot; If !~&quot;,.1-~&quot;:3 HM EJL CP\[_I?, I II \[O.r{&quot; ID- &quot;2&quot; I t_: MMA#'J~)9 I(;bMO&quot; LINK~&quot;rmesxL4K&quot; &lt;W DOM-&quot;4&quot; t-E.RI-=&quot;IV'd&gt;,I TM I\[ ~-&lt;&quot;3&quot;ll_El,..lt'..,l./',==&quot;l If!&quot; I_INK=%i b~/4/4!4 1'-I&quot;)1 IO (/vV) t &lt;W DOM::&quot;I&quot; I--EA\] = &quot;V I 1POIIJ Elh.~)1.'1~ I l..4:-',1~.~ 1t3 CPIz !-t HECC~/3&quot; &quot; I i0=&quot;'\],&quot; LIZMi,.4A=&quot;F.:bn-17/ &lt;W DOM,~&quot;4&quot; ~ E/\I-&quot;Vlll ~O\[\[I Eq ~'IPb'ILI KP c{:p\[~!q..CO\[7 CIffA~.' \~.&quot; I\[,),= ~J I.EMM/',=' rio,c4) 14 CblDA \[ &lt;W IX:¢.'iJ,&lt;.I&quot; I~@q=&quot;S HM E!I MW)K O.fl&quot; 11 )-%&quot; I_EMMA#',~' L INK= \[ Ij-3 e LI. H K II ) .q ( f'i/&quot;/&gt;/// &lt; W D O M =&quot; a&quot; \[&quot; V.ACr =~&quot;ADV&quot; I D =&quot;/' L E M Mi\:~&quot; M I ~ IO L1 \[ HI) Of~ I_ll' .\[,4K=&quot; obc-l-I'&gt;Hr-HoBe.t~,ljO&lt;p, &lt;W DOM=&quot;root' Iz\[:I'&lt;.F=&quot;V I IF~OI U t-r\[ j} HL11437~&gt;&quot;GI i'.,.f~l.&gt;/., gOB&quot; D=&quot;~' LEMMA= rA¢ /A\[{ blLi.,ATbOY- ~j ID . \[1\] \[2\] \[3\] \[4\] \[ \[._\] \] \[7\] \[07 Edit-lree.. _l Word I Xo-ru \[-1'~ rll4Cbl-4t. t 4l Fie l,llq \[~t,lno t41 N Ofl.F1.4C\[tH O it, I-4r ttt_iDOI411U dll.. j'lOl-i~).ftt~jl(-;\[} _,_1_ t Setmp\[c,, iq.us.qi#q-i cer4unc;~ Fit;ure I. Sentence I roporiics dialog in Strli,,d. rc:tation (link type is &quot;adverb&quot;). By doubleclicking all itoi\]i hi the word list or prossh\]g the button, a linguist can invoke dialog whidows f{}r editing 1}roportios {}f single words, ltowovor, the i\]lost coIlvenient way of editing the structure consists in invoking a Tree l~\]dilor whldow} shown in Fig. 2 with the Sall\]O soiltollco ~lS, hi the previous picture. The Tree Editor interface Js .shlipio alld nattlrai. Words of the SOUlCO SOlltCllCt: ;11%; written on the left, their lelllllias aic pill hlto glay roclallgles, alld their inorl)hological foattnes arc written on the right. The syntactic relations are shown as arrows directed from the master to the slave; Ihe link typc.s are indicated in rotmdod rcclanglos oll lhe arcs. All text l\]elds except for tile sotlrco SOIl\[ClICK are edilable in-place. Moreover, one can drag Ihe rOlllldod rectangles: dropping it on a word illeans that this word is; declared ;i new maStOl- It)l die word \['rOlil which the rectangle was dragged. A sh;glo rightd)ulton click on the lolllllla reel;ingle</p>
            </div>
            <note n="3." place="below">Markup flDrnmt</note>
            <note n="4." place="below">Annotation tools and procedures</note>
            <div1>
                <head xml:id="sec1"></head>
                <p>certain etbout it! F eFO ltOrlHerdf1. Cancel J i Lemme . XO I;-1 cpNJ I/\[ul I 1HC.'.L&gt;I 4C 7HML,:LCPr-!~nEO~t // \[4\] IlL /-V,.RT // \[41 bbll I-&gt; lp..:/, i ipOLl.I t--~ rlHq 1,13~..~1\] I\]o£trlHCblDAFt-&gt; V'Hr&gt;OLtl t~Ft rlpb.iq~/-.... \[,ll</p>
                <p>S 141vl ELI MV.)t,, 0~II \[{iJ tvlrt I(J\[JE-_l tt4o /&quot;¢Jv i \[\[ii .\[\[O1-i\\[1 blLW'q/bCYl \i F1POlll EJ-i/ll,ll/I H31z,. rm, et oSoT | ~,r~,-,~ l eq~,,,.-, q riosvt-oo~q nao,&gt;~&lt;H~,,__J npe~li.iK e~c-,Words: nsert \] I- I~ \] v I Comments... dHao ~o~n,ea,o. ....... ..... ~ &quot; ~ )~~\] v nPOLU Ell nlau 14sbnl~ cpE~ HECOB</p>
                <p>&quot;~&quot;~¢( nacc-a.an &quot;'~1 nOILIFII4CblBATb I V FIPOI£1 Eft FIPVlq KP CPEICt COB c-rPAfl</p>
                <p>s.M En Mw0n Mr.oBe.Ho .- ,,,(a'o6~cr&quot;;'=). I MFHOBEHHO I ADV .~ora~az~c...('-~.-~ I1OI-AD.IMBATbCFI I V F1POLU En fl kiLl Vi3bFIB MW&gt;K COB . . . . . . . . . . . . . nan.can, l(O~?~-&gt; I rll4C~,Tbl I vnpou\] Ell rlviq 143bFIB MY&gt;K COl? Figure 2. Tree Editor dialog in StrEd. brings out the word properties dialog° All colors, sizes and fonts are customizable.</p>
                <div2>
                    <head xml:id="sec5.">Types of linguistic information by level</head>
                    <p>M o rpK0Jg_g y information The morphological analyzer ascribes features to every word. The feature set for Russian includes: part of speech, animateness, gender, number, case, degree of comparison, short form (of adjectives and participles), representation (of verbs), aspect, tense, person, voice. Syntax information As we have already mentioned, the result of the parsing is a tree composed of links. Links are binary and oriented; they link single words rather than syntactic groups. For every syntactic group, one word (head) is chosen to represent it as a slave in larger syntactic units; all other members of the group become slaves of the head. In a typical case, the number of nodes in the syntactic tree corresponds to the number of word tokens. However, several exceptional situations occur in which the number of nodes may be less or even greater than the number of word tokens. The latter case is especially interesting. We postulate such a description in the following cases: a) Copulative sentences in the present tense where the auxiliary verb can be omitted. This is treated as a special &quot;zero-form&quot; of the copula, e.g. On - uchitel' \[He is a teacher, lit. He - teacher\]° The copula should be introduced in the syntactic representation. b) Elliptical constructs (omitted members of</p>
                    <p>contrasted coordinative expressions), like in</p>
                    <p>Ja kupil rubashku, a on galstuk \[I bought a</p>
                    <p>shirt, and he bought a necktie, lit. I bought a</p>
                    <p>shirt, and he a necktie\]. The latter type of sentences hould be discussed in more detail. Elliptical constructions are known to be one of the toughest problems in the formalization of natural language syntax. In our corpus, we decided to reconstruct the omitted elements in the syntactic trees, tamking them with a special '°phantom&quot; feature. In the above example, a phantom node is inserted into the sentence between the words on 'he' and galstuk 'necktie'. This new node will have a lemma POKUPAT&quot; \[BUY\] and will beat&quot; exactly the same morphological features as the wordform kupil \[bought\] physically present in the sentence, plus a special &quot;phantom&quot; marker. In certain cases, the feature set for the phantom may differ from that of the prototype, e.g. in a slightly modified phrase Ja kupil rubashku, aona galstuk \[I bought a shirt, and she (bought) a necktie\] the phantom node will have the feminine gender, as required by the agreement with the subject of the second clause. Most real-life elliptical constructs can be represented in this way. The inventory of syntactic relationship types generated by the ETAP--3 system is wLst enough: at present, we count 78 different syntactic function types. All relationships are divided into 6 major groups: aclant, altribulive, quantitative, adverbial, coordinative, auxiliary. For readers' COlwenience, we will give equivalent English examples: Aelant relalionships link the predicate word to its arguments. Some examples (\[IX\] - master, \[Y\] - slave):</p>
                    <p>predicative - Pete \[Y\] reads \[X\];</p>
                    <p>completive (1,2, 3)- translate \[X\]</p>
                    <p>the book \[Y, l-compl\]</p>
                    <p>from \[Y1, 2-compl\] English</p>
                    <p>into \[Y2, 3-compl\] Russian Ah-ibutive relationships often link a noun to a modifier expressed by an adjectve, another noun, a participle clause, etc: relative- The house \[X\] we live\[YI in. Quanlitalive relationships link a noun to a word with quantity semantics, or two such words one to another: quantitative -five \[Y\] pages \[IX\]; auxiliary-quantitative - gtirly \[Y\] five IX\]; Adverbial relationshil)s link the predicate word to various adverbial modifiers:</p>
                    <p>adverbial- come \[Xl i, the evening \[Y\];</p>
                    <p>parenthetic - In my opinion IYI, lhal's \[IX\] righI. Coordinalive relationships serve for clauses coordinated by conjunctions: coordinative - buy apples \[XI and peaJwlYl ; coordinative-conj unctive - I)tty apples</p>
                    <p>and \[X\] l)emw \[Y\]. Auxiliary relationships typically link two elements that form a single syntactic unit:</p>
                    <p>analytical- will \[IX\] buy \[Y\]; The list of syntactic relations is not closed. Tile process of data acquisition brings up a variety of rare syntactic constructions, hardly covered by traditional grammars. In some cases, this has led to the introduction of new syntactic link types in order to reflect the semantic relation between single words and make tile syntactic structure unambiguous. Conclusion Corpus crcation is not yet complctcd: at prcscnt, the flfll syntactic markup has been generated for 4,000 sentences (55,000 words), which constitutes 30% of the total amount planned. Our approach permits to include all information expressed by morphological and syntactic means in contemporary Russian. We expect that the new corpus will stimulate a broad range of further investigations, both theoretical and applied. We plan to make the corpus awtilable via EI,RA fiamework after completion. Samples of tagged text, documentation and structure editing tools will be available for download from our site: Ifltp://prolin~.iitp.ru/Corpus/preview.zip. Acknowledgements This work is supported by Russian Foundation of Fundamental Research, grant No. 98-0790072. References Apresjan Ju.D., Boguslavskij I.M., Iomdin L.I~.,</p>
                    <p>Lazurskij A.V., Sannikov V.Z. and Tsinman L.I..</p>
                    <p>(1992). The linguistics oJ'a Machine 7)'anslation</p>
                    <p>System. Meta, 37 (1), pp. 97-112. Aprcsian Ju.D., Boguslavskij I.M., Iomdin 1..I..,</p>
                    <p>I.azurskij A.V., Sannikov V.Z. and Tsinlnan L.I..</p>
                    <p>(1993). @~stbme de tmduction atttomatique ETAP.</p>
                    <p>In: £a 7)'aductique. P.Bouillon and A.Clas (eds).</p>
                    <p>l.es Presses de I'Universitd de Montrdal, Monlrdal. lIaiicova E., Panevova J., Sgall P. (1998). Lal~,guage</p>
                    <p>Resources Need Amlolations To Make Them</p>
                    <p>Really Reusable: 7&quot;he Ibz~gtte Del;enden~o;</p>
                    <p>&quot;l)'eebank. in: Proceedings of lhe First Interna-</p>
                    <p>tional Conference on I:anguage Resources &amp;</p>
                    <p>Evahmtion, pp. 713-718. Kur&lt;&gt;hashi S., Nagao M. (1998). BuiMing a Japanese</p>
                    <p>Parsed Corpus while lmprovbzg the Parsin,~</p>
                    <p>System. In: Proceedings of the First Inlernational</p>
                    <p>Conference on Language Resources &amp; Evaluation,</p>
                    <p>pp. 719-724 I.anguagc Resources (1997). hu Survey of the State of</p>
                    <p>the Art in IIuman Language Technology. Eds.</p>
                    <p>G. B. Varile, A. Zampolli, Linguistica Computa-</p>
                    <p>zionale, w)l. XII-XIII, pp. 381-408. Marcus M. P., Santorini B., and Marcinkiewicz M.-A.</p>
                    <p>(1993). Building a large Am~otated Corpus o/&quot;</p>
                    <p>English: The Penn 7)vebank. Computational</p>
                    <p>lfinguistics, Vol. 19, No. 2. TEI Guidelines (1994). TEl Guidelbws for Electronic</p>
                    <p>7k.xt Encoding and h~tetwhange (P3). URI.:</p>
                    <p>hlq)://elext.lil).virginia.edu/TEI.html</p>
                </div2>
            </div1>
        </body>
        <back/>
    </text>
</TEI>
