Tagging and Chunking with Bigrams

Abstract

In this paper we present an integrated system for tagging and chunking texts from a certain language. The approach is based on stochastic finite-state models that are learnt automatically. This includes bigrmn models or tinite-state automata learnt using grammatical inference techniques. As the models involved in our system are learnt automatically, this is a very flexible and portable system. Itl order to show the viability of our approach we t)resent results for tagging mid chunking using bigrain models on the Wall Street Journal corpus. We have achieved an accuracy rate for tagging of 96.8%, and a precision rate tbr NP chunks of 94.6% with a recall rate of 93.6%.

Introduction

Part of Speech Tagging and Shallow Parsing are two well-known problems in Natural Language Processing. A Tagger can be considered as 2 translator that reads sentences from a certain language and outputs the corresponding sequences of part of speech (POS) tags, taking into account the context in which each word of the sentence appears. A Shallow Parser involves dividing sentences into non-overlapping segments on the basis of very superticial analysis. It; includes discovering the main constituents of the sentences (NPs, VPs, PPs, ...) and their heads. Shallow Parsing usually identifies non-recnrsive constituents, also called chunks (Abney, 1991) (such as non-recursive Noun Phrases or base NP, base VP, and so on). It can include deterlnining syntactical relationships uch as subject-verb, verb-object, etc., Shallow parsing wlfich always follows tlm tagging process, is used as a fast and reliable pre-l)rocessing phase for full or partial parsing. It can be used for hffbrmation Retrieval Systems, Information Extraction, Text Summarization and Bilingual Alignment. In addition, it is also used to solve colnputational linguistics tasks such as disambiguation t)roblems.

POS Tagging Approaches

The different aI)proaches for solving this problem can be classified into two main classes depending oi1 tile tendencies followed for establishing tile Language Model (LM): tile linguistic apI)roach, based oil hand-coded linguistic rules and the learning apI)roach derived fi'om a corpora (labelled or nonlabelled). Other at)proximations that use hybrid methods have also been proposed (Voutilaiuen and Padr6, 1997).

In tim linguistic apl)roach, an exI)ert linguist is needed to formalise the restrictions of the language. This implies a very lfigh cost and it is very dependent on each particular language. We can lind an important contribution (Voutilainen, :1995) that uses Constraint Grammar tbrmalism. Supervised learning methods were proposed in (Brill, 1995) to learn a set, of transforlnation rules that repair tim error committed by a probabilistic tagger. The main a(tvantage of the linguistic approach is that the model is constructed from a linguistic I)oint of view and contains many and complex kinds of knowledge_

iI1 tim lem'ning approach, tile most extended tbrmalism is based on n-grains or IIMM. In tiffs case, the language inodel can be estimated from a labelled corpus (supervised methods) (Church, 1988)(Weisehedel et al., 1.993) or from a nonlabelled corpus (unsupervised methods) (Cutting et 21., 1992). In the first; case, the model is trained from the relative observed Dequencies. In the second one, the model is learned using the Baunl-\¥elch algorithm from an initial model which is estimated using labelled corpora (Merialdo, 1994). The advantages of the unsupervised approach are the facility to tmild language models, the flexibility of choice of categories and the ease of apt)lication to other languages. We can find some other machine-learning approaches that use more sophisticated LMs, such as Decision Trees (Mhrquez and Rodrfguez, 1998)(Magerman, 1996), memory-based approaclms to learn special decision trees (Daelemans et al., 1996), maximmn entropy approaches that combine statistical information from different sources (Ratnaparkhi, 1996), finite state autonmt2 inferred using Grammatical Inference (Pla and Prieto, 1998), etc.

The comparison among different al)t)roaches i dif ficult due to the nmltiple factors that can be eonsidered: tile languagK, tile mmfl)er and tyt)e of the tags, the size of tilt vocabulary, thK ambiguity, the diiticulty of the test ski, Kte. The best rKsults rel)orted on the Wall Street ,lore'hal (WSJ) %'e('.l)ank (\]~'\[al'CllS el al., 1993), using statistical language models, have an ae(:uracy rack) between 95% and 97% (del)Knding on the different factors mKntiono.d al)ove). For the linguistic al)proach tim results ark l)etter. For exmnp\]e, in (Voutilaineu, 1995) an accuracy of 99.7% is rel)orted , but cKrtain ambiguities ill thK ou|;tnl(; remain unsolved. Some works have recently l)een pul)lished (Brill and Wu, 1998) in which a sel; of taggers are combined in order to lint)rove the.Jr l/erfornmn(:e. In some cases, these methods achieve an accuracy of 97.9% (llalterKn (31; al., 1998). 1,2 Shallow Parsing A1)t)roaches Since the early 90's~ sKveral l;Kchni(tues for carrying out shalh)w parsing have been d(3velol)ed. Tlms(~ techniques can also bK classified into two main groups: basKd on hand-codKd linguistic rules and based on iKarning algorithms. ThKsK approadms ll~we a conunon chara(:tcristi(:: thKy take, l;he se(lUKnCK of 1Kxi(:al tags 1)rot)oscd t)y a POS tagger as input, for both the h;arning and the (:bunking proC(~sses. 1.2.1 Techniques based on hand-coded

linguistiK rules These methods use a hand-written set of rules that ark defined llsing POS as tKrnfinals of tim gI'gtlllmar. Most of these works use tinit(! slate \]nel;llo(ls for (tel;Kcl;ing (:hunks or f()r a(:(:olni)lishing el;her linguisti(: l;asks (EjKrhed, 1988), (:\lm(~y, 1996), (At o Mokhtar and Chanod, :19!)7). ()ther works use (tit'-ferellI; ~raltllllgd;ical \]'orlllalislllS~ S/l(;h as (:OllSl;r;/illl; grmnmars (Voutilainen, 1993), or (:oral)inK th('. grammar rules with a set of heuristi(:s (Bourigault, :1992). ThesK works usually use. a small test SKi that is lllallually evaluated, so the achieved results are not signi\[icant. The regular KXln:cssions defined in (Ejerlied, 1988) identified both non-recursive clauses and non-recursive NPs in English text. The cxperimKntation on l;he Brown (:ortms achiKvKd a prK(:ision ratK of 87% (for clauses) and 97.8 % (for NPs). Abhey introduced the concept of chunk (Almey, 1991) m)d l/resentKd an incremental l)artial parser (Abney, 1996). This parsKr identities chunks l)ase on the parts of Sl)eKch, and it then chooses how to con> bine them tbr higher level analysis using lexical intbrmation. ThK average 1)rKcision and recall rates for chunks were 87.9% and 87.1%, rest)ectivKly ,on a tKst set of 1000 sKntKneKS. An iimrenmntal architKcture of finite--state transducers for French is pres(mted in (At-Mokhtar and Chanod, 1.997). Each transducer 1)ert'orms a linguisti(; task su(:h as id(3ntif~ying sKgments or syntactic strueturKs and dKtecting subjects and ol)jects. The system was (3wfluated on various corpora for subject and object detKction. The precision rate varied between 9(,).2% and 92.6%. The recall rate varied between 97.8% and 82.6%.

The NP2bol llarsKr described in (Voutilainen, 1993) identified nmximal-length noun phrases. NPtool gave a precision ral, e of 95-98% and a recall ratK of 98.5-100%. These results were criticised in (Raulshaw and Marcus, 1.995) due to some inconsistencies and aplmrenl; mistakKs which appeared on thK sample given in (Voutilainen, 1993). Bourigault dKvelopKd the LECTER parser fin" French using grmnmatical rules and soum hem'istics (Bourigault, 1992). lit achieved a recall rate of 95% identit~ying maxilnal length ternfinological noun phrases, but tie (lid not givK a prKcision ratK, so it is difficult; to Kvaluate the actual pKribrmance of tile parsKr. 1..2.2 LKarning Techniques These al)lnoachcs automa.tica.lly (:onstruel; a language model from a labello.d alld brackKted corpus. The lirst probabilistic approach was proposed in (Church, 1988). This method learn(; a bigram model for detecting simph3 noun phrasKs on the Brown corpus. Civ('n a sequen('e of parts of st)(3eeh as inl)ug , the Church program inserts the most prol)able openings and Kndings of NPs, using a Viterbiqiko. dynamic programming algorithm. Church did not giVK precision and recall rates. He showKd that 5 out of 24:3 NP were omitted, but in a very small test with a POS tagging ac(:uraey of 99.5%.

Transfornlation-based 1Karning (TBI,) was USKd in (\]~;unshaw an(l Mar(:us, 1995) to (lc, t(',('t baSK NP. In this work ('hunldng was considKre(1 as a tagging technique, so that each P()S could be tagged with I (inside lmseNP), O (outside baseNl )) or B (inside a baseNP, but 1;11(3 pre(:eding word was ill mlother basKNP). This at)preach rKsulted in a precision rate of 91.8% and a rKcall rate of 92.3%. This iesult was automatically Kwlhlat;ed ell ,q. (;est set; (200,000 words) extracl;Kd from the WS.\] Treebank. The main drawlmek to this approach are the high requiremKnts tbr tilne and space which ark needed to train ~he sysl;elll; it needs to train 100 tKmplates of combinations of words.

There are s(;v(;ral works that use a m('mory-based h,arning algorithm. ThKse at)proaehKs construct a classifier tbr a task by storing a sKI; of exmnples in inemory. Each (;xamI)le is definKd l)y a set of fhatures that havK to 1)c. learnt from a 1)racketed corpus. The Memory-Based Learning (MBL) algorithm (l)aele,roans (3t al., 1999) takes into account lexical and POS information. It stores the following features: thK word form mid POS tag of thK two words to the left, the tbeus word and onK word to the right. This systKm achiKved a precision rate of 93.7'7o and a recall rate of 94.0% on t\]lK WSJ Treebank. HowevKr, when only POS information was used the l)erformance decreased a.chiKving a precision rate of 90.3% mid a recall rate of 90.1%. Tile Memory-Based Sequence Learning (MBSL) algorithm (Argamon et al., 1998) learns substrings or sequences of POS and brackets. Precision and recall rates were 92.4% on the same data used in (Ramshaw and Marcus, 1995).

A simple approach is presented in (Cardie and Pierce, 1998) called Treebank Apl)roach (TA). This techtfique matches POS sequences from an initial noun phrase grammar which was extracted fl'om an annotated corpus. The precision achieved for each rule is used to rank and prune the rules, discarding those rules whose score is lower than a predefined threshold. It uses a longest match heuristic to determine base NP. Precision and recall on the WSJ Treebank was 89.4% and 90.0%, respectively.

It is difficult to compare the different al)proaches due fbr various reasons. Each one uses a different definition of base NP. Each one is evaluated on a different corpus or on different parts of the same cortms. Some systems have even been evaluated by hand on a very small test set. Table 1 summarizes tile precision and recall rates for learning approaches that use data extracted from the WSJ Treebank.

Method NP-Pl'ecision NP-Recall

TBL 91.8 92.3

MBSL 92.4 92.4

TA 89.4 90.9

MBL 93.7 94.0 MBL (only POS) 90.3 90.1 Tat)le 1: Precision and recall rates tbr diflhrent NP parsers.

General Description of our Integrated approach to Tagging and Chunking

We propose an integrated system (Figure 1) that combines different knowledge sources (lexical probabilities, LM for chunks and Contextual LM tbr the sentences) in order to obtain the corresponding sequence of POS tags and the shallow parsing (\[su WllC~W.~/c~ su\] W.~lC~ ... \[su W, lC,, su\]) from a certain input string (1'I:1,I¥.2, ...,I/l:n). Our system is a transducer composed by two levels: the upper one represents the Contextual LM for tile sentences, and the lower one modelize the chunks considered. The formalism that we have used in all levels are finite-state automata. To be exact, we have used models of bigrmns which are smoothed using the backoff technique (Katz, 1987) in order to achieve flfll coverage of the language. The bigrams LMs (bigram probabilities) was obtained by means of the SLM TOOLKIT (Clarksond and Ronsenfeld, LEAIINING ~\[-C,m,zxtuall.~ I2"l"°'~.Chunks \] l l'e×icalPmbabilities J CIUNKIN(; ~ ~ Figure 1: Overview of the System. 1997) from tile sequences of categories in the training set. Then, they have been rei)resented like finite-state automata.

The learning phase.

The models have been estimated from labelled and bracketed corpora. The training set is composed by sentences like: \[su w,/c,w.,/c., su\] w~/c~ ... \[su ~,~:,~/c,~ su\] ./. where Wi are the words, Ci are part-of-speech tags and SU are tile chunks considered. Tile models learnt are: • • • Contextual LM: it is a smoothed bigram model learnt from tile sequences of part-of speech tags (Ci) and chunk descrit)tors (XU) present in the training corpus (see Figure 2a). Models for the chunks: they are smoothed bigram models learnt fl'om the sequences of partof-speech tags eorrest)onding to each chunk of the training corpus (see Figure 2b). Lexical Probabilities: they are estilnated from the word frequencies, tile tag frequencies and the word per tag frequencies. A tag dictionary is used which is built from the full corpus which gives us the possible lexical categories (POS tags) for each word; this is equivalent to having an ideal morphological analyzer. The probabilities for each possible tag are assigned from this information taking into account the obtained statistics. Due to the fact that the word cannot have been seen at training, or it has only been seen in some of the possible categories, it is compulsory to apply a smoothing mechanism. In our case, if the word has not previously been seen~ the same probability is assigned to all the categories given by the dietionary; if it has been seen, but not in all the (c) Integrated LM (b) LM for Chunks

. . . . . . . . . . . . . . . . . . . . . . . . . i I(<SU>\[( ) x * i , ', z f+@,, --_... J i i t i i . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 2: Integrated Language Model fin" Tagging and Chunking. categories, the smoothing called "add one" is applied. Afterwards, a renormalization process is carried out.

Once the LMs have been learnt, a regular substitution of the lower model(s) into the upper one is made. In this way, we get a single Illtegrated LM which shows the possible concatenations of lexical tags and syntactical uuits, with their own transition probabilities which also include the lexical probabilities ms well (see Figure 2c). Not(', that the models in Figure 2 are not smoothed).

The Decoding Process: Wagging and Parsing

The tagging and shallow parsing process consists of finding out the sequence of states of maximum 1)robability on the Integrated LM tor an input sentence. Therefore, this sequence must be compatible with the contextual, syntactical and lexical constraints. This process can be carried out by Dynamic Progt'ammiitg using the Viterbi algorithm, which is conveniently modified to allow for (;ransitions between certain states of the autotnata without consmning any symbols (epsilon l;ransitious). A portion of the Dynamic Progranmfing trellis for a generic sentence using the Integrated LM shown in Figure 2c can be seen in Figure 3. The states of the automata that can be reached and that are compatible with the lexical constraints are marked with a black circle (i.e., fl'om the state Ck it is possible to reach the state Ci if the transition is in the automata and the lexical probability P(Wi\[Ci) is not null). Also, the transitions to initial and final states of the models for chunks (i.e., fl'om Ci to < SU >) are allowed; these states are marked in Figure 3 with a white circle and in this case no symbol is consumed. Ill all these cases, the transitions to initial and final produce transitions to their successors (the dotted lines in Figure 3) where now symbols must be consumed.

Once the Dynamic Programing trellis is built, we can obtain the maximum probability path for the input sentence, and thus the best sequence of lexical tags and the best segmentation in chunks. <s> Ci cj <Is> ....... <~u> ......... ........ Ci (:k {.Jill c,, </S U> : ............ - - - - t~ - hlput: ... Wll-2 Wll-I Wn </S> Output: ... Wn~2/Ci I SU Wnq/Cn SUI Wn/Ck </s> Figure 3: Partial %'ellis for Programming Decoding based oil tile Integrated LM.

Experimental Work

In this section we will describe a set of experiments that we carried out in order to demonstrate the capabilities of the proposed approach for tagging and shallow parsing. The experiments were carried out \]\ ~ x\ it>, . . . . "~ \]', ̀ % l{inal / "" / " ~ t ' , , • ........ 7~-o.. ........

"%3~ ". ' State ~>;, . . . . .

',' I L s', / t / / //~ ..... on the WSJ corpus, using the POS tag set; defined in (Marcus etlal. , 1993), considering only the NP chunt{s (lefine~l by (Church, 1988) and using tile models that we have presented above. Nevertheless, the use of this apt)roach on other corpora (changing the reference language), other lexical tag sets or other kinds of chunks can be done in a direct way. Precision •

Recall

Corpus Description.

We used a t)ortion of the WSJ corpus (900,000 words), which was tagged according to the Penn Treebank tag set and bracketed with NP markers, to train and test the system.

The tag set contained 45 different tags. About 36.5% of the words in the cortms were mnbiguous, with an ambiguity ratio of 2.44 tag/word over the ambiguous words, 1.52 overall.

Experimental Results.

In order to train the models and to test the system, we randomly divided the corpora into two parts: approximately 800,000 words for training aud 100,000 words tbr testing.

Both the bigram models for representing contextual information mid syntactic description of the NP chunk and the lexical probabilities were estimated from training sets of different sizes. Due to the fact that we did not use a morphological analyser for English, we constructed a tag dictionary with the lexicon of the training set and the test set used. This dictionary gave us tile possible lexical tags for each word fl'om the corpus. In no case, was the test used to estimate the lexical probabilities. Figure 4: incrementM training sets.

In Figure 4, we show the results of tagging on the test set in terms of the training set size using three at)proaches: the simplest (LEX) is a tagging process which does not take contextual information into account, so the lexical tag associated to a word will

BIG BIG-BIG \[i (~ {\] {1 LI \[\[ i i i i Accuracy Rate of Tagging on WSJ for Figure 5: NP-chunldng results on WSJ for incremental training sets. ~, + <,, + , i , __ i i Tagger BIG-BIG Lex BIG IDEAL

Tagging NP-Clmnking

Accuracy Precision I Recall 96.8 94.6 193.6 94.3 90.8 91.3 96.9 94.9 94.1 100 (assumed) 95.5 94.7 Table 2: Tagging and NP-Chunking results t'or differents taggers (training set of 800,000 words). be that which has aI)peared more often in the training set. Tile second method corresponds to a tagger based on a bigram model (BIG). The third one uses the Integrated LM described in this pai)er (BIGBIG). The tagging accuracy for BIG and BIG-BIG was close, 96.9% and 96.8% respectively, whereas without the use of the language model (LEX), tile tagging accuracy was 2.5 points lower. The trend in all the cases was that an increment in the size of the training set resulted in an increase in the tagging accuracy. After 300,000 training words, the result became stabilized.

In Figure 5, we show the precision (#correct proposed NP/#proposed NP) and recall (#correct proposed NP/#NP in the reference) rates for NP chunking. The results obtained using the Integrated LM were very satisfactory achieving a precision rate of 94.6% and a recall rate of 93.6%. The performance of the NP chunker improves as the training set size increases. This is obviously due to the fact that tile model is better learnt when the size of the training set increases, and the tagging error decreases as we have seen above.

The usual sequential 1)rocess for chunking a sentence can also be used. That is, first we tag the sentence and then we use the Integrated LM to carry out the chunking. In this case, only tim contextual t)robabilities are taken into account in the decoding 1)recess. In Table 2, we show the most relevant resuits that we obtained for tagging and tbr NP chunking. The first row shows the result when the tagging and the chunking are done in a integrated way. The following rows show the performmme of the sequential process using different taggers: • LEX: it takes into account only lexical probat)ilities. In this case, the tagging accuracy was 94.3%. • BIG: it is based on a bigram model that achieved an accuracy of 96.9%. • IDEAL: it siinulates a tagger with an accuracy rate of 100%. To do this, we used the tagged sentences of the WSJ corlms directly.

These results confirm that precision and recall rates increase when the accuracy of the tagger is beN;er. The pert'ormmme of 1;he, se(tuential process (u:dng the BIG tagger) is slightly 1letter than the pet'formance of the integrated process (BIG-BIG). We think that this is 1)robably b(;cause of the way we combined the I)robabilities of t;he ditthrent models.

200 300 400 500 60O #Words x 1000 200 300 4o0 500 600 7(30 800 #Words x 1000 Conclusions and Future Work

In this 1)aper, we have t)rcscntcd a system tot" Tagging and Chunldng based on an Integrated Language Model that uses a homogeneous tbrmalism (finite-state machine) to combine different knowledge sources: lexical, syntacti(:al and contextual inodels. It is feasible l)oth in terms of 1)erfl)rmanc(; and also in terms of computational (:tliciency.

All the models involv(:d are learnt automatically fi'om data, so the system is very tlexibte and 1)ortable and changes in the reference language., lexical tags or other kinds of chunks can be made in a direct way.

The tagging accuracy (96.9% using BIG and 96.8% using BIG-BIG) is higher tlmn other similar alIl)roaches. This is because we have used the tag di('tionary (including the test set in it) to restrict the possible tags for unknown words, this assmnplion obviously in(:rease the rates of tagging (we have not done a quantitative study of this factor).

As we have mentioned above, the comparison with other approaches i ditficult due mnong other reasons to tim following ones: the definitions of base NP are not always the stone, the sizes of the train and the test sets are difl'erent and the knowledge sources used in the learning process are also different. The precision for NP-chunking is similm' to other statistical at)preaches t)resented in section 1, tbr 1)oth the integrated process (94.6%) and l;tm sequential process using a tagger based on 1)igrams (94.9%). The recall rate is slightly lower than for some apl)roaches using the integrated system (93.6%) and is similar for the sequential process (94.1%). When we used the sequential system taking an error ti'ee input (IDEAL), the performance of the system obviously increased (95.5% precision and 94.7% recall). These results show the influence of tagging errors on the process. Nevertheless, we are studying why the results lietween the integrated process and the sequential process are diflbrent. We are testing how the introduction of soIne adjustnmnt factors among the models tk)r we, ighting the difl'erent 1)robability distribution can lint)rove the results.

The models that we have used in this work, are illgrams, but trigrams or any stochastic regular model can be used. In this respect, we have worked on a more coml)lex LMs, formalized as a. finite-state automata which is learnt using Grammatical Inference tectufiques. Also, our ai)l)roach would benefit fl'om the inclusion of lexical-contextual in%rmation into the LM. 5 Acknowledgments This work has been partially supl)orted 1)y the Stmnish I{esem'ch Projct:t CICYT (TIC97-0671-C02O11O2). References S. Abney. 1991. Parsing by Chunks. R. Berwick, S.

Almey and C. Tcnny (eds.) Principle -based Pars-

ing. Kluwer Acadenfic Publishers, Dordrecht. S. Almey. 1996. Partial Parsing via Finit('.-Sta~e

Cascades. In Proceedings of the ES,S'LLI'96 Ro-

bust Parsinfl Workshop, l?rague, Czech l{elmblie. S. Argamon, I. Dagan, and Y. Krymolowski. 1.998.

A Memory based Approach to Learning Shallow

Natural Language, Patterns. In l~roceedi'ngs of

t,h,e joint 17th, International Conference on Com-

putational Linguistics and 36th Annual Meeting

of the Association for Computational Linguistics,

COLING-ACL, pages 67 73, Montrdal, Canada. S. At-Mokhtar and ,l.P. Chanod. 1997. Incremen-

tal Finite-State Parsing. In Proceedings of the 5th,

Conference on Applied Natural Language Process-

ing, \Vashington D.C., USA. D. Bourigault. 1992. Surface Grmnmatical Anal-

,),sis for tim Extraction of ~l~.~rminological Noml

Phrases. In Proceedings of the 15th International

Conference on Computational Linguistics, pages

977-981. Eric Brill and Jun Wu. 1998. Classifier Combi-

nation for hnproved Lexical Disambiguation. In

Procccdings of the joint 17th, International Con-

fcrcncc on Computational Linguistics and 36th

Annual Meeting of thc Association for Computa-

tional Linguistics, COLING-ACL, pages 191-195,

Montrdal, Canada. E. Brill. 1995. Transibnnation-based Error-driven

Learning and Natural Language Processing: A

Case Study in Part-of-sI)eech Tagging. Compu-

tational Linguistics, 21 (4) :543-565. C. Car(lie and D. Pierce. 1998. Error-Driven Prun-

ning of Treebank Grammars for Base Noun Phrase

Identification. In Proceedings of the joint 17th

International Conference on Computational Lin-

guistics and 36th Annual Meeting of the Asso-

ciation for Computational Linguistics, COLING-

ACL, pages 218 224, Montrdal, Canada, August. K. W. Church. 1988. A Stochastic Parts Program

and Noun Phrase Parser for Unrestricted Text.

In Proceedings of the 1st Conference on Applied

Natural Language Processing, ANLP, pages 136-

143. ACL. P. Clarksond and R. Ronsenfeld. 1997. Statistical

Language Modeling using the CMU-Cambridge

Toolkit. In Procccdinfls of Eurospccch, Rhodes,

C,-reece. D. Cutting, J. Kut)iec , J. Pederson, and P. Nil)un.

1992. A Practical Part-of-speech Tagger. In Pfv-

cccdings of the 3rd Confcrcnce oft Applied Natu-

ral Language Processing, ANLP, pages 133 140.

ACL. W. Daelelnans, J. Zavrel, P. Berck, and S. Gillis.

1996. MBT: A MeInory-Based Part of speech

Tagger Generator. In Proceedings of the /tth

Workshop on Very Large Cmpora, pages 14-27,

Copenhagen, Denmark. W. Daelemans, S. Buchholz, and J. Veenstra. 1999.

Memory-Based Shallow Parsing. In Proceedings

of EMNLP/VLC-99, pages 239 246, University of

Maryla.nd, USA, June. E. Ejerhed. 1988. Finding Clauses in Unrestricted

Text by Finitary and Stochastic Methods. In Pro-

cccdings of Second Confcrcncc on Applied Natural

Language Processing, pages 219-227. ACL. H. van Halteren, J. Zavrel, and W. Daelemans. 1998.

Improving Data Driven Wordclass Tagging by

System Combination. In Proceedings of the joint

17th International Confcr'cncc oft Computational

Linguistics and 36th Annual Mccting of the Asso-

ciation for Computational Linguistics, COLING-

ACL, pages 491-497, Montrdal, Canada, August. S. M. Katz. 1987. Estimation of Probabilities from

Sparse Data for tile Language Model Component

of a Speech Recognizer. IEEE T~nnsactions on

Acoustics, Speech and Signal Processing, 35. D. M. Magerman. 1996. Learning Grammatical

Structure Using Statistical Decision-Trees. In

Proceedings of the 3rd International Colloquium

on GTnmmatical Inference, ICGI, pages 1-21.

Springer-Verlag Lecture Notes Series in Artificial

Intelligence 1147. M. P. Marcus, M. A. Marcinkiewicz, and B. San-

torini. 1993. Building a Large Annotated Cortms

of English: Tile Penn Treebank. Computational

Linguistics, 19(2). Llu/s Mhrquez and Horacio RodHguez. 1998. Part-

of Speech T~gging Using Decision Trees. In C.

Nddellee and C. Rouveirol, editor, LNAI 1398:

Proceedings of thc lOth European Conference

on Machine Learning, ECML'98, pages 25-36,

Chemnitz, GermNly. Springer. B. Merialdo. 1994. Tagging English Text with a

Probabilistic Model. Computational Linguistics,

20(2):155-171. F. Pla and N. Prieto. 1998. Using Grammatical

Inference Methods tbr Automatic Part of speech

Tagging. In Proceedings of 1st International Con-

ference on Language Resources and Evaluation,

LREC, Granada, Spain. L. Ramshaw and M. Marcus. 1995. Text Chunking

Using ~lYansfbrmation-Based Learning. In Pro-

cccdings of third Workshop on Very Large Col

pora, pages 82 94, June. A. Ratnapm'khi. 1996. A Maximum Entrol)y Part

of-speech Tagger. In Proceedings of the 1st Con-

fcrcncc on Empirical Methods in Natural Lan-

guagc Processing, EMNLP. Atro Voutilainen and Llufs Padrd. 1997. Develol)-

inn a Hybrid NP Parser. In Proceedings of the 5th

Conference on Applied Natural Language Prvecss-

ing, ANLP, pages 80 87, Washington DC. ACL. Atro Voutilainen. 1993. NPTool, a Detector of En-

glish Noun Phrases. In Proceedings of the Work-

shop on Very Lafflc Corpora. ACL, June. Atro Voutilainen. 1995. A Syntax-Based Part of

speech Analyzer. In Prvcccdings of the 7th Con-

ference of the European Ch, aptcr of the Association

for Computational Linguistics, EACL, Dut)lin,

h'eland. R. Weischedel, R. Schwartz, J. Pahnueci, M. Meteer,

and L. Ramshaw. 1993. Coping with Ambiguity

and Unknown \~or(ls through Probabilistic Mod-

els. Computational Linguistics, 19(2):260-269.