Lexicalized Hidden Markov Models for Part-of-Speech Tagging

Abstract

Since most previous works tbr HMM-1)ased tagging consider only part-ofsl)eech intbrmation in contexts, their models (:minor utilize lexical inforlnatiol~ which is crucial tbr resolving some morphological tmfl)iguity. In this paper we introduce mliformly lexicalized HMMs fin: i)art ofst)eech tagging in 1)oth English and \](ore, an. The lexicalized models use a simplified back-off smoothing technique to overcome data Sl)arsehess. In experiment;s, lexi(:alized models a(:hieve higher accuracy than non-lexicifliz(~d models and the l)ack-off smoothing metho(l mitigates data sparseness 1)etter (;ban simple smoothing methods.

Introduction

1)arl;-Ofsl)e(:('h(POS) tagging is a l)ro(:ess ill which a l)rOl)('.r \])()S l;ag is assigned to ea(:h wor(l in raw tex(;s. Ev('n though morl)h()logi(:ally aml)iguous words have more thnn one P()S tag, they l)elong to just one tag in a colll;ex(;. 'J~o resolve such ambiguity, taggers lmve to consult various som'ces of inibrmation such as lexica\] i)retbrences (e.g. without consulting context, table is more probably a n(mn than a. ver}) or an adje(:t;ive), tag n-gram context;s (e.g. after a non-1)ossessiv(: pronoun, table is more l)robal)ly a verb than a. nmm or an adjective., as in th, ey table an amendment), word n-grain conl;e.xl;s (e.g. betbre lamp, table is more probal)ly an adjective than ~ noun or ~ verb, as in I need a table lamp), and so on(Lee et al., 1.999).

However, most previous HMM-1)ased taggers consider only POS intbrmation in contexts, and so they C~I~IlII()t capture lexical infi)rnmtion which is necessary for resolving some mort)hological alnbiguity. Some recent, works lmve rel)orted thai; tagging a('curacy could l)e iml)roved 1)y using lexicM intbrnml;ion in their models such as the transtbrmation-based patch rules(Brill, 1994), the ln~txinnun entropy model(lIatn~q)arkhi, 1996), the statistical lexical ruh:s(Lee et al., 1999), the IIMM considering multi-words(Kim, 1996), the selectively lexicalized HMM(Kim et al., 1999), and so on. In the l)revious works(Kim, 1996)(Kim et al., 1999), however, their ItMMs were lexicalized seh:ctively and resl;rictively.

\]n this l>al)er w('. prol)ose a method of uniformly lcxicalizing the standard IIMM for partof speech tagging in both English and Korean. Because the slmrse-da.ta problem is more serious in lexicMized models ttl~ll ill the standard model, a simplified version of the well-known back-oil' smoothing nml;hod is used to overcome the. 1)rol)lem. For experiments, the Brown corpus(Francis, 1982) is used lbr English tagging and the KUNLP (:orlms(Lee ('t al., 1999) is used for Kore, an tagging. Tim eXl)criln(;nl;~t\] results show that lexicalized models l)erform better than non-lexicalized models and the simplified back-off smoothing technique can mitigate data sparseness betl;er than silnple smoothing techniques.

Tile "standard" HMM

We basically follow the not~ti(m of (Charniak et al., 1993) to describe Bayesian models. In this paper, we assume that {w I , 'w~,..., w ~0 } is a set of words, {tt,t'2,...,t;} is a set of POS tags, a sequence of random variables l'lq,,~ = l~q lazy... I'E~ is a sentence of n words, and a sequence of random w~riables T1,,, = 7~T,2... TT~ is a sequence of n POS tags. Because each of random wtrbflfles W can take as its value any of the words in the vocabulary, we denote the value of l'l(i by wi mM a lmrticular sequence of wflues tbr H~,j (i < j) by wi, j. In a similar wl.ty, we denote the value of Ti by l,i and a particular sequence of values for T/,j (i _< j) t)y ti,j. For generality, terms wi,j and ti,j (i > j) are defined as being empty.

Tile purpose of Bayesian models for POS tagging is to find the most likely sequence of POS tags for a given sequence of' words, as follows: = arglnaxPr(T,,n =- I W,,,, = w,,,d tl,n Because l'efhrence to the random variables thelnselves can 1)e oulitted, the above equation b ecolnes: T('wl,n) = argmax Pr(tl,n \[ wl,,z) (1)

~'l,~t Now, Eqn. 1 is transtbrnled into Eqn. 2 since Pr(wl,n) is constant for all tq,~,

Pr (l.j ,n, wl,n) T(*/q,n) -- argmax

t, .... Pr('wl,n)

= arDnaxP,'(tj,,~,w,,,,) (2) tl ,n Then, tile prolmbility Pr(tL,z, wl,n ) is broken down into Eqn. 3 by using tile chain rule. fl(Pr(ti,t\],i-l,Wl,i-1) ) Pr(tl,n,~q,r,,) = x Pr(/~i \[tl,i,~Vl,i-l) (3) i=l

Because it is difficult to compute Eqn. 3, the standard ItMM simplified it t)3; making a strict Markov assumption to get a more tract~d)le tbrm. Pr(tl,,,, Wl,n) ~ x Pr(wi I td (4) i=l I51 the standard HMM, the probability of the current tag ti depends oi5 only the previous K tags ti-K,i-1 and the t)robability of' the current word wi depends on only the current tag 1. Thereibre, this model cannot consider lexical information in contexts. 3 Lexicalized HMMs In English POS tagging, the tagging unit is a word. On the contrary, Korean POS tagging prefers a morpheme 2.

1Usually, K is determined as 1 (bigram as in (Charniak et al., 1993)) or 2 (trigram as in (Merialdo, 1991)).

2The main reason is that the mtmber of word-unit tags is not finite because I(orean words can be ti'eely and newly formed l)y agglutinating morphemes(Lee et al., 1999). ,/, Flies/NNS Flies/VBZ like/CS like/IN like/JJ like/VB a/A~ a/IN a/NN ttower/NN flower/VB ./. $/$ Figure 1: A word-unit lattice ot' "Flies like a \[lower ."

Figure 1 shows a word-unit lattice of an Eilglish sentence, "Flies like a flowc'r.", where each node has a word and its word-unit tag. Figure 2 shows a morpheme-unit lattice of a Korean sentence, "NcoNeun tIal Su issDa.", where each node has a morphenm and its morI)hemeunit tag. In case of Korean, transitions across a word boundary, which are depicted by a solid line, are distinguished fl'om transitions within a word, which are depicted by a dotted line. ill both cases, sequences connected by bold lines indicate the most likely sequences. 3.1 Word-unit models Lexicalized HMMs fbr word-unit tagging are defined 1)y making a less strict Markov assmnption, as tbllows: A(T(K,j), W(I;j))~ Pr(tl,,~,wl,n) i=\] x Pr(wi I ti-L,i, wi-I,i-1) Ill models A(T(K,j), 14/(L j)), the probability of the current tag ti depends on both tile previous If tags ti-K,i-i and the previous d words wi-j,i-i and the probability of the current word 'wi depends on the current tag and the previous L tags ti_L, i and the previous Iwords wi-l,i-~. So, they can consider lexieal inforination. In experiments, we set If as 1 or 2, J as 0 or K, L as 1 or 2, and 1 as 0 or L. If J and I are zero, the above models are non-lexicalized models. Otherwise, they are lexicalized models. $/, Neo/N NI"

¢. No'an~ PX H~d/NNCC Hd/NNBU S'a/NNCG iss/\zJ Da/EFF

°"'OOoo,,,j~g_._.--"-

./ss. $/$ Figure 2: A morl)heme-unit latti(:(; of "N,oN,'un llal S'u i.ssl)a." (= You (:an do it.)

in a lexicalized model A(~/(9,2), r l lI f ('J,2)), fin" exmnl)lc ,the t)robal)ility of a node "a/AT" of tlm most likely sequen(:e in Figure 1 is calculate(t as tbllows: l'r(AT' I NM& vIL Fli(:,~, lit,:c) x Pr(a t :'1~, • NNS, VH, 1 tq l'~,c.s, • lil,:c) 3.2 Morphelne-unit models l);~yesian models for lnOrl)heme-unit tagging tin(t the most likely se(lueame of mor\])h(mms and corresponding tags fi)r ;~ given sequence of words, as follows: ~'(11) ,1,,) = al'glll;XX Pr(c l,v,, ?/~,,,u I '1,,,,~) (6) Cl~u flltl,,t , ra-ax Pr(c,,,,, m,,. ',,,,, ,,,) (7) Cl,~tllt~l,u In the above equations, u(_> 'n) denotes the llllIlll)cr of morph(mms in a Se(ltlell(;e ('orrespending the given word sequ('ncc, c denotes a morl)heme-mfit tag, 'm. denotes a morl)heme , aim p denotes a type of transition froln the previous tag to the current tag. p can have one of two values, "#" denoting a transition across a word bomldary and "+" denoting a transition within a word. Be(-ause it is difficult to calculate Eqn. 6, the word sequence term 'w~,,, is usually ignored as ill Eqn. 7. Instead, we introduce p in

Ncol/VV

4 Ncun/EFD H~(VV \]Ia/VX Su/NNBG iss/VX Da/EFC Eqn. 7 to consider word-spacing 3.

Tile probability Pr(cj ,~L, P2,u, 'm,~ ,u) is also broken down into Eqn. 8 t)3r using the chain rule. Pr(c~ ,,,, P2,,, , 'm, , ,,,,)

fl ( \])r(ci,Pi \[ cl,i-l,P2,i-l,'lnl,i-l) ) ~- i=1 X P1"(1~'1,i \[('d,i,I,2,i,17tl,i_\]) (8) \]3('caus(' Eqn. 8 is not easy to (;omlmte ~ it is sinll)lified by making a Marker assmnt)tion to get; a more tractal)le forlll.

In a similar way to the case of word-unit; tagging, lexicalize(t HMMs for morl)heme-mfit tagging are defined by making a less strict Markov assunq)tion, as tblh)ws: A(C\[,q(K,.\]), AJ\[sI(L,1)) 1= Pr(c\],,,,p2,,,, 'mq,~,)

I'r(c \[,pd I ,,I,i-,Uc/--lC/-' ~=~, xl'r(milci l,,i\[,>-L+l,,i\],'mi-l,i--I) (!)) In models A(C\[.q(tc,,I),M\[q(L,Q), the 1)robal)ility of the (:urrent mori)heme tag ci depends on l)oth the 1)revious K |:ags Ci_K,i_ 1 (oi)tionally, th(' tyl)eS of their transition Pi-K~ 1,i-~) a.n(l the 1)revious ,\] morl)hemes H~,i_.l,i_ 1 all(1 the probability of the current mort)heine 'm,i (t(> 1)en(ls on the current, tag and I:he previous L tags % l,,i (optional\]y, the typ('~s of their transition Pi -L-t-I,i) and the 1)revious Imorl)hemes ?lti--l,i-1. ~()~ t\]l(ly ('&ll &lSO (-onsid(,r h;xi(-al in- formation.

In a lexicalized model A(C,.(~#), M(~,2)) whea:e word-spa(:ing is considered only in the tag probal)ilities, for example, the 1)rol)al)ility of a nod(; "S'u/NNBG" of the most likely sequence in Figurc 2 is calculated as follows: Pr(NNBG, # \[ Vl4 EFD, +, Ha, l) x Pr(gu \[ VV, EFD, NNBG, Ha, l) 3.3 Parameter estimation In supervised lcarning~ the simpliest parameter estimation is the maximum likelihood(ML) cstimation(Duda et al., 1973) which lnaximizes the i)robal)ility ot! a training set. The ML estimate of tag (K+l)-gram i)robal)ility, PrML (f;i \[ t,i-K,i-i), is calculated as follows:

Pr(ti l ti_ir,i_j) P __ \]: q(ti-i(,i) (10) ML Fq(ti-lGi-l)

aMost 1)rcvious HMM-bascd Korean taggcrs except (Kim et al., 1998) did not consider word-spacing. where the flmction Fq(x) returns the fl:equency of x in the training set. When using the maximum likelihood estimation, data sparseness is more serious in lexicalized models than in nonlexicalized models because the former has even more parameters than the latter.

In (Chen, 1996), where various smoothing techniques was tested for a language model by using the perplexity measure, a back-off smoothing(Katz, 1987) is said to perform better on a small traning set than other methods. In the back-off smoothing, the smoothed probability of tag (K+l)-gram PrsBo(ti \[ ti-l~,i-l) is calculated as tbllows: Pr (ti \[ ti-I(,i-~) = ,5'1~20 drPrML(ti \[ti-I(,i-1) " if r>0 (11) c~(ti-K,i-1) Prsso(ti \[ ti-K+l,i-l)if r = 0 where r = Fq(ti_t(,i), r* = (r+ 1)'nr+l r* (r+l.) x~%.+l dr ~ F ltl 1- (r+l)xm.+l nl n,. denotes the nmnber of (K+l)-gram whose frequency is r, and the coefficient dr is called the discount ratio, which reflects the Good~lhtring estimate(Good, 1953) 4. Eqn. 11 means that Prxgo(ti \[ ti-K,i-l) is under-etimated by dr than its maximum likelihood estimate, if r > 0, or is backed off by its smoothing term Prsuo(ti \[ ti-K+j,i-l) in proportion to the value of the flmction (~(ti-K,i-t) of its conditional term ti-K,i-1, if r = 0.

However, because Eqn. 11 requires complicated computation in ~(ti-l(,i-1), we simI)lify it to get a flmction of the frequency of a conditional term, as tbllows:

ct(Fq(ti-K,i-1) = f) =

E\[Fq(ti-I(,i-1) = f\]

Ax E7-o E\[Fq(ti-K,i-1) -= f\] (12) where A = 1 - ~ Pr (tglti-/c,i-,), ti--K,i~r>O SBO E\[Fq(ti-g,i-1) = f\] = SP\]to ( ti \[ti-K + l,i-1) ti- K + L i,r=O,F q( ti- K,i-1)= f ' In Eqn. 12, the range of .f is bucketed into 7 regions such as f = 0, 1, 2, 3, 4, 5 and f > 6 since it is also difficult to compute this equation tbr all possible values of f.

Using the formalism of our simplified back-off smoothing, each of probabilities whose ML estimate is zero is backed off by its corresponding smoothing term. In experiments, the smoothing terms of Prsl~o(ti \[ ti-K,i-l,~t)i-,l,i-l) are determined as follows: PI'sBo(ti\[ ti-Ii+l,i-h )if K> 1,d> 1

wi_j+~,i_~ Prsuo(ti ifK >_ 1, d = 1 (13) Prs13o(ti \[ ti-K+Li-l) if K > 1, J = 0 PrAD(ti) if K = 0, J = 0 Also, the snloothing terms of' Pl's\]~o(wi ti_L,i, Wi_l,i_ 1 ) are determined as follows: \[ Prst~o(wi ti-L+~,i, il)i-I+l ,i-I Prsuo (wi ti-L,i) Prs,o (wi ti-L+Li) PrsBO(Wi) PrA.O i) In Eqn. 13 and 14, the smoothing term of a unigram probability is calculated by using an additive smoothing with 5 = 10 .2 which is chosen through experiments. The equation for the additive smoothing(Chen, 1996) is as tbllows:

Fq(ti-t(,i) + 5 ~tl (Fq(ti-lf,i) + 5) ) if L _> 1, I> 1 if L _> 1, I = 1 if L >_ 1, I = 0 ifL = 0, I --= 0 ilL = -1, I = 0 (14) AD In a similar way, the smoothing terms of parameters in Eqn. 9 ~re determined. 3.4 Model decoding h'om the viewpoint of the lattice structure, the t)roblem of POS tagging can be regarded as the problem of finding the most likely path ti'om the start node ($/$) to the end node ($/$). The Viterbi search algorithm(Forney, 1973), which has been used for HMM decoding, can be effectively applied to this task just with slight modification 5. 4 Experiments 4.1 Environment In experiments, the Brown corpus is used tbr English POS tagging and the KUNLP corpus

'%uch modification is explained in detail in (Lee, 1999).

Brown KUNLP NW 1,113,189 167,115 NS 53,885 15,211 NT 82 65 DA 1.64 3.4:1 RUA 61.54% 26.72% NW Number of words. NS Number of sentcnccs. NT Numl){'.r of tags (nlorpheme-unit tag for KUNLP). DA Degree of mnbiguity (i.e. the number of tags per word). RUA 1\].atio f mlanlbiguous words. Table 1: Intbrmation al)out the Brown eortms and the KUNLP tort}us Inside-test -AD(a AD(~ = ADO;-A\])((; = AD(5 = AD(5 = AD(5 = \]\]}-~7)- 95.57 AD(5 = 95.57 SBO 95.55 ML = \](}- \]) =a) ML Maximum likelihood estimate (with simple smoothing). A\]) Additiv(~ smoothing. SBO Sinll}liticd 1)ack-off smootlfing. lal)l(, 2: lagging accura(:y (}f A(C(\]:o), M0}:0 )) for Kore~m POS tagging. Table 1 shows some intbrmation M)out 1}oth (:ori)ora {~. Each of them was segmented into two parts, the training set of 90% and the test; set of 10%, ill. the way that each sentence in the test set was extra{'tc, d \]i'()ln every 1(} senl;ellce. A(:cording to Tabl(! 1, Koreml is said to 1)e lllOre (litli(:ult to disambiguat(; tl\]ml English.

We assmne "closed" wmabulary for English and "open" vocabulary for Korean since we do not h~ve any English morphological mmlyzer consistent with the Brown corlms. Therefore, for morphological mmlysis of English, we just

aNote that some sentcnc('.s, which have coml}osite tags(such as "HV+TO" in "hafta"), "ILLEGAL" tag, or "NIL" tag~ were remov(M fronl the Brown corl)us and tags with "*" (not) such as "BEZ*" were r(',l)la(:(~(t 1)y (:of r{~st}o\]ttling tags without "*" such as "BEZ". ()utside-|;(;st

94.97

93.02

94.79

95.08

95.05

94.98 I I I I I t t } I I I I I (a) # of paraln{;ters \]' --I I \[ I I .99

M\], -D.98 AD -×- - SB() .97 _I~L~ ~_± I I I I I I I I I I I I I

1,02,01,023} 13} 2,{1 1 , 1 1, l 1,0 1,1 2,01,1 2,22,22,22,2 1,(} 2,01,12,2

(},0 {},{} 1,01,1} 2,02,0 0,01 0 1,12,01,11,10,01,01 ,l 2,02,22,22,22,2 (1)} Inside-test .98 .97 .(,):, ?vii, -rJ--

AD '×- .:)4 SBO -~--(c) Ouiside-test .99 .98 \[

( .97 .96

1,02,01,02,01,02,0 1,11,11,01,12,01,12,22,22,22,21,02,01,12,2

0,00,01,0 1,02,02,0 0,01,01,l 2,01,11,10,(11,{11,I 202,22,22,22,2 (d) inside vs. outside-test in SBO Figure 3: Results of English tagging I I I I I I I I I I i t ~ IM (}.5M looked up the dictionary tailored to the Brown corpus. In case of Korean, we have used a Korean morphological analyzer(Lee, 1999) which is consistent with the KUNLP corpus. Table 2 shows the tagging accuracy of the simplest HMM, A(C(l:0),M(0:0)), for Korean tagging, according to various smoothing methods 7. Note that ML denotes a simple smoothing method where ML estimates with probability less than 10 -9 are smoothed and replaced by 10-9• Because, in the outside-test, AD(d = 10 -2) performs better than ML and kD(a ¢ 10-2), we use 5 = 10 -2 in our additive smoothing. According to Table 2, SBO I)ertbrms well even in the simplest HMM.

Figure 3 illustrates 4 graphs'about the results of English tagging: (a) the number of parameters in each model, (b) the accuracy of each model tbr the training set, (c) the accuracy of each model for the test set, and (d) the accuracy of each model with SBO tbr both training and test set. Here, labels in x-axis sI)ecify models in the way that ~ K, ,1 denotes A(T(\];,j) ,W(Lj)). Therefore, the first 6 models are non-lexicalized models and tile others are lexicalized models.

Actually, SBO uses more parameters than others. The three smoothing methods, ML, AD, SBO, perform well for the training set; since the inside-tests usually have little data sparseness. On the other hand, tbr the unseen test set, the simple methods, ML and AD, cannot mitigate the data sparseness problem, especially in sophisticated models. However, our method SBO can overcome the problem, as shown in Figure 3(c). Also, we can see in Figure 3(d) that some lexicalized models achieve higher accuracy than non-lexicalized models. We can say that the best lexicalized model, A(T(1,~),W(1,1)) using SBO, improved the simple bigram model, A(T(L0),W(0,0)) usmg • SBO, from 97.19>/o ~ to 97.87~ 0 (the error reduction ratio of 24.20%). Interestingly, some lexicalized models (such as A(T(1,1), W-(0,0)) and A(T(1,1), W(1,o))), which have a relatively small number of paranmters, perform better than non-lexicalized models in the case of outsidetests using SBO. Untbrtunately, we cannot ex-

rInside-test means an experiment on the training set itself and outside-test an experiment on the test set. .96 .94 ~ • ..~ uu . X • "" " ~' ".~1%~ ~ .92 .90 .88 .86 ML ~ k

AD .x. SBO I I I I I I I I I I f I I I I I I I • 97 .966 .9(;2 (b) Considering word-spacing Figure 4: Results of Korean tagging pect the result of outside-tests from that of inside-tests because there is no direct relation t)etween them

Figm:e 4 includes 2 graphs about the results of Korean tagging: (a) the outside accuracy of each model A(C(K,j),MiL,I)) and (b) the outside accnracy of each model A(C\[s\](~-g),M\[s\](L,0) with/without considering word-spacing when using SBO. Here, labels in x-axis specify models in the way that ,7,, K,J denotes A(C\[s\](K,j),i~/I\[.~\](Lj)) and, tbr example, C,,M in (b) denotes k(C~(,r,j), M(L,r)).

As shown in Figure 4, the simple methods, ML and AD, cannot mitigate that sparsedata problem, t)ut our method SBO can overcome it. Also, some lexicalized models pertbrm better than non-lexicalized models. On the other hand, considering word-spacing ives good clues to the models sometimes, but yet we cannot sw what is the best ww. From the experimental results, we can say that the best model, A(C(9,2),M(2,2)) using SBO, improved the previous models, A(C(1,0), M(o,o)) us(a) Outside-test I ~ I I I d~ I I I t I I-~ ¢ \[\] + + x ÷ x \[\] × ÷ I l l l l l l l I I I I I C,M + C~,/l~/ + ~.~, -~I~ X ing ML(Lee, 1995), and A(G(,,0), M(0,0))using ML(Kim et al., 1998), t'ronl 94.97% and 95.05% to 96.98% (the error reduction ratio of 39.95% mid 38.99%) respectively. 5 Conclusion We have 1)resented unitbrmly lexicalized HMMs for POS tagging of English and Korean. In the models, data sparseness was etlix:tively mitigated by using our simplified ba(-k-ofl" smoothing. From the ext)eriments, we have ol)served that lexical intbrmation is usefifl fi)r POS tagging in HMMs, as is in other models, and ore" lexicalized models improved non-lexicalized models by the error reduction ratio of 24.20% (in English tagging) and 39.95% (in Korean tagging).

G('.nerally, the mfiform extension of models requires ral)id increase of parameters, and hence suffers fl'om large storage a.nd sparse data. l~.ecently in many areas where HMMs are used, many eflbrts to extend models non-mfitbrmly have been made, sometimes resulting in noticeable improvement. For this reason~ we are trying to transfbnn our uniform models into nonmliform models, which may 1)e more effective in terms of both st)ace (:omt)h'~xity and relial)le estimation of I)areme|;ers, without loss of accuracy. References 12. Brill. 1994. Some Advances in

~l¥ansformation-B ased Part of St)eech

~Dtgging. In P~ve. of the 12th, Nat'l Cm¢. on

Art'tficial hdelligencc(AAAI-.9~), 722-727. E. Charniak, C. Hendrickson, N. Jacobson, and

M. Perkowitz. 1993. l~3quations for Part-

of Speech %~gging. In Proc, of the 11th,

Nat'l CoT~:f. on Artificial Intclligence(AAAL

93), 784-789. S. F. Chen. 1996. Building Probabilistic Models

for Natural Language. Doctoral Dissert~tion,

Harvard University, USA. R. O. Duda and R. E. Hart. 1973. Pattern CIas-

s'~fication and Scene Analysis. John Wiley. G. D. Forney. 1973. The Viterbi Algorithm. Ill

Proc. of the IEEE, 61:268-278. W. N. Francis and H. Ku~era. 1982. Fre-

quency Analysis of English Usage: Lczicon

and GTnmmar. Houghton Mitltin Coral)any ,

Boston, Massachusetts. I. J. Good. 1953. "The Population Frequen-

cies of Species and the Estimation of Pop-

ulation Parameters," Ill Biometrika, 40(3-

4):237-264. S. M. Katz. 1987. Estimation of Probabilities

fronl Sparse Data for the Language Model

Component of a Speech Recognizer. In IEEE

Transactions on Acoustics, Speech, and Signal

i'rocessing(ASSl'), 35(3):400-401. J.-\]). Kim, S.-Z. Lee, and H.-C. Rim. 1998.

A Morpheme-Unit POS Tagging Model Con-

sidering Word-Spacing. Ill Pwc. of th.e I0 th

National CoT~:fercnce on Korean h~:formation

PTveessing, 3-8. J.-D. Kim, S.-Z. Lee, and H.-C. Rim. 1999.

HMM Specialization with Selective Lexi-

calization. In Pwe. of the joint SIGDAT

Co~l:h':rence on Empirical Methods in Nat-

'aral Language Processing and Very La'qtc

Co'rpora(EMNLP- VL C-99), ld4-148. J.-H. Kim. 1996. Lcxieal Disambig'aation with

Error-Driven Learning. Doctoral Disserta-

tion, Korea Advanced Institute of Science and

Te.clmology(KAIST), Korea. S.-H. Lee. 1995. Korean POS Tagging System

Considering Unknown Words. Master The-

sis, Korea Advanced Institute of Science and

Teclmology(KAIST), Korea. S.-Z. Lee, .I.-D. Kim, W.-H. Ryu, and H.-

C. Rim. 1999. A Part-of Speech Tagging

Model Using Lexical l/.ules Based on Corlms

Statistics. In Pwc. of the International Con-

ference on Computer \])'lvcessin 9 of Oriental

Languages(lCCPOL-99), 385-390. S.-Z. Lee. 1999. New Statistical Models for Au-

tomatic POS Tagging. Doctoral Dissertation,

l(orea University, Korea. B. Merialdo. 1991. Tagging Text with a Prol)-

abilisl;ic Model. In P~vc. of the International

Conference on Acoustic, Spccch and Signal

Processing(ICASSP-91), 809-812. A. Ratnap~rkhi. 1996. A Maximum Entrol)y

Model tbr Part-of-Speech Tagging. In Proe.

of the Empirical Methods in Natural Lan-

guage P~vcessi'ng Co'a:fercnce(EMNLP-9b'),

133-142.

r said that d,. = i if r > 5. - 93.92 95.02 95.42 95.55 95.57 95.57 95.57 94.89 94.87 95.25 1,11,11,01,12,01,12,22,22,22,21,02,01,12,2 {},(} 0,01 ,(} 1,02,(} 2,0 0,01,01,12,0 1,1 1,1 0,01,0 1,12,0 2,2 2,2 2,2 2,2 1,11,I 1,01,12,01,I 2,22,22,22,21,02,01,12,2 o,00,01,01,02,02,0 0 01,01,12,01,11,10,01,01,12,02,22,22,22,2 - ML AD .x. SBO Results and evaluation 1,11,11,01,12,01,12,22,22,22,21,02,01,12,2 0,00,01,01,02,02~0 0,01,01,12,01,11,10,01,01,12,02,22,22,22,2 0,01,01,12,01,11,10,01,01,12,02,22,22,22,2 1,11,11,01,12,01,12,22,22,22,21,02,01,12,2