Learning a Robust Word Sense Disambiguation Model using Hypernyms in Definition Sentences

Abstract

This robustness paper proposes of word a method to improve the tion WSD (WSD) a for sense disambigua- system Japanese. trained from Two classiﬁers tained corpus: are one is a word sense-tagged supervised learning, a classiﬁer ob- by the other deﬁnition using hypernyms in dictionary. extracted The from is a classiﬁer mer will sentences a of suitable for the disambiguation for- be appropriate high frequency for low words, while words. the latter A ro- is combining WSD two will frequency bust system classiﬁers. be constructed In our by periments, these proposed F-measure method were and 3.4% applicability ex- of our the respectively, with and single 10% greater, obtained compared learning. a classiﬁer by supervised

Introduction

Word of sense disambiguation appropriate (WSD) meaning is the pro- cess for selecting given the word in document. Obvi- or sense ously, portant WSD a processes is one needed of for fundamental a the many natural and lan- im- guage decade, processing many (NLP) applications. been Over made the past WSD of Japanese. studies Most current have on machine learning (Li and research Takeuchi, uses 1997; 2001), Murata techniques achieved et al., good 2001; performance. Takamura et How- al., ever, has word as sense-tagged supervised corpora, learning methods often require from sparseness, i.e., words they which suﬀer occur data in training do not not disambiguated. frequently Therefore, a we corpus cannot can be pervised learning algorithms in use su- NLP to applications, alone disambiguate both when frequency it practical especially is neces- sary low words. high and frequency method To tackle combine this problem, two WSD this paper classiﬁers. proposes One a is to The a learning classiﬁer algorithm obtained by supervised learning. used for this classiﬁer is the Support will Vector work well Machine (SVM); disambiguation this clas- siﬁer frequency words. for The the second of high Naive Bayes model, which will classiﬁer work is the In the disambiguation model, of hypernyms low frequency well for this words. nition in sentences in overcome dictionary extracted from to a considered deﬁ- order sparseness. are scribed The in details Section of data 2, the and SVM Naive classiﬁer Bayes are model de- in Section 3. The combination the is in Section of 4. these The two experi- clas- siﬁers mental described ported evaluation in Section of 5. the We proposed mention method some is re- works Section in Section 6, and conclude related 7. the paper in

SVM Classifier

The SVM ﬁrst is classiﬁer supervised is learning the SVM algorithm, classiﬁer. Since sense-tagged a corpus is required training a word classiﬁer not as to disam- data, and biguate the words which can do not be occur used in SVM However, frequently the data. widely reported as the for eﬀectiveness NLP been including a variety of of has Takamura tasks 2001), WSD we (Murata know et al., for disambiguation al., that it 2001; work well et of high will words. frequency ing When instance training the SVM We be following represented classiﬁer, each train- vector. should features, by a feature typical for used WSD. the which are •S (0) ,S ( − 1) ,S ( − 2) ,S (+1) ,S (+2) Surface forms or of a target word word. and in a target position A words just before after indicates num- ber word parentheses from word. the of a a target •P ( − 1) ,P ( − 2) ,P (+1) ,P (+2) Parts-of-speech or (POSs) word. of words just before after a target •S ( − 2) ·S ( − 1), S (+1) ·S (+2), S ( − 1) ·S (+1) Pairs of surface word. forms of words surrounding a target •P ( − 2) ·P ( − 1), P (+1) ·P (+2), P ( − 1) ·P (+1) Pairs of POSs word. of words surrounding a tar- get •B sent Base forms of content words in a sentence 1 . •C sent Semantic sentence. classes of content Semantic words in a derived from classes Japanese used here are “Nihongo-Goi-Taikei” the the- saurus 1997). (Ikehara et al., •B head , B mod Base ( forms ) of of the head ( word. B head ) or modiﬁers B mod a target • ( B case ; B noun ) A ( pair ) of and the base forms noun of a ( case marker B case word a case is ﬁller verb. B noun ) when the target a • ( B case ; C noun ) A ( pair ) of and the base form of a case noun semantic ) class of marker B case ( the when a is noun the target word case ﬁller verb. C a • ( B case ; B verb ) A ( pair ) of and the base forms ( of a case case head verb B verb ) when marker B verb. word is a case ﬁller noun of the target a a certain ing We used the LIBSVM package 2 SVM The SVM model for train- SVM the (Schölkopf, classiﬁer. 2000) with kernel, is where ν− method parameter to ν =0 SVM . 0001. a linear the is to multi The pairwise cation. used apply classiﬁ-

Naive Bayes Classifier using Hypernyms in Definition Sentences

In this WSD section, we will using describe hypernyms the details of words of the classiﬁer Figure 1: Sense Set of “ 漫談 ” extracted nary. from deﬁnition sentences in a dictio-

Overview

Let considering us explain the basic in idea which of the model comic case the word by (mandan the in following “ 漫談 ” sentence (A) ; chat) disambiguated: the example should be ing In to this paper, EDR concept word senses dictionary are deﬁned (EDR, accord- Figure the 1 illustrates two meanings for 1995). (comic in EDR concept dictionary. “ 漫談 ” “CID” number indicates chat) the of sense. a concept ID, an identiﬁcation of“ One of a (comic the ways to siﬁer from ” sense-tagged chat) is disambiguate to train the WSD the senses 漫談 corpus, with clas- SVM the However, when as occurs or not “ 漫談 ” (comic the classiﬁer. chat) training corpus, infrequently we can not train at any all reliable in the classiﬁers. words, To train inition we looked the WSD For hypernyms classiﬁer for sentences. of low frequency at Japanese, in senses most in def- word For in deﬁnition hypernym sentence of is cases the pernym. last a a hy- 3c5631 in Figure example, 1 is the underlined sense entertainment), the last while hyper- word “ nym 演芸 of ” (engei 1f66e3 ; EDR is 話 ” (hanashi ; story). the senses In “ the whose hypernyms concept dictionary, tertainment) or (story). also there are are For “ 演芸 ” (en- shown in Figure “ 2, 話 10d9a4, ” 3c3fbb example, 3c5ab3 as tainment), senses whose and are while hypernyms hypernym are of “ 3cf737, 演芸 ” 0f73c1 (enter- and in 3c3071 is the training “ 話 ” corpus, (story). we If can these train senses clas- oc- cur the (comic determines is whether the (entertainment) hypernym a siﬁer that or ” chat) of “ 漫談 hypernym, 話 ” (story). we If we “ 演芸 also can determine determine ” “ which the correct sense, 3c5631 can or 1f66e3. Notice is the correct train model when that can “ 漫談 ” (comic we such a even Figure 2: Examples of Senses whose Hypernyms are “ 演芸 ” (entertainment) or “ 話 ” (story) Figure 0efb60 3: of Deﬁnition word Sentence of the sense the “ ゲーム ” chat) itself does not occur in the training cor- pus. model As described later, hypernym we train the of probabilistic instead that of predicts word sense. a Much more a given training word, dicting will a data hypernyms be available to train the model model pre- dicting senses, rather than the pre- of hypernyms than because of there are Figure fewer 2 types this senses. trates 3c3fbb clearly: 3c5ab3 all words illus- 10d9a4, fact in training labeled with and the can data labeled with data pernym be used as the (entertainment). In this the way, hy- we frequency train “ 演芸 ” can words. a reliable Furthermore, WSD classiﬁer hypernyms for will low be tences, automatically can described extracted in Subsection from deﬁnition trained 3.2, without so sen- model as automatically that the human intervention. be

Extraction of Hypernyms

In this hypernyms subsection, from we deﬁnition will describe sentences how to in ex- tract dictionary. In principle, we hy- a pernym of For is assume last in word that a the deﬁnition of the nition sentence. sense example, a deﬁ- sentence word of sense 3c5631 of the shown 演芸 (entertainment) “ 漫談 ” (comic the ” is chat), last “ in Figure 1. However, we the can- hy- pernym, not as Let always consider regard the deﬁnition last word of sense as a hypernym. us word the In 0eb70d EDR of the dictionary, “ ゲーム ” (gêmu; representing expression game). often “ A/ the concept / word the A) を / 表わす語 ” (a appears in deﬁnition of sentences. sense is not In this case, word the hypernym the hypernym of 0efb60 the last is not but A. Thus the in ” (go; word) the last word “ 語 Figure 3. but “ 回数 ” (kaisuu ; number) When we sentence, extract a hypernym deﬁnition from sentence is a deﬁni- tion morphologically the (word segmentation ﬁrst and an- alyzed POS in deﬁnition tagging) by sentence ChaSen 3 is identiﬁed . Then a hypernym matching. a An example of by pattern when patterns expression used here is the / rule extracting / is found A in deﬁnition the sentence. “ A/ We を表わす made 64 語 similar ” patterns a hypernyms appropriately. manually in order to Out extract EDR of the 194,303 dictionary, senses of hypernyms content words were in the concept extracted matching for 191,742 algorithm. senses Furthermore, (98.7%) by our pattern chose 100 hypernyms randomly and we validity, found of checked their nyms were appropriate. and Therefore, that 96% our the method hyper- for jor extracting reasons why hypernyms acquisition worked of hypernyms well. The ma- were of and in morpho- failed lack analysis patterns of deﬁnition faults sentences. the logical

Naive Bayes Model

We model will describe considers the details hypernyms of our probabilis- tic that First of consider in deﬁni- tion following sentences. probability: all, let us the P ( s,c|F ) (1) In pernym (1), s is a sense from of a target deﬁnition word, sentence c is a hy- of , and extracted is of the input s sentence F the including set features representing word. an a target http://chasen.aist-nara.ac.jp/hiki/ChaSen/ Next, we approximate Equation (1) as (2): P ( s,c|F )= P ( s|c,F ) P ( c|F ) P ( s|c ) P ( c|F ) The ﬁrst model term, P ( s|c,F ), sense is (2) the probabilistic that (and predicts ). It is a similar s given to a feature set F Naive Bayes model c for WSD the ordinary However, we assume this (Pedersen, model 2000). for low that words can not be trained of training data. frequency Therefore, we due approximate to a lack P ( Using s|c,F ) to P ( Bayes’ s|c ). follows: rule, Equation (2) can be com- puted as P ( s|c ) P ( c|F )= [BAR] P ( s P ) P ( c ( ) c|s ) P ( c P ) P ( F ( F|c ) ) (3)

= [BAR] P ( s P ) P ( F ( F|c ) ) (4) Notice hypernym that P ( of c|s ) in (3) is is equal uniquely to 1, because a matching c a sense (Subsection s 3.2). extracted by As pattern all we want to do is to choose an s ′ maximizes (4), ( ) eliminated: which P F can be s ′ = argmax s [BAR] P ( s P ) P ( F ( F|c ) ) (5)

= argmax P ( s ) P ( F|c ) (6) s Finally, by the in Naive conditionally Bayes assumption, independent, that is all Equation features (6) F are can be approximated as follows: s ′ = argmax s P ( s ) f ∏ i ∈F P ( f i |c ) (7) In which (7), P ( s ) is the statistics prior probability of of a sense s senses, while reﬂects ( ) is posterior the appearance which f i |c the probability of P individual reﬂects collocation statistics of i and a hypernym between models c . an parameters feature f probabilistic The from these word sense-tagged corpus. can be estimated We the estimated P ( ( s ) by ) Expected timation P f |c by linear interpolation. Likelihood Es- and i Feature Set The almost features same used ones in the Naive in Bayes SVM model classiﬁer are for as following used features: the except the [Features not used in the Naive Bayes model] •S ( − 2) ,S (+2) ,P ( − 2) ,P (+2) •S ( − 2) ·S ( − 1), S (+1) ·S (+2), S ( − 1) ·S (+1) •P ( − 2) ·P ( − 1), P (+1) ·P (+2), P ( − 1) ·P (+1) •C sent ,( B case ; C noun ) According to of the preliminary Naive Bayes experiment, model the accuracy the when in slightly SVM de- creased were used. all This features was the reason why classiﬁer above the we did not use the features.

Discussion

The following extracting discussion examines hypernyms from our method for tences. deﬁnition senMultiple Hypernyms In general, two from or more deﬁnition hypernyms sentence, when can be ex- tracted inition of sense a consists of the def- or deﬁnition a sentence several sentences a However, for this contains work a coordinate structure. only one of hypernym in for EDR a sense, we extracted senses because dictionary deﬁni- tions described all single the concept are no by coordinate a sentence, structure. and most of them contain pernyms, In order we to apply must our consider model for ( instead of Equation the probabilistic multiple hy- model is P s,C|F of hypernyms. ) Unfortunately, (1), where C a set of ( ) is not obvious, so the inves- es- timation of this P s,C|F tigation will be done in future. Ambiguity of hypernyms The meanings fact that not hypernyms to may have lem, appear be a major several does most hypernyms in deﬁnition prob- because meaning of certain dictionary have a single sen- tences according a for this work we ignored to our rough possible observation. ambiguity So of hypernyms. the Using other dictionaries As described pattern in Subsection matching. 3.2, We hypernyms rebuild patterns when we would are extracted by other have to tionaries, these we not to use much but do expect require dic- labor. Generally, in Japanese too word in deﬁnition sentence the last hypernym. a Furthermore, many can be regarded as patterns a for EDR dictionary extraction the for concept other dictionaries. may also be applicable from building major patterns Japanese to extract dictionary, hyper- We are nyms already Iwanami the other the WSD system Kokugo will Jiten them. , and developing the that use

We tried using the special symbol “NUM” as a feature for any numbers in a sentence, but the performance was slightly worse in our experiment. We thank the anonymous reviewer who gave us the comment about this. http://www.csie.ntu.edu.tw/%7Ecjlin/ libsvm/ ChaSen is the Japanese morphological analyzer. Combined Model

The in details of two WSD two classiﬁers sections: are one is described the previous for frequency words, the SVM classiﬁer other is high Naive Bayes for and low the the words. These classiﬁer bined to two classiﬁers fre- quency construct robust WSD are com- developed two the kinds of combined system. in 4.1 models, We below subsections and 4.2. de- scribed

Simple Ensemble

In this model, is quite the simple. process combining When only the two classiﬁers SVM or Naive Bayes, outputs one of classiﬁers, given senses word, provided the combined model senses for a classiﬁer. outputs output by senses, that ones When both classiﬁers SVM always the provided the chosen by output. classiﬁer are for the ﬁnal In the experiment were trained for in Section words 5, which SVM occur clas- siﬁers than 20 times in training corpus. Therefore, more rized simple follows: ensemble the the we described SVM here classiﬁer is summa- frequency as words use those the which occur more for high than low 20 frequency times and words. the Naive Bayes classiﬁer for the

Ensemble using Validation Data

First, sense-tagged we prepare corpus, validation common data, which is performance of test data classi- for a The as the classiﬁers. word is the ﬁers , for a (8). w evaluated by correctness C w deﬁned by

# of words in which one is correct of the senses selected by a classiﬁer [BAR]

C w = # one of or words more for which a classiﬁer selects senses The is main improve reason for combining applicability two classiﬁers WSD to system. the Note recall and classiﬁer which of often the outputs sense that a would a correct , even it also achieve outputs high wrong cor- rectness senses. more Thus, C w it improves higher though the the C of w of a combined classiﬁer, the model. the recall the Next, word the correctness is measured C on w of each classiﬁer for each When two w output the their scores compared. senses validation for Then, a given data. word, classiﬁers word senses provided C w are the outputs. by the better classiﬁer are selected When as the is the number ﬁnal comparison of words of in the validation data small, the classiﬁers’ C w (8) is of unreliable. words in For that reason, when is the number the threshold validation , data is h for a sense less output that a cer- tain O chosen by the output. SVM classiﬁer the dation is for ﬁnal all words SVM in This is because the correctness the vali- for data in Naive higher Section Bayes for the classiﬁer. In classiﬁer than the 5, we 10. the experiment set Oh to

Experiment

In to this section, EDR our we will describe proposed (EDR, method. the experiment evaluate corpus 1995) in We experiment. used the is made of about 200,000 Japanese the It up from newspaper sentences extracted zines. In with EDR corpus, articles sense ID (CID). each word and maga- notated the We was an- sentences in a EDR used 20,000 20,000 sentence the corpus as the test data, remaining 161,332 as the sentences validation data, training and the data. The training was as the SVM and data Naive used Bayes to train the while classiﬁer the bined validation data in was used for classiﬁer, the model Subsection 4.2. the com- instances described for were am- The target biguous content words used in evaluation all of instances was the test data; the num- ber We target combined three single 91,986. two evaluated models: WSD classiﬁers and • BL

The baseline which always model. This is the WSD most clas- siﬁer sense. selects When the is more fre- quently than one used sense with there chooses equally high senses. frequency, the classiﬁer all those • NB

The Naive Bayes classiﬁer (Section 3).

• SVM

The SVM classiﬁer (Section 2).

• SVM+NB(simple)

The (Subsection combined 4.1). model by simple ensemble • SVM+NB(valid)

The combined (Subsection model 4.2). using the validation data measure(F) Table 1 reveals , applicability(A) the precision(P), recall(R), word of classiﬁers and number F- 4 types(T) on of these ﬁve the test Table 1: Results of WSD Classiﬁers data. A(applicability) number of instances indicates disambiguated the ratio of classi- the to indicates total number number of of word target instances; by a ﬁer the which T disambiguated the types could be by a classiﬁer. The SVM two combined classiﬁer, for models criteria outperformed precision. the The in all remarkable. applicability except gains recall 1: Notice and were classiﬁers the ﬁgures in column espe- cially “T” in only Table to 4,575 the words, SVM while could Naive be Bayes ap- plied classiﬁers cluding were to 10,501 the low frequency applicable words. Thus, words, ensem- in- of two would signiﬁcantly the ble improve these of precision. applicability classiﬁers and recall with little loss bined Comparing models, the “SVM+NB(validation)” performance of the two com- outperformed was no “SVM+NB(simple)”, slightly signiﬁcant diﬀerence between them. but there correctness, , of SVM on The C w was the classiﬁer than the validation Naive Bayes data usually greater SVM that of the was when classiﬁer, both were so the applicable. classiﬁer was preferred almost same This the and we think strategy this was for the simple why en- semble, performance of two combined the models reason the almost will the same. In the rest of this section, were show for combined model we using the results only. the the validation data WSD Our system. goal was The to improve naive way the to robustness construct of the WSD is ensemble a of ro- bust supervised learned system to create and an a So, we compared classiﬁer our proposed a baseline method clas- siﬁer. (SVM+NB) SVM and baseline with the combined shown Table classiﬁer 2 and (SVM+BL). model Figure 4. The of the in Table re- sults shows are same criteria in Table 1, indicating 2 “SVM+NB” the outperformed as that criteria. Figure 4 shows “SVM+BL” between for all F-measure of classiﬁers the relation the in training the The horizontal and word fre- quency the data. axis Table 2: Results of the Combined Models (1) F(o) N(o) Figure 4: Results of the Combined Models (2) indicates ing the ) in occurrence log scale. of Squares words in and the triangles train- with data “SVM+BL”, lines ( indicates o respectively, the F ( o ) of where the “SVM+NB” and macro of F-measures for words F ( o which )isthe occur average indicates o times in ( the ), training number data. of The word broken line which occur N o the . For convenience, types = 0, we o times 5 plot (0) and N (0) when o F at x = − instead of (= 0). As shown in Fig- 0 . 5 4, “SVM+NB” −∞ log signiﬁcantly 10 ure “SVM+BL” low frequency words, outperformed number of word for ( ( )) obviously and the when types Naive classiﬁer o was small. N o proposed In became greater Bayes other could words, prob- the handle many more low frequency here ably classiﬁer. Therefore, it words than more the baseline to combine Naive Bayes clas- was eﬀective with SVM the siﬁer baseline the in order classiﬁer bustness of overall WSD to improve rather than the classiﬁer system. the ro- the Finally, we constructed SVM, a combined Naive Bayes model and of all baseline three classiﬁers, the model classiﬁers. outperformed As shown in Table two-classiﬁer 3, this combined slightly models shown in Table the 2. Table 3: Results of SVM+NB+BL

5)=SVM+NB(valid) To be more accurate, F ( o ) and N ( o ) are the figures for words which occurred more than or equal o times and less than o + t times, where o + t is the next point at the horizontal axis. t was chosen as the smallest integer so that N ( o ) would be more than 100. Related Work

As described was to in Section improve 1, robustness the goal of of this project system. One of promising ways to the WSD robust WSD the system is construct a ing with EM unsupervised algorithm (Manning learn- Schütze, as 1999), the i.e. training WSD and from our an to data set. On a classiﬁer approach unlabeled is machine the other hand, in addition use a corpus readable dic- tionary WSD. Notice to a we as knowledge resources nyms of deﬁnition for sentences that in used hyper- train Naive Bayes a and dictionary pro- to worked the well for words classiﬁer, which did this cess in corpus. However, not we did oc- cur not frequently compare our method the learning method empirically. and the of projects. This unsupervised our will be one future Using to hypernyms semantic of deﬁnition derived sentences from is similar thesaurus. using One of advantages classes of our method a is the senses that a thesaurus according is not obligatory to machine when read- word dictionary. are deﬁned Furthermore, a able probabilistic model our method predicts is to proaches hypernym train the of a word, while most that a semantic previous (i.e., ap- condition use of posterior classes as of the Bayes model). probability features the Naive In facts, in the we case the also derived use features from associated ), described thesaurus, with semantic ( classes the in Section C 2. sent and B case ; C noun as Several previous machine studies have dictionary used both for a corpus WSD (Litkowski, and a 2002; readable Stevenson and Wilks, 2001). Rigau The et al., those and ours is diﬀerence 1997; between we information methods for Training derived probabilistic from the model dictionary the way use predicts WSD. hypernym the in dictionary is our that a not competition However, a with these our various method. methods own approach. in In are robustness of WSD would fact, the more improved the with by combining system in this paper. these meth- be even ods that described

Conclusion

This paper robust has WSD proposed system. a We method to combined develop a obtained supervised a WSD classiﬁer frequency by words and learning pernyms classiﬁer in using for high in deﬁnition sentences a hy- for low frequency words. a dictionary showed remarkably both Experimental results that recall improved and with applicability our method. were ture, we to investigate optimum In two or to train way fu- combine plan the to probabilistic these model classiﬁers using hypernyms in a deﬁni- single tion and low sentences, which frequency is suitable words. for both high References EDR. 1995. guide EDR (second electronic edition). dictionary Technical Report technical Japan Ikehara Electronic Dictionary 1997. Nihongo Research Goi Institute. TR–045, Satoshi Japanese) . Iwanami et al. Shoten, Publishers. Taikei (in Hand Li and is Jun-ichi both strong Takeuchi. and 1997. in Using Japanese evi- dence homograph that disambiguation. In reliable Processing Litkowski. Society of Japan SIG-NL, Sense , pages 53–59. Informa- Kenneth tion disambiguation: C. Conﬂuence 2002. of information for and methods. In Proceedings supervised SIGLEX/SENSEVAL unsupervised Sense of the Disambiguation , Workshop 47–53. on Word Christopher Foundations D. Manning pages Statistical and Natural Hinrich Language Sch¨ utze, Pro- 1999. Masaki cessing, of Murata, chapter Masao 7. MIT Utiyama, Press. moto, Qing Ma, Hitoshi Kiyotaka Japanese and disambiguation Isahara. Uchi- word 2001. and sense support machine using In Proceedings vector SENSEVAL-2 meth- the simple ods. bayes 135–138. of the , pages Ted ing Pedersen. ensembles 2000. of naive A simple baysian approach to build- word disambiguation. In Proceedings classiﬁers for NAACL, sense 63–69. of the German Rigau, pages 1997. Combining Jordi unsupervised Atserias, and Eneko knowledge Agirre. methods for word disambiguation. lexical ACL sense Sch the ölkopf. 2000. , pages New 48–55. In Pro- Bernhard ceedings of gorithms. Neural Computation support , 12:1083–1121. vector al- Mark action Stevenson of knowledge and Yorick sources Wilks. in word 2001. The disam- inter- biguation. Computational Linguistics sense 349. , 27(3):321– Hiroya Kaoru Takamura, Yamamoto, Hiroyasu Yuji Yamada, Matsumoto. Taku Kudoh, Ensembling with based to on and WSD. feature In space Proceedings restructuring 2001. NLPRS, application 41–48. of the pages