Accessor Word Variety Extraction Criteria for Chinese

Haodi Feng ∗ Shandong University City University of Hong Kong Xiaotie Deng ‡ City University of Hong Kong Kang Chen † Tsinghua University Weimin Zheng † Tsinghua University We are interested in the problem of word extraction from Chinese text collections. We define a and have specific meanings. We start with the viewpoint that a word is a distinguished linguistic entity that can be used in many different language environments. We consider the characters that are directly before a string (predecessors) and the characters that are directly after a string (successors) as important factors for determining the independence of the string. We call such characters accessors of the string, consider the number of distinct predecessors and successors of a stringinalargecorpus(TREC5andTREC6documents),andusethemasthemeasurementofthe context independency of a string from the rest of the sentences in the document. Our experiments confirm our hypothesis and show that this simple rule gives quite good results for Chinese word extraction and is comparable to, and for long words outperforms, other iterative methods. ,

Introduction

Words of word are the basic is linguistic in units many of natural (1996), According language processing. to Huang, The extraction papers. Chen, importance word is stressed in natural processing (NLP), it and is Tsou the where the modules basic unit interface. Possible language modules involved as at the lexical level all identification of speech words synthesis, semantic interpretation, are the lexicon, speech on. Thus, recognition, syntactic parsing, the and/or the delimitation of words and so is of words prerequisite is of NLP. lexical running Teahan et information al. (2000) state that interpreting in texts a a text as a sequence beneficial word-based for some retrieval and storage tasks: for According example, Guo full-text (1997), searches, words compression, and key-phrase extraction. to language-processing and tokens are the primary including building Japanese blocks (Kobayasi, in almost all Tokumaga, linguistic theories 1992), Tanaka and English 1994), (Garside, Korean Leech, (Yun, Lee, systems, and Sampson and Rim 1987), 1995), in German media, (Pachunke et al. and and various such ∗ School Avenue, of Kowloon, Computer Hong Science Kong. and E-mail: Technology, Jinan, PRC; Department or of Computer Science, Tat Chee † Department of Computer Science Technology, fenghd@cs.cityu.edu.hk and Peking, PR China. fenghaodi@hotmail.com. E-mail: { ck99,zwm-dcs } @mails. tsinghua.edu.cn. ‡ Department of Computer Science, Tat Chee Avenue, Kowloon, Hong Kong. E-mail: csdeng@cityu. edu.hk. © c 2004 Association for Computational Linguistics Computational Linguistics Volume 30, Number 1 as continuous speech and cursive handwriting, indexing, and in proofreading. numerous The applications, identification of such as translation, natural recognition, is nontrivial and observed Chao (1968), words words in language since, as by words. linguistic often represent a different Chinese set than do sociological one texts are character based, in not most word based. Each Chinese character morpheme. This stands for phonological a problem, in as only syllable and less Chinese than 10% of the of word cases represents types (and a less than (Chen 50% of presents a in composed single character et al. 1993). the tokens Chinese text) are in other a Oriental However, texts, and texts some have such mark languages boundaries of meaningful such as Japanese, words. Even do not English delimiters as spaces to is text, some phrases consist in of the Chinese. several words. as a factor How However, the problem in English for not interesting dominant problem. as Note word to extract is words from from Chinese texts is still an problem of that extraction Word different the very closely related meaningful in sentence segmentation. Sentence extraction aims to collect meaningful text. segmentation partitions a sentence into all of the strings a problems Word several consecutive segments. involved in extraction it should be easier than information-retrieval can operate be solved using simpler methods. sentence segmentation, Some Chinese and the of word Csmart systems (Chien at 1995). the character However, level of the Chinese processing, system it is to instead improve the level, for example, natural further important the efficiency one studies most from important linguistics (Kwok Single 1997). Chinese Lexicon commonly thought construction is to be to apply to of con- sidered often be meanings. the This tasks. when characters can quite carry different with other ambiguity word. can be n-grams, where to form a 3. According Chinese resolved words the characters are combined or characters can be unigrams, bigrams, trigrams, (Beijing Language Institute n > Chinese words, Chinese 7% 69 8% 1986), to the among 7% the 9,000 Frequency most Dictionary of Modern four-grams, 0 are unigrams, . are bigrams, 2 . are trigrams, 0 frequent 26 . probably and of 002% are words. However, There lexicons for identifying . 007% . five-grams. are some (and are more most Weeber, the frequent) Vos, Baayen (2000) sometimes less-frequent words are effective. in medical-information and recently extracted extraction and found that many side-effect-related terms a of This indicates system a less than five. that low-frequency words of the may terms had frequency carry important information. Our experiments show that we also very words method without overly can extract low- frequency There using a simple are directions in which words degrading the two can be formed (Huang, precision. Tsou 1996). generally One of is deductive The strategy, other is whereby inductive words strategy, are identified through Chen, and the the words segmentation through running texts. process of morpho-lexical the This which identifies words with the compositional (e.g., numeric rules. strategy sents by rules. In repre- nese common characteristics approaches (Sproat compounds) text three basic al. 1996): Chi- segmentation there are of The approach et identifies pure heuris- tic, pure statistical, and a hybrid or morpho-lexical the two. heuristic of words by words. applying The prior knowledge in approach identifies Shih words rules governing based on the derivation (1990) the distribution of new statistical Sproat their large and develop purely statistical method com- ponents a corpus. a that utilizes the mutual information between two characters: I ( x,y )= log p ( x,y ) [BAR] it ( x ) p ( y ) ; the Ge, of method is can deal only with words of p limitation Pratt, the Smyth (1999) that probability of words introduce a simple probabilistic model length two charac- ters. occurrence and of based on the Chien (1997) mutual information a PAT-tree-based the that overlapped method constitute a set predefined of that extracts significant words assumptions. develops with by ob- serving two patterns the significance function Feng, Chen, Deng, and Zheng Accessor Variety Criteria for Chinese Word Extraction SE c = [BAR] Pr Pr ( )+ Pr ( c ) ( b ) Pr ( ) , where a and b are the two biggest substrings of string c. Zhang, Gao, a Zhou (2000) − c and propose the mutual application information. of a statistical Yamamoto method that ment with mutual information inverse and Church is based context and (2001) on dependence experi- both Japanese and residual words, main document frequency criteria deciding and their contribution is in (RIDF) as for a reduced method computing term and document frequency. In affording work for cited point, the dimension that is almost mutual all of the frequency. to this Chen Bai (1998) propose used to compute information term corpus-based learning approach is and a Chang Su (1997) that learns grammatical Viterbi rules and automatically process evaluates them. and training select potential unknown words and iter- use an unsupervised atively (2000) truncate unlikely unknown words to propose in the augmented Chinese dictionary. Teahan (2001) algorithm for of text segmentation. Paola et al. and Stevenson a compression-based with robustness demonstrate an effective of combination deeper linguistic knowl- edge the and scalability a statistical Mo technique (1996) to derive of classification. et al. deal with knowledge about thematic relations for verb in Mandarin the iden- tification the determinative-measure compounds parsing measures. Chinese by developing We introduce grammatical rules to combine determinators 3.1), another accessor which (AV) (for and concept, variety is a detailed definition, to the likely be meaning- refer subsection word. Actually, to Harris describe (1970) extent to a string work uses similar motivated criteria to determine and is We English to a ful our morpheme boundaries, with partially by his success. old; accessor that are than a certain number first strings smaller (called discard those varieties meaningful subsequent discussion). The the thresh- see words. In we remaining strings are considered to be potentially addition, (clarified apply in rules to remove 3.2). Our strings that consist of a word and adhesive characters subsection obtained. experiment shows that even for small is In Section thresholds, our 2, we work. introduce quite good results of In Section examples 3, we of can be unknown our words, method. the In identification Section 4, we of which our the task in Section We 5. In Section our discuss 6, we work with present experimental results. conclude a discussion work and a comparison to previous results of AV. We note list some our future method is that can be pursued following marginally the concept that although This method quite simple, itself it is from better than previous statistically based in computer-based approaches comparable results. and grammatical Moreover, rules. Because distinguishes innovative of its simplicity, can method easily its with applications. methods variations of it our be used worthy of and combination statistical and grammatical methods are further exploration.

Unknown Words

Computational Linguistics Volume 30, Number 1

Proposed Approach

One Chinese of the word important parameters is word that or is employed frequency. in statistical Equivalent methods for automatic extraction frequency, character analogously. frequencies, as and term used Algorithms such document on frequency is are word, to measure are based used how likely it is that are of these frequencies is many meaningful that a particular string characters a it must meaning.” according However, to the belief in that “when a variety. This a this article, we string not repeated frequency, times, carry use but accessor environments, can it may be explained meaning.” as “when We a string introduce appears under different linguistic new identifying carry a meaningful Chinese words. the concept accessor variety as a criterion for Feng, Chen, Deng, and Zheng Accessor Variety Criteria for Chinese Word Extraction Computational Linguistics Volume 30, Number 1 1. 2. 3. , Feng, Chen, Deng, and Zheng Accessor Variety Criteria for Chinese Word Extraction

Numeric-Type Compounds Experimental Results

Computational Linguistics Volume 30, Number 1 4.2 As news. noted Experiments previously, on the Small Corpus We processed the small of corpus contained in approximately with 1.7 MB from data one of Xinhua all the strings the corpus lengths to ten Feng, Chen, Deng, and Zheng Accessor Variety Criteria for Chinese Word Extraction Table Some 1 of the words extracted from the small corpus. Economy of GuangDong new of and Hong PuDong Kong region Sihanouk Italian (name) nature Team protection Administration Committee region UNESCO of PLO Association of GuangDa Relations Across International Trust the Taiwan Straits Investment (ARATS) YiZheng Chemical Fibre United Company Parent Ocean Petroleum Company XiaoLangDi Company Irrigation Hinge of China French Open Tennis Project Hong United Kong Special Nations Administration Security Council Region Innovation Asia Development of Economy Bank most-favoured-nation the System Christopher (name) clause Preparing Mandela Committee UBA Championship (name) Cup of the extracted words that were not contained Computational Linguistics Volume 30, Number 1 Table 2 Experiments on the threshold–precision relationship of the small corpus. Threshold Precision Number of words Table Experiments 3 on the word length–precision relationship of the small corpus. Word length Precision Number of words Table Experiments 4 on the threshold–partial recall relationship of the small corpus. Threshold Partial recall Number of words We of checked only a of the total corpus. We used from this sample whole to find of the corpus was meaningful words by then with hand. The automatic compared that of result extraction extraction words in sample sentences. number The partial of words recall was in computed as the number hand the We of both sets divided by the in Table 4. the human extraction set. list the missed We experimental partial-recall analyzed the occurred instance only with values once, the threshold which was of two. Some of the words words were they missed occurred less than the threshold. Some were because This they only in very restricted environments. of the means because that although was only they one. appeared In more than sor the latter case, we once in the corpus, their acces- variety value The could extract in the strings contained substrings. details discussed the on error that such strings as are section analysis. small portion (a random sample of approximately 2,000 sentences) Feng, Chen, Deng, and Zheng Accessor Variety Criteria for Chinese Word Extraction Table Some 5 words extracted from the large corpus.

Huang desolate Yanping road (Chinese name) Bi Tong Ling (name goose feather of Chinese fan a medicine)

send love by swan goose

MaGu beloved offers hometown of independent birthday management present right the Peking Technology Museum the Institute of East China Peking fake and bad merchandise the Sunday penmanship newspaper temple fair photo socialistic modernization 4.3 Experiments The corpus Data was on the Large Corpus that Consortium, used n.d.), for these which experiments 160,000 was the TREC including Chinese corpus (Lin- guistic were News in Agency People’s in Daily 1994 from contains 1995. 1991 In to 1993 articles, and a portion we of news articles that published Xinhua the released by with the of and The the experiment, approximately extracted We seven characters. data contained 7,000,000 words lengths two to sentences. were out first on eliminated the non-Chinese were characters. All on of the experiments we also conducted the large corpus. In that 5 carried the small corpus Notice some legitimate words that were extracted from Ta- ble first show words in word was the large from corpus. that these or in cannot dictionary. be found the list that extracted the small In corpus Table 6, we the system were observed show in Table the overall precision observed performance. However, The performance trends that much 2 one, can be many also here. as larger characters have the chance occur this corpus is than the to words. previous That is why the is much to together form spurious Nevertheless, precision is much now, lower than that many the corpus larger a correct word for the small occur in corpus. more environments as in which can improve than the small corpus, suggests that we can without overly the precision by using a large For threshold for nine, when the accessor variety the recall. example, the threshold is value degrading precision is partial 73 4% set the and the partial recall remains high 80 to as high as . as as . 4%. Table Experiments 6 on the threshold–precision relationship of the large corpus. Threshold Precision Number of words Computational Linguistics Volume 30, Number 1 Table 7 Experiments on the word length–precision relationship of the large corpus with threshold three. Word length Precision Partial recall Number of words Table 8 Experiments on the word length–precision relationship of the large corpus with threshold nine. Word length Precision Partial recall Number of words The precision on and partial-recall The performance method respect was to the length also tested the large corpus. nine same in sample Tables 7 8, respectively. used, and the results for thresholds Note three and that there is are shown we a great from in and jump the nine, precision for word lengths two method but the well partial with recall does not and three after much. For change the words, threshold three to change The next longer the was intended even performs threshold three. of words with experiment from to test The the partial-recall is performance indicates lengths two to seven. result shown in Table 9, for which all the means that we the partial-recall most value of is words satisfactory in even with a large threshold. This that can extract the the corpus. 4.4 In Experiments we on Numeric-Type numeric-type Compounds this section, were consider from compounds. Some in Table of the 10. compounds of this type that extracted the large corpus are listed Table Experiments 9 on the threshold–partial recall relationship of the large corpus. Threshold Partial recall Number of words in word was Feng, Chen, Deng, and Zheng Accessor Variety Criteria for Chinese Word Extraction Table 10 Numeric-type compounds extracted. [BAR]

March 2nd

May first Fourth, time the of 1992 two sides the Strait relationship between Kilograms two or countries one thirty Hong Kong so hundred ounces dollars two hundred forty thousand mu 4.5 Error Two Analysis words kinds of errors occurred: neglect of meaningful the extraction words. of meaningless Some errors of strings as meaningful below. and the the two types are listed 4.5.1 Meaningless Strings Extracted. Republic A number of Bosnia of meaningless strings were extracted: for example, the and come ‘city Computational Linguistics Volume 30, Number 1

3 4 5 6 7 8 9 83.8% 89.6% 94.1% 96.8% 97.4% 97.3% 97.7% 14,468 8,648 6,147 4,757 3,800 3,162 2,734 3 4 5 6 7 8 9 56.6% 91.4% 85.1% 90.4% 89.4% 90.1% 80.3% 2,532 3,417 712 493 180 111 61 3 4 5 6 7 8 9 66.5% 59.0% 54.3% 50.3% 47.1% 44.0% 41.5% 14,468 8,648 6,147 4,757 3,800 3,162 2,734 3 4 5 6 7 8 9 58.3% 69.0% 70.3% 70.4% 73.8% 74.2% 73.4% 1,269,378 788,964 562,407 432,830 349,511 291,688 249,904 3 4 5 6 7 22.9% 68.9% 67.0% 76.0% 70.7% 83.5% 80.9% 83.3% 81.6% 64.3% 335,557 360,413 141,153 123,392 42,836 3 4 5 6 7 52.7% 74.6% 75.0% 86.9% 89.4% 3 4 5 6 7 8 9 87.2% 85.6% 84.2% 83.0% 82.0% 81.2% 80.4% 73.0% 70.2% 63.9% 63.2% 42.9% 55,015 78,868 18,775 15,663 4,383 1,269,378 788,964 562,407 432,830 349,511 291,688 249,904 200 Conclusion

We have described a hybrid method for extracting Chinese words corpus of variety and adhesive word characters. We from the Chinese text using accessor tested the method on the performance We different method thresholds and on lengths and different corpus sizes. conclude that in the word based accessor variety The and adhesive with characters performs we efficiently corpus was much fulfilling extraction larger than that with tasks. precision the small was opposite used For when the large corpus, was but the situation for partial recall. with example, were the threshold 83 8% 66 set 5%, to three, with small were corpus . and . respectively, the precision whereas and partial recall the was the large nine, corpus they 58 . 3% numbers and 87 were . 2%, 97 respectively. 4%. As the corresponding when . 7% of and 41 When 73 80 set to . 5% the threshold 4% versus . into and words . even human judges text from differ facing the system to system (Sproat task segmenting et al. 1996), it is a and test corpora methods. differ very To difficult to compare ways: We implemented illustrate two convincingly Chang the efficiency Su’s of (1997) our method, method we chose our own one method of the most direct of which was and one was and in paper. We on a corpus, Chang the Su’s size similar to the that Their used was for two iterative reasons: method approach their integrated was chose and ours, paper it as reference unsupervised, just monly like word-filtering and a complicated (including Viterbi that mutual several entropy, Gaussian techniques mixture training, information, com- used and joint modules: density One function) is to improve their module, result. is contains two the segmentation which Their segmentation system used to module, segment which words is and calculate the frequencies of the words used to rank the likelihood from ratios of words; the words, the other is the filtering to filter out words with with and further those low likelihood ratios into the augmented and high likelihood the augmented dictionary. dictionary add iteratively those repeats two modules ratios until predefined condition is The system We marginally will these a that even compared simplicity, to such will a deliberate IT approach, Chang our fulfilled. is show better. For we simple Su’s method AV our method, where use IT to implies refer to iterative and and the symbol AV method to variety. refer to and implies accessor in August We combined September PD9208.SGML 1992, and which PD9209.SGML is proportion (files of of People’sDaily TREC Chinese as published [Linguistic (notice Data and Consortium, Chang n.d.]) a Su’s to form paper, a file of 311,591 376,053 the corpus in [1997] sentences were after the clearing step We that and one sentences of conducted comparison other experiments, for words extracting with words with used). two of lengths two to four characters The and the for extracting seven for selecting these two sets of word lengths lengths is that Chang two to and method Su characters. considered we only reason words words with with lengths of of two to four characters, iterations IT was two to seven characters. In whereas in our ments, number consider of lengths (because Chang Su both experi- iterations), the for 21 21 the AV value threshold (when and the AV value of also conducted Because or and to this threshold, it is considered to be a word) for our method a string is is greater than equal we not in AV, it is impossible three. do segment the file with to count the precision and recall by comparing the hand-segmented sample sentences the automatically Feng, Chen, Deng, and Zheng Accessor Variety Criteria for Chinese Word Extraction segmented sample sentences. (In this case, sample sentences are first method obtained, then they are The segmented both is by hand and automatically number of words by the under examina- tion. precision automatically, equal to the number of that words are extracted both by hand and cally. The is divided by the total number of words that are extracted automati- recall automatically, equal to the number of that words are extracted both by hand and divided method by was the total in Chang Su’s that (1997) are extracted original by work. hand.) This evaluation IT applied AV with method and we in Instead, we evaluated We both and 1,000 the words of that word described (in the previous sections. randomly chose from four, each in length the first from to output and the second experiment, word experiment, word length varied two from was length two that generated by each method. varied The to seven) precision of word the each of was dictionary 1,000 words length then defined word as the proportion Regarding of correct words among the sample the same length. partial recall (we used partial recall as a substitute from for traditional recall, as discussed previously), we first randomly Then chose we sentences words with the unsegmented file, from and then segmented of was different lengths this of them by The hand. extracted word number set length defined of sentences. partial recall were each from then as the words of that from length that extracted both output dictionary, the hand-segmented sample sentences number and words the of automatically generated were from divided by the total of The that length that extracted we in the hand-segmented were sample sentences. system dictionaries that (i.e., used each experiment the before dictionary downloaded from derived from large dictionary described a 119,538 from http://www. mandarintools.com/segmenter.html In that contains of terms two to seven were approach the size the system dictionary and the size of char- acters long). each experiment, corpus to of the applied were mentioned chosen in Chang Su’s those (1997) original the system work. dictionary and the corpus that In AV each experiment, all of and were the values Therefore, of precision and partial recall of both IT and counted by the same person. the evaluation results should be reasonably In credible. the experiment of extracting 4,355 words of lengths 4,252 four-grams, two to four, the We trigrams, and that is, system dictionary contained 24,705 bigrams, word chose was 979 a total S. We sentences and segmented them by hand. Suppose of 33,312 entities. randomly the in obtained then removed from S those segments that occurred that set original the system dictionary (the 376,053 and those segments The that appeared less was than five times in the Chang corpus Su (1997) not sentences). latter removal from consider segments with undertaken S, obtained did 580 frequency of because Hence, and we 156 135 four-grams. less than These five. words were were new words bigrams, trigrams, and considered to be extracted and used recall for each method, by hand IT from the and AV. sample In Table sentences to test the partial the for word length from 11, we list IT precision AV. and partial-recall value each two to four for both We and can is see from opposite the table that word IT outperforms methods the comparatively, for AV’s length four. With AV for is word word length two, but the situation just length worse. three, Considering the two our method, perform that AV, is much and performance Because we observed from than IT, we slightly simpler conclude that it is quite with word this experiment that the performance of our promising. improves increased we method In length, conducted we another words experiment with to further examine this phenomenon. The this experiment, extracted with 4,906 system 4,834 dictionary that we used contained 38,097 lengths from two to seven characters. 27,986 trigrams, four-grams, five-grams, 89 entries, 44 bigrams, We 1,989 238 six-grams and seven-grams. randomly chose sentences and segmented them by hand. Computational Linguistics Volume 30, Number 1 Table 11 Precision and partial recall of word lengths two to four of the first experiment on IT and AV. [BAR] Bigram Trigram Four-gram [BAR] Precision IT Partial AV recall IT AV Table 12 Precision AV. and partial recall of word lengths two to seven of the second experiment on IT and [BAR] Bigram Trigram Four-gram Five-gram Six-gram Seven-gram [BAR] Precision AV IT Partial recall AV IT After filtering out the segments numbers that appeared of new in the words system dictionary from the of word that were and those with frequencies less than five, were 699, extracted 369, by 13, two to seven 264, 54, hand the respectively. sample sentences and of These words were lengths The to test the partial recall. In Table 12, we 31, used list the results in the second way experiment. were precision in Table and 11. partial-recall values were computed This the strongly same table indicates as the word Two AV values that outperforms IT for of all environments characters have greater chances occurring word lengths except for length two. numbers of This together in different of our method in than case of larger characters. i.e., the bigrams, of as the threshold was that we degrades the precision was it when value a bigram greater than or used for AV value three, word. the The AV we out reason of for the lower partial were recall of AV with equal from word to three, we regarded is as a length two that filtered all the bigrams For that word both absent meanings adhesive characters. larger lengths, only the system dictionary with and had of occurring in environments; those grams method. is, specific had chances that had higher AV values, which together resulted in different they The of AV a reason higher partial recall values with higher precision longer grams is value in our when for a with higher AV value both was absent from that even longer gram we not it out. the had characters, did dogmatically Alternatively, system dictionary and adhesive (as whether it was in one of filter the three styles h+core, core+t, or we furthered examined in Section ways 3). If it was in one of of IT then we filtered it out. h+core+t discussed AV with There these styles, are several worse to with explain the performance First, IT being not better than that of bigrams which but many improve longer grams. partial recall while does consider adhesive characters, helps the degrading method, out adhesive of with are to inspect the (note precision, contain characters and they hard that in our as grams we filtered some out the grams adhesive words, characters). which Second, is intended IT uses several improve techniques to filter But the bad candidates for real in In IT to help the precision. there are several deficiencies this design. the Feng, Chen, Deng, and Zheng Accessor Variety Criteria for Chinese Word Extraction segmentation module, a with longer segment is preferred. (For where each sentence, the highest likelihood, the likelihood is IT tries to find segmentation multiplication one of the is relative frequencies of defined as the of the all of the segments, and the relative frequency segment of defined [Chang as the frequency Su 1997]. that segment Therefore, if divided by the sum of the frequency all grams its and is smaller.) This will a segmentation inevitably has more segments, of then likelihood On value recall the other degrade the partial shorter grams. numbers of hand, because IT’s system dictionaries model) longer information terms, filter module (i.e., usually contain very limited or inadequate a likelihood-ranking has nonword to correctly This will describe inevitably the feature functions of word class class for longer grams. words might degrade both the and for longer grams, considered nonwords, precision the nonwords partial might recall as words. real be real Finally, be and more considered it although more the noise combination of several only one features seems comprehensive, also generates than using We feature. think that is all of the factors in Table that we 12. Comparing described Table above 12 can roughly Table 11, we explain the phenomenon that presented the for the word to are even same lengths. One find that results in slightly different we Note reason for this different obtained used words. different Therefore, system dictionaries. of that all of is that results will were experiments, of only on new the Usually, the size the system dictionary affect the result the experiment. obtained. the larger In the system dictionary are and recall that the dictionary that we is, the poorer the precision were more are in we used in in the latter experiment, That there is why bigrams than the dictionary that used the first experiment. in the precision value and of four-grams than those in As the partial-recall value smaller the first experiment. there are similar numbers for bigrams are trigrams in and Another both dictionaries, may the results for these grams are is very close of both experiments. factor that methods lead to these differences the use different sample In sentences and different we to segment only them testing the former experiment, considered with by hand for from partial recall. of to from characters, four. and In hence only terms lengths two four segmented we the sample sentences to terms lengths from two to the latter experiment, considered all terms with lengths two to seven characters.

47.04% 85.69% 75.34% 25.75% 84.62% 81.41% 68.76% 81.48% 87.41% 49.85% 84.84% 28.28% 71.59% 59.12% 78.05% 32.71% 70.37% 56.60% 80.65% 32.62% 84.62% Discussion of Future Work

In this work, we have proposed Our approach accessor may variety as an alternative Chinese give rise to new to the commonly used frequency processing. criterion. other Our promising results for word research extraction make it directions in method text problems well. a potential useful is In for word from word is as addition, extraction the basic step for Extracted many text-processing words tasks. It related to but different segmentation. problems, can be used as the fundamental elements of for related application such as creating a text summary Futhermore, for a bundle words articles and as of text clustering. in other sequences Our method letters may occur not problems, areas. be of only in language processing, but also application involving processing some heuristic (Deng, Li, value to Wang other related Thijs such Narasimhan as those substring identification and 2002; Majoros, Subramanian, et al. 2002; Yandell et al. 2002), and biomedical concepts Finally, in our and simple method, we 2003). is The is process the data only of once, and no iterative re- finement applied. improvement result comparable with even to that and longer grams. The very comprehensive simplicity of our method sys- tems makes it shows some especially suitable for processing large corpora. Computational Linguistics Acknowledgments We would in like our to thank Jing-Shin implementation of Chang for help IT. Many their method thanks also And go we to Chun-yu Kit for his suggestions. also thank the References Beeferman, John D. Doug, Adam Lafferty. 1999. Berger, Statistical and Machine models for 34(1–3):177–210. text segmentation. Learning , Beijing Language Institute. [Word 1986. Frequency Xian dai han Dictionary yu pin lu of ci Modern dian Language Institute Press. Chinese]. Beijing Chang, Jing-Shin iterative and Keh-Yih method Su. 1997. An unsupervised Chinese new for International Linguistics Journal lexicon extraction. Chinese of Computational and Language Processing , Chao, 2(2):97–148. Chinese. Yuen-Ren. University 1968. of A California Grammar Press, of Spoken Chen, Berkeley Ching-Yu, and Los Shu-Fen Angeles. Huang, Keh-Jiann Chen. Tseng, 1993. Chu-Ren and properties of Mandarin Some distributional Chinese—A Sinica study Proceedings based on the Pacific Academia Conference corpus. Formal Computational of Asia Linguistics on I, 81–95, and Chen, Keh-Jiann pages Unknown Ming-Hong Taipei. word and Chinese Bai. 1998. detection method. for by a International corpus-based Journal learning Linguistics Chinese of Computational 3(1):27–44. and Language Processing , Chien, Lee-Feng. 1995. Chinese Csmart—A high-performance In Proceedings document retrieval International system. Conference Computer of the 1995 Processing Oriental Languages of 176–183, of Hawaii. (ICCPOL) , Chien, pages Lee-Feng. 1997. PAT-tree-based keyword information extraction for Annual International In Chinese retrieval. ACM Proceedings SIGIR of the 20th Conference Information on Retrieval Research , and Development 50–58, in Philadelphia. pages Dai, Khoo. Yubin, 1999. Teck A Ee new Loh, and Christopher Chinese statistical incorporating formula for text segmentation Volume 30, Number 1 contextual Proceedings information. In SIGIR Annual ’99: ACM of SIGIR the 22nd Conference International Development on Research Information and 82–89, in Berkely, CA. Retrieval , pages Deng, Xiaotie, Wang. Guojun Center Li, and Lusheng 2002. with and distinguisher Journal for strings Combinatorial unbounded Optimization alphabet. Garside, of Roger, Geoffrey , Leech, 6(4):383–400. Geoffrey Sampson. 1987. The and Computational Corpus-Based Analysis Approach of . English: Ge, Wanda London, A Xian-Ping, Smyth. Pratt, Longman. 1999. Discovering and Padhraic from Chinese Annual text. In words unsegmented International ACM Proceedings SIGIR of the Conference 22nd Information on Retrieval Research , and Development in Berkeley, CA. pages 271–272, Guo, properties. Jin. 1997. Computational Critical tokenization Linguistics and , its Harris, 23(4):569–596. Zellig S. within 1970. words. Morpheme boundaries Structural Transformational In Linguistics Papers in D. Reidel, and Dordrecht, 68–77. . Huang, Ben-Jamin Chu-Ren, K. Tsou. Keh-Jiann pages 1996. Readings Chen, and Chinese Natural Language Processing. in Journal Chinese Linguistics Monograph In Series no. of 1–22. 9, edited by Chu-ren Huang et Kobayasi, al., pages Hozumi Yosiyuki, Tanaka. Takenobu 1994. Analysis Tokumaga, and Japanese nouns of compound information. In using collocational 15th International Conference Proceedings of the Computational Linguistics (COLING’94) on 865–869, Kyoto, Japan. , Kwok, vol. 2, Kui-Lam. pages 1997. in Chinese Comparing representations retrieval. International In ACM Proceedings information SIGIR of Conference the 20th Annual Research Retrieval Development 34–41, Philadelphia. in Information on and Linguistic Data , pages Mandarin-Text Consortium. Retrieval Conference n.d. TREC Majoros, Mandarin William Newswire H., G. , LDC Mani 2000T52. Mark Yandell. Identification Subramanian, and in 2003. of key concepts modified biomedical Markov literature using Bioinformatics a , 19(3):402–407. heuristic. Mo, Chen, Ruo-Ping Chu-Ren J., Yao-Jung Huang. Yang, 1996. Keh-Jiann Determinative-measure and Mandarin Chinese: Formation compounds in implementation. In Readings rules and parser in Feng, Chen, Deng, and Zheng Accessor Variety Criteria for Chinese Word Extraction Chinese Natural Chinese Language Linguistics Processing, Monograph Journal of no. 9, Chu-ren Huang Series edited by 123–146. et al., Narasimhan, pages Gao, Giri, Xuning Changsong Wang, Ning Bu, Xu, Yuan Mathee. Mining protein and Kalai motifs. 2002. Journal Computational sequences for 9(5):707–720. of Biology , Pachunke, Wothke, Thomas, Oliver Rudolf Mertineit, Schmidt. Klaus and 1992. morphological Broad coverage automatic segmentation Proceedings of German 14th words. In Conference of the International (COLING’92) on Computational , 4, Linguistics Nantes, France. vol. pages 1218–1222, Paola, Automatic Merlo and Suzanne Stevenson. 2001. verb classification of based on statistical distribution Computational Linguistics argument structure. , Ricardo, 27(3):373–408. Berthier. Baeza-Yates ACM 1999. Addison Modern and Press, Information Ribeiro-Neto Wesley Longman. Retrieval . Sproat, Richard method and Chilin Shih. word 1990. A statistical in Chinese for finding boundaries Processing Chinese text. Oriental Computer 4:336–351. of and Languages Sproat, Nancy Richard, Chang. Chilin 1996. Shih, A William Gale, and word-segmentation stochastic finite-state Chinese. Computational Linguistics algorithm for , Teahan, 22(3):377–404. McNab, William Ian J., H. Yingying Witten. Wen, Rodger A J. and 2000. compression-based word algorithm Computational for Chinese segmentation. , Linguistics Thijs, Gert, , 26(3):375–393. Lescot, Kathleen Stephane Marchal, Rombauts, Magali Moor, Pierre Rouze, Bart Yves De A Gibbs and Moreau. 2002. sampling overrepresented motifs method in to detect of the upstream regions Computational coexpressed Biology , genes. Journal 9(2):447–464. of Wang, Yong-Heng, 1990. Automatic Hai-Ju Su, processing and of Yan Mo. words. Journal Processing Chinese Chinese Weeber, , 4(4):1–11. of Information Marc, Baayen. Rein Vos, Extracting and R. Harald 2000. words: Pitfalls the lowest frequency Computational Linguistics , and possibilities. Wu, Zimin Gwyneth Tseng. 26(3):301–317. Chinese and 1993. text Achievements segmentation for problems. text retrieval: Journal American Society and Information of the Science , 44(9):532–542. for Yamamoto, Using Mikio and Kenneth W. Church. 2001. suffix arrays to compute term frequency and in document Computational frequency for all Linguistics substrings , a corpus. Yun, Bo-Hyun, 27(1):1–30. Rim. 1995. Analysis Ho Lee, of and Korean Hae-Chang nouns information. compound Proceedings using statistical 1995 International In Conference of Oriental on Computer the Languages (ICCPOL-95) Processing , of Zhang, Jian, Jianfeng Gao, Ming Hawaii. Extraction of Chinese and Zhou. 2000. words—An compound experimental Proceedings study Second on a very large Chinese corpus. Language Processing of Workshop the 132–139, Hong Kong. , pages