<?xml version="1.0"?><!DOCTYPE article SYSTEM "/project/take/software/searchbench_offline_processing/paperxml_generator/aclextractor/src/python/../resource/dtd/paperxml.dtd"><article><header><firstpageheader><page local="1" global="1033"/><title>Chinese Term Extraction Using Minimal Resources</title><pubinfo>Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008),pages 1033-1040 Manchester, August 2008</pubinfo><author surname="Yang" givenname="Yuhang"><org  name="Harbin Institute of Technology" country="China" city="Harbin"/></author><author surname="Lu" givenname="Qin"><org  name="Hong Kong Polytechnic University" country="Hong Kong" city="Hung Hom"/></author><author surname="Zhao" givenname="Tiejun"><org  name="Harbin Institute of Technology" country="China" city="Harbin"/></author></firstpageheader><frontmatter><p><b>Chinese Term Extraction Using Minimal Resources</b></p><p><b>Yuhang Yang</b></p><p>School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China 1983yang@gmail.com</p><p><b>Qin Lu</b></p><p>Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China</p><p>csluqin@comp.polyu.e du. hk</p><p><b>Tiejun Zhao</b></p><p>School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China</p><p>tjzhao@mtlab.hit.edu .cn</p></frontmatter><abstract>This paper presents a new approach for term extraction using minimal resources. A term candidate extraction algorithm is proposed to identify features of the relatively stable and domain independent term delimiters rather than that of the terms. For term verification, a link analysis based method is proposed to calculate the relevance between term candidates and the sentences in the domain specific corpus from which the candidates are extracted. The proposed approach requires no prior domain knowledge, no general corpora, no full segmentation and minimal adaptation for new domains. Consequently, the method can be used in any domain corpus and it is especially useful for resource-limited domains. Evaluations conducted on two different domains for Chinese term extraction show quite significant improvements over existing techniques and also verify the efficiency and relative domain independent nature of the approach. Experiments on new term extraction also indicate that the approach is quite effective for identifying new terms in a domain making it useful for domain knowledge update. </abstract></header><body><section number="1" title="Introduction"><p>Terms are the lexical units to represent the most fundamental knowledge of a domain. Term</p><p>© 2008. Licensed under the <i>Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported </i>license (http://creativecommons.org/licenses/by-nc-<u>sa/3.0/</u>). Some rights reserved.</p><p>extraction is an essential task in domain knowledge acquisition which can be used for lexicon update, domain ontology construction, etc. Term extraction involves two steps. The first step extracts candidates by unithood calculation to qualify a string as a valid term. The second step verifies them through termhood measures (Kageura and Umino, 1996) to validate their domain specificity.</p><p>Existing techniques extract term candidates mainly by two kinds of statistic based measures including <i>internal association </i>(e.g. Schone and Jurafsky, 2001) and <i>context dependency </i>(e.g. Sornlertlamvanich et al., 2000). These techniques are also used in Chinese term candidate extraction (e.g. Luo and Sun, 2003; Ji and Lu, 2007). Domain dependent features of domain terms are used in a weighted manner to identify term boundaries. However, these algorithms always face the dilemma that fewer features are not enough to identify terms from non-terms whereas more features lead to more conflicts among selected features in a specific instance.</p><p>Most term verification techniques use features on the difference in distribution of a term occurred within a domain and across domains, such as <i>TF-IDF </i>(Salton and McGill, 1983; Frank, 1999) and <i>Inter-Domain Entropy </i>(Chang, 2005). Limited distribution information on term candidates in different documents are far from enough to distinguish terms from non-terms. Other researches attempted to use more direct information. The therm verification algorithm, <i>TV ConSem, </i>proposed in (Ji and Lu, 2007) for Chinese calculate the percentage of context words in a domain lexicon using both frequency information and semantic information. However, this technique requires a large domain lexicon and relies heavily on both the size and the quality of the   lexicon.   Some   supervised learning approaches have been applied to protein/gene name recognition (Zhou et al., 2005) and Chinese new word identification (Li et al., 2004) using <i>SVM </i>classifiers (Vapnik, 1995) which also require large domain corpora and annotations, and intensive training is needed for a new domain.<page local="2" global="1034"/></p><p>Current term extraction techniques (e.g. Frank et al., 1999; Chang, 2005; Ji and Lu, 2007) suffer from three major problems. The first problem is that these algorithms cannot identify certain kinds of terms such as the ones that have less statistical significance. The second problem is their dependency on full segmentation for Chinese text which is particularly vulnerable to handle domain specific data (Huang et al., 2007). The third problem is their dependency on some a priori domain knowledge such as a domain lexicon making it difficult to be applied to a new domain.</p><p>In this work, the proposed algorithm extracts candidates by identifying the relatively stable and domain independent term boundary markers instead of looking for features associated with the term candidate themselves. Furthermore, a novel algorithm for term verification is proposed using link analysis to calculate the relevance between term candidates and the sentences in domain specific corpus to validate their domain specificity.</p><p>The rest of the paper is organized as follows. Section 2 describes the proposed algorithms. Section 3 explains the experiments and the performance evaluation. Section 4 is the conclusion.</p></section><section number="2" title="Methodology"><subsection number="2.1" title="Delimiters   Based   Term Candidate Extraction"><p>Generally speaking, sentences are constituted by substantives and functional words. Domain specific terms <i>(terms </i>for short) are more likely to be domain substantives. Words immediate before and after these terms, called <i>predecessors </i>and <i>successors </i>of the terms, are likely to be either functional words or other general substantives connecting terms. These predecessors and successors can be considered as markers of terms, and are referred to as <i>term delimiters </i>in this paper. In contrast to terms, delimiters are relatively stable and domain independent. Thus, they can be extracted more easily. Instead of looking for features associated with terms as in other works, this paper looks for features associated with term delimiters. That is, term delimiters are identified first. Words between delimiters are then taken as term candidates.</p><p>The proposed delimiter identification based algorithm, referred to as <i>TCE_DI </i>(Term Candidate Extraction - Delimiter Identification), extracts term candidates from a domain corpus by using a delimiter list, referred to as the <i>DList. </i>Given a <i>DList, </i>the algorithm <i>TCE_DI </i>itself is straight forward. For a given character string <i>CS (CS </i>= <i>C1C2...Cn) </i>shown in Figure 1, where <i>Ci </i>is a Chinese character. Suppose there are two delimiters <i>D1 </i>= <i>Ci1...Cil </i>and <i>D2 = CJ1^CJm </i>in <i>CS </i>where <i>D1 e DList </i>and <i>D2 e DList. </i>The string <i>CS </i>is then segmented to five substrings: <i>C</i><i>1</i><i>.C</i><i>ib,</i> <i>Ci1.</i><i>..Cil </i>and <i>CJ1 — CJm </i>are delimiters, <i>C1 — Cib, Cia...C</i><i>jb, </i>and <i>CJa...Cn </i>are regarded as term candidates as labeled by <i>TC</i><i>1, </i><i>TC</i><i>2 </i>and <i>TC</i><i>3 </i>in Figure 1, respectively. If there is no delimiter contained in <i>CS, </i>the whole string <i>C</i><i>1</i><i>C</i><i>2</i><i>.C</i><i>n </i>is regarded as one term candidate.</p><doubt alpha="50.0" length="46" tooSmall="False" monospace="0.0">(^^H-^ih   (^^ia—^^Cj1.■■Cjm,andCja.--Cn.Since</doubt><doubt alpha="52.4" length="21" tooSmall="False" monospace="0.0">\—T(:&gt;—ir~rc-^i irCti</doubt><figure caption="Figure 1. Paradigm of Term Candidate Extraction"></figure><p><i>DList </i>can be obtained either from a delimiter training corpus or from a given stop word list. Given a delimiter training corpus, <i>CorpusTraining, </i>normally a domain specific corpus, and a domain lexicon <i>Lexicon, DList </i>can be obtained based on the following algorithm, referred to as <i>DListExt (DelimiterList Extraction Algorithm).</i></p><p><b>Step 1 </b>: For each term <i>T</i><i>i </i>in <i>Lexicon, </i>mark <i>T</i><i>i </i>in</p><p><i>Corpus</i><i>Training </i>as a non-divisible lexical unit. <b>Step 2: </b>Segment remaining text in <i>CorpusTraining. </i><b>Step 3: </b>Extracts predecessors and successors of</p><p>all <i>T</i><i>i </i>as delimiter candidates. <b>Step 4: </b>Remove delimiter candidates that are</p><p>contained in a <i>T</i><i>i </i>in <i>Lexicon. </i><b>Step 5: </b>Rank delimiter candidates by frequency and the top <i>N</i><i>DI </i>number of items are considered delimiters.</p><p>The <i>DList Ext </i>algorithm basically use known terms in a domain specific <i>Lexicon </i>to find the delimiters. It can be shown in the experiments later that <i>Lexicon </i>does not need to be comprehensive. Even if a small training corpus, <i>Corpus</i><i>Training, </i>is not available in a language without sufficient domain specific NLP resources, a stop-word list produced by experts or from a general corpus can serve as <i>DList </i>directly without using the <i>DList Ext </i>algorithm.</p><page local="3" global="1035"/></subsection><subsection number="2.2" title="Link Analysis Based Term Verification"><p>In a domain corpus, some sentences are <i>domain relevant sentences </i>which contain more domain specific information whereas others are <i>general sentences </i>which contain less domain information. A domain specific term is more likely to be contained in domain relevant sentences, which means that domain relevant sentences and domain specific terms have a mutually reinforcing relationship. A novel algorithm, referred to as <i>TVLinkA </i>(Term Verification -Link Analysis) based the Hyperlink-Induced Topic Search <i>(HITS) </i>algorithm (Kleinberg, 1997) originally proposed for information retrieval, is proposed using link analysis to calculate the relevance between term candidates and the sentences in domain specific corpora for term verification.</p><p>In <i>TV LinkA, </i>a node <i>p </i>can either be a sentence or a term candidate. If a term candidate <i>Term</i><i>C </i>is contained in a sentence <i>Sen </i>of the corpus <i>Corpus</i><i>Extract </i>where the candidates were extracted, there is a directional link from <i>Sen </i>to <i>Term</i><i>C. </i>This way, a graph for the candidates and the sentences in <i>Corpus</i><i>Extract </i>can be constructed and the links between them indicate their relationships. A good <i>hub </i>in <i>Corpus</i><i>Extract </i>is a sentence that contains many good authorities; a good <i>authority </i>is a term candidate that is contained in many good hubs. Each node <i>p </i>is associated with a non-negative authority weight <i>w(</i><i>p)</i><i>A </i>and a non-negative hub weight w(p)<i>H </i>. Link analysis in <i>TV LinkA</i> makes use of the relationship between hubs and authorities via an iterative process to maintain and update authority/hub weights for each node of the graph.</p><p>Let <i>VA </i>denote the authority vector <i>(w(p1)A, w(p2)A,..., w(pn)A) </i>and <i>V </i>denote the hub vector <i>(w(p1)H, w(p2)H,..., w(pn)H), </i>where <i>n </i>is the sum of the total number of sentences and the total number of term candidates. Given weights <i>VA </i>and <i>V </i>with a directional link <i>p </i>—q, the <i>I </i>operation(an in-pointer to a node) and the <i>O </i>operation(an out-pointer to a node) update <i>w(q)A </i>and <i>w(p)H </i>as follows.</p><doubt alpha="57.1" length="28" tooSmall="False" monospace="0.0">Ioperation:w(q)A= ^ w(p)H(1)</doubt><doubt alpha="80.0" length="5" tooSmall="False" monospace="0.0">p—qeE</doubt><doubt alpha="57.1" length="28" tooSmall="False" monospace="0.0">Ooperation:w(p)H= ^ w(q)A(2)</doubt><p>Let <i>k </i>be the iteration termination parameter and <i>z</i> be the vector (1, 1, 1,., 1) , and <i>VA </i>and <i>VH </i>are initialized to <i>V0A </i>= <i>V0H = z. </i>Hubs and authorities can then be calculated as follows.</p><doubt alpha="62.5" length="48" tooSmall="False" monospace="0.0">Fori= 1, 2,...,kApply theIoperation to (VA,VHX),</doubt><p>obtaining new <i>V,A</i>.</p><p>Apply the <i>O </i>operation to (<i>V'</i><i>A </i>, <i>VHX </i>), obtaining new <i>V,</i><i>H </i>.</p><p>Normalize    <i>V]A</i><i> </i>by dividing the normalization factor (w' <i>(p)A </i>)<footnote anchor="2"/> to obtain <i>ViA </i>.</p><p>Normalize <i>V]H</i><i> </i>by dividing the normalization factor       (w' <i>(p)H </i>)<footnote anchor="2"/> to obtain <i>ViH </i>.</p><doubt alpha="100.0" length="3" tooSmall="False" monospace="0.0">End</doubt><p>Return (<i>VkA </i>, <i>VkH</i>)</p><p>In <i>Corpus</i><i>Extract, </i>term candidates with high authority in a few documents are likely to be domain specific terms whereas candidates with high authority in many documents are more likely to be commonly used general words. Based on this observation, the termhood of each candidate term <i>Term</i><i>C, </i>denoted as <i>Termhood</i><i>C, </i>is calculated according to formula (3) defined below.</p><doubt alpha="56.7" length="30" tooSmall="False" monospace="0.0">TermhoodC =(£w(C)A)log(DD-)(3)</doubt><p>where <i>w(C)jA </i>is the authority of <i>Term</i><i>C </i>in a document <i>D</i><i>j </i>of <i>CorpusExtract, </i>|D| is the total number of documents in <i>Corpus</i><i>Extract </i>and <i>DF</i><i>C </i>is the total number of documents in which <i>Term</i><i>C </i>occurs. <i>Term</i><i>C</i><i>s </i>are then ranked according to their termhood values <i>Termhood</i><i>C, </i>and the top ranked <i>N</i><i>TCList </i>candidates are considered terms. <i>N</i><i>TCList </i>is an algorithm parameter to be determined experimentally.</p></subsection></section><section number="3" title="Performance Evaluation 3.1    Data Preparation"><p>To evaluate the performance of the proposed algorithms for Chinese, experiments are conducted on four corpora of two different domains as listed in Table 1. Corpus<i>IT_Small </i>and <i>Corpus</i><i>IT_Large </i>are two sets of non-overlapping academic papers in the IT domain and Corpus<i>IT_Small </i>is identical to the corpus used in <i>TV_ConSem(Ji and Lu, 2007). CorpusLegal_Small </i>is a complete set of official Chinese criminal law articles. <i>Corpus</i><i>Legal_Large </i>includes the complete set of official Chinese constitutional law articles and Economics/Finance law articles (http:<page local="4" global="1036"/>//www.law-lib.com/). Three domain lexicons used in the experiments are detailed in Table 2. <i>LexiconIT </i>is obtained according to the term extraction algorithm (Ji and Lu, 2007) with manual verification. <i>LexiconLegal </i>is extracted from <i>CorpusLegal_Small </i>by manual verification too. Because legal text covers a lot of different areas such finance, science, advertisement, etc., the actually legal specific terms are relatively small in size. <i>LexiconPKU </i>contains a total of 144K manually verified IT terms supplied by the Institute of Computational Linguistics, Peking University. <i>LexiconPKU, </i>is used as the standard term set for evaluation on the IT domain. <i>Corpus</i><i>IT_Smaii </i>and <i>LexiconIT </i>are used to obtain the delimiter list of IT domain, <i>DListIT. Corpus</i><i>Legai_Smaii </i>and <i>Lexicon</i><i>Legai </i>are used to obtain  the  delimiter  list  of legal domain,</p><p><i>Large </i>and <i>Corpus</i><i>Legai_Large </i>are used as open test data to evaluate the proposed algorithms in IT domain and legal domain, respectively.</p><p>To verify that the approach works with a simple stop word list without delimiter extraction, a stop word list, <i>DListSW, </i>is also used as reference by taking the 494 general purpose stop words downloaded from a Chinese NLP resource website (www.nlp.org.cn) without any modification.</p><p>The performance of the algorithm in the IT domain is evaluated by precision according to the follow formula:</p><doubt alpha="56.8" length="37" tooSmall="False" monospace="0.0">N      + N precision^ =Lexicon-New(4)</doubt><doubt alpha="100.0" length="7" tooSmall="False" monospace="0.0">NTCList</doubt><p>where <i>N</i><i>TCList </i>is the number of term candidates in term candidate list <i>TCList </i>extracted by an evaluated algorithm, <i>N</i><i>Lexicon </i>denotes the number of term candidates in <i>TCList </i>contained in <i>Lexicon</i><i>PKU, </i><i>N</i><i>New </i>denotes the number of extracted term candidates that are not in <i>Lexicon</i><i>PKU, </i>yet are considered correct. Thus, <i>N</i><i>New </i>is the number of newly discovered terms with respect to <i>Lexicon</i><i>PKU. </i>The verification of all the new terms is carried out manually by two experts independently. A new term is considered correct if both experts marked them as correct terms. As there is no reasonably large standard legal term list available, the evaluation of the legal domain in terms of precision is conducted manually. No evaluation on new term extraction is conducted.</p><p>To evaluate the ability of the algorithms in identify new terms in the IT domain, another measurement is applied to the IT corpus against <i>Lexicon</i><i>PKU </i>based on the following formula:</p><doubt alpha="46.2" length="13" tooSmall="False" monospace="0.0">Rnte= Nr"-(5)</doubt><p>where <i>TCList </i>and <i>N</i><i>New </i>are the same as given in formula (4). A higher <i>R</i><i>NTE </i>indicates that more extracted terms are outside of <i>Lexicon</i><i>PKU </i>and are thus considered new terms. This is similar to the measurements of out of vocabulary (OOV) in Chinese segmentation. A higher <i>R</i><i>NTE </i>indicates the algorithm can be useful for domain knowledge update including lexicon expansion.</p><subsection number="3.2" title="Evaluation on Term Extraction"><p>For comparison, a statistical based term candidate extraction algorithm, <i>TCE SEF&amp;CV </i>with the best performance in (Ji and Lu, 2007) using both internal association and external strength, is used as the reference algorithm for the evaluation of <i>TCE_DI. </i>A statistics based term verification algorithm, <i>TV_ConSem </i>(Ji and Lu, 2007) using semantic information within a context window is used for the evaluation of <i>TV_LinkA. Lexicon</i><i>PKU </i>is also used in <i>TV_ConSem. </i>Two popular methods integrated without division of candidate extraction and verification steps are used for comparison. The first one is based on <i>TF-IDF </i>(Salton and McGill, 1983; Frank et al., 1999). The second one is a supervised learning approach based on a <i>SVM </i>classifier, SVMlight (Jo achims, 1999). The features used by SVMiight are shown in Table 3. Two training sets are constructed for the <i>SVM </i>classifier. The first one includes 3,337 positive examples <i>(LexiconIT) </i>and 5,950 negative examples extracted from <i>Corpus</i><i>IT_Smaii. </i>The second one includes 394 positive examples <i>(Lexicon</i><i>Legai) </i>and 28,051 negative examples extracted from <i>Corpus</i><i>Legai_smaii.</i><page local="5" global="1037"/><i></i></p><doubt alpha="25.0" length="4" tooSmall="False" monospace="0.0">1G36</doubt><table caption="Table 1. Different Corpora Used for Experiments" class="main" frame="box" rules="all" border="0" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Corpus</p></td><td class="cell"><p>Domain</p></td><td class="cell"><p>Size</p></td><td class="cell"><p>Text type</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><i>CorpusIT_Small</i></p></td><td class="cell"><p>IT</p></td><td class="cell"><p>77K</p></td><td class="cell"><p>Academic</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p>papers</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><i>CorpusIT_Large</i></p></td><td class="cell"><p>IT</p></td><td class="cell"><p>6.64M</p></td><td class="cell"><p>Academic</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p>papers</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><i>Corpus</i></p></td><td class="cell"><p>Legal</p></td><td class="cell"><p>344K</p></td><td class="cell"><p>Law</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><i>Legal Small</i></p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p>article</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><i>Corpus</i></p></td><td class="cell"><p>Legal</p></td><td class="cell"><p>1.G4M</p></td><td class="cell"><p>Law</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><i>Legal Large</i></p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p>article</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><table caption="Table 2. Different Lexicons Used for Experiments" class="main" frame="box" rules="all" border="1" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Lexicon</p></td><td class="cell"><p>Domain</p></td><td class="cell"><p>Size</p></td><td class="cell"><p>Source</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><i>Lexicon</i><i>IT</i></p></td><td class="cell"><p>IT</p></td><td class="cell"><p>3,337</p></td><td class="cell"><p><i>Corpus</i></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><i>Lexicon</i><i>Legal</i></p></td><td class="cell"><p>Legal</p></td><td class="cell"><p>394</p></td><td class="cell"><p><i>ITSmall</i></p><p><i>Corpus</i></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><i>LexiconPKU</i></p></td><td class="cell"><p>IT</p></td><td class="cell"><p>144K</p></td><td class="cell"><p><i>Legal Small</i></p><p>PKU</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><p><u>No.</u></p><doubt alpha="0.0" length="1" tooSmall="False" monospace="0.0">1</doubt><doubt alpha="0.0" length="1" tooSmall="False" monospace="0.0">2</doubt><doubt alpha="0.0" length="5" tooSmall="False" monospace="0.0">3 4 5</doubt><doubt alpha="0.0" length="1" tooSmall="False" monospace="0.0">7</doubt><p>_<u>Feature Explanation</u>_</p><p>Percentage of the Chinese characters occurred in <i>LexiconDomain</i>Frequency in the domain corpus Frequency in the general corpus Part of speech</p><p>The length of Chinese characters in the candidate</p><p>The length of non-Chinese characters in the candidate Contextual evidence</p><table caption="Table 3. Features Used in the SVM Classifier"></table><p>Figure 2 shows the performance of the proposed <i>TCE_DI </i>and <i>TVLinkA </i>for term extraction compared to the reference algorithms for IT domain using <i>Corpus</i><i>IT_Large. </i><i>TCE_DI</i><i>IT </i>and <i>TCE_DI</i><i>iegai </i>indicate <i>TCE_DI </i>using extracted delimiter lists <i>DListIT </i>and <i>DListLegal </i>with <i>NDI </i>= 500, respectively. <i>TCE_DISW </i>simply uses the stop word list <i>DListSW.</i></p><doubt alpha="0.0" length="36" tooSmall="False" monospace="0.0">100' 95' 90' 85' 80' 75' 70' 65' 60'</doubt><doubt alpha="0.0" length="5" tooSmall="False" monospace="0.0">55 50</doubt><doubt alpha="0.0" length="7" tooSmall="False" monospace="0.0">45' 40'</doubt><doubt alpha="0.0" length="4" tooSmall="False" monospace="0.0">1000</doubt><doubt alpha="0.0" length="4" tooSmall="False" monospace="0.0">2000</doubt><doubt alpha="0.0" length="4" tooSmall="False" monospace="0.0">3000</doubt><doubt alpha="0.0" length="4" tooSmall="False" monospace="0.0">4000</doubt><doubt alpha="0.0" length="4" tooSmall="False" monospace="0.0">5000</doubt><figure caption="Figure 2 Performance of Different Algorithms on IT Domain"></figure><p>As shown in Figure 2, term extraction based on <i>TCE_DIIT </i>combined with <i>TV LinkA </i>gives the best performance. It achieves 75.4% precision when the number of extracted terms <i>N</i><i>TCList </i>reaches 5,000. The performance is 9.6% and 29.4% higher in precision compared to <i>TF-IDF </i>and <i>TCESEF&amp;CV </i>combined with <i>TV ConSem,</i> respectively. These translate to improvements of precision of over 14.8% and 63.9%, respectively.</p><p>When applying the same <i>TV LinkA </i>algorithm for term verification, <i>TCE_DI </i>using different delimiter lists provide 24% better performance on average compared to the <i>TCESEF&amp;CV </i>algorithm which translates to improvement of over 47%. The result from using delimiters of legal domain <i>(DList</i><i>Legai) </i>to data in IT domain (as shown in <i>TCE_DI</i><i>iegai) </i>is better on average than using a simple general stop word list. It should be noted, however, that <i>TCE_DI</i><i>SW </i>still performs much better than the reference algorithms, which means that delimiter based term candidate extraction algorithm can improve performance even without any domain specific training. When applying the same <i>TCE_DI</i><i>IT </i>algorithm in term candidate extraction, <i>TV_LinkA </i>provides 10% higher performance compared to the <i>TV_ConSem </i>algorithm which translates to improvement of over 15.3%. It is important to point out that <i>TV_LinkA </i>using only the stop word list without any domain specific knowledge performs better than <i>TV_ConSem </i>using a large domain lexicon. In other words, delimiter based extraction with link analysis use much less resources and still improve performance of <i>TV_ConSem.</i></p><p>The performance of <i>TCE_DI</i><i>IT </i>or <i>TCE_SEF&amp;CV </i>combined with <i>TV_ConSem </i>have an upward trend when more terms are extracted which seems to be against intuition. The principle of the <i>TV_ConSem </i>algorithm is that a candidate is considered a valid term if a majority of its context words already appear in the domain lexicon. General words are more likely to be ranked on top because they are commonly used which explains the low performance of <i>TV_ConSem </i>in the lower range of <i>N</i><i>TCList. </i>When <i>N</i><i>TCList </i>increases, more domain terms are included. Thus, there is an upward trend in precision. But, the upward trend reverts at around 4,500 because the measurement in percentage is too low to distinguish valid terms from non-term candidates.</p><p>It is also interesting to point out that the simple <i>TF-IDF </i>algorithm which was rarely used in Chinese term extraction performs as well as the <i>SVM </i>classifier. The main reason is that the test corpus consists of academic papers. So, many terms are consistent and repeated a lot of times in different documents which accords with the idea of <i>TF-IDF. </i>Thus, <i>TF-IDF </i>performs relatively well because of the high-quality domain corpus. However, <i>TF-IDF, </i>as a statistics based algorithm suffers from similar problem as others based on</p><doubt alpha="0.0" length="1" tooSmall="False" monospace="0.0">6</doubt><doubt alpha="100.0" length="1" tooSmall="False" monospace="0.0">o</doubt><doubt alpha="0.0" length="1" tooSmall="False" monospace="0.0">0</doubt><p>Extracted Terms <i>(NTCList)</i><page local="6" global="1038"/></p><doubt alpha="0.0" length="4" tooSmall="False" monospace="0.0">5000</doubt><doubt alpha="0.0" length="1" tooSmall="False" monospace="0.0">0</doubt><p>statistics. Thus it does not perform as well as the proposed <i>TCE_DI </i>and <i>TVLinkA </i>algorithms.</p><p>Figure 3 shows that the proposed algorithms achieve similar performance on the legal domain. <i>TCE_DILegal </i>combined with <i>TVLinkA </i>perform the best. The result from using IT domain delimiters <i>(DListIT) </i>in legal domain as shown in <i>TCE_DI</i><i>IT </i>is better on average than using the general purpose stop list. This further proves that extracted delimiter list even from a different domain can be more effective than a general stop word list. When applying the same <i>TVLinkA </i>algorithm for term verification, <i>TCE_DI </i>using different delimiter lists are better than all the reference algorithms. Without large lexicon in Chinese legal domain, the <i>TVConSem </i>algorithm does not even work. <i>TV LinkA </i>using no prior domain knowledge for term verification still achieves similar improvement compared to that of the IT domain where a comprehensive domain lexicon is available.</p><doubt alpha="100.0" length="1" tooSmall="False" monospace="0.0">e</doubt><doubt alpha="0.0" length="2" tooSmall="False" monospace="0.0">0*</doubt><doubt alpha="0.0" length="28" tooSmall="False" monospace="0.0">100. 90' 80' 70' 60' 50' 40'</doubt><doubt alpha="28.8" length="52" tooSmall="False" monospace="0.0">1000        2000        3000 4000 Extracted Terms(N)</doubt><doubt alpha="87.5" length="8" tooSmall="True" monospace="0.0">vTCList'</doubt><figure caption="Figure 3. Performance of Different Algorithms on Legal Domain"></figure><p>There are three main reasons for the performance improvements of the proposed <i>TCE_DI </i>and <i>TV_LinkA </i>algorithms. Firstly, the delimiters which are mainly functional words (e. g. "iS"(at/in), "gK"(or)) and general substantive (e.g. ";si|"(be), "(adopt)) can be extracted easily and are effective term boundary markers since they are quite domain independent and stable. Secondly, the granularity of domain specific terms extracted the proposed algorithm is much larger than words obtained by word segmentation. This keeps many noisy strings out of the term candidate set. Thus, the proposed delimiter based algorithm performs much better over segmentation based statistical methods. Thirdly, the proposed approach is not as sensitive to term frequency as other statistical based approaches because term candidates are identified without regards to the frequencies of the candidates. In the <i>TV_LinkA </i>algorithm, terms are verified by calculating the relevance between candidates and the sentences instead of the distributions of terms in different types of documents. Terms having low frequencies can be identified as long as they are in domain relevant sentences whereas in the previous approaches including <i>TF-IDF, </i>terms with less statistical significance are weeded out. For example, a long IT term "MM-t^it^fc" (Hierarchical storage system) with a low frequency of 6 is extracted using the proposed approach. It cannot be identified by <i>TF-IDF </i>since the statistical information is not significant. This term cannot be extracted by the segmentation based algorithms either because general segmentor split long terms into pieces making them difficult to be reunited using term extraction techniques.</p><p>It is interesting to know that the proposed approach not only achieves the best performance for both domains, it also achieves second best when using extracted delimiters from a different domain. The results confirm that delimiters are quite stable across domains and the relevance between candidates and sentences are efficient for distinguishing terms from non-terms in different domains. In fact, the proposed approach can be applied to different domains with minimal training or no training if resources are limited.</p></subsection><subsection number="3.3" title="Evaluation on New Term Extraction"><p>As <i>Lexicon</i><i>PKU </i>is the only ready-to-use domain lexicon, the evaluation on new term extraction is conducted on <i>Corpus</i><i>IT_Large </i>only. Figure 4 shows the evaluation of the proposed algorithms compared to the reference algorithms in terms of <i>R</i><i>NTE, </i>the ratio of new terms among all identified terms.</p><p>It can be seen that the proposed algorithms <i>TCE_DI</i><i>IT </i>combined with <i>TV_LinkA </i>is basically the top performer throughout the range. It can identify 4% (with respect to <i>TCESEF&amp;CV +TVConSem) </i>to 27% (with respect to <i>TF-IDF) </i>more new terms when <i>NTCList </i>reaches 5,000 which translate to improvements of over 9% to 170%, respectively. The second best performer is <i>TCE_DI</i><i>legal </i>combined with <i>TV_LinkA </i>using delimiters of legal domain. In fact, it only underperforms in the lower range of <i>N</i><i>TCList</i> compared to <i>TCE_DIIT.</i><page local="7" global="1039"/><i> </i>When <i>NTCList </i>reaches 5,000, their performance is basically the same. However, the <i>TCE_DISW </i>algorithm using stop words performs much worse than using extracted delimiter lists as shown for <i>TCE_DI</i><i>IT </i>and <i>TCE_DI</i><i>legal. </i>In the <i>TCE_DI </i>algorithm, character strings are split by delimiters and the remained parts are taken as term candidates. Generally speaking, if a new term contains a delimiter or a stop word as its component, it cannot be identified correctly. Consequently, if a new term contains a stop word as its component, it cannot be extracted correctly using <i>TCE_DI</i><i>SW. </i>However, new terms are less likely to contain delimiters because the delimiter extraction algorithm <i>DList_Ext </i>would not consider a component as a delimiter if it is contained in a term in <i>LexiconDomain. </i>Consequently, <i>TCE_DISW</i>is less adaptive to domain specific data compared to <i>TCE_DI</i><i>IT </i>and <i>TCE_DI</i><i>legal. </i>That is also why <i>TCE_DI</i><i>SW </i>picks up new terms much more slowly.</p><doubt alpha="0.0" length="4" tooSmall="False" monospace="0.0">1000</doubt><doubt alpha="0.0" length="4" tooSmall="False" monospace="0.0">5000</doubt><doubt alpha="0.0" length="1" tooSmall="False" monospace="0.0">0</doubt><doubt alpha="0.0" length="4" tooSmall="False" monospace="0.0">"820</doubt><doubt alpha="0.0" length="14" tooSmall="False" monospace="0.0">2000 3000 4000</doubt><p>Extracted Terms (N )</p><figure caption="Figure 4. Performance of Different Algorithms"></figure><p>for New Term Extraction</p><p>It is interesting to know that <i>TCE_DI</i><i>IT </i>combined with <i>TV_ConSem </i>identifies more new terms in the low range of <i>N</i><i>TCList. </i>In the <i>TV_ConSem </i>algorithm, the major information used for term verification is the percentage of the context words appear in the domain lexicon. As discussed earlier in Section 3.2, <i>TV_ConSem </i>ranks commonly used general words higher than others which leads to the low precision of <i>TV_ConSem </i>for term extraction. A new term faces a similar scenario because more of its context words occur in the domain lexicon than that of other terms. Thus, new terms are actually ranked higher than other terms in <i>TV_ConSem </i>which explains its higher ability to identify new terms in the low range of <i>N</i><i>TCList. </i>However, its performance drops in the high range of <i>N</i><i>TCList </i>because the influence of context words diminishes in terms of percentage in the domain lexicon to distinguish terms from non-terms. Figure 4 also shows that <i>TF-IDF </i>and <i>SVM </i>perform the worst in new term extraction compared to other algorithms. <i>TF-IDF </i>has relatively low ability to identify new terms since new terms are not widely used and they do not repeat a lot of times in many documents. As <i>SVM </i>is sensitive to training data, it is naturally not adaptive to new terms.</p><p>All current Chinese term extraction algorithms rely on segmentation with comprehensive lexical knowledge and yet Chinese segmentation algorithms have the OOV (out of vocabulary) problem. This makes Chinese term extraction particularly vulnerable to new term extraction. The proposed approach, on the other hand, is based on delimiters which is more stable, domain independent, and OOV independent. Figure 4 shows that <i>TCE_DI </i>and <i>TV_LinkA </i>using minimal training from different domains can extract much more new terms than previous techniques. In fact, the proposed approach can serve as a much better tool to identify new domain terms and can be quite effective for domain lexicon expansion.</p></subsection></section><section number="4" title="Conclusion"><p>In conclusion, this paper presents a robust term extraction approach using minimal resources. It includes a delimiter based algorithm for term candidate extraction and a link analysis based algorithm for term verification. The proposed approach is not sensitive to term frequency as the previous works. It requires no prior domain knowledge, no general corpora, no full segmentation, and minimal adaptation for new domains.</p><p>Experiments for term extraction are conducted on IT domain and legal domain, respectively. Evaluations indicate that the proposed approach has a number of advantages. Firstly, the proposed approach can improve precision of term extraction quite significantly. Secondly, the fact that the proposed approach achieves the best performance on two different domains verifies its domain independent nature. The proposed approach using delimiters extracted from a different domain also achieves the second best performance which indicates that the delimiters are quite stable and domain independent.<page local="8" global="1040"/> The proposed approach still performs much better than the reference algorithms when using a general purpose stop word list, which means that the proposed approach can improve performance well even as a completely unsupervised approach without any training. Consequently, the results demonstrate that the proposed approach can be applied to different domains easily even without training. Thirdly, the proposed approach is particularly good for identifying new terms so that it can serve as an effective tool for domain lexicon expansion.</p><doubt alpha="0.0" length="2" tooSmall="False" monospace="0.0">50</doubt><doubt alpha="0.0" length="2" tooSmall="False" monospace="0.0">40</doubt><doubt alpha="0.0" length="2" tooSmall="False" monospace="0.0">0}</doubt><doubt alpha="100.0" length="1" tooSmall="False" monospace="0.0">H</doubt><doubt alpha="0.0" length="2" tooSmall="False" monospace="0.0">30</doubt><doubt alpha="100.0" length="1" tooSmall="False" monospace="0.0">s</doubt><doubt alpha="0.0" length="2" tooSmall="False" monospace="0.0">10</doubt></section><section title="Acknowledgements"><p>This work was done while the first author was working at the Hong Kong Polytechnic University supported by CERG Grant B-Q941 and Central Research Grant: G-U297.</p></section><references><p>Chang Jing-Shin. 2005. Domain Specific Word Extraction from Hierarchical Web Documents: A First Step toward Building Lexicon Trees from Web Corpora. In <i>Proceedings of the Fourth SIGHAN Workshop on Chinese Language Learning: </i>64-71.</p><p>Chien LF. 1999. Pat-tree-based adaptive keyphrase extraction for intelligent Chinese information retrieval. <i>Information Processing and Management, </i>vol.35: 501-521.</p><p>Eibe Frank, Gordon. W. Paynter, Ian H. Witten, Carl Gutwin, and Craig G. Nevill-Manning. 1999. Domain-specific    Keyphrase    Extraction. In <i>Proceedings of 16th International Joint Conference on Artificial Intelligence IJCAI-99:</i><i> </i>668-673.</p><p>Feng Haodi, Kang Chen, Xiaotie Deng , and Weimin Zheng, 2004. Accessor variety criteria for Chinese word extraction. <i>Computational Linguistics, </i>30(1):75-93.</p><p>Hiroshi Nakagawa, and Tatsunori Mori. 2002. A simple but powerful automatic term extraction method. In <i>COMPUTERM-2002 Proceedings of the 2nd International Workshop on Computational</i> <i>Term:</i><i> </i>29-35. Taiwan, August 2002.</p><p>Hisamitsu T., and Y. Niwa. 2002. A measure of term representativeness based on the number of co-occurring salient words. In <i>Proceedings of the 19th</i> <i>COLING, 2002.</i><i></i></p><p>Huang Chu-Ren, Petr "Simon, Shu-Kai Hsieh, and Laurent Pr'evot. 2007. Rethinking Chinese Word Segmentation:        Tokenization, Character</p><p>Classification,  or Wordbreak Identification. In <i>Proceedings of the ACL 2007 Demo and Poster Sessions:</i><i> </i>69-72. Joachims T. 2000. Estimating the Generalization Performance of a SVM Efficiently.</p><p>In <i>Proceedings of the International Conference on Machine Learning, </i>Morgan Kaufman, 2000.</p><p>Kageura K., and B. Umino. 1996. Methods of automatic term recognition: a review. <i>Term </i>3(2):259-289.</p><p>Kleinberg J. 1997. Authoritative sources in a hyperlinked environment. In <i>Proceedings of the 9th</i> <i>ACM-SIAM Symposium on Discrete Algorithms:</i><i> </i>668-677. New Orleans, America, January 1997.</p><p>Ji Luning, and Qin Lu. 2007. Chinese Term Extraction Using Window-Based Contextual Information. In <i>Proceedings of CICLing 2007, LNCS 4394:</i><i> </i>62 - 74.</p><p>Li Hongqiao, Chang-Ning Huang, Jianfeng Gao, and Xiaozhong Fan. The Use of SVM for Chinese New Word Identification. In <i>Proceedings of the 1st International Joint Conference on Natural Language Processing ( IJCNL P2004): </i>723-732. Hainan Island, China, March 2004.</p><p>Luo Shengfen, and Maosong Sun. 2003. Two-Character Chinese Word Extraction Based on Hybrid of Internal and Contextual Measures. In <i>Proceedings of the Second SIGHAN Workshop on Chinese Language Processing:</i><i> </i>24-30.</p><p>McDonald, David D. 1993. Internal and External Evidence in the Identification and Semantic Categorization of Proper Names. In Proceedings of the Workshop on Acquisition of Lexical Knowledge from Text, pages 32--43, Columbus, OH, June. Special Interest Group on the Lexicon of the Association for Computational Linguistics.</p><p>Nasreen AbdulJaleel and Yan Qu. 2005. Domain Term Extraction and Structuring via Link Analysis.</p><p>In <i>Proceedings of the AAAI '05 Workshop on Link Analysis: </i>39-46.</p><p>Salton, G., and McGill, M.J. (1983). <i>Introduction to Modern Information Retrieval. </i>McGraw-Hill.</p><p>Schone, P. and Jurafsky D. 2001. Is Knowledge-free Induction of Multiword Unit Dictionary Headwords a solved problem? In <i>Proceedings of EMNLP2001.</i></p><p>Sornlertlamvanich V., Potipiti T., and Charoenporn T. 2000. Automatic Corpus-based Thai Word Extraction with the C4.5 Learning Algorithm. In <i>Proceedings of COLING 2000.</i><i></i></p><p>Vladimir N. Vapnik. 1995. <i>The Nature of Statistical Learning Theory. </i>Springer, 1995.</p><p>Zhou GD, Shen D, Zhang J, Su J, and Tan SH. 2005. Recognition of Protein/Gene Names from Text using   an   Ensemble   of   Classifiers. <i>BMC</i> <i>Bioinformatics </i>2005, 6(Suppl 1):S7.</p></references></body></article>