<?xml version="1.0"?><!DOCTYPE article SYSTEM "/project/take/software/searchbench_offline_processing/paperxml_generator/aclextractor/src/python/../resource/dtd/paperxml.dtd"><article><header><firstpageheader><page local="1" global="1277"/><title>Web-based List Question Answering</title><author surname="Yang" givenname="Hui"><org  name="National University of Singapore" country="Singapore"/></author><author surname="Chua" givenname="Tat-Seng"><org  name="Konan University" country="Japan" city="Kobe"/></author></firstpageheader><frontmatter><p><b>Web-Based List Question Answering</b></p><p><b>Hui Yang, Tat-Seng Chua</b></p><p>School of Computing National University of Singapore 3 Science Drive 2, 117543, Singapore <u>yangh@lycos</u><u>.</u>co.uk,chuats@comp.nus.edu.sg</p></frontmatter><abstract>While research on question answering has be­come popular in recent years, the problem of ef­ficiently locating a complete set of distinct answers to list questions in huge corpora or the Web is still far from being solved. This paper ex­ploits the wealth of freely available text and link structures on the Web to seek complete answers to list questions. We introduce our system, FADA, which relies on question parsing, web page classification/clustering, and content extrac­tion to find reliable distinct answers with high re­call. </abstract></header><body><section number="1" title="Introduction"><p>The Text REtrieval Conference Series (TREC) has greatly encouraged Question Answering (QA) re­search in the recent years. The QA main task in the recent TREC-12 involved retrieving short concise answers to factoid and list questions, and answer nuggets for definition questions (Voorhees, 2003). The list task in TREC-12 required systems to as­semble a set of distinct and complete exact answers as responses to questions like <i>"What are the brand names of Belgian chocolates? ". </i>Unlike the questions in previous TREC conferences, TREC-12 list ques­tions did not specify a target number of instances to return but expected all answers contained in the cor­pus. Current QA systems (Harabagiu et al., 2003; Katz et al., 2003) usually extract a ranked list of fac­toid answers from the top returned documents by retrieval engines. This is actually the traditional way to find factoid answers. The only difference between answering list questions and factoid questions here is that list QA systems allow for <i>multiple answers, </i>whose scores are above a cut-off threshold.</p><p>An analysis of the results of TREC-12 list QA systems (Voorhees, 2003) reveals that many of them severely suffer from two general problems: <i>low re­call and non-distinctive answers. </i>The median aver­age F1 performance of list runs was only 21.3% while the best performer could only achieve 39.6% (Table 1). This unsatisfactory performance exposes the limitation of using only traditional Information Retrieval and Natural Language Processing tech­niques to find an exhaustive set of factoid answers as compared to only one.</p><p>TREC-12 Run Tag Avg Fi</p><p>In contrast to the traditional techniques, the Web is used extensively in systems to rally round factoid questions. QA researchers have explored a variety of uses of the Web, ranging from surface pattern min­ing (Ravichandran et al., 2002), query formulation (Yang et al., 2003), answer validation (Magnini et al., 2002), to directly finding answers on the Web by data redundancy analysis (Brill et al., 2001). These systems demonstrated that with the help of the Web they could generally boost baseline performance by 25%-30% (Lin 2002).</p><p>The well-known redundancy-based approach iden­tifies the factoid answer as an N-gram appearing most frequently on the Web (Brill et al. 2001). This idea works well on factoid questions because factoid questions require only one instance and web docu­ments contains a large number of repeated informa­tion about possible answers. However, when dealing with list questions, we need to find all distinct in­stances and hence we cannot ignore the less frequent answer candidates. The redundancy-based approach fails to spot novel or unexpectedly valuable informa­tion in lower ranked web pages with few occur­rences.</p><p>In this paper, we propose a novel framework to employ the Web to support list question answering. Based on the observations that multiple answer in­stances often appear in the list or table of a single web page while multiple web pages may also con­tain information about the same instance, we differ­entiate these two types of web pages. For the first category, which we call Collection Page (CP), we need to extract table/list content from the web page. For the second category, which we call Topic Page (TP), we need to find distinct web pages relating to different answer instances. We will demonstrate that the resulting system, FADA <b>(F</b>ind <b>A</b>ll <b>D</b>istinct <b>A</b>n­swers), could achieve effective list question answer­ing in the TREC corpus.</p><table caption="Table 1: TREC-12 Top 5 Performers (Voorhees, 2003)" class="main" frame="box" rules="all" border="0" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>LCCmainS03</p></td><td class="cell"><p>0.396</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>nusmml03r2</p></td><td class="cell"><p>0.319</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>MITCSAIL03c</p></td><td class="cell"><p>0.134</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>isi03a</p></td><td class="cell"><p>0.118</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>BBN2003B</p></td><td class="cell"><p>0.097</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Average</p></td><td class="cell"><p>0.213</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><page local="2" global="1278"/><p>The remainder of this paper is organized as fol­lowing. Section 2 gives the design considerations of our approach. Section 3 details our question analysis and web query formulation. Section 4 describes the web page classification and web document features used in FADA. Section 5 shows the algorithm of topic page clustering while Section 6 details the an­swer extraction process. Section 7 discusses experi­mental results. Section 8 concludes the paper.</p></section><section number="2" title="Design Considerations"><p>Our goal is to find as many distinct exact answers on the Web as possible. This requires us to:</p><p>• perform effective and exhaustive search; and</p><p>• extract distinct answers.</p><p>In order to perform effective search, we employ question transformation to get effectual web queries. However, this is not a trivial task. If the query is too general, too many documents may be retrieved and the system would not have sufficient resources to scan through all of them. If the query is too specific, no pages may be retrieved.</p><p>Given millions of web pages returned by search engines, our strategy is to divide-and-conquer by first identify Collection Pages (CP) that contain a list of answer instances. For example, for the question <i>"What breeds of dog have won the "Best in Show" award at the Westminster Dog Show?", </i>we can find a Collection Page as shown in Figure 1 (top). Such a web page is a very good resource of answers. In general, we observe that there is a large number of named entities of the type desired appearing in a Collection Page, typically in a list or table. Our in­tuition is that if we can find a Collection Page that contains almost all the answers, then the rest of the work is simply to extract answers from it or related web pages by wrapper rule induction.</p><p>Another kind of "good" web page is a Topic Page, that contains just one answer instance (Figure 1, bot­tom). It typically contains many named entities, which correspond to our original query terms and some other named entities of the answer target type. Given the huge amount of web data, there will be many Topic Pages that refer to the same answer in­stance. There is hence a need to group the pages and to identify a pertinent and distinctive page in order to represent a distinct answer.</p><p><b>Web page class Description</b></p><p>Collection Page     Containing a list of answers Topic Page The best page to represent an answer instance</p><p>Relevant Page       Relevant to an answer instance by providing either support or objection to the Topic Page</p><p>Irrelevant Page     Not related to any answer instance</p><table caption="Table 2: Web Page Classes"></table><p>Figure 1 : Examples of Collection Page (top) and Topic Page (bottom)</p><p>The rest of the top returned web pages could be either relevant or irrelevant to the question. In sum­mary, we need to classify web pages into four classes: <i>Collection Page, Topic Page, Relevant Page, and Irrelevant Page </i>(Table 2), based on their func­tionality and contribution in finding list answers.</p><p>Based on the above considerations, we propose a general framework to find list answers on the Web using the following steps:</p><p>a) Retrieve a good set of web documents.</p><p>b) Identify Collection Pages and distinct Topic Pages as main resources of answers.</p><p>c) Perform clustering on other web pages based on their similarities to distinct Topic Pages to form clusters that correspond to distinct answer in­stances.</p><p>d) Extract answers from Collection Pages and Topic Page clusters.</p></section><section number="3" title="Question Transformation and Web Page Retrieval"><p>Agichtein et al. (2001) presented a technique on learning search engine specific query transforma­tions for question answering. A set of transformation rules are learned from a training corpus and applied to the questions at the search time. Related work could also be found in Kwok et al. (2001) where the user's question is processed by a parser to learn its syntactic structure and various query modulation techniques are applied to the initial questions to get high quality results for later answer extraction.</p><p>FADA performs question parsing to identify key question words and the expected answer type. It ex­tracts several sets of words from the original ques­tion and identifies the detailed question classes. It then formulates a number of queries by combining the known facets together with heuristic patterns for list questions.<page local="3" global="1279"/></p><p>We perform both shallow and full parsing on a question followed by Named Entity Recognition (NER) to get the known query facets and their types. The shallow parser we used is the free online mem­ory-based chunker<footnote anchor="1"/> and the full parser is MINIPAR<footnote anchor="2"/>. Both parsers are very efficient and usually parse 300 words within a second. The procedure of query pars­ing is as follows:</p><p>a) Remove head words. The head words in a ques­tion could be wh-question words and leading verbs. The list of head words includes <i>"who, what, when, where, which, how, how much, how many, list, name, give, provide, tell", </i>etc. Removing them enables us to get the correct subject/object relation and verb in the question. For example, for question <i>"What breeds of dog have won the 'Best in Show' award at the Westminster Dog Show?", </i>after removing the head word, the question be­comes <i>"breeds of dog have won the 'Best in Show' award at the Westminster Dog Show".</i></p><p>b) Detect subject and object for the remaining ques­tion segments by shallow parsing. For example, after parsing the above question, we get:</p><doubt alpha="58.8" length="204" tooSmall="False" monospace="0.0">[NP1Subjectbreeds//NNS NP1Subject]   {PNP [P of/IN P]    [NP dog/NN NP]   PNP} [VP1have/VBP won/VBN VP1]    [NP1Objectthe/DT Best/JJS NP1Object]   {PNP   [P in/IN P]    [NP Show/NNP   ''/''   award/NN NP]</doubt><doubt alpha="57.6" length="33" tooSmall="False" monospace="0.0">PNP} {PNP [P at/IN P]  [NP the/DT</doubt><p>Westminster/NNP Dog//NNP Show/NNP NP]</p><doubt alpha="75.0" length="4" tooSmall="False" monospace="0.0">PNP}</doubt><p>From the parsed sentence, we want to get the logi­cal subject as the sentence subject or its immedi­ate modifiers. Here we have the logical subject-<i>"breeds of dog", </i>verb-"won", and logical object-"the <i>best in show award". </i>If the resulting logical subject/object is the term <i>"that" </i>as in the follow­ing parsed query for <i>"U.S. entertainers that later became politicians":</i></p><p>[NP U.S./NNP entertainers//NNS NP]</p><p>[NP1Subject that/WDT NP1Subject] [ADVP later/RB ADVP]    [VP1 became/VBD VP1] [NP1Object politicians//NNS NP1Object]</p><p>we get the noun or noun phrase before the clause as the logical subject/object. Hence, we have the logical subject<i>-"entertainers" </i>, action<i>-"became" </i>, and logical object<i>-"politician".</i></p><p>c) Extract all the noun phrases as potential descrip­tions from the remaining question segments, which are usually prepositional phrases or clauses. For the "dog breeds" example, we get the descrip-tions<i>-"Westminster Dog Show" </i>.</p><footnote label="1"> http://ilk.kub.nl/cgi-bin/tstchunk/demo.pl</footnote><footnote label="2"> http://www.cs.ualberta.ca/~lindek/minipar.htm</footnote><p>d) Apply named entity recognition to the resulting description phrases by using NEParser, a fine­grained named entity recognizer used in our TREC-12 system (Yang et al., 2003). It assigns tags like "person", "location", "time", "date", "number". For the "dog breed" example, <i>"West­minster" </i>gets the location tag.</p><p>After the above analysis, we obtain all the known facets provided in the original question. We then make use of this knowledge to form web queries to get the right set of pages. This is a crucial task in dealing with the Web. One of the query transforma­tion rules is given as follows:</p><doubt alpha="66.1" length="127" tooSmall="False" monospace="0.0">(list|directoty|category|top|favorite )?   (:|of)? &lt;subj&gt; &lt;action&gt;? &lt;object&gt;? &lt;description1&gt;? &lt;description2&gt;?...&lt;descriptionN&gt;?</doubt><p>The rule starts the query optionally with leading words (list, directory, category), optionally followed by a colon or "of", followed by subject phrase (&lt;subj&gt;), optionally followed by action (&lt;action&gt;), optionally followed by object (&lt;object&gt;) and de­scription phrases (&lt;description1&gt;...&lt;descriptionN&gt;). In the above pattern, "?" denotes optional, omit, and "|" alternative. For example, for the "dog <i>breed" </i>question, we form queries <i>"breed of dog won best in show Westminster Dog Show", "directory breed of dog best in show Westminster Dog Show", </i>and <i>"list breed of dog won best in show" etc.</i></p><p>Transforming the initial natural language ques­tions into a good query can dramatically improve the chances of finding good answers. FADA submits these queries to well-known search engines (Google, AltaVista, Yahoo) to get the top <i>1,000 </i>Web pages per search engine per query. Here we attempt to re­trieve a large number of web pages to serve our goal - find <i>All Distinct </i>answers. Usually, there are a large number of web pages which are redundant as they come from the same URL addresses. We remove the redundant web pages using the URL addresses as the guide. We also filter out files whose formats are nei­ther HTML nor plain text and those whose lengths are too short or too long. Hence the size of the re­sulting document set for each question varies from a few thousands to ten of thousands.</p></section><section number="4" title="Web Page Classification"><p>In order to group the web pages returned by the search engines into the four categories discussed earlier, it is crucial to find a good set of features to represent the web pages. Many techniques such as td.idf (Salton and Buckley, 1988) and a stop word list have been proposed to extract lexical features to help document clustering. However, they do not work well for question answering. As pointed out by Ye et al. (2003) in their discussion on the per­son/organization finding task, given two resume pages about different persons, it is highly possible that they are grouped into one cluster because they share many similar words and phrases.<page local="4" global="1280"/> On the other hand, it is difficult to group together a news page and a resume page about the same target entity, due to the diversity in subject matter, word choice, liter­ary styles and document format. To overcome this problem, they used mostly named entity and link information as the basis for clustering. Compared to their task, our task of finding good web documents containing answers is much more complex. The fea­tures are more heterogeneous, and it is more difficult to choose those that reflect the essential characteris­tics of list answers.</p><p>In our approach, we obtain the query words through subject/object detection and named entity recognition. We found that there are a large number of named entities of the same type appearing in a Collection Page, typically within a list or table. And in a Topic Page, there is also typically a group of named entities, which could correspond to our origi­nal query terms or answer target type. Therefore, named entities play important roles in semantic ex­pression and should be used to reflect the content of the pages.</p><p>The Web track in past TREC conferences shows that URL, HTML structure, anchor text, hyperlinks, and document length tend to contain important heu­ristic clues for web clustering and information re­trieval (Craswell and Hawking, 2002). We have found that a Topic Page is highly likely to repeat the subject in its URL, title, or at the beginning of its page. In general, if the subject appears in important locations, such as in HTML tags &lt;title&gt;, &lt;H1&gt; and &lt;H2&gt;, or appears frequently, then the corresponding pages should be Topic Pages and their topic is about the answer target.</p><p>Followed the above discussion, we design a set of 29 features based on Known Named Entity Type, An­swer Named Entity Type, ordinary Named Entities, list, table, URL, HTML structure, Anchor, Hyper­links, and document length to rep resent the web pages. Table 3 lists the features used in our system. In the table and subsequent sections, NE refers to Named Entity.</p><p>We trained two classifiers: the Collection Page classifier and the Topic Page classifier. The former classifies web pages into Collection Pages and non-collection pages while the later further classifies the non-collection pages into Topic Pages and Others. Both Classifiers are implemented using Decision Tree C4.5 (Quinlan 1993). We used 50 list questions from TREC-10 and TREC-11 for training and TREC-12 list questions for testing. We parse the questions, formulate web queries and collect web pages by using the algorithm described in Section 2.</p><p>Each sample is represented using the features listed in Table 3. Some of the decision rules are as follows:</p><doubt alpha="34.5" length="29" tooSmall="False" monospace="0.0">a) OUT_Link &gt;= 25 &amp; NE &gt; 78 &amp;</doubt><doubt alpha="46.5" length="71" tooSmall="False" monospace="0.0">b) Answer_NE  &gt;= 30-&gt;Class CP OUT_Link&lt;=2 5 &amp; Answer_NE&lt;=5 &amp; NE  &gt; 46-&gt;</doubt><p>Class TP</p><doubt alpha="56.4" length="39" tooSmall="False" monospace="0.0">c) OUT_Link&gt;=25 &amp; URL_Depth &gt; 3-&gt;Others</doubt><doubt alpha="56.2" length="16" tooSmall="False" monospace="0.0">d) NE&lt;=4-&gt;Others</doubt><p>Rule a) implies that good Collection Pages should have many outlinks, NEs and especially answer NEs. Rule b) implies that good Topic Pages should have many NEs but relatively few links and answer NEs. Rule c) show that Others have deeper URL depth; while Rule d) shows that they have fewer NEs.</p><table caption="Table 3: Web Page Features" class="main" frame="box" rules="all" border="0" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p>Feature</p></td><td class="cell"><p>Meaning</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>1</p></td><td class="cell"><p>|PER|</p></td><td class="cell"><p># of Person NEs</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>2</p></td><td class="cell"><p>|ORG|</p></td><td class="cell"><p># of Organization NEs</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>3</p></td><td class="cell"><p>|LOC|</p></td><td class="cell"><p># of Location NEs</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>4</p></td><td class="cell"><p>|TME|</p></td><td class="cell"><p># of Time NEs, including date, year, month</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>5</p></td><td class="cell"><p>|NUM|</p></td><td class="cell"><p># of Numer NEs</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>6</p></td><td class="cell"><p>|COD|</p></td><td class="cell"><p># of Code NEs, including phone number,</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p>zip code, etc</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>7</p></td><td class="cell"><p>|OBJ|</p></td><td class="cell"><p># of Object NEs, including animal, planet,</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p>book, etc</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>8</p></td><td class="cell"><p>|NE|</p></td><td class="cell"><p>Total number of the above NEs</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>9</p></td><td class="cell"><p>|Known NE|</p></td><td class="cell"><p>Total # of NEs within the same NE type as</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p>in the question. In the "dog breed" example,</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p>it is the number of Location NEs since</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p>"Westminster" is identified as Location by</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p>NER.</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>10</p></td><td class="cell"><p>Unknown N</p></td><td class="cell"><p># of NEs belonging to other NE type. In the</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p>E|</p></td><td class="cell"><p>"dog breed" example, it is the total number</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p>of <i>Time and Breed </i>NEs</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>11</p></td><td class="cell"><p>|Answer NE|</p></td><td class="cell"><p># of NEs belonging to expected answer</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p>type. In the "dog breed" example, it is the</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p>number of <i>Breed </i>NEs</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>12</p></td><td class="cell"><p>|Known NE|</p></td><td class="cell"><p>Ratio of | Known _NE| to |NE|</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p>/ |NE|</p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>13</p></td><td class="cell"><p>|Unknown N</p></td><td class="cell"><p>Ratio of | Unknown _NE| to |NE|</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p>E| / |NE|</p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>14</p></td><td class="cell"><p>|Answer NE|</p></td><td class="cell"><p>Ratio of |Answer NE| to |NE|</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p>/ |NE|</p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>15</p></td><td class="cell"><p>Length</p></td><td class="cell"><p># of tokens in a page</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>16</p></td><td class="cell"><p>Content Len</p></td><td class="cell"><p># of words in a page excluding HTML</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p>gth</p></td><td class="cell"><p>tags</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>17</p></td><td class="cell"><p>|NE|/Length</p></td><td class="cell"><p>Ratio of |NE| to |Token|</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>18</p></td><td class="cell"><p>|NE|/Content</p></td><td class="cell"><p>Ratio of |NE| to |Word|</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p>Length</p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>19</p></td><td class="cell"><p>|In Link|</p></td><td class="cell"><p># of in-links</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>20</p></td><td class="cell"><p>|Out Link|</p></td><td class="cell"><p># of out-links</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>21</p></td><td class="cell"><p>|All_Link|</p></td><td class="cell"><p>The sum of in-links and out-links</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>22</p></td><td class="cell"><p>Keyword in</p></td><td class="cell"><p>Boolean indicating presence of keywords in</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p>Title</p></td><td class="cell"><p>page title</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>23</p></td><td class="cell"><p>Keyword</p></td><td class="cell"><p>Boolean indicating presence of keywords in</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p>in_URL</p></td><td class="cell"><p>URL</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>24</p></td><td class="cell"><p>Keyword</p></td><td class="cell"><p>Boolean indicating presence of keywords in</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p>in Page</p></td><td class="cell"><p>the page</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>25</p></td><td class="cell"><p>|Answer NE</p></td><td class="cell"><p># of NEs belonging to expected answer type</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p>in Title|</p></td><td class="cell"><p>presenting in page title</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>26</p></td><td class="cell"><p>|Answer NE</p></td><td class="cell"><p># of NEs belonging to expected answer type</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p>in URL|</p></td><td class="cell"><p>presenting in URL</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>27</p></td><td class="cell"><p>|&lt;li&gt;|</p></td><td class="cell"><p># of HTML tags representing a list or table,</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p>including &lt;li&gt;, &lt;ol&gt;, &lt;ul&gt;, &lt;br&gt;,&lt;td&gt;</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>28</p></td><td class="cell"><p>|&lt;li&gt;&lt;a</p></td><td class="cell"><p># of HTML tags, including &lt;li&gt;, &lt;ol&gt;,</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p>href=|</p></td><td class="cell"><p>&lt;ul&gt;, &lt;br&gt;,&lt;td&gt; to represent a list/table of</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p>anchors,</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>29</p></td><td class="cell"><p>URL Depth</p></td><td class="cell"><p>The depth of URL</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><page local="5" global="1281"/><p>Web page classification enables us to get Collec­tion Pages, Topic Pages and the rest of the pages. Our experiments on TREC-12 list questions showed that we can achieve a classification precision of 91.1% and 92% for Collection Pages and Topic Pages respectively.</p></section><section number="5" title="Finding Answer Sources"><p>Based on Web page classification, we form the ini­tial sets of Collection Pages CPSet, Topic Pages TPSet and OtherSet. In order to boost the recall, we first use the outgoing links of Collection Pages to find more Topic Pages. These outgoing pages are potential Topic Pages but not necessarily appearing among the top returned web documents. Our subse­quent tests reveal that the new Topic Pages intro­duced by links from Collection Pages greatly increase the overall answer recall by 23%. The new Topic Page set becomes:</p><p><i>TPSet'  = TPSet + {outgoing pages of CPs}</i></p><p>Second, we select distinct Topic Pages. We com­pare the page similarity between each pair of Topic Pages using the algorithm below.</p><p>for each pair <i>{tpit   tp</i><i>j}  </i>in <i>TPSet' </i>if   (<i>sim(tpi,tp</i><i>j)&gt; </i><b><i>9)</i></b></p><p>if <i>&amp;ANE</i><i>_in_tp</i><i>i </i><i>&gt; K</i><i>ANE</i><i>_in_tpj</i>move <i>tp</i><i>j </i>into <i>OtherSet</i>;</p><p>Here the page similarity function <i>sim() </i>is a linear combination of overlaps between <i>Known NE, An-swer_NE, URL similarity and link similarity. </i><b><i>e </i></b>is preset at 0.75 and may be overridden by the user. <i>tt</i><i>ANE</i><i>_in_tpi </i>is the number of named entities of answer type in Topic Page <i>tpi</i>. For those pairs with high similarity, we keep the page that contains more named entities of answer type in <i>TPSet' </i>and move the other into <i>OtherSet. </i>The resulting Topic Pages in <i>TPSet' </i>are distinct and will be used as cluster seeds for the next step.</p><p>Third, we identify and dispatch Relevant Pages from <i>OtherSet </i>into appropriate clusters based on their similarities with the cluster seeds.</p><p>for each <i>rpi </i>in <i>OtherSet </i>{</p><doubt alpha="54.9" length="51" tooSmall="False" monospace="0.0">k= argmax {sim(rpi,tpk)} if(sim(rpi,tpk)   &gt;    T )</doubt><p>insert <i>rpi </i>into <i>clusterk; </i>else insert <i>rpi </i>into <i>IrrelevantSet</i>; }</p><p>Here <b><i>T </i></b>is preset at 0.55, and <i>sim() </i>is defined as above. Each cluster corresponds to a distinct answer instance. The Topic Page provides the main facts about that answer instance while Relevant Pages provide supporting materials for the unique answer instance. The average ratio of correct clustering is 54.1% in our experiments.</p><p>Through web page clustering, we avoid early an­swer redundancy, and have a higher chance to find­ing distinct answers on the noisy Web.</p></section><section number="6" title="Answer Extraction"><subsection number="6.1" title="HTML Source Page Cleaning"><p>Many HTML web pages contain common HTML mistakes, including missing or unmatched tags, end tags in the wrong order, missing quotes round attrib­utes, missed / in end tags, and missing &gt; closing tags, etc. We use HtmlTidy<footnote anchor="3"/> to clean up the web pages before classification and clustering. FADA also uses an efficient technique to remove advertisements. We periodically update the list from Accs-Net<footnote anchor="4"/>, a site that specializes in creating such blacklists of adver­tisers. If a link address matches an entry in a black­list, the HTML portion that contained the link is removed.</p></subsection><subsection number="6.2" title="Answer Extraction from CP"><p>Collection Pages are very good answer resources for list QA. However, to extract the "exact" answers from the resource page, we need to perform wrapper rule induction to extract the useful content. There is a large body of related work in content extraction, which enables us to process only extracted content rather than cluttered data coming directly from the web. Gupta et al. (2003) parsed HTML documents to a Document Object Model tree and to extract the main content of a web page by removing the link lists and empty tables. In contrast, our link list ex­tractor finds all link lists, which are table cells or lists for which the ratio of the number of links to the number of non-linked words is greater than a spe­cific ratio. We have written separate extractors for each answer target type. The answers obtained in Collection Pages are then "projected" onto the</p><p>TREC AQUAINT corpus to get the TREC answers (Brill et al., 2001).</p></subsection><subsection number="6.3" title="Answer Extraction from TP Cluster"><p>Having web pages clustered for a certain question, especially when the clusters nicely match distinct answer, facilitates the task of extracting the possible answers based on the answer target type. We per­form this by first analyzing the main Topic Pages in each cluster. In case we find multiple passages con­taining different answer candidates in the same Topic Page, we select the answer candidate from the passage that has the most variety of NE types since it is likely to be a comprehensive description about different facets of a question topic. The answer found in the Topic Page is then "projected" onto the QA corpus to get the TREC answers as with the Col­lection Page. In case no TREC answers can be found based on the Topic Page, we go to the next most relevant page in the same cluster to search for the answer.<page local="6" global="1282"/> The process is repeated until either an an­swer from the cluster is found in the TREC corpus or when all Relevant Pages in the cluster have been exhausted.</p><footnote label="3"> http://htmltrim.sourceforge.net/tidy.html</footnote><footnote label="4"> http://www.accs-net.com/hosts/get_hosts.html</footnote><p>For the question <i>"Which countries did the first lady Hillary Clinton visit?", </i>we extracted the <i>Loca­tions </i>after performing Named Entity analysis on each cluster and get 38 country names as answers. The recall is much higher than the best performing system (Harabagiu et al., 2003) in TREC-12 which found 26 out of 44 answers.</p></subsection></section><section number="7" title="Evaluation on TREC-12 Question Set"><p>We used the 37 TREC-12 list questions to test the overall performance of our system and compare the answers we found in the TREC AQUAINT corpus (after <i>answer projection </i>(Brill et al. 2001)) with the answers provided by NIST.</p><subsection number="7.1" title="Tests of Web Page Classification"><p>In Section 3, the web pages are classified into three classes: Collection Pages, Topic Pages, and Others. Table 4 shows the system performance of the classi­fication. We then perform a redistribution of classi­fied pages, where the outgoing pages from CPs go to TP collection, and the Relevant Pages are grouped as supportive materials into clusters, which are based on distinct Topic Page. Nevertheless, the perform­ance of web page classification will influence the later clustering and answer finding task. Table 4 shows that we could achieve an overall classification average precision of 0.897 and average recall of 0.851. This performance is adequate to support the subsequent steps of finding complete answers.</p><p>T able 4: Performance of Web Page Classification</p></subsection><subsection number="7.2" title="Performance and Effects of Web Page Clustering"><p>Relevant Pages are put into clusters to provide sup­portive material for a certain answer instance. The performance of Relevant Page dispatch/clustering is 54.1%. We also test different clustering thresholds for our web page clustering as defined in Section 5. We use the F1 measure of the TREC-12 list QA re­sults as the basis to compare the performance of dif­ferent clustering threshold combinations as shown in xx. We obtain the best performance of F1 = 0.464 when T=0.55 and 0=0.75.</p><doubt alpha="0.0" length="28" tooSmall="False" monospace="0.0">9(0.55)9(0.65)0(0.75) 9=0.85</doubt></subsection><subsection number="7.3" title="Overall Performance"><p>Table 6 compares a baseline list question answering system with FADA. The baseline is based on a sys­tem which we used in participation in the TREC-12 QA task (Yang et al., 2003). It extends the tradi­tional IR/NLP approach for factoid QA to perform list QA, as is done in most other TREC-12 systems. It achieves an average F1 of 0.319, and is ranked 2ndin the list QA task.</p><p>We test two variants of FADA - one without in­troducing the outgoing pages from CPs as potential variants are used to evaluate the effects of CPs in the list QA task. The results of these two variants of FADA on the TREC-12 list task are presented in</p><doubt alpha="64.3" length="42" tooSmall="False" monospace="0.0">TPs (FADA1), and one with (FADA2). The two</doubt><p>Table 6.</p><doubt alpha="66.7" length="18" tooSmall="False" monospace="0.0">Avg P Avg R Avg F2</doubt><doubt alpha="30.8" length="26" tooSmall="False" monospace="0.0">Baseline 0.568 0.264 0.319</doubt><doubt alpha="41.3" length="46" tooSmall="False" monospace="0.0">FADA1 (w/o outgoing pages)   0.406 0.344 0.372</doubt><doubt alpha="38.3" length="47" tooSmall="False" monospace="0.0">FADA2 (w/ outgoing pages)     0.516 0.422 0.464</doubt><doubt alpha="42.3" length="26" tooSmall="False" monospace="0.0">TREC-12 best run - - 0.396</doubt><table caption="Table 6: Performance on TREC-12 Test Set"></table><p>Without the benefit of the outgoing pages from CPs to find potential answers, FADA1 could boost the average recall by 30% and average F1 by 16.6% as compared to the baseline. The great improvement in recall is rather encouraging because it is crucial for a list QA system to find a complete set of an­swers, which is how list QA differ from factoid QA.</p><p>By taking advantage of the outgoing pages from CPs, FADA2 further improves performance to an average recall of 0.422 and average F1 of 0.464. It outperforms the best TREC-12 QA system (Voorhees, 2003) by 19.6% in average F1 score.</p><p>From Table 6, we found that the outgoing pages from the Collection Pages (or resource pages) con­tribute much to answer finding task. It gives rise to an improvement in recall of 22.7% as compared to the variant of FADA1 that does not take advantage of outgoing pages. We think this is mainly due to the characteristics of the TREC-12 questions. Most questions ask for well-known things, and famous events, people, and organization. For this kind of questions, we can easily find a Collection Page that contains tabulated answers since there are web sites that host and maintain such information. For in­stance, "Westminster Dog Show" has an official web site<footnote anchor="5"/>.<page local="7" global="1283"/> However, for those questions that lack Collection Pages, such as <i>"Which countries did the first lady Hillary Clinton visit?", </i>we still need to rely more on Topic Pages and Relevant Pages.</p><table caption="Table S: Clustering Threshold Effects" class="main" frame="box" rules="all" border="0" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><i>T=0.</i></p></td><td class="cell"><p><i>25</i></p></td><td class="cell"><p>0.130</p></td><td class="cell"><p>0.234</p></td><td class="cell"><p>0. 324</p></td><td class="cell"><p>0.236</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><i>T=0.</i></p></td><td class="cell"><p><i>35</i></p></td><td class="cell"><p>0.136</p></td><td class="cell"><p>0.244</p></td><td class="cell"><p>0. 338</p></td><td class="cell"><p>0.232</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><i>T=0.</i></p></td><td class="cell"><p><i>45</i></p></td><td class="cell"><p>0.148</p></td><td class="cell"><p>0.332</p></td><td class="cell"><p>0. 428</p></td><td class="cell"><p>0.146</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><i>T</i><b><i>=0.</i></b></p></td><td class="cell"><p><b><i>SS</i></b></p></td><td class="cell"><p>0.166</p></td><td class="cell"><p>0.408</p></td><td class="cell"><p><b>0. 464</b></p></td><td class="cell"><p>0.244</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><i>T=0.</i></p></td><td class="cell"><p><i>65</i></p></td><td class="cell"><p>0.200</p></td><td class="cell"><p>0.322</p></td><td class="cell"><p>0. 432</p></td><td class="cell"><p>0.236</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><table class="main" frame="box" rules="all" border="1" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Page Class</p></td><td class="cell"><p>Avg Prec.</p></td><td class="cell"><p>Avg Rec.</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Collection</p></td><td class="cell"><p>91.1%</p></td><td class="cell"><p>89.S%</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Topic</p></td><td class="cell"><p>92.0%</p></td><td class="cell"><p>88.4%</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Relevant</p></td><td class="cell"><p>86.S%</p></td><td class="cell"><p>83.4%</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Overall</p></td><td class="cell"><p>89.7%</p></td><td class="cell"><p>8S.1%</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><p>With the emphasis on answer completeness and uniqueness, FADA uses a large set of documents obtained from the Web to find answers. As com­pared to the baseline system, this results in a drop in average answer precision although both recall and F1 are significantly improved. This is due to the fact that we seek most answers from the noisy Web di­rectly, whereas in the baseline system, the Web is merely used to form new queries and the answers are found from the TREC AQUAINT corpus. We are still working to find a good balance between preci­sion and recall.</p><p>The idea behind FADA system is simple: Since Web knowledge helps in answering factoid ques­tions, why not list questions? Our approach in FADA demonstrates that this is possible. We believe that list QA should benefit even more than factoid QA from using Web knowledge.</p></subsection></section><section number="8" title="Conclusion"><p>We have presented the techniques used in FADA, a system which aims to find complete and distinct answers on the Web using question parsing, web page classification/clustering and content extraction. By using the novel approach, we can achieve a recall of 0.422 and F1 of 0.464, which is significantly bet­ter than the top performing systems in the TREC-12 List QA task. The method has been found to be ef­fective. Our future work includes discovering an­swers on non-text web information, such as images. Much text information is stored as images on the web, and hence, cannot be accessed by our approach, and some do contain valuable information.</p></section><references><p>E. Agichtein, S. Lawrence, and L. Gravano. 2001. "Learning search engine specific query transforma­tions for question answering." In the Proceedings of the 10th ACM World Wide Web Conference (WWW 2001).</p><p>E. Brill, J. Lin, M. Banko, S. Dumais, and A. Ng. 2001. "Data-intensive question answering". In the Pro­ceedings of the 10th Text REtrieval Conference (TREC 2001).</p><doubt alpha="65.2" length="46" tooSmall="False" monospace="0.0">N. Craswell, D Hawking. 2002. "Overview of the</doubt><p>TREC-2002 Web Track", In the Proceedings of the 11th Text REtrieval Conference. (TREC 2002).</p><p>S. Gupta, G. Kaiser, D. Neistadt, P. Grimm, 2003. "DOM-based Content Extraction of HTML Docu­ments", In the Proceedings of the 12th ACM World</p><p>Wide Web conference. (WWW 2003).</p><p>S. Harabagiu, D. Moldovan, C. Clark, M. Bowden, J. Williams, J. Bensley, 2003 "Answer Mining by Combining Extraction Techniques with Abductive Reasoning," In the notebook of the 12th Text RE­trieval Conference (TREC 2003), 46-53.</p><p>B. Katz, J. Lin, D. Loreto, W. Hildebrandt, M. Bilotti, S. Felshin, A. Fernandes, G. Marton, F. Mora, 2003, "Integrating Web-Based and Corpus-Based Tech­niques for Question Answering", In the notebook of the 12th Text REtrieval Conference (TREC 2003), 472-480.</p><p>C. Kwok, O. Etzioni, and D. S. Weld, 2001, "Scaling Question Answering to the Web", In the Proceed­ings of the 10th ACM World Wide Web conference.</p><p>(WWW 2001).</p><p>C. Y. Lin, "The Effectiveness of Dictionary and Web-Based Answer Reranking." In the Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002).</p><p>B. Magnini, M. Negri, R. Prevete and H. Tanev. 2002. "Is it the Right Answer? Exploiting Web Redun­dancy for Answer Validation". In the Proceedings of the 40th Annual Meeting of the Association for</p><p>Computational Linguistics. (ACL 2002), 425-432.</p><p>J. R. Quinlan, 1993. C4.5: Programs for Machine Learning. Morgan-Kaufmann, San Francisco.</p><p>D. Ravichandran, and E. H. Hovy. 2002. "Learning Surface Text Patterns for a Question Answering System." In the Proceedings of the 40th ACL con­ference. (ACL 2002).</p><p>G. Salton and C. Buckley, "Term-weighting ap­proaches in automatic text retrieval", Information Processing and Management: an International Jour­nal, v.24 n.5, 1988</p><doubt alpha="57.4" length="47" tooSmall="False" monospace="0.0">E. M.Voorhees. 2003. "Overview of the TREC 2003</doubt><p>Question Answering Track." In the notebook of the 12th Text REtrieval Conference (TREC 2003), 14-27.</p><p>H. Yang, T. S. Chua, S Wang, C. K. Koh. 2003.</p><p>"Structured Use of External Knowledge for Event-based Open Domain Question Answering", In the Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in</p><p>Information Retrieval (SIGIR 2003).</p><doubt alpha="57.7" length="97" tooSmall="False" monospace="0.0">H. Yang, H. Cui, M. Maslennikov, L. Qiu, M. Y. Kan, T. S. Chua. 2003. "QUALIFIER in the TREC12 QA</doubt><p>Main Task", In the notebook of the 12th Text RE­trieval Conference (TREC 2003).</p><doubt alpha="51.0" length="49" tooSmall="False" monospace="0.0">S. Ye, T. S. Chua, J. R. Kei. 2003. "Querying and</doubt><p>Clustering Web Pages about Persons and Organiza­tions". Web Intelligence 2003, 344-350.</p><footnote label="5"> http://www.westminsterkennelclub.org/</footnote></references></body></article>