<?xml version="1.0"?><!DOCTYPE article SYSTEM "/project/take/software/searchbench_offline_processing/paperxml_generator/aclextractor/src/python/../resource/dtd/paperxml.dtd"><article><header><firstpageheader><page local="1" global="849"/><title>Learning Entailment Rules for Unary Templates</title><pubinfo>Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008),pages 849-856 Manchester, August 2008</pubinfo><author surname="Szpektor" givenname="Idan"><org  name="Bar Ilan University" country="Israel" city="Ramat Gan"/></author><author surname="Dagan" givenname="Ido"><org  name="Bar Ilan University" country="Israel" city="Ramat Gan"/></author></firstpageheader><frontmatter><p><b>Learning Entailment Rules for Unary Templates</b></p><p><b>Idan Szpektor</b></p><p>Department of Computer Science Bar-Ilan University Ramat Gan, Israel</p><p>szpekti@macs.biu.ac.il</p><p><b>Ido Dagan</b></p><p>Department of Computer Science Bar-Ilan University Ramat Gan, Israel dagan@macs.biu.ac.il</p></frontmatter><abstract>Most work on unsupervised entailment rule acquisition focused on rules between templates with two variables, ignoring <i>unary rules - </i>entailment rules between templates with a single variable. In this pa­per we investigate two approaches for un­supervised learning of such rules and com­pare the proposed methods with a binary rule learning method. The results show that the learned unary rule-sets outperform the binary rule-set. In addition, a novel directional similarity measure for learning entailment, termed <i>Balanced-Inclusion, </i>is the best performing measure. </abstract></header><body><section number="1" title="Introduction"><p>In many NLP applications, such as Question An­swering (QA) and Information Extraction (IE), it is crucial to recognize whether a specific target meaning is inferred from a text. For example, a QA system has to deduce that <i>"SCO sued IBM" </i>is inferred from <i>"SCO won a lawsuit against IBM" </i>to answer <i>"Whom did SCO sue?". </i>This type of reasoning has been identified as a core semantic inference paradigm by the generic <i>Textual Entail­ment </i>framework (Giampiccolo et al., 2007).</p><p>An important type of knowledge needed for such inference is <i>entailment rules. </i>An entailment rule specifies a directional inference relation be­tween two <i>templates, </i>text patterns with variables, such as <i>'X</i><i> </i>win lawsuit against <i>Y</i><i> </i><b>—&gt; </b><i>X</i><i> </i>sue <i>Y'. </i>Applying this rule by matching <i>'X</i><i> </i>win lawsuit against <i>Y' </i>in the above text allows a QA system to</p><p>© 2008. Licensed under the <i>Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported </i>li­cense (http://creativecommons.Org/licenses/by-nc-sa/3.0/). Some rights reserved.</p><p>infer <i>'X</i><i> </i>sue <i>Y' </i>and identify <i>"IBM", Y's </i>instantia­tion, as the answer for the above question. Entail­ment rules capture linguistic and world-knowledge inferences and are used as an important building block within different applications, e.g. (Romano etal.,2006).</p><p>One reason for the limited performance of generic semantic inference systems is the lack of broad-scale knowledge-bases of entailment rules (in analog to lexical resources such as WordNet). Supervised learning of broad coverage rule-sets is an arduous task. This sparked intensive research on unsupervised acquisition of entailment rules (and similarly paraphrases) e.g. (Lin and Pantel, 2001; Szpektor et al., 2004; Sekine, 2005).</p><p>Most unsupervised entailment rule acquisition methods learn <i>binary rules, </i>rules between tem­plates with two variables, ignoring <i>unary rules, </i>rules between <i>unary templates </i>(templates with only one variable). However, a predicate quite of­ten appears in the text with just a single variable (e.g. intransitive verbs or passives), where infer­ence requires unary rules, e.g. <i>'X</i><i> </i>take a nap —&gt; <i>X</i><i> </i>sleep' (further motivations in Section 3.1).</p><p>In this paper we focus on unsupervised learn­ing of unary entailment rules. Two learning ap­proaches are proposed. In our main approach, rules are learned by measuring how similar the variable instantiations of two templates in a corpus are. In addition to adapting state-of-the-art similar­ity measures for unary rule learning, we propose a new measure, termed <i>Balanced-Inclusion, </i>which balances the notion of directionality in entailment with the common notion of symmetric semantic similarity. In a second approach, unary rules are derived from binary rules learned by state-of-the-art binary rule learning methods.</p><p>We tested the various unsupervised unary rule learning methods, as well as a binary rule learn­ing method, on a test set derived from a standard IE benchmark.<page local="2" global="850"/> This provides the first comparison between the performance of unary and binary rule-sets. Several results rise from our evaluation: (a) while most work on unsupervised learning ignored unary rules, all tested unary methods outperformed the binary method; (b) it is better to learn unary rules directly than to derive them from a binary rule-base; (c) our proposed Balanced-Inclusion measure outperformed all other tested methods in terms of Fl measure. Moreover, only Balanced-Inclusion improved Fl score over a baseline infer­ence that does not use entailment rules at all.</p></section><section number="2" title="Background"><p>This section reviews relevant distributional simi­larity measures, both symmetric and directional, which were applied for either lexical similarity or unsupervised entailment rule learning.</p><p>Distributional similarity measures follow the Distributional Hypothesis, which states that words that occur in the same contexts tend to have similar meanings (Harris, 1954). Various measures were proposed in the literature for assessing such simi­larity between two words, <i>u </i>and <i>v. </i>Given a word <i>q, </i>its set of features <i>Fq </i>and feature weights <i>wq(f)</i><i> </i>for / G <i>Fq, a </i>common symmetric similarity measure is Lin similarity (Lin, 1998a):</p><doubt alpha="50.0" length="10" tooSmall="False" monospace="0.0">Lin(u,v) =</doubt><doubt alpha="100.0" length="1" tooSmall="False" monospace="0.0">E</doubt><doubt alpha="46.2" length="13" tooSmall="False" monospace="0.0">[Wu(f)+Wv(f)]</doubt><p>where the weight of each feature is the pointwise mutual information (pmi) between the word and the feature: <i>wq{f)</i><i> = log[Pp^].</i></p><p>Weeds and Weir (2003) proposed to measure the symmetric similarity between two words by av­eraging two directional (asymmetric) scores: the coverage of each word's features by the other. The coverage of <i>u </i>by <i>v </i>is measured by:</p><p><i>Cover(u, </i><i>v)</i></p><p><i>feFuiiFv</i></p><doubt alpha="71.4" length="7" tooSmall="False" monospace="0.0">E/eKMf)</doubt><p>The average can be arithmetic or harmonic:</p><doubt alpha="53.7" length="41" tooSmall="False" monospace="0.0">Weeds A(u,v) = ^[Cover(u,v) + Cover(v,u)]</doubt><p>Binary rule learning algorithms adopted such lexical similarity approaches for learning rules be­tween templates, where the features of each tem­plate are its variable instantiations in a corpus, such as <i>{X=lSCO\ Y='IBM'} </i>for the example in Section 1. Some works focused on learning rules from <i>comparable corpora, </i>containing com­parable documents such as different news articles from the same date on the same topic (Barzilay and Lee, 2003; Ibrahim et al., 2003). Such corpora are highly informative for identifying variations of the same meaning, since, typically, when variable instantiations are shared across comparable docu­ments the same predicates are described. However, it is hard to collect broad-scale comparable cor­pora, as the majority of texts are non-comparable.</p><p>A complementary approach is learning from the abundant regular, non-comparable, corpora. Yet, in such corpora it is harder to recognize varia­tions of the same predicate. The DIRT algorithm (Lin and Pantel, 2001) learns non-directional bi­nary rules for templates that are paths in a depen­dency parse-tree between two noun variables <i>X </i>and <i>Y.</i><i> </i>The similarity between two templates <i>t</i><i> </i>and <i>t' </i>is the geometric average:</p><doubt alpha="47.4" length="38" tooSmall="False" monospace="0.0">DIRT(t, t') = ^Linx(t,t') ■ Liny{t,t')</doubt><p>where <i>Linx </i>is the Lin similarity between X's in­stantiations of <i>t </i>and X's instantiations of <i>t' </i>in a corpus (equivalently for <i>Liny). </i>Some works take the combination of the two variable instantia­tions in each template occurrence as a single com­plex feature, e.g. <i>{X-Y='SCO-IBM'}, </i>and com­pare between these complex features of <i>t </i>and <i>t' </i>(Ravichandran and Hovy, 2002; Szpektor et al., 2004; Sekine, 2005).</p><p><b>Directional Measures </b>Most rule learning meth­ods apply a symmetric similarity measure between two templates, viewing them as paraphrasing each other. However, entailment is in general a direc­tional relation. For example, <i>'X</i><i> </i>acquire <i>Y</i><i> </i><i>—&gt;</i><i> X </i>own <i>Y' </i>and 'countersuit against <i>X </i>—&gt; lawsuit against <i>X'.</i></p><p>(Weeds and Weir, 2003) propose a directional measure for learning hyponymy between two words, '1 —&gt; r', by giving more weight to the cov­erage of the features of <i>I </i>by <i>r </i>(with <i>a &gt; \):</i></p><p><i>WeedsH(u, </i><i>v]</i></p><doubt alpha="60.0" length="65" tooSmall="False" monospace="0.0">=2-Cover(u,v)-Cover(v,u)WeedsD{l, r)=aCover{l, r)+{l-a)Cover{r,I)</doubt><p><i>Cover(u, </i><i>v)</i><i> + Cover(v, </i><i>u)</i><i> </i>Weeds et al. also used pmi for feature weights.</p><p>When <i>a=l, </i>this measure degenerates into <i>Cover(l,r),</i><i> </i>termed <i>Precision(l,r). </i>With <i>Precision(l, </i><i>r)</i><i> </i>we obtain a "soft" version of the inclusion hypothesis presented in (Geffet and Dagan, 2005), which expects <i>I </i>to entail <i>r </i>if the "important" features of <i>I </i>appear also in <i>r.</i><page local="3" global="851"/><i></i></p><p>Similarly, the LEDIR algorithm (Bhagat et al., 2007) identifies the entailment direction between two binary templates, <i>I</i><i> </i>and <i>r,</i><i> </i>which participate in a relation learned by (the symmetric) DIRT, by measuring the proportion of instantiations of <i>I </i>that are covered by the instantiations of <i>r.</i></p><p>As far as we know, only (Shinyama et al., 2002) and (Pekar, 2006) learn rules between unary tem­plates. However, (Shinyama et al., 2002) relies on comparable corpora for identifying paraphrases and simply takes any two templates from compa­rable sentences that share a named entity instan­tiation to be paraphrases. Such approach is not feasible for non-comparable corpora where statis­tical measurement is required. (Pekar, 2006) learns rules only between templates related by local dis­course (information from different documents is ignored). In addition, their template structure is limited to only verbs and their direct syntactic ar­guments, which may yield incorrect rules, e.g. for light verbs (see Section 5.2). To overcome this lim­itation, we use a more expressive template struc­ture.</p></section><section number="3" title="Learning Unary Entailment Rules 3.1 Motivations"><p>Most unsupervised rule learning algorithms fo­cused on learning binary entailment rules. How­ever, using binary rules for inference is not enough. First, a predicate that can have multiple arguments may still occur with only one of its arguments. For example, in <i>"The acquisition ofTCA was suc­cessful", 'TCA' </i>is the only argument of <i>'acqui­sition'. </i>Second, some predicate expressions are unary by nature. For example, modifiers, such as 'the elected <i>X',</i><i> </i>or intransitive verbs. In addition, it appears more tractable to learn all variations for each argument of a predicate separately than to learn them for combinations of argument pairs.</p><p>For these reasons, it seems that unary rule learn­ing should be addressed in addition to binary rule learning. We are further motivated by the fact that some (mostly supervised) works in IE found learn­ing unary templates useful for recognizing relevant named entities (Riloff, 1996; Sudo et al., 2003; Shinyama and Sekine, 2006), though they did not attempt to learn generic knowledge bases of entailment rules.</p><p>This paper investigates acquisition of unary en­tailment rules from regular non-comparable cor­pora. We first describe the structure of unary templates and then explore two conceivable ap­proaches for learning unary rules. The first ap­proach directly assesses the relation between two given templates based on the similarity of their in­stantiations in the corpus. The second approach, which was also mentioned in (Iftene and Balahur-Dobrescu, 2007), derives unary rules from learned binary rules.</p><subsection number="3.2" title="Unary Template Structure"><p>To learn unary rules we first need to define their structure. In this paper we work at the syntac­tic representation level. Texts are represented by dependency parse trees (using the Minipar parser (Lin, 1998b)) and templates by parse sub-trees.</p><p>Given a dependency parse tree, any sub-tree can be a candidate template, setting some of its nodes as variables (Sudo et al., 2003). However, the num­ber of possible templates is exponential in the size of the sentence. In the binary rule learning litera­ture, the main solution for exhaustively learning all rules between any pair of templates in a given cor­pus is to restrict the structure of templates. Typi­cally, a template is restricted to be a path in a parse tree between two variable nodes (Lin and Pantel, 2001; Ibrahim etal.,2003).</p><p>Following this approach, we chose the structure of unary templates to be paths as well, where one end of the path is the template's variable. How­ever, paths with one variable have more expressive power than paths between two variables, since the combination of two unary paths may generate a binary template that is not a path. For example, the combination of <i>'X</i><i> </i>call indictable' and 'call <i>Y</i><i> </i>indictable' is the template <i>'X</i><i> </i>call <i>Y</i><i> </i>indictable', which is not a path between <i>X</i><i> </i>and <i>Y.</i></p><p>For every noun node <i>v </i>in a parsed sentence, we generate templates with <i>v </i>as a variable as follows:</p><p>1. Traverse the path from <i>v </i>towards the root of the parse tree. Whenever a candidate pred­icate is encountered (any noun, adjective or verb) the path from that node to <i>v </i>is taken as a template. We stop when the first verb or clause boundary (e.g. a relative clause) is en­countered, which typically represent the syn­tactic boundary of a specific predicate.</p><page local="4" global="852"/><p>2. To enable templates with control verbs and light verbs, e.g. <i>'X</i><i> </i>help preventing', <i>'X</i><i> </i>make noise', whenever a verb is encoun­tered we generate templates that are paths be­tween <i>v </i>and the verb's modifiers, either ob­jects, prepositional complements or infinite or gerund verb forms (paths ending at stop words, e.g. pronouns, are not generated).</p><p>3. To capture noun modifiers that act as predi­cates, e.g. 'the losing <i>X',</i><i> </i>we extract template paths between <i>v </i>and each of its modifiers, nouns or adjectives, that are derived from a verb. We use the Catvar database to identify verb derivations (Habash and Dorr, 2003).</p><p>As an example for the procedure, the templates extracted from the sentence <i>"The losing party played it safe" </i>with <i>'party' </i>as the variable are: 'losing <i>X',</i><i> </i><i>'X</i><i> </i>play' and <i>'X</i><i> </i>play safe'.</p></subsection><subsection number="3.3" title="Direct Learning of Unary Rules"><p>We applied the lexical similarity measures pre­sented in Section 2 for unary rule learning. Each argument instantiation of template <i>t </i>in the corpus is taken as a feature /, and the pmi between <i>t </i>and / is used for the feature's weight. We first adapted DIRT for unary templates <i>(unary-DIRT, </i>apply­ing Lin-similarity to the single feature vector), as well as its output filtering by LEDIR. The various Weeds measures were also applied<footnote anchor="1"/>: symmetric arithmetic average, symmetric harmonic average, weighted arithmetic average and Precision.</p><p>After initial analysis, we found that given a right hand side template <i>r,</i><i> </i>symmetric measures such as Lin (in DIRT) generally tend to prefer (score higher) relations <i>(I,</i><i> </i><i>r)</i><i> </i>in which <i>I</i><i> </i>and <i>r</i><i> </i>are related but do not necessarily participate in an entailment or equivalence relation, e.g. the wrong rule 'kill <i>X </i>&lt;-&gt; injure <i>X'.</i></p><p>On the other hand, directional measures such as Weeds Precision tend to prefer directional rules in which the entailing template is infrequent. If an in­frequent template has common instantiations with another template, the coverage of its features is typically high, whether or not an entailment rela­tion exists between the two templates. This behav­ior generates high-score incorrect rules.</p><p>Based on this analysis, we propose a new measure that balances the two behaviors, termed</p><p>!We applied the best performing parameter values pre­sented in (Bhagat et al, 2007) and (Weeds and Weir, 2003).</p><p><i>Balanced-Inclusion {Bine). </i>Bine identifies entail­ing templates based on a directional measure but penalizes infrequent templates using a symmetric measure:</p><doubt alpha="62.2" length="37" tooSmall="False" monospace="0.0">BInc(l,r)= \JLin(l,r)■ Precision(l,r)</doubt></subsection><subsection number="3.4" title="Deriving Unary Rules From Binary Rules"><p>An alternative way to learn unary rules is to first learn binary entailment rules and then derive unary rules from them. We derive unary rules from a given binary rule-base in two steps. First, for each binary rule, we generate all possible unary rules that are part of that rule (each unary template is extracted following the same procedure described in Section 3.2). For example, from <i>'X</i><i> </i>find solu­tion to <i>Y —&gt; X </i>solve <i>Y' </i>we generate the unary rules <i>'X</i><i> </i>find —&gt; <i>X</i><i> </i>solve', <i>'X</i><i> </i>find solution —&gt; <i>X</i><i> </i>solve', 'solution to <i>Y</i><i> </i>—&gt; solve <i>Y'</i><i> </i>and 'find solu­tion to <i>Y</i><i> </i>—&gt; solve <i>Y'.</i><i> </i>The score of each generated rule is set to be the score of the original binary rule.</p><p>The same unary rule can be derived from dif­ferent binary rules. For example, 'hire <i>Y </i>—&gt; em­ploy <i>Y'</i><i> </i>is derived both from <i>'X</i><i> </i>hire <i>Y</i><i> </i><i>—&gt;</i><i> </i><i>X</i><i> </i>em­ploy <i>Y' </i>and 'hire <i>Y </i>for <i>Z </i>—&gt; employ <i>Y </i>for <i>Z', </i>having a different score from each original binary rule. The second step of the algorithm aggregates the different scores yielded for each derived rule to produce the final rule score. Three aggregation functions were tested: sum <i>{Derived-Sum), </i>aver­age <i>(Derived-Avg) </i>and maximum <i>(Derived-Max).</i></p></subsection></section><section number="4" title="Experimental Setup"><p>We want to evaluate learned unary and binary rule bases by their utility for NLP applications through assessing the validity of inferences that are per­formed in practice using the rule base.</p><p>To perform such experiments, we need a test-set of <i>seed templates, </i>which correspond to a set of target predicates, and a corpus annotated with all argument mentions of each predicate. The evalu­ation assesses the correctness of all argument ex­tractions, which are obtained by matching in the corpus either the seed templates or templates that entail them according to the rule-base (the latter corresponds to <i>rule-application).</i></p><p>Following (Szpektor et al., 2008), we found the ACE 2005 event training set<footnote anchor="2"/> useful for this pur­pose. This standard IE dataset includes 33 types of event predicates such as <i>Injure, Sue </i>and <i>Divorce.</i></p><footnote label="2">http://projects.ldc.upenn.edu/ace/</footnote><page local="5" global="853"/><p>All event mentions are annotated in the corpus, in­cluding the instantiated arguments of the predicate. ACE guidelines specify for each event its possible arguments, each associated with a semantic role. For instance, some of the <i>Injure </i>event arguments are <i>Agent, Victim </i>and <i>Time.</i></p><p>To utilize the ACE dataset for evaluating entail­ment rule applications, we manually represented each ACE event predicate by unary seed templates. For example, the seed templates for <i>Injure </i>are <i>'A</i><i> </i>injure', 'injure <i>V </i>and 'injure in T". We mapped each event role annotation to the corresponding seed template variable, e.g. <i>Agent' </i>to <i>A </i>and <i>'Victim' </i>to <i>V </i>in the above example. Templates are matched using a syntactic matcher that han­dles simple morpho-syntactic phenomena, as in (Szpektor and Dagan, 2007). A rule application is considered correct if the matched argument is annotated by the corresponding ACE role.</p><p>For testing binary rule-bases, we automatically generated binary seed templates from any two unary seeds that share the same predicate. For ex­ample, for <i>Injure </i>the binary seeds <i>'A</i><i> </i>injure <i>V,</i><i> </i><i>'A</i><i> </i>injure in T" and 'injure <i>V </i>in T" were automatically generated from the above unary seeds.</p><p>We performed two adaptations to the ACE dataset to fit it better to our evaluation needs. First, our evaluation aims at assessing the correctness of inferring a specific target semantic meaning, which is denoted by a specific predicate, using rules. Thus, four events that correspond ambiguously to multiple distinct predicates were ignored. For in­stance, the <i>Transfer-Money </i>event refers to both <i>do­nating </i>and <i>lending </i>money, and thus annotations of this event cannot be mapped to a specific seed tem­plate. We also omitted 3 events with less than 10 mentions, and were left with 26 events (6380 argu­ment mentions).</p><p>Additionally, we regard all entailing mentions under the textual entailment definition as correct. However, event mentions are annotated as correct in ACE only if they explicitly describe the target event. For instance, a Divorce mention does en­tail a preceding marriage event but it does not ex­plicitly describe it, and thus it is not annotated as a <i>Marry </i>event. To better utilize the ACE dataset, we considered for a target event the annotations of other events that entail it as being correct as well. We note that each argument was considered sep­arately. For example, we marked a mention of a divorced person as entailing the marriage of that person, but did not consider the place and time of the divorce act to be those of the marriage .</p></section><section number="5" title="Results and Analysis"><p>We implemented the unary rule learning algo­rithms described in Section 3 and the binary DIRT algorithm (Lin and Pantel, 2001). We executed each method over the Reuters RCV1 corpus<footnote anchor="3"/>, learning for each template <i>r </i>in the corpus the top 100 rules in which <i>r </i>is entailed by another tem­plate <i>I,</i><i> </i><i>'I</i><i> </i><b>—&gt; </b><i>r'.</i><i> </i>All rules were learned in canonical form (Szpektor and Dagan, 2007). The rule-base learned by binary DIRT was taken as the input for deriving unary rules from binary rules.</p><p>The performance of each acquired rule-base was measured for each ACE event. We measured the percentage of correct argument mentions extracted out of all correct argument mentions annotated for the event (recall) and out of all argument mentions extracted for the event (precision). We also mea­sured Fl, their harmonic average, and report macro average <i>Recall, Precision </i>and <i>Fl </i>over the 26 event types.</p><p>No threshold setting mechanism is suggested in the literature for the scores of the different algo­rithms, especially since rules for different right hand side templates have different score ranges. Thus, we follow common evaluation practice (Lin and Pantel, 2001; Geffet and Dagan, 2005) and test each learned rule-set by taking the top <i>K </i>rules for each seed template, where <i>K </i>ranges from 0 to 100. When <i>K=0, </i>no rules are used and mentions are ex­tracted only by direct matching of seed templates.</p><p>Our rule application setting provides a rather simplistic IE system (for example, no named entity recognition or approximate template matching). It is thus useful for comparing different rule-bases, though the absolute extraction figures do not re­flect the full potential of the rules. In Secion 5.2 we analyze the full-system's errors to isolate the rules' contribution to overall system performance.</p><subsection number="5.1" title="Results"><p>In this section we focus on the best performing variations of each algorithm type: binary DIRT, unary DIRT, unary Weeds Harmonic, Bine and <i>Derived-Avg. </i>We omitted the results of methods that were clearly inferior to others: (a) <i>Weeds A, WeedsD </i>and <i>Weeds-Precision </i>did not increase<page local="6" global="854"/></p><footnote label="3">http ://about.reuters .com/researchandstandards/corpus/</footnote><p>Recall over not using rules because rules with in­frequent templates scored highest and arithmetic averaging could not balance well these high scores; (b) out of the methods for deriving unary rules from binary rule-bases, <i>Derived-Avg </i>performed best; (c) filtering with (the directional) LEDIR did not improve the performance of unary DIRT.</p><p>Figure 1 presents Recall, Precision and Fl of the methods for different cutoff points. First, we ob­serve that even when matching only the seed tem­plates <i>(K=0), </i>unary seeds outperform the binary seeds in terms of both Precision and Recall. This surprising behavior is consistent through all rule cutoff points: all unary learning algorithms per­form better than binary DIRT in all parameters. The inferior behavior of binary DIRT is analyzed in Section 5.2.</p><p>The graphs show that symmetric unary ap­proaches substantially increase recall, but dramati­cally decrease precision already at the top 10 rules. As a result, Fl only decreases for these methods. Lin similarity (DIRT) and Weeds-Harmonic show similar behaviors. They consistently outperform <i>Derived-Avg. </i>One reason for this is that incorrect unary rules may be derived even from correct bi­nary rules. For example, from <i>'X</i><i> </i>gain seat on <i>Y</i><i> </i>—&gt; elect <i>X</i><i> </i>to <i>Y'</i><i> </i>the incorrect unary rule <i>'X</i><i> </i>gain —&gt; elect <i>X'</i><i> </i>is also generated. This problem is less frequent when unary rules are directly scored based on their corpus statistics.</p><p>The directional measure of Bine yields a more accurate rule-base, as can be seen by the much slower precision reduction rate compared to the other algorithms. As a result, it is the only algo­rithm that improves over the Fl baseline of <i>K=0, </i>with the best cutoff point at <i>K=20. </i>Bine's re­call increases moderately compared to other unary learning approaches, but it is still substantially bet­ter than not using rules (a relative recall increase of 50% already at <i>K=10). </i>We found that many of the correct mentions missed by Bine but identified by other methods are due to occasional extractions of incorrect frequent rules, such as partial templates (see Section 5.2). This is reflected in the very low precision of the other methods. On the other hand, some correct rules were only learned by Bine, e.g. 'countersuit against <i>X</i><i> </i><i>—&gt;</i><i> </i><i>X</i><i> </i>sue' and <i>'X</i><i> </i>take wife <i>—&gt; X </i>marry'.</p><p>When only one argument is annotated for a spe­cific event mention (28% of ACE predicate men­tions, which account for 15% of all annotated ar-</p><doubt alpha="78.9" length="19" tooSmall="True" monospace="0.0">Precision vs. Top K</doubt><doubt alpha="25.0" length="4" tooSmall="True" monospace="0.0">60t-</doubt><doubt alpha="0.0" length="2" tooSmall="True" monospace="0.0">50</doubt><doubt alpha="0.0" length="2" tooSmall="True" monospace="0.0">40</doubt><doubt alpha="0.0" length="2" tooSmall="True" monospace="0.0">20</doubt><doubt alpha="0.0" length="2" tooSmall="True" monospace="0.0">10</doubt><doubt alpha="4.3" length="23" tooSmall="False" monospace="0.0">0-I-,-,-,-,-,-,-,-,-,--</doubt><doubt alpha="0.0" length="88" tooSmall="True" monospace="0.0">0 10        20        30        40        50        60        70        80        90 100</doubt><doubt alpha="75.0" length="28" tooSmall="True" monospace="0.0">Top K rules Recall vs. Top K</doubt><doubt alpha="25.0" length="4" tooSmall="True" monospace="0.0">35n-</doubt><doubt alpha="4.3" length="23" tooSmall="False" monospace="0.0">0-I-,-,-,-,-,-,-,-,-,-,</doubt><doubt alpha="66.7" length="24" tooSmall="True" monospace="0.0">Top K rules F1 vs. top K</doubt><doubt alpha="25.0" length="4" tooSmall="True" monospace="0.0">25t-</doubt><doubt alpha="81.8" length="11" tooSmall="True" monospace="0.0">Top K rules</doubt><p>--■-Derived-Avg -o- Weeds Harmonic     m Binary DIRT</p><p><b>A </b>Unary DIRT <b>—</b><b>A— </b>Balanced-Inclusion</p><figure caption="Figure 1: Average Precision, Recall and Fl at dif­ferent top K rule cutoff points."></figure><p>guments), binary rules either miss that mention, or extract both the correct argument and another in­correct one. To neutralize this bias, we also tested the various methods only on event mentions an­notated with two or more arguments and obtained similar results to those presented for all mentions. This further emphasizes the general advantage of using unary rules over binary rules.</p><page local="7" global="855"/></subsection><subsection number="5.2" title="Analysis"><p><b>Binary-DIRT </b>We analyzed incorrect rules both for binary-DIRT and Bine by randomly sampling, for each algorithm, 200 rules that extracted incor­rect mentions. We manually classified each rule <i>'I</i><i> —&gt; r' </i>as either: (a) <i>Correct - </i>the rule is valid in some contexts of the event but extracted some in­correct mentions; (b) <i>Partial Template -1 </i>is only a part of a correct template that entails <i>r. </i>For exam­ple, learning <i>'X</i><i> </i>decide —&gt; <i>X</i><i> </i>meet' instead of <i>'X</i><i> </i>decide to meet —&gt; <i>X </i>meet'; (e) <i>Incorrect - </i>other incorrect rules, e.g. 'charge <i>X</i><i> </i>—&gt; convict <i>X'.</i></p><p>Table 1 summarizes the analysis and demon­strates two problems of binary-DIRT. First, rela­tive to Bine, it tends to learn incorrect rules for high frequency templates, and therefore extracted many more incorrect mentions for the same num­ber of incorrect rules. Second, a large percentage of incorrect mentions extracted are due to partial templates at the rule left-hand-side. Such rules are leaned because many binary templates have a more complex structure than paths between arguments. As explained in Section 3.2 the unary template structure we use is more expressive, enabling to learn the correct rules. For example, Bine learned 'take <i>Y </i>into custody —&gt; arrest <i>Y' </i>while binary-DIRT learned 'X take <i>Y</i><i> </i><b><i>—&gt; </i></b><i>X</i><i> </i>arrest <i>Y'.</i></p><p><b>System Level Analysis </b>We manually analyzed the reasons for false positives (incorrect extrac­tions) and false negatives (missed extractions) of Bine, at its best performing cutoff point <i>(K=20), </i>by sampling 200 extractions of each type.</p><p>From the false positives analysis (Table 2) we see that 39% of the errors are due to incorrect rules. The main reasons for learning such rules are those discussed in Section 3.3: (a) related templates that are not entailing; (b) infrequent templates. All learning methods suffer from these issues. As was shown by our results, Bine provides a first step to­wards reducing these problems. Yet, these issues require further research.</p><p>Apart from incorrectly learned rules, incorrect template matching (e.g. due to parse errors) and context mismatch contribute together 46% of the errors. Context mismatches occur when the entail­ing template is matched in inappropriate contexts. For example, 'slam <i>X</i><i> </i>—&gt; attack <i>X'</i><i> </i>should not be applied when X is a ball, only when it is a person. The rule-set net effect on system precision is better estimated by removing these errors and fixing the annotation errors, which yields 72% precision.</p><p>Table 1: Rule type distribution of a sample of 200 rules that extracted <i>incorrect </i>mentions. The corre­sponding numbers of incorrect mentions extracted by the sampled rules is shown in parentheses.</p><p>Table 2: Distribution of reasons for false positives (incorrect argument extractions) by Bine at <i>K=2§.</i></p><p>Table 3 presents the analysis of false negatives. First, we note that 12% of the arguments cannot be extracted by rules alone, due to necessary dis­course analysis. Thus, a recall upper bound for en­tailment rules is 88%. Many missed extractions are due to rules that were not learned (61.5%). How­ever, 25% of the mentions were missed because of incorrect syntactic matching of correctly learned rules. By assuming correct matches in these cases we isolate the recall of the rule-set (along with the seeds), which yields 39% recall.</p></subsection></section><section number="6" title="Conclusions"><p>We presented two approaches for unsupervised ac­quisition of unary entailment rules from regular (non-comparable) corpora. In the first approach, rules are directly learned based on distributional similarity measures. The second approach de­rives unary rules from a given rule-base of binary rules. Under the first approach we proposed a novel directional measure for scoring entailment rules, termed Balanced-Inclusion.</p><p>We tested the different approaches utilizing a standard IE test-set and compared them to binary rule learning. Our results suggest the advantage of learning unary rules: (a) unary rule-bases perform better than binary rules; (b) it is better to directly learn unary rules than to derive them from binary rule-bases.<page local="8" global="856"/> In addition, the Balanced-Inclusion measure outperformed all other tested methods.</p><table class="main" frame="box" rules="all" border="1" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p>Binary DIRT</p></td><td class="cell"><p>Balanced Inclusion</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Correct</p></td><td class="cell"><p>16 (70)</p></td><td class="cell"><p>38 (91)</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Partial Template</p></td><td class="cell"><p>27 (2665)</p></td><td class="cell"><p>6 (81)</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Incorrect</p></td><td class="cell"><p>157 (2584)</p></td><td class="cell"><p>156 (787)</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Total</p></td><td class="cell"><p>200 (5319)</p></td><td class="cell"><p>200 (959)</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><table class="main" frame="box" rules="all" border="1" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Reason</p></td><td class="cell"><p>% mentions</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Incorrect Rule learned</p></td><td class="cell"><p>39.0</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Context mismatch</p></td><td class="cell"><p>27.0</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Match error</p></td><td class="cell"><p>19.0</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Annotation problem</p></td><td class="cell"><p>15.0</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><table caption="Table 3: Distribution of reasons for false negatives (missed argument mentions) by Bine atK=2§." class="main" frame="box" rules="all" border="1" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Reason</p></td><td class="cell"><p>% mentions</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Rule not learned</p></td><td class="cell"><p>61.5</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Match error</p></td><td class="cell"><p>25.0</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Discourse analysis needed</p></td><td class="cell"><p>12.0</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Argument is predicative</p></td><td class="cell"><p>1.5</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><p>In future work, we plan to explore additional unary template structures and similarity scores, and to improve rule application utilizing context matching methods such as (Szpektor et al., 2008).</p></section><section title="Acknowledgements"><p>This work was partially supported by ISF grant 1095/05, the 1ST Programme of the European Community under the PASCAL Network of Ex­cellence IST-2002-506778 and the NEGEV project (www.negev-initiative.org).</p></section><references><p>Barzilay, Regina and Lillian Lee. 2003. Learn­ing to paraphrase: An unsupervised approach us­ing multiple-sequence alignment. In <i>Proceedings of HLT-NAACL.</i></p><p>Bhagat, Rahul, Patrick Pantel, and Eduard Hovy. 2007. Ledir: An unsupervised algorithm for learning di­rectionality of inference rules. In <i>Proceedings of EMNLP.</i></p><p>Geffet, Maayan and Ido Dagan. 2005. The distribu­tional inclusion hypotheses and lexical entailment.</p><p>In <i>Proceedings of ACL.</i></p><p>Giampiccolo, Danilo, Bernardo Magnini, Ido Dagan, and Bill Dolan. 2007. The third pascal recogniz­ing textual entailment challenge. In <i>Proceedings of WTEP.</i></p><p>Habash, Nizar and Bonnie Dorr. 2003. A categorial variation database for english. In <i>Proceedings of NACL.</i></p><p>Harris, Z. 1954. Distributional structure. <i>Word, </i>10(23): 146-162.</p><p>Ibrahim, Ali, Boris Katz, and Jimmy Lin. 2003. Ex­tracting structural paraphrases from aligned mono­lingual corpora. In <i>Proceedings oflWP.</i></p><p>Iftene, Adrian and Alexandra Balahur-Dobrescu. 2007. Hypothesis transformation and semantic variability rules used in recognizing textual entailment. In <i>Pro­ceedings of WTEP.</i></p><p>Lin, Dekang and Patrick Pantel. 2001. Discovery of inference rules for question answering. In <i>Natu­ral Language Engineering, </i>volume 7(4), pages 343-360.</p><p>Lin, Dekang. 1998a. Automatic retrieval and cluster­ing of similar words. In <i>Proceedings of COLING-ACL.</i></p><p>Lin, Dekang. 1998b. Dependency-based evaluation of minipar. In <i>Proceedings of the Workshop on Evalu­ation of Parsing Systems at LREC.</i></p><p>Pekar, Viktor. 2006. Acquisition of verb entailment from text. In <i>Proceedings ofNAACL.</i></p><p>Ravichandran, Deepak and Eduard Hovy. 2002. Learn­ing surface text patterns for a question answering system. In <i>Proceedings of ACL.</i></p><p>Riloff, Ellen. 1996. Automatically generating extrac­tion patterns from untagged text. In <i>AAAI/IAAI, Vol. </i><i>2,</i><i> </i>pages 1044-1049.</p><p>Romano, Lorenza, Milen Kouylekov, Idan Szpektor, Ido Dagan, and Alberto Lavelli. 2006. Investigat­ing a generic paraphrase-based approach for relation extraction. In <i>Proceedings ofEACL.</i></p><p>Sekine, Satoshi. 2005. Automatic paraphrase discov­ery based on context and keywords between ne pairs. In <i>Proceedings oflWP.</i></p><p>Shinyama, Yusuke and Satoshi Sekine. 2006. Preemp­tive information extraction using unrestricted rela­tion discovery. In <i>Proceedings of HLT-NAACL.</i></p><p>Shinyama, Yusuke, Satoshi Sekine, Sudo Kiyoshi, and Ralph Grishman. 2002. Automatic paraphrase ac­quisition from news articles. In <i>Proceedings ofHLT.</i></p><p>Sudo, Kiyoshi, Satoshi Sekine, and Ralph Grishman. 2003. An improved extraction pattern representation model for automatic ie pattern acquisition. In <i>Pro­ceedings of ACL.</i></p><p>Szpektor, Idan and Ido Dagan. 2007. Learning canon­ical forms of entailment rules. In <i>Proceedings of RANLP.</i></p><p>Szpektor, Idan, Hristo Tanev, Ido Dagan, and Bonaven­tura Coppola. 2004. Scaling web-based acquisition of entailment relations. In <i>Proceedings of EMNLP.</i></p><p>Szpektor, Idan, Ido Dagan, Roy Bar Haim, and Jacob Goldberger. 2008. Contextual preferences. In <i>Pro­ceedings of ACL.</i></p><p>Weeds, Julie and David Weir. 2003. A general frame­work for distributional similarity. In <i>Proceedings of EMNLP.</i></p></references></body></article>