<?xml version="1.0"?><!DOCTYPE article SYSTEM "/project/take/software/searchbench_offline_processing/paperxml_generator/aclextractor/src/python/../resource/dtd/paperxml.dtd"><article><header><firstpageheader><page local="1" global="374"/><title>Integrating N-best SMT Outputs into a TM System</title><pubinfo>Coling 2010: Poster Volume,pages 374-382, Beijing, August 2010</pubinfo><author surname="He" givenname="Yifan"><org  name="Dublin City University" country="Ireland" city="Dublin"/></author><author surname="Ma" givenname="Yanjun"><org  name="Dublin City University" country="Ireland" city="Dublin"/></author><author surname="Way" givenname="Andy"><org  name="Dublin City University" country="Ireland" city="Dublin"/></author><author surname="van Genabith" givenname="Josef"><org  name="Dublin City University" country="Ireland" city="Dublin"/></author></firstpageheader><frontmatter><p><b>Integrating N-best SMT Outputs into a TM System</b></p><p><b>Yifan He     Yanjun Ma    Andy Way     Josef van Genabith</b></p><p>Centre for Next Generation Localisation School of Computing Dublin City University</p><p>{yhe,yma,away,josef}@computing.dcu.ie</p></frontmatter><abstract>In this paper, we propose a novel frame­work to enrich Translation Memory (TM) systems with Statistical Machine Trans­lation (SMT) outputs using ranking. In order to offer the human translators mul­tiple choices, instead of only using the top SMT output and top TM hit, we merge the N-best output from the SMT system and the k-best hits with highest fuzzy match scores from the TM sys­tem. The merged list is then ranked ac­cording to the prospective post-editing ef­fort and provided to the translators to aid their work. Experiments show that our ranked output achieve 0.8747 precision at top 1 and 0.8134 precision at top 5. Our framework facilitates a tight integration between SMT and TM, where full advan­tage is taken of TM while high quality SMT output is availed of to improve the productivity of human translators. </abstract></header><body><section number="1" title="Introduction"><p>Translation Memories (TM) are databases that store translated segments. They are often used to assist translators and post-editors in a Computer Assisted Translation (CAT) environment by re­turning the most similar translated segments. Pro­fessional post-editors and translators have long been relying on TMs to avoid duplication of work in translation.</p><p>With the rapid development in statistical ma­chine translation (SMT), MT systems are beginning to generate acceptable translations, espe­cially in domains where abundant parallel corpora exist. It is thus natural to ask if these translations can be utilized in some way to enhance TMs.</p><p>However advances in MT are being adopted only slowly and sometimes somewhat reluctantly in professional localization and post-editing envi­ronments because of 1) the usefulness of the TM, 2) the investment and effort the company has put into TMs, and 3) the lack of robust SMT confi­dence estimation measures which are as reliable as fuzzy match scores (cf. Section 4.1.2) used in TMs. Currently the localization industry relies on TM fuzzy match scores to obtain both a good ap­proximation of post-editing effort and an estima­tion of the overall translation cost.</p><p>In a forthcoming paper, we propose a trans­lation recommendation model to better integrate MT outputs into a TM system. Using a binary classifier, we only recommend an MT output to the TM-user when the classifier is highly confi­dent that it is better than the TM output. In this framework, post-editors continue to work with the TM while benefiting from (better) SMT outputs; the assets in TMs are not wasted and TM fuzzy match scores can still be used to estimate (the up­per bound of) post-editing labor.</p><p>In the previous work, the binary predictor works on the 1-best output of the MT and TM sys­tems, presenting either the one or the other to the post-editor. In this paper, we develop the idea fur­ther by moving from binary prediction to ranking. We use a ranking model to merge the k-best lists of the two systems, and produce a ranked merged list for post-editing.<page local="2" global="375"/> As the list is an enriched ver­sion of the TM's k-best list, the TM related assets are better preserved and the cost estimation is still valid as an upper bound.</p><p>More specifically, we recast SMT-TM integra­tion as a ranking problem, where we apply the Ranking SVM technique to produce a ranked list of translations combining the k-best lists of both the MT and the TM systems. We use features in­dependent of the MT and TM systems for rank­ing, so that outputs from MT and TM can have the same set of features. Ideally the transla­tions should be ranked by their associated post­editing efforts, but given the very limited amounts of human annotated data, we use an automatic MT evaluation metric, TER (Snover et al., 2006), which is specifically designed to simulate post­editing effort to train and test our ranking model.</p><p>The rest of the paper is organized as follows: we first briefly introduce related research in Sec­tion 2, and review Ranking SVMs in Section 3. The formulation of the problem and experiments with the ranking models are presented in Sections 4 and 5. We analyze the post-editing effort ap­proximated by the TER metric in Section 6. Sec­tion 7 concludes and points out avenues for future research.</p></section><section number="2" title="Related Work"><p>There has been some work to help TM users to apply MT outputs more smoothly. One strand is to improve the MT confidence measures to bet­ter predict post-editing effort in order to obtain a quality estimation that has the potential to replace the fuzzy match score in the TM. To the best of our knowledge, the first paper in this area is (Spe-cia et al., 2009a), which uses regression on both the automatic scores and scores assigned by post­editors. The method is improved in (Specia et al., 2009b), which applies Inductive Confidence Machines and a larger set of features to model post-editors' judgment of the translation quality between 'good' and 'bad', or among three levels of post-editing effort.</p><p>Another strand is to integrate high confidence MT outputs into the TM, so that the 'good' TM entries will remain untouched. In our forthcoming paper, we recommend SMT outputs to a TM user when a binary classifier predicts that SMT outputs are more suitable for post-editing for a particular sentence.</p><p>The research presented here continues the line of research in the second strand. The difference is that we do not limit ourselves to the 1-best out­put but try to produce a k-best output in a rank­ing model. The ranking scheme also enables us to show all TM hits to the user, and thus further protects the TM assets.</p><p>There has also been work to improve SMT us­ing the knowledge from the TM. In (Simard and Isabelle, 2009), the SMT system can produce a better translation when there is an exact or close match in the corresponding TM. They use regres­sion Support Vector Machines to model the qual­ity of the TM segments. This is also related to our work in spirit, but our work is in the opposite direction, i.e. using SMT to enrich TM.</p><p>Moreover, our ranking model is related to reranking (Shen et al., 2004) in SMT as well. However, our method does not focus on produc­ing better 1-best translation output for an SMT system, but on improving the overall quality of the k-best list that TM systems present to post-editors. Some features in our work are also different in na­ture to those used in MT reranking. For instance we cannot use N-best posterior scores as they do not make sense for the TM outputs.</p></section><section number="3" title="The Support Vector Machines"><subsection number="3.1" title="The SVM Classifier"><p>Classical SVMs (Cortes and Vapnik, 1995) are binary classifiers that classify an input instance based on decision rules which minimize the reg­ularized error function in (Eq. 1):</p><doubt alpha="36.8" length="19" tooSmall="False" monospace="0.0">11min       -wTw+C&amp;</doubt><doubt alpha="44.1" length="34" tooSmall="False" monospace="0.0">subject to:   j/i(w xi+ b) &gt; 1 — &amp;</doubt><p>where <b>(x, </b>y) G Rn <i>x </i>{1, —1} are l training in­stances. <b>w </b>is the weight vector, £ is the relaxation variable and <i>C &gt; </i>0 is the penalty parameter.</p></subsection><subsection number="3.2" title="Ranking SVM for SMT-TM Integration"><p>The SVM classification algorithm is extended to the ranking case in (Joachims, 2002). For a certain group of instances, the Ranking SVM aims at producing a ranking r that has the maximum Kendall's t coefficient with the the gold standard ranking r*.<page local="3" global="376"/></p><p>Kendall's t measures the relevance of two rank­ings: t<i>(ra,rb) </i><i>=</i><i> </i>Q, where P and Q are the amount of concordant and discordant pairs in <i>ra</i><i> </i>and rb. In practice, this is done by building constraints to minimize the discordant pairs <i>Q. </i>Following the basic idea, we show how Ranking SVM can be applied to MT-TM integration as fol­lows.</p><p>Assume that for each source sentence s, we have a set of outputs from MT, <b>M </b>and a set of outputs from TM, <b>T. </b>If we have a ranking r(s) over translation outputs <b>M T </b>where for each translation output d G <b>M</b><b>IJ </b><b>T, </b><i>(di, d</i><i>j)</i><i> </i>G r(s) iff di <i>&lt;r(s) </i><i>d</i><i>j</i>, we can rewrite the ranking constraints as optimization constraints in an SVM, as in Eq.</p><p>(2).</p><p>subject to:</p><doubt alpha="40.2" length="87" tooSmall="False" monospace="0.0">V(di,dj)er(si):w($(si,di) —$(sx,dj))&gt; 1— V(di,dj)er(s„) : w(&lt;S&gt;(s„,di)—&lt;S&gt;(s„,dj)) &gt; 1—</doubt><doubt alpha="0.0" length="3" tooSmall="False" monospace="0.0">(2)</doubt><p>where $(sn, di) is a feature vector of translation output <i>d</i><i>i </i>given source sentence <i>s</i><i>n. </i>The Ranking SVM minimizes the discordant number of rank­ings with the gold standard according to Kendall's t .</p><p>When the instances are not linearly separable, we use a mapping function 0 to map the features <b>xi </b>($(sn, di) in the case of ranking) to high di­mensional space, and solve the SVM with a kernel function <i>K </i>in where <i>K(xi, x</i><i>j) = </i><i>0(xi)T0(x</i><i>j</i>).</p><p>We perform our experiments with the Radial Basis Function (RBF) kernel, as in Eq. (3).</p><doubt alpha="36.1" length="36" tooSmall="False" monospace="0.0">K(xi,xj)=exp(—Y||xi —xj||2),7&gt; 0 (3)</doubt></subsection></section><section number="4" title="The Ranking-based Integration Model"><p>In this section we present the Ranking-based SMT-TM integration model in detail. We first in­troduce the k-best lists in MT (called N-best list) and TM systems (called m-best list in this section) and then move on to the problem formulation and the feature set.</p><subsection number="4.1" title="K-Best Lists in SMT and TM"><subsubsection number="4.1.1" title="The SMT N-best List"><p>The N-best list of the SMT system is generated during decoding according to the internal feature scores. The features include language and transla­tion model probabilities, reordering model scores and a word penalty.</p></subsubsection><subsubsection number="4.1.2" title="The TM M-Best List and the Fuzzy"><p><b>Match Score</b></p><p>The m-best list of the TM system is gener­ated in descending fuzzy match score. The fuzzy match score (Sikes, 2007) uses the similarity of the source sentences to predict a level to which a translation is reusable or editable.</p><p>The calculation of fuzzy match scores is one of the core technologies in TM systems and varies among different vendors. We compute fuzzy match cost as the minimum Edit Distance (Lev-enshtein, 1966) between the source and TM en­try, normalized by the length of the source as in Eq. (4), as most of the current implementations are based on edit distance while allowing some additional flexible matching.</p><doubt alpha="36.6" length="41" tooSmall="False" monospace="0.0">^       ,r, , %       . EditDistance(s,e)</doubt><doubt alpha="53.8" length="26" tooSmall="False" monospace="0.0">FuzzyMatch(t) =min-, -,(4)</doubt><doubt alpha="71.4" length="7" tooSmall="False" monospace="0.0">eLen(s)</doubt><p>where s is the source side of the TM hit <i>t, </i>and e is the source side of an entry in the TM.</p></subsubsection></subsection><subsection number="4.2" title="Problem Formulation"><p>Ranking lists is a well-researched problem in the information retrieval community, and Ranking SVMs (Joachims, 2002), which optimizes on the ranking correlation t have already been applied successfully in machine translation evaluation (Ye et al., 2007). We apply the same method here to rerank a merged list of MT and TM outputs.</p><p>Formally given an MT-produced N-best list <b>M </b>= {mi, m2,mn}, a TM-produced m-best list <b>T </b>= <i>{ti,t2,...,tm} </i>for a input sentence s, we define the gold standard using the TER met­ric (Snover et al., 2006): for each d G <b>M T, </b><i>(d</i><i>i</i><i>,d</i><i>j</i>) G r(s) iff <i>TER(d</i><i>i) </i><i>&lt; TER(d</i><i>j</i>). We train and test a Ranking SVM using cross vali­dation on a data set created according to this cri­terion. Ideally the gold standard would be cre­ated by human annotators. We choose to use TER as large-scale annotation is not yet available for this task.<page local="4" global="377"/> Furthermore, TER has a high correla­tion with the HTER score (Snover et al., 2006), which is the TER score using the post-edited MT output as a reference, and is used as an estimation of post-editing effort.</p></subsection><subsection number="4.3" title="The Feature Set"><p>When building features for the Ranking SVM, we are limited to features that are independent of the MT and TM system. We experiment with system-independent fluency and fidelity features below, which capture translation fluency and adequacy, respectively.</p><subsubsection number="4.3.1" title="Fluency Features"><p><b>Source-side Language Model  Scores. </b>We compute the LM probability and perplexity of the input source sentence on a language model trained on the source-side training data of the SMT sys­tem, which is also the TM database. The inputs that have lower perplexity on this language model are more similar to the data set on which the SMT system is built.</p><p><b>Target-side Language Model Scores. </b>We com­pute the LM probability and perplexity as a mea­sure of the fluency of the translation.</p></subsubsection><subsubsection number="4.3.2" title="Fidelity Features"><p><b>The Pseudo-Source Fuzzy Match Score. </b>We translate the output back to obtain a pseudo source sentence. We compute the fuzzy match score between the original source sentence and this pseudo-source. If the MT/TM performs well enough, these two sentences should be the same or very similar. Therefore the fuzzy match score here gives an estimation of the confidence level of the output.</p><p><b>The IBM Model 1 Score. </b>We compute the IBM Model 1 score in both directions to measure the correspondence between the source and target, as it serves as a rough estimation of how good a translation it is on the word level.</p></subsubsection></subsection></section><section number="5" title="Experiments"><subsection number="5.1" title="Experimental Settings"><doubt alpha="40.0" length="10" tooSmall="False" monospace="0.0">5.1.1 Data</doubt><p>Our raw data set is an English-French trans­lation memory with technical translation from a multi-national IT security company, consisting of 51K sentence pairs. We randomly select 43K to train an SMT system and translate the English side of the remaining 8K sentence pairs, which is used to run cross validation. Note that the 8K sentence pairs are from the same TM, so that we are able to create a gold standard by ranking the TER scores of the MT and TM outputs.</p><p>Duplicated sentences are removed from the data set, as those will lead to an exact match in the TM system and will not be translated by trans­lators. The average sentence length of the training set is 13.5 words and the size of the training set is comparable to the (larger) translation memories used in the industry.</p><subsubsection number="5.1.2" title="SMT and TM systems"><p>We use a standard log-linear PB-SMT model (Och and Ney, 2002): GIZA++ imple­mentation of IBM word alignment model 4, the phrase-extraction heuristics described in (Koehn et al., 2003), minimum-error-rate training (Och, 2003), a 5-gram language model with Kneser-Ney smoothing trained with SRILM (Stolcke, 2002) on the English side of the training data, and Moses (Koehn et al., 2007) to decode. We train a system in the opposite direction using the same data to produce the pseudo-source sentences.</p><p>We merge distinct 5-best lists from MT and TM systems to produce a new ranking. To create the distinct list for the SMT system, we search over a 100-best list and keep the top-5 distinct out­puts. Our data set consists of mainly short sen­tences, leading to many duplications in the N-best output of the SMT decoder. In such cases, top-5 distinct outputs are good representations of the SMT's output.</p></subsubsection></subsection><subsection number="5.2" title="Training, Tuning and Testing the"><p><b>Ranking SVM</b></p><p>We run training and prediction of the Ranking SVM in 4-fold cross validation.   We use the <i>SVMlight<footnote anchor="1"/> </i>toolkit to perform training and testing.<page local="5" global="378"/></p><p>When using the Ranking SVM with the RBF kernel, we have two free parameters to tune on: the cost parameter C in Eq. (1) and the radius parameter 7 in Eq. (3). We optimize C and 7 using a brute-force grid search before running cross-validation and maximize precision at top-5, with an inner 3-fold cross validation on the (outer) Fold-1 training set. We search within the range [2-6, 2<footnote anchor="9"/>], the step size is 2 on the exponent.</p></subsection><subsection number="5.3" title="The Gold Standard"><doubt alpha="100.0" length="4" tooSmall="True" monospace="0.0">Topi</doubt><doubt alpha="75.0" length="4" tooSmall="True" monospace="0.0">Top3</doubt><doubt alpha="75.0" length="4" tooSmall="True" monospace="0.0">Top5</doubt><figure caption="Figure 1: MT and TM's percentage in gold stan­dard"></figure><p>Figure 1 shows the composition of translations in the gold standard. Each source sentence is asso­ciated with a list of translations from two sources, i.e. MT output and TM matches. This list of translations is ranked from best to worst accord­ing TER scores. The figure shows that over 80% of the translations are from the MT system if we only consider the top-1 translation. As the num­ber of top translations we consider increases, more TM matches can be seen. On the one hand, this does show a large gap in quality between MT out­put and TM matches; on the other hand, however, it also reveals that we will have to ensure two ob­jectives in ranking: the first is to rank the 80% MT translations higher and the second is to keep the 20% 'good' TM hits in the Top-5. We design our evaluation metrics accordingly.</p></subsection><subsection number="5.4" title="Evaluation Metrics"><p>The aim of this research is to provide post-editors with translations that in many cases are easier to edit than the original TM output. As we formulate this as a ranking problem, it is natural to measure the quality of the ranking output by the number of better translations that are ranked high. Some­times the top TM output is the easiest to edit; in such a case we need to ensure that this translation has a high rank, otherwise the system performance will degrade.</p><footnote label="1">http://svmlight.joachims.org/</footnote><p>Based on this observation, we introduce the idea of <i>relevant </i>translations, and our evaluation metrics: PREC@k and HIT@k.</p><p><b>Relevant Translations. </b>We borrow the idea of <i>relevence </i>from the IR community to define the idea of translations worth ranking high. For a source sentence s which has a top TM hit t, we define an MT/TM output m as relevant, if <i>TER(m) &lt; TER(t). </i>According to the defini­tion, relevant translations should need no more post-edits than the original top hit from the TM system. Clearly the top TM hit is always relevant.</p><p><b>PREC@k. </b>We calculate the precision (PREC@k) of the ranking for evaluation. As­suming that there are n relevant translations in the top k list for a source sentence <i>s, </i>we have PREC@k= <i>n/k </i>for s. We test PREC@k, for <i>k = </i>1...10, in order to evaluate the overall quality of the ranking.</p><p><b>HIT@k. </b>We also estimate the probability of having one of the relevant translations in the top k, denoted as HIT@k. For a source sentence s, HIT@k equals to 1 if there is at least one relevant translation in top k, and 0 otherwise. This mea­sures the quality of the best translation in top k, which is the translation the post-editor will find and work on if she reads till the kth place in the list. HIT@k equals to 1.0 at the end of the list.</p><p>We report the mean PREC@k and HIT@k for all s with the 0.95 confidence interval.</p></subsection><subsection number="5.5" title="Experimental Results"><p>In Table 1 we report PREC@k and HIT@k for k = 1..10. The ranking receives 0.8747 PREC@1, which means that most of the top ranked translations have at least the same quality as the top TM output. We notice that precision re­mains above 0.8 till k = 5, leading us to conclude that most of the <i>relevant </i>translations are ranked in the top-5 positions in the list.</p><page local="6" global="379"/><table caption="Table 1: PREC@k and HIT@k of Ranking PREC %        HIT %"></table><doubt alpha="6.7" length="30" tooSmall="False" monospace="0.0">~k=ï     87.47±1.60 87.47±1.60</doubt><doubt alpha="3.4" length="29" tooSmall="False" monospace="0.0">k=2     85.42±1.07 93.36±0.53</doubt><doubt alpha="3.4" length="29" tooSmall="False" monospace="0.0">k=3     84.13±0.94 95.74±0.61</doubt><doubt alpha="3.6" length="28" tooSmall="False" monospace="0.0">k=4    82.79±0.57 97.08±0.26</doubt><doubt alpha="3.4" length="29" tooSmall="False" monospace="0.0">k=5     81.34±0.51 98.04±0.23</doubt><doubt alpha="3.6" length="28" tooSmall="False" monospace="0.0">k=6    79.26±0.59 99.41±0.25</doubt><doubt alpha="3.6" length="28" tooSmall="False" monospace="0.0">k=7    74.99±0.53 99.66±0.29</doubt><doubt alpha="3.4" length="29" tooSmall="False" monospace="0.0">k=8    70.87 ±0.59 99.84±0.10</doubt><doubt alpha="3.4" length="29" tooSmall="False" monospace="0.0">k=9     67.23±0.48 99.94±0.08</doubt><doubt alpha="3.8" length="26" tooSmall="False" monospace="0.0">k=10 64.00±0.46 100.0±0.00</doubt><p>Using the HIT@k scores we can further con­firm this argument. The HIT@k score grows steadily from 0.8747 to 0.9941 for k = 1...6, so most often there will be at least one <i>relevant </i>trans­lation in top-6 for the post-editor to work with. After that room for improvement becomes very small.</p><p>In sum, both of the PREC@k scores and the HIT@k scores show that the ranking model effec­tively integrates the two translation sources (MT and TM) into one merged k-best list, and ranks the <i>relevant </i>translations higher.</p><table caption="Table 2: PREC@k - MT and TM Systems MT % TM %"></table><doubt alpha="3.6" length="28" tooSmall="False" monospace="0.0">~k=1   85.87±1.32 100.0±0.00</doubt><doubt alpha="4.0" length="25" tooSmall="False" monospace="0.0">k=2 82.52±1.60 73.58±1.04</doubt><doubt alpha="3.2" length="31" tooSmall="False" monospace="0.0">k=3   80.05 ±1.11   62.45 ±1.14</doubt><doubt alpha="3.9" length="51" tooSmall="False" monospace="0.0">k=4 77.92±0.95 56.11±1.11 k=5 76.22±0.87 51.78±0.78</doubt><p>To measure whether the ranking model is ef­fective compared to pure MT or TM outputs, we report the PREC@k of those outputs in Table 2. The k-best output used in this table is ranked by the MT or TM system, without being ranked by our model. We see the ranked outputs consistently outperform the MT outputs for all <i>k</i><i> </i>= 1<i>...</i><i> </i>5 w.r.t. precision at a significant level, indicating that our system preserves some high quality hits from the</p><p>TM.</p><p>The TM outputs alone are generally of much lower quality than the MT and Ranked outputs, as is shown by the precision scores for <i>k</i><i> </i>= 2<i>...</i>5. But</p><p>TM translations obtain 1.0 PREC@1 according to the definition of the PREC calculation. Note that it does not mean that those outputs will need less post-editing (cf. Section 6.1), but rather indicates that each one of these outputs meet the lowest ac­ceptable criterion to be <i>relevant.</i></p></subsection></section><section number="6" title="Analysis of Post-Editing Effort"><p>A natural question follows the PREC and HIT numbers: after reading the ranked k-best list, will the post-editors edit less than they would have to if they did not have access to the list? This question would be best answered by human post-editors in a large-scale experimental setting. As we have not yet conducted a manual post-editing experiment, we try to measure the post-editing effort implied by our model with the edit statistics captured by the TER metric, sorted into four types: <i>Insertion, Substitution, Deletion </i>and <i>Shift. </i>We report the av­erage number of edits incurred along with the 0.95 confidence interval.</p><subsection number="6.1" title="Top-1 Edit Statistics"><p>We report the results on the 1-best output of TM, MT and our ranking system in Table 3.</p><p>In the single best results, it is easy to see that the 1-best output from the MT system requires the least post-editing effort. This is not surpris­ing given the distribution of the gold standard in Section 5.3, where most MT outputs are of better quality than the TM hits.</p><p>Moreover, since TM translations are generally of much lower quality as is indicated by the num­bers in Table 3 (e.g. 2x as many substitutions and 3x as many deletions compared to MT), un­justly including very few of them in the ranking output will increase loss in the edit statistics. This explains why the ranking model has better rank­ing precision in Tables 1 and 2, but seems to in­cur more edit efforts. However, in practice post­editors can neglect an obvious 'bad' translation very quickly.</p></subsection><subsection number="6.2" title="Top-k Edit Statistics"><p>We report edit statistics of the Top-3 and Top-5 outputs in Tables 4 and 5, respectively. For each system we report two sets of statistics: the Best-statistics calculated on the best output (according<page local="7" global="380"/></p><table caption="Table 3: Edit Statistics on Ranked MT and TM Outputs - Single Best"></table><p>Insertion Substitution Deletion Shift to TER score) in the list, and the Mean- statistics calculated on the whole Top-k list.</p><doubt alpha="6.5" length="153" tooSmall="False" monospace="0.0">TM-Top1 0.7554 ± 0.0376 4.2461 ± 0.0960 2.9173 ± 0.1027 1.1275 ± 0.0509 MT-Top1     0.9959 ± 0.0385   2.2793 ± 0.0628   0.8940 ± 0.0353   1.2821 ± 0.0575</doubt><doubt alpha="9.5" length="74" tooSmall="False" monospace="0.0">Rank-Top1  1.0674 ± 0.0414 2.6990 ± 0.0699 1.1246 ± 0.0412 1.2800 ± 0.0570</doubt><p>The Mean- numbers allow us to have a general overview of the ranking quality, but it is strongly influenced by the poor TM hits that can easily be neglected in practice. To control the impact of those TM hits, we rely on the Best- numbers to es­timate the edits performed on the translations that are more likely to be used by post-editors.</p><p>In Table 4, the ranking output's edit statistics is closer to the MT output than the Top-1 case in Table 3. Table 5 continues this tendency, in which the Best-in-Top5 Ranking output requires marginally less <i>Substitution </i>and <i>Deletion </i>opera­tions and significantly less <i>Insertion </i>and <i>Shift </i>op­erations (starred) than its MT counterpart. This shows that when more of the list is explored, the advantage of the ranking model - utilizing mul­tiple translation sources - begins to compensate for the possible large number of edits required by poor TM hits and finally leads to reduced post­editing effort.</p><p>There are several explanations to why the rel­ative performance of the ranking model improves when <i>k</i><i> </i>increases, as compared to other models. The most obvious explanation is that a single poor translation is less likely to hurt edit statistics on a k-best list with large <i>k, </i>if most of the transla­tions in the k-best list are of good quality. We see from Tables 1 and 2 that the ranking output is of better quality than the MT and TM outputs w.r.t. precision. For a larger k, the small number of in­correctly ranked translations are less likely to be chosen as the Best- translation and hold back the Best- numbers.</p><p>A further reason is related to our ranking model which optimizes on Kendall's t score. Accord­ingly the output might not be optimal when we evaluate the Top-1 output, but will behave better when we evaluate on the list. This is also in ac­cordance with our aim, which is to enrich the TM with MT outputs and help the post-editor, instead of choosing the translation for the post-editor.</p></subsection><subsection number="6.3" title="Comparing the MT, TM and Ranking Outputs"><p>One of the interesting findings from Tables 3 and 4 is that according to the TER edit statistics, the MT outputs generally need a smaller number of edits than the TM and Ranking outputs. This cer­tainly confirms the necessity to integrate MT into today's TM systems.</p><p>However, this fact should not lead to the con­clusion that TMs should be replaced by MT com­pletely. First of all, all of our experiments exclude exact TM matches, as those translations will sim­ply be reused and not translated. While this is a realistic setting in the translation industry, it re­moves all sentences for which the TM works best from our evaluations.</p><p>Furthermore, Table 5 shows that the Best-in-Top5 Ranking output performs better than the MT outputs, hence there are TM outputs that lead to smaller number of edits. As <i>k </i>increases, the rank­ing model is able to better utilize these outputs.</p><p>Finally, in this task we concentrate on rank­ing useful translations higher, but we are not in­terested in how useless translations are ranked. Ranking SVM optimizes on the ranking of the whole list, which is slightly different from what we actually require. One option is to use other optimization techniques that can make use of this property to get better Top-k edit statistics for a smaller k. Another option is obviously to perform regression directly on the number of edits instead of modeling on the ranking. We plan to explore these ideas in future work.</p></subsection></section><section number="7" title="Conclusions and Future Work"><p>In this paper we present a novel ranking-based model to integrate SMT into a TM system, in or­der to facilitate the work of post-editors. In such<page local="8" global="381"/></p><p>Insertion Substitution Deletion Shift a model, the user of the TM will be presented with an augmented k-best list, consisting of trans­lations from both the TM and the MT systems, and ranked according to ascending prospective post­editing effort.</p><table caption="Table 4: Edit Statistics on Ranked MT and TM Outputs - Top 3"></table><p>From the post-editors' point of view, the TM remains intact. And unlike in the binary transla­tion recommendation, where only one translation recommendation is provided, the ranking model offers k-best post-editing candidates, enabling the user to use more resources when translating. As we do not actually throw away any translation pro­duced from the TM, the assets represented by the TM are preserved and the related estimation ofthe upper bound cost is still valid.</p><p>We extract system independent features from the MT and TM outputs and use Ranking SVMs to train the ranking model, which outperforms both the TM's and MT's k-best list w.r.t. precision at k, for all ks.</p><p>We also analyze the edit statistics of the inte­grated k-best output using the TER edit statistics. Our ranking model results in slightly increased number of edits compared to the MT output (ap­parently held back by a small number of poor TM outputs that are ranked high) for a smaller k, but requires less edits than both the MT and the TM output for a larger k.</p><p>This work can be extended in a number of ways. Most importantly, We plan to conduct a user study to validate the effectiveness of the method and to gather HTER scores to train a better ranking model. Furthermore, we will try to experiment with learning models that can further reduce the number of edit operations on the top ranked trans­lations. We also plan to improve the adaptability of this method and apply it beyond a specific do­main and language pair.</p></section><section title="Acknowledgements"><p>This research is supported by the Science Foun­dation Ireland (Grant 07/CE/I1142) as part of the Centre for Next Generation Localisation (www.cngl.ie) at Dublin City University. We thank Symantec for providing the TM database and the anonymous reviewers for their insightful comments.</p><table caption="Table 5: Edit Statistics on Ranked MT and TM Outputs" class="main" frame="box" rules="all" border="1" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>TM-Best-in-Top3</p></td><td class="cell"><p>0.4241</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0250</p></td><td class="cell"><p>3.7395</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0887</p></td><td class="cell"><p>2.9561</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0966</p></td><td class="cell"><p>0.9738</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0505</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>TM-Mean-Top3</p></td><td class="cell"><p>0.6718</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0200</p></td><td class="cell"><p>5.1428</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0559</p></td><td class="cell"><p>3.6192</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0649</p></td><td class="cell"><p>1.3233</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0310</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>MT-Best-in-Top3</p></td><td class="cell"><p>0.7696</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0351</p></td><td class="cell"><p>1.9210</p></td><td class="cell"><p></p></td><td class="cell"><p>0.0610</p></td><td class="cell"><p>0.7706</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0332</p></td><td class="cell"><p>1.0842</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0545</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>MT-Mean-Top3</p></td><td class="cell"><p>1.1296</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0229</p></td><td class="cell"><p>2.4405</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0368</p></td><td class="cell"><p>0.9341</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0209</p></td><td class="cell"><p>1.3797</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0344</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Rank-Best-in-Top3</p></td><td class="cell"><p>0.8170</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0355</p></td><td class="cell"><p>2.0744</p></td><td class="cell"><p></p></td><td class="cell"><p>0.0608</p></td><td class="cell"><p>0.8410</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0338</p></td><td class="cell"><p>1.0399</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0529</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Rank-Mean-Top3</p></td><td class="cell"><p>1.0942</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0234</p></td><td class="cell"><p>2.7437</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0392</p></td><td class="cell"><p>1.0786</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0231</p></td><td class="cell"><p>1.3309</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0334</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><table class="main" frame="box" rules="all" border="1" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>TM-Best-in-Top5</p></td><td class="cell"><p>0.4239</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0250</p></td><td class="cell"><p>3.7319</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0885</p></td><td class="cell"><p>2.9552</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0967</p></td><td class="cell"><p>0.9673</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0504</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>TM-Mean-Top5</p></td><td class="cell"><p>0.6143</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0147</p></td><td class="cell"><p>5.5092</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0473</p></td><td class="cell"><p>3.9451</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0521</p></td><td class="cell"><p>1.3737</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0240</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>MT-Best-in-Top5</p></td><td class="cell"><p>0.7690</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0351</p></td><td class="cell"><p>1.9163</p></td><td class="cell"><p></p></td><td class="cell"><p>0.0610</p></td><td class="cell"><p>0.7685</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0332</p></td><td class="cell"><p>1.0811</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0544</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>MT-Mean-Top5</p></td><td class="cell"><p>1.1912</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0182</p></td><td class="cell"><p>2.5326</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0291</p></td><td class="cell"><p>0.9487</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0165</p></td><td class="cell"><p>1.4305</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0272</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Rank-Best-in-Top5</p></td><td class="cell"><p>0.7246</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0338*</p></td><td class="cell"><p>1.8887</p></td><td class="cell"><p></p></td><td class="cell"><p>0.0598</p></td><td class="cell"><p>0.7562</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0327</p></td><td class="cell"><p>0.9705</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0515*</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Rank-Mean-Top5</p></td><td class="cell"><p>1.1173</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0181</p></td><td class="cell"><p>2.8777</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0312</p></td><td class="cell"><p>1.1585</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0200</p></td><td class="cell"><p>1.3675</p></td><td class="cell"><p>±</p></td><td class="cell"><p>0.0260</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table></section><references><p>Cortes, Corinna and Vladimir Vapnik. 1995. Support-vector networks. <i>Machine learning, </i>20(3):273-297.</p><p>Joachims, Thorsten. 2002. Optimizing search engines using clickthrough data. In <i>KDD '02: Proceed­ings of the eighth ACM SIGKDD international con­ference on Knowledge discovery and data mining, </i>pages 133-142, New York, NY, USA.</p><page local="9" global="382"/><p>Koehn, Philipp., Franz Josef Och, and Daniel Marcu.</p><p>2003. Statistical phrase-based translation. In <i>Proceedings of the 2003 Conference of the North American Chapter ofthe Association for Computa­tional Linguistics on Human Language Technology (NAACL/HLT-2003), </i>pages 48 - 54, Edmonton, Al­berta, Canada.</p><p>Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical ma­chine translation. In <i>Proceedings of the 45th </i><i>An­</i><i>nual Meeting ofthe Association for Computational Linguistics Companion Volume Proceedings ofthe Demo and Poster Sessions (ACL-2007), </i>pages 177­180, Prague, Czech Republic.</p><p>Levenshtein, Vladimir Iosifovich. 1966. Binary codes capable of correcting deletions, insertions, and re­versals. <i>Soviet Physics Doklady, </i>10(8):707-710.</p><p>Och, Franz Josef and Hermann Ney. 2002. Discrim­inative training and maximum entropy models for statistical machine translation. In <i>Proceedings of 40th Annual Meeting of the Association for Com­putational Linguistics (ACL-2002), </i>pages 295-302,</p><p>Philadelphia, PA, USA.</p><p>Och, Franz Josef. 2003. Minimum error rate training in statistical machine translation. In <i>Proceedings of the 41st Annual Meeting on Association for Com­putational Linguistics (ACL-2003), </i>pages 160-167, Morristown, NJ, USA.</p><p>Shen, Libin, Anoop Sarkar, and Franz Josef Och.</p><p>2004. Discriminative reranking for machine trans­lation. In <i>HLT-NAACL 2004: Main Proceedings, </i>pages 177-184, Boston, Massachusetts, USA. As­sociation for Computational Linguistics.</p><p>Sikes, Richard. 2007. Fuzzy matching in theory and practice. <i>Multilingual, </i>18(6):39-43.</p><p>Simard, Michel and Pierre Isabelle. 2009. Phrase-based machine translation in a computer-assisted translation environment. In <i>Proceedings of the Twelfth Machine Translation Summit (MT Summit XII), </i>pages 120 - 127, Ottawa, Ontario, Canada.</p><p>Snover, Matthew, Bonnie Dorr, Richard Schwartz, Lin-nea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation.</p><p>In <i>Proceedings ofAssociation for Machine Transla­tion in the Americas (AMTA-2006), </i>pages 223-231,</p><p>Cambridge, MA, USA.</p><p>Specia, Lucia, Nicola Cancedda, Marc Dymetman, Marco Turchi, and Nello Cristianini. 2009a. Esti­mating the sentence-level quality of machine trans­lation systems. In <i>Proceedings of the 13th An­nual Conference ofthe European Association for Machine Translation (EAMT-2009), </i>pages 28 - 35, Barcelona, Spain.</p><p>Specia, Lucia, Craig Saunders, Marco Turchi, Zhuo-ran Wang, and John Shawe-Taylor. 2009b. Improv­ing the confidence of machine translation quality estimates. In <i>Proceedings ofthe Twelfth Machine Translation Summit (MT Summit XII), </i>pages 136 -143, Ottawa, Ontario, Canada.</p><p>Stolcke, Andreas. 2002. SRILM-an extensible lan­guage modeling toolkit. In <i>Proceedings ofthe Sev­enth International Conference on Spoken Language Processing, </i>volume 2, pages 901-904, Denver, CO,</p><p>USA.</p><p>Ye, Yang, Ming Zhou, and Chin-Yew Lin. 2007. Sentence level machine translation evaluation as a ranking. In <i>Proceedings </i><i>ofthe</i><i> Second Workshop on Statistical Machine Translation, </i>pages 240-247, Prague, Czech Republic.</p></references></body></article>