The Alignment Statistical Template Machine Approach Translation to

Franz Josef Och ∗ Hermann Ney † Google RWTH Aachen A phrase-based statistical machine translation approach — the alignment template approach — is described. This translation approach allows for general many-to-many relations between words. Thereby, the context of words is taken into account in the translation model, and local changes in word order from source to target language can be learned explicitly. The model is described using a log-linear modeling approach, which is a generalization of the often used source–channel approach. Thereby, the model is easier to extend than classical statistical machine translation systems. We describe in detail the process for learning phrasal translations, the feature functions used, and the search algorithm. The evaluation of this approach is performed on three different tasks. For the German–English speech Verbmobil task, we analyze the effect of various system components. On the French–English Canadian Hansards task, the alignment template system obtains significantly better results than a single-word-based translation model. In the Chinese–English 2002 National Institute of Standards and Technology (NIST) machine translation evaluation it yields statistically significantly better NIST scores than all competing research and commercial translation systems.

Introduction

Machine complex, translation many words (MT) have is various a hard meanings problem, and because different natural possible languages translations, are highly tences often might have In various addition, readings, it is sometimes and the necessary relationships to take between world linguistic knowledge entities sen- are account. The vague. number of to relevant them dependencies all into account is much in a machine too large translation and those system. depen- into dencies Given these are too (produce translations) boundary complex conditions, take given incomplete a machine knowledge. translation In system such a has case, to a make principled decisions proach to solving decisions that problem is to incomplete use the concepts knowledge. of statistical This is decision the goal theory of statistical to ap- to make optimal given try machine provements The use translation. in of the statistical quality techniques of research in systems machine in recent translation years. has For led example, to dramatic the statis- im- tical approaches of the Verbmobil evaluations (Wahlster 2000) or the U.S. National ∗ 1600 Lehrstuhl Amphitheatre für Informatik Parkway, VI, Computer Mountain View, Science CA Department, 94043. E-mail: RWTH och@google.com. † Ahornstr. 55, 52056 Aachen, Germany. E-mail: ney@cs.rwth-aachen.de. Aachen–University of Technology, Submission publication: received: 1 June 2004 19 November 2002; Revised submission received: 7 October 2003; Accepted for © c 2004 Association for Computational Linguistics Computational Linguistics Volume 30, Number 4 Institute 2003 of Standards obtain the best and Technology results. In (NIST)/TIDES addition, the field of MT evaluations statistical machine 2001 through 1 rapidly progressing, and the quality of systems is definitely getting better availability and translation factor these improvements is the of better. large An is portant in im- data for training also statistical models. the field Yet the modeling, statistical training, machine and amounts since of translation search of have improved was methods IBM al. in the 1994). late This 1980s article and early focuses on 1990s an (Brown important et al. 1990; Brown improvement, et al. pioneered 1993; by namely, the Berger use of et eralized) phrases instead of just single words as the core elements of the (gen- translation We model. statistical the use describe of a in Section log-linear 2 the model basics to of our incorporate statistical the translation various model. knowledge We sources suggest overall translation system and to perform discriminative training of the free into an parameters. This approach can be seen as a generalization of the originally model source–channel In modeling 3, we framework describe the for statistical alignment and techniques for statistical machine translation. suggested Section term learning alignment to phrase models consecutive translations use sequence the term in of from used word to obtain a refers a a words alignments. word the has to be just distinguished from the of occurring linguistic sense. in Here, phrase phrases not constrained The text and bilingual are by linguistic phrase boundaries. Compared learned the word-based statistical phrase translation instead models of a single-word-based in Brown et al. (1993), this model is based to on results a (statistical) of the results, recent machine lexicon an increasing translation evaluations, of this approach lexicon. working seems Looking number researchers are currently at the give the best and on different to methods and Wong for 2002; learning Venugopal, phrase translation Vogel, and lexica Waibel for 2003; machine Tillmann translation 2003; Koehn, purposes Och, (Marcu Marcu 2003). Our stage, approach compute to learning an alignment a phrase between translation words, lexicon and works in the second in and stages: In the first we two stage, use generalized we extract versions the aligned of these phrase phrases, pairs. called In our alignment machine translation templates, that system, also we include then the word In Section alignment 4, we and backbone describe use word the translation the classes various instead model components of the words model. The of is the of the themselves. alignment template statistical feature translation tion, which requires that covers a translation source of sentence a new sentence and the be produced composed translation. of a set of align- func- ment feature templates functions that score the well-formedness the of the produced target language sentence Other (i.e., der of language the alignment model templates. feature functions), Note that the all number components of produced of our statistical words, or translation annotated are purely data-driven that there is no need for linguistically machine the or- lation corpora. model models (Yamada This and is an Knight important 2001; advantage and Gildea 2003; compared Charniak, to Knight, syntax-based and Yamada trans- 2003) In that Section require 5, we a parser describe for in source detail or our target implementation. We use a dynamic-programming-based search language. algorithm allows trade-off between and beam and search discuss efficiency quality. We also algorithm an efficient a discuss the use of heuristic that functions In Section to reduce 6, we the describe number various of search results errors obtained for a fixed on beam different size. German–English task, we analyze the effect of various tasks. system For compo- the Verbmobil Och and Ney The Alignment Template Approach to Statistical Machine Translation Figure Architecture 1 of the translation approach based on a log-linear modeling approach. nents. tem obtains On the significantly French–English better Canadian results than Hansards a single-word-based task, the alignment translation template model. In sys- the significantly Chinese–English better statistically 2002 NIST than machine all competing translation research evaluation and it commercial yields results translation that are systems. 2. Log-Linear Models for Statistical Machine Translation We are given a source (French) sentence f = f 1 J = f 1 , ... ,f j , ... ,f J , which is to be trans- lated target into sentences, a target we (English) will choose sentence the sentence e = with e I 1 = the e 1 , ... highest ,e i , ... probability: ,e I . Among all possible 2 ê I 1 = argmax { Pr ( e I 1 | f 1 J ) } e I 1 (1) The sentence argmax in the operation target language. denotes the search problem, that is, the generation of the output As an alternative to the often used source–channel approach (Brown et al. 1993), we pecially directly well-founded model the posterior probability Pr ( e I framework for doing this 1 | maximum-entropy f 1 J is the ) (Och and Ney (Berger, Pietra, and Della Pietra framework, have a set framework 2002). An es- Della 1996). In this we of M fea- ture functions h m ( e I 1 ,f 1 J ) ,m = 1, ... ,M. For each feature function, there exists a model Computational Linguistics Volume 30, Number 4 parameter λ m ,m = 1, ... ,M. The direct translation probability is given by

Pr ( e I 1 | f J 1 ) = p M ( e I 1 | f J λ 1 1 ) exp [ ∑ M m 1 λ m h m ( e I 1 ,f J = = 1 )] [BAR] ∑ e I exp [ ∑ M m 1 λ m h m ( e ′ I 1 ,f J ′ 1 = 1 )] This approach natural has language been suggested understanding by task. Papineni, Roukos, and Ward (1997, 1998) for a We obtain the following decision rule:

ê I 1 argmax { Pr ( e I | f J } = 1 1 ) e1 I (2) (3) { M } = argmax ∑ λ m h m ( e I 1 ,f 1 J ) e I 1 m = 1 Hence, The the overall time-consuming architecture of renormalization the log-linear in equation modeling (3) approach is is not needed summarized in in search. , A standard 1, criterion { ) s = ,S for on log-linear a parallel models training is the corpus maximum consisting class posterior of Figure S sentence 1. : probability pairs criterion, ( f s e s which ... can be } derived from the maximum-entropy principle:

{ S } λ̂ M 1 = argmax ∑ log p λ M 1 ( e s | f s ) λ M 1 s = 1 This direct-translation corresponds model. to maximizing to This direct the discriminative optimization equivocation training of or the maximizing (Ney posterior 1995) probability the likelihood because we in directly Bayes’ of the decision take lem under into rule account is referred this criterion the overlap as has very in the nice probability properties: distributions. There is one unique The optimization global optimum, prob- and global there optimum. are algorithms Yet the (e.g. ultimate gradient goal descent) is to obtain that are good guaranteed translation to converge quality on to un- the seen quality test as data. measured An alternative by an automatic training evaluation criterion therefore criterion directly (Och 2003). optimizes translation den Typically, variables. the To include translation these probability Pr ( e I in 1 | our f 1 J dependencies ) log-linear is decomposed model, via we additional extend the hid- feature functions to include the dependence on the additional hidden variable. Using for example the alignment a J 1 as hidden variable, we obtain M feature functions of the form h m ( e I 1 ,f 1 J ,a J 1 ) ,m = 1, ... ,M and the following model:

exp ( ∑ M m 1 λ m h m ( e I e ,a J J = 1 ,f 1 J ,a J Pr I f 1 ) ( 1 1 | 1 ) = [BAR] e ,a J exp ( ∑ M m 1 λ m h m ( e ′ I 1 ,f 1 J ,a ′ J ) ∑ ′ I ′ = 1 ) 1 1 Obviously, we can perform the same step for translation models with an even richer set of hidden variables than only the alignment a J 1 .

http://www.nist.gov/speech/tests/mt/. The general notational probability convention distributions employed with in (nearly) this article no specific is as follows. assumptions. We use In the contrast, symbol for Pr model-based ( · ) to denote probability distributions, we use the generic symbol p ( · ) . Learning Translation Lexica

In translation this section, lexica we that describe are the methods basis of for the learning machine the translation single-word system and phrase-based described in (4) ) Och and Ney The Alignment Template Approach to Statistical Machine Translation Section are 4. used First, to we learn introduce word the basic alignment. concepts Then, we of statistical describe how alignment these models, alignments which used to learn bilingual phrasal translations. can be 3.1 Statistical Alignment Models In (statistical) alignment models Pr ( f J 1 ,a J 1 | e I 1 ) , a “hidden” alignment a = a J duced that describes a mapping from a source position to a target position 1 is intro- relationship between the translation model and the j alignment model is given by a j . The Pr ( f J 1 | e I 1 )= ∑ Pr ( f J 1 ,a J 1 | e I 1 )

a1 J The for alignment source a J words 1 may that contain are not alignments aligned with a any j = 0 with In the statistical model depends on target the “empty” a set word. word e 0 to account general, learned from training data. To express the of dependence unknown of the parameters model on the θ that set, use the following notation: parameter is we (5)

Pr ( f 1 J ,a J 1 | e I 1 )= p ( f 1 J ,a J 1 | e I θ 1 ) A Brown detailed et al. description (1993) and of Och different and Ney specific (2003). statistical Here, we alignment use the models hidden can Markov be found model in (HMM) al. (1993) alignment to compute model the word (Vogel, alignment Ney, and for Tillmann the parallel 1996) training and Model corpus. 4 of Brown et sisting To train of the sentence unknown pairs parameters s , s : θ , we 1, are ,S given { = . For a each parallel sentence training pair corpus ( e ) s ( f s , con- S f ... } e s ) , the by alignment maximizing variable the is likelihood denoted on by the a parallel = a J 1 . The training unknown corpus: parameters θ are determined (6) { S [ ]} θ̂ = argmax ∏ ∑ p θ ( f s , a | e s ) (7) θ s = 1 a This rithm optimization (Dempster, Laird, can be and performed Rubin 1977). using For the a expectation given sentence maximization pair there (EM) are a algo- large certain number of alignments. is also called The the alignment â J model) 1 that has (of the that highest model): probability (under a Viterbi alignment

â J 1 = argmax p ( f 1 J ,a J 1 | e I θ̂ 1 ) (8) a J 1 A alignment detailed models comparison compared of the to quality human-made of these Viterbi word alignments alignments can for various be found statistical and Ney (2003). in Och

Symmetrization

The or more baseline target alignment words. Therefore, model does lexical not correspondences allow a source word like the to be German aligned compound with two word word must Zahnarzttermin be mapped for onto dentist’s two or appointment more target words. cause problems Therefore, because the resulting a single Viterbi source alignment of the standard alignment models has a systematic loss in recall. Here, we Computational Linguistics Volume 30, Number 4 Figure Example 2 of a (symmetrized) word alignment ( Verbmobil task). describe alignment various models methods by applying for performing a heuristic postprocessing a symmetrization step of that our combines directed the statistical ments in both translation directions (source to target, target to source). Figure 2 shows align- an example To solve of this a symmetrized problem, we alignment. train in both translation directions. For each sentence pair, we i,b compute two 0 denote Viterbi the alignments sets of alignments a J 1 and in b I 1 . the Let two A 1 Viterbi = { ( a j ,j alignments. ) | a j > 0 } increase 2 { ( i ) | b i > } To and A = one alignment the quality matrix of the using alignments, one of the we following can combine combination (symmetrize) methods: A 1 and A 2 into A • • • Intersection: A = A 1 ∩ A 2 . Union: A = A 1 ∪ A 2 . Refined determined. method: The elements In a first step, of this the intersection intersection result A from = A 1 both ∩ A 2 Viterbi is alignments alignment and iteratively are therefore by adding very reliable. alignments Then, i,j we extend occurring the A ( ) only in the Och and Ney The Alignment Template Approach to Statistical Machine Translation

alignment in A, or if A 1 the or in the alignment following conditions A 2 if both neither hold: f j nor e i have an alignment

The or a alignment vertical ( i,j ) has neighbor a i,j horizontal , i,j neighbor 1 that is ( i − 1 1,j ) already , ( in i + 1,j A. ) ( − The set A i,j does not ) ( horizontal and ) } vertical contain + ) ∪{ ( neighbors. alignments with both Obviously, ments with the a intersection higher yields precision an and a alignment lower consisting and lower precision the recall. alignment. The of only The union one-to-one refined yields a higher align- a of combined alignment recall often able to improve precision and recall compared to the method Whether higher precision higher is nonsymmetrized is ments. a or a recall preferred depends on align- application of recall the word more alignment. important. For the Therefore, purpose of statistical MT, it the seems final higher is we use the union or the refined that a method The to obtain resulting a symmetrized symmetrized alignment matrix. combination translation how many lexica times and ) by alignments computing are then are aligned relative used to | divided frequencies train p e f by the count using single-word-based ( the of count how N many ( e,f ) of the word e occurs: f N ( f ) times f p ( e | f )= N ( e,f ) [BAR] N ( f ) • • 3.3 Bilingual Contiguous Phrases In of this source section, language we present words a method and for target learning language relationships words. between , n as input a general word alignment This algorithm, whole phrases will m be called takes matrix which tion 3.2). The following, output phrase-extract is a set describe of bilingual the criterion phrases. (Sec- In the the word we alignment matrix: that defines the set of phrases that consistent with BP ( f 1 J ,e I 1 ,A ) = {( f j j + m ,e i i + n ) : ∀ ( i ′ ,j ′ ) ∈ A : j ≤ j ′ ≤ j + m ↔ i ≤ i ′ ≤ i + n Hence, stituted the by set all bilingual of all bilingual phrase phrases pairs in that which are the all consistent only the words of target words language within with the phrase the alignment source and the language is con- phrase of phrase. the target are aligned language that we require phrase with that are aligned at least only one word with the in the words source of the language source phrase language words aligned Note empty with at or least target one language word of phrases the target that language would correspond phrase. As to a the result “empty there word” are no be of the These word-based source phrases statistical can be computed alignment phrases in one language and checking straightforwardly models. whether the aligned by enumerating guage consecutive, with possible words words in the all the exception of other possible are that are not aligned lan- all. Figure 3 gives the algorithm that computes the phrases. The algo- at rithm target takes seven language into phrases. account that result Table possibly phrase-extract 1 shows unaligned the the bilingual words at phrases the boundaries containing of the from application of algorithm between source this to the alignment two or and of Figure 2. words is (9) ∧∃ ( i ′ ,j ′ ) ∈ A : j ≤ j ′ ≤ j + m ∧ i ≤ i ′ ≤ i + n } Computational Linguistics Volume 30, Number 4 Table 1 Examples of two- to to seven-word the bilingual alignment of phrases Figure 2. obtained by applying the algorithm phrase-extract [BAR] ja , ja , ja , ich denke ich ja , ja ich , ich denke mal denke mal mal , , also , ich denke , ich , ich , ich denke mal , denke mal mal mal , , , ich denke also also ich denke wir ich denke mal ich denke mal mal mal also , , ich denke , also ich denke mal , also wir wollten wir denke denke mal , mal also , denke mal denke , also mal wir , also wollten wir , also , also , also wir wollten wir also wir also wollten wir wir in wollten in unserer Abteilung unserer in in unserer unserer Abteilung Abteilung ein ein neues neues Netzwerk Netzwerk unserer Abteilung aufbauen ein neues Netzwerk ein neues ein neues Netzwerk neues aufbauen Netzwerk pressive It should power. be If emphasized a consecutive that phrase this constraint in one language to consecutive is translated phrases into limits two or the three ex- nonconsecutive phrase pair learned phrases by this in the approach. other language, In principle, there this is approach no corresponding to learning bilingual from secutive a word-aligned in source corpus and could target be extended language straightforwardly as well. Informal to experiments handle noncon- phrases shown that phrases phrases allowing for especially nonconsecutive increases phrases the percentage significantly of wrong increases phrases. the number have extracted and Therefore, of we consider only consecutive phrases.

Alignment Templates

In replacing the following, words with we add word generalization classes and capability also by storing to the bilingual the alignment phrase information lexicon by for each phrase pair. These generalized and alignment-annotated phrase pairs are called alignment templates . Formally, an alignment template z is a triple F J ′ ( ,E I ′ 1 1 ,Ã ) yes yes , yes , , I yes , I I think yes , I think think , , I , well , , I I think , I think , I think , think , , well I think well we I I I think think think , , , well well I think , well we we think , plan to think think , , well well think , well we we , well plan to , , well well we we well we plan to well we plan we plan to to in in our our a set new up network department a new network in our department in our department our a new department set a new up a network new network new network Och and Ney The Alignment Template Approach to Statistical Machine Translation Figure quasi-consecutive Algorithm 3 phrase-extract TP is a predicate for extracting that tests phrases whether from the a word-aligned set of words sentence TP is consecutive, pair. Here with the possible exception ( ) of words that are not aligned. that describes the alignment Ã between a source class sequence F J ′ 1 and a target class sequence a E I ′ to bilingual 1 . If phrase each word together corresponds with an to alignment one class, within an alignment this phrase. template Figure corresponds examples of alignment templates. 4 shows The alignment Ã is represented as a matrix with J ′ matrix element with · ( I ′ value 1 means that the words at the corresponding + 1 ) binary positions elements. are A aligned, aligned with and the a target value word, 0 means then that it is the aligned words with are not the aligned. empty word If a source , which word is is not imaginary position 0. e 0 at the i = The classes used in F J ′ 1 and E I ′ the method described in Och (1999) 1 and are automatically constitute a partition trained bilingual of the vocabulary classes using source each and target instance language. of a In general, we are not limited to disjoint classes as long of as specific class. specific In the following, word we use is disambiguated, classes. it would be possible the class function that is, uniquely C to map belongs words to to their a instead of Hence, the automatically trained word to employ classes parts-of-speech The of classes instead of the words used here. or semantic categories generalization. use For example, if there exist classes themselves in source has contain names, it possible and the target advantage is that an alignment language of better town template learned using a specific that town name can be generalized to other town names.

In the following, ẽ and f̃ denote target and source phrases, respectively. To train the extended probability version of applying an alignment template p z = ( F J ′ of the algorithm ( from Section 1 ,E I ′ 1 ,Ã 3.3. ) | f̃ All ) , we bilingual use an phrases that are consistent with the alignment phrase-extract are extracted together with the alignComputational Linguistics Volume 30, Number 4 Figure Examples 4 of alignment templates obtained in training. ment alignment within template this bilingual occurred phrase. in the aligned Thus, we training obtain corpus. a count The probability N ( z ) of how of using often an an means alignment of relative template frequency: to translate a specific source language phrase f̃ is estimated by p ( z =( F J ′ ,E I ′ ,Ã f̃ )= N ( z ) ·δ ( F J ′ ) | 1 ,C ( f̃ )) 1 1 [BAR] N ( C ( f̃ )) To probabilities reduce the only memory for phrases requirement up to a of certain the alignment maximal length templates, in the we source compute language. these ( 10 ) Och and Ney The Alignment Template Approach to Statistical Machine Translation Depending tween four on the and size seven of the words. corpus, In the addition, maximal we length remove in the alignment experiments templates that is be- a probability lower than a certain threshold. In the experiments, we use a have of 0 01. threshold . and It should their be emphasized associated that probabilities is this very algorithm easy to for computing implement. The aligned joint phrase translation pairs suggested by Marcu and Wong (2002) tries to learn phrases as part of a model algorithm, training which leads A to very large comparison of memory the two requirements approaches can and be a rather full EM algorithm. found in complicated and Marcu (2003). Koehn, Och,

Translation Model

To describe our translation model based on the alignment templates described in the previous section in a formal way, we first decompose both the source sentence f 1 J the target sentence into a sequence of phrases ( 1, ,K): and e I 1 k = ...

f 1 J = f̃ 1 K , f̃ k = f j k 1 + 1 , ... ,f − j k

e I 1 = ẽ K 1 , ẽ k = e i k 1 + 1 , ... ,e − i k Note phrase that pairs. there In are the a following, large number we will of possible describe segmentations the model for of a specific a sentence segmentation. pair into Eventually, not known however, when new a text model is translated. can be described Hence, in as which part of the the specific overall segmentation search process is (Section To allow 5), we possible also search reordering for the of optimal phrases, segmentation. we introduce an alignment on the phrase level π 1 K between the source phrases f̃ 1 K and the target phrases ẽ K 1 . Hence, π 1 K permutation of the phrase positions 1, ,K and indicates that the phrases is and a ... ẽ k phrases f̃ π k are translations a specific alignment of one another. template We assume is used: that for the translation between these z k

ẽ k z k ←→ f̃ π k Hence, our model has the following hidden variables:

π 1 K ,z K 1 German–English Figure 5 gives sentence an example pair. of the word alignment and phrase alignment of a edge We sources describe are described our model as using feature a log-linear functions modeling that include approach. the given Hence, source all language knowlstring f 1 J , the target language string e I we have the following functional form 1 , of and all the feature above-stated functions: hidden variables. Hence,

h ( e I 1 ,f 1 J , π 1 K ,z K 1 ) Figure 6 gives an overview of the decisions made in the alignment template model. First, alignment the source template sentence is words chosen and f 1 J are the grouped sequence into of phrases chosen alignment f̃ 1 K . For each templates phrase is f̃ an z corresponding reordered (according alignment to π template 1 K ). Then, every z). Finally, phrase the f̃ produces sequence its of translation phrases ẽ constitutes (using the ẽ K the sequence of words e I . 1 1 (11) (12)

K Computational Linguistics Volume 30, Number 4 Figure Example 5 templates. of segmentation of German sentence and its English translation into alignment Figure Dependencies 6 in the alignment template model. Och and Ney The Alignment Template Approach to Statistical Machine Translation 4.1 Feature Functions 4.1.1 Alignment Template Selection. To score the use of an alignment template, we use the probability function by p ( z multiplying | f̃ ) the defined in probability Section of all 3. We used establish alignment a corresponding templates and feature logarithm: taking the h AT ( e J K I ,f 1 , K 1 ,z K )= log ∏ p ( z k | f j 1 π 1 π j k 1 + 1 ) k 1 π− = k Here, j language π− k sentence 1 + 1 is the position and is of the the first word position of of the alignment last word template of that z k in this feature π k function alignment the source Note that j requires that a translation of a new template. composed of a set of alignment templates that covers both the source sentence translation. is no notion sentence be the produced There of “empty phrase” that and the “empty word” is in word-based a statistical permutation, and alignment no models. insertions The corresponds alignment on to phrase level actually or deletions are allowed. the 4.1.2 Word probability Selection. , For which scoring is the estimated use of using target language relative words, frequencies as we use a p ( | f ) described lexicon tion 3.2. The e target word word depends matrix e by on : the aligned source words. If we in Sec- resulting alignment and the predicted word denote class for the A = A π 1 K ,z K 1 word e i by E i , then the feature function h WRD is defined as follows:

h WRD ( e 1 1 J I I ,f , π 1 K ,z K 1 )= log ∏ p ( e i |{ f j | ( i,j ) ∈ A } ,E i ) i = 1 which For p ( e is i |{ constrained f j | ( i,j ) ∈ to A } predict ) we use only a words uniform that mixture are in of the a predicted single-word word model class p ( e | : f ) , E i

p e f i,j ) A ,E )= ∑ { j | ( i,j ) ∈ A } p ( e i | f ( |{ | ( } j ) i j ∈ i [BAR] |{ j | ( i,j ) A }| ·δ ( C ( e i ) ,E i ) ∈ model. A disadvantage The translations of this model is that the word or order is ignored in the word the an day identical after tomorrow probability. after Yet the day first tomorrow should for translation receive the one the obtain German significantly übermorgen in higher lexicon probability. model Hence, f,i,j we also include a dependence on the word a positions the p ( e | ) : ( 13 ) ( 14 ) i − 1 j − 1 p ( e i | f j , ∑ [( i ′ ,j ) ∈ A ] , ∑ [( i,j ′ ) ∈ A ]) ( 15 ) i ′ = 1 j ′ = 1 Here, not only [( i ′ on ,j ) the ∈ A aligned ] is 1 if French ( i ′ ,j ) ∈ word A and 0 , but otherwise. also on As the a number result, the of preceding word e i depends words This aligned with e i and on the positions number f j distinguishes the within of the preceding a phrasal translation. English words The aligned French . model number with parameters f j a of estimation p ( e | f,i,j ) problem is significantly especially higher for words than that that of rarely p ( e occur. | f ) alone. Therefore, Hence, of there we linearly is data interpolate the models p ( e | f ) and p ( e | f,i,j ) . 4.1.3 very Phrase often a Alignment. monotone alignment The phrase is a alignment correct alignment. feature simply Hence, takes the feature into account of by summing over function that h AL measures the “amount nonmonotonicity” the distance (in the Computational Linguistics Volume 30, Number 4 source language) of alignment templates that are consecutive in the target language: h AL ( e I 1 ,f 1 J , π 1 K ,z K 1 )= ∑ K k = + 1 1 | j π− k 1 − j π k − 1 | ( 16 ) Here, j includes π 0 is defined 1 to to equal 0 include and the j K + 1 − 1 distance π is defined from the to end equal J. position The of above-stated the last phrase sum the end k of = K + sentence. to ing The sum sequence of seven of K jump = 6 alignment distances: 0 templates 0 1 3 in Figure 2 0 5 corresponds 0 6. to the follow- + + + + + + = 4.1.4 dard Language Model backing-off Features. word-based As trigram a default language language model model (Ney, feature, Generet, and we use Wessel a stan- 1995):

e 1 J I , K ,z K )= log + 1 h LM ( I ,f 1 π 1 1 ∏ p ( e i | e i − 2 ,e i − 1 ) i = 1 In addition, we use a 5-gram class-based language model:

h CLM ( e 1 ,f 1 J I , π 1 K ,z K 1 )= log + 1 I ∏ p ( C ( e i ) | C ( e i − 4 ) , ... ,C ( e i − 1 )) i = 1 The dencies use better of the into language account. model feature in equation (18) helps take long-range depen4.1.5 also use Word as Penalty. a feature the To improve number of the produced scoring for target different language target words sentence (i.e., lengths, the length we of the produced target language sentence):

h WP ( e I 1 ,f 1 J , π 1 K ,z K 1 )= I Without short. this feature, we typically observe that the produced sentences tend to be too 4.1.6 conventional Conventional lexicon Lexicon. co-occur in We the also given use sentence a feature pair. that Therefore, counts how the many weight entries for the of a provided conventional dictionary can be learned:

h LEX ( e I 1 ,f 1 J , 1 K ,z K 1 )= #CO-OCCURRENCES ( LEX, e I 1 ,f 1 J π ) The matically intuition trained is that lexicon the conventional and therefore dictionary should get LEX a larger is more weight. reliable than the autoused 4.1.7 Additional is that we can Features. add numerous A major features advantage that of baseline system. Here, will deal the with log-linear specific modeling MT ourselves to the problems approach statistical we restrict described of the of features. Yet of we could and use target grammatical language. features that relate certain grammatical set dependencies counts how we many source define arguments the following the main feature, verb of For which a sentence example, has a has using nonzero in the a function value source if or the target k ( · ) that sentence, in each of the can two sentences has the same number of arguments: verb

h ( f 1 J ,e I 1 , 1 K ,z K 1 )= δ ( k ( f 1 J π ) ,k ( e I 1 )) In the the dialogue same way, act classification. we can introduce semantic features or pragmatic features such as ( 17 ) ( 18 ) ( 19 ) ( 20 ) ( 21 ) Och and Ney The Alignment Template Approach to Statistical Machine Translation

Training

For the three approaches. different For the tasks on which we posterior task, report we results, train the we use probability criterion model two different parameters training to maximum Verbmobil (equation (4)). λ M the class For 1 the according English task and the Chinese–English NIST task, we simply tune French– Hansards parameters by coordinate descent using on held-out starting data with the respect model to the the metric as a point automatic model uation employed, parameters eval- the task. not Note that this tuning to depends converge on the the starting point obtained to global optimum on of the on Verbmobil parameters and is guaranteed the model data. ters. As a result, efficient this approach algorithm for is limited to performing a this very small tuning for number a larger of model training An number parame- parameters can be found in Och (2003). of model A class standard posterior approach probability to training criterion is the the log-linear GIS model parameters (Generalized Iterative of the (Darroch Ratcliff 1972). To apply Scaling) maximum and this algorithm, we have to solve algorithm cal problems. possible The sentences, renormalization for which we needed do not in know equation of an (3) requires efficient a various sum over practi- algorithm. Hence, many proximate from this space sum of by all extracting possible a large sentences ( set of highly n-best probable sentences approximation). The set as of a we ap- the sample sentences is computed by means of an appropriately extended version of the considered algorithm trained Using described with an the n-best in Section 5. search GIS approximation, algorithm yield we worse might translation face the results problem even that on the the parameters corpus. This can happen because with the modified model scaling factors, the training list can change significantly and can include sentences that have not been taken into n-best account worse than in training. the old model Using parameters. these sentences, To avoid the new this model problem, parameters we proceed might as perform In a first step, we perform a search, compute an n-best list, and use this n-best follows. train the model parameters. new Second, list, which we is use combined the new model with the parameters this a n-best n-best list, new model parameters are computed. existing in n-best a new This process list. search list to and using compute extended Third, iterated guaranteed, until the resulting in the limit n-best the list n-best does list not will change. contain In this all possible algorithm, translations. convergence is practice, is the algorithm as converges after five to seven iterations. In our experiments this In final of We n-best the n-best might list list have contains because the problem about 500–1000 the n-best that list none alternative is of too the small given translations. or reference because the translations search algorithm is part performs duced given pruning a certain which input in principle sentence. limits To solve the possible this problem, translations we define that as can reference be pro- translation of word errors for maximum-entropy with respect to any training of the reference each sentence translations that has in the the minimal n-best number details of the training procedure can be found in Och and Ney (2002). list. More

In model. this section, we describe an efficient search architecture for the alignment template

General Concept

In et al. general, (1993) the is search NP-complete problem (Knight for statistical 1999). Therefore, MT even using we cannot only Model expect 1 to of develop Brown Computational Linguistics Volume 30, Number 4 efficient errors. search Yet for algorithms 6.1.2). practical that are guaranteed applications the art of it is to solve acceptable developing a search to the problem commit without algorithm some lies in search search (Section Hence, allow efficient search without finding errors approximations and heuristics that an suitable many search errors. committing too aim In is the that development the search of the algorithm search algorithm should be described efficient. It in should this be section, possible to our main a sentence that of reasonable search length algorithm within sometimes a few results seconds in search of computing errors, time. translate the as long as We the accept on translation quality is minor. Yet it should be possible to reduce the impact search search errors by increasing search computing errors. search time. In the limit, should it not should be number expect The the algorithm possible of without search algorithm be able to impose scale any to limitations. We also that up to principal sentences with meet an acceptable aims, it is computing necessary time. very long To effort. We these accomplish such In a to restriction have by a mechanism searching constrain the in set a that restricts beam pruning, we of breadth-first the search pruning: search. considered manner translation with didates (the “beam”) that only to the different promising parts of the ones. input We compare sentence. This in beam makes the search can- hypotheses cover those of the probabilities problematic. Therefore, we integrate an admissible estimation comparison the remaining Many the probabilities other search to arrive approaches at a complete suggested translation in the literature (Section do 5.6) of described aims: of not meet the

• Neither integer programming optimal A* search (Germann (Och, et Ueffing, al. 2001) and for Ney statistical 2001) nor MT optimal efficient search for long sentences. allows

• Greedy commit search severe algorithms search errors (Wang (Germann 1998; Germann et al. 2001). et al. 2001) typically

• Other algorithms approaches by assuming to solving monotone the search alignments problem imposing simplified recombination (Tillmann obtain polynomial et al. 1997) time a structure (Nießen et al. 1998). or Others (Garcı́a-Varea, make simplifying Casacuberta, assumptions and Ney 1998; about Garc´ the search does the original IBM stack search decoder (Berger ıa-Varea space et al. 1994). et al. All 2001), simplifications make the search problem these as introduce fundamental ultimately search errors. simpler but search, In the which following, allows a we trade-off describe between our search efficiency algorithm and quality based on beam. search algorithm be adapted by the adjusting concept the of can easily size beam the The to other phrase-based translation of models. in Tillmann For and single-word-based Ney (2003). search in MT, a similar algorithm has been described

Search Problem

Putting obtain the everything following together decision and rule: performing search in maximum approximation, we { M } ê I 1 = argmax ∑ λ m h m ( e I 1 ,f 1 J , π 1 K ,z K · e , K ,z K 1 ) (22) I 1 π 1 1 m = 1 Och and Ney The Alignment Template Approach to Statistical Machine Translation Using the decision four rule: feature functions AT, AL, WRD, and LM, we obtain the following 3

ê1 I argmax { = e I 1 , π 1 K ,z K 1 I ∑( λ LM log p e e ,e )+ λ WRD log p ( e |{ f | ( i,j ) A } ,E ) ) ( i | i − 2 i − 1 i j ∈ i i = 1 K + ∑ k = 1 ( λ AT log p ( z k | f j j π π k k − 1 + 1 )+ λ AL ·| j π k − j π k − 1 + 1 | )

J j log p EOS e ,e } + λ AL · ( − π K )+ λ LM ( | I − 1 I ) Here, each we have word grouped (from LM the and contributions WRD, of the expression various (24)), feature AL, (25)), and those for those for functions end of every into (from and expression the alignment those for AT sentence template which includes a term log ( EOS e ,e ) for the (expression end-of-sentence language (26)), probability. p | I − 1 I model ply obtain To extend an additional this decision term rule for the for word each penalty word. The (WP) class-based feature function, 5-gram we language sim- model functions (CLM) can be included nicely like λ WP ture decompose into the contributions trigram language for each model. produced Note that target all these language fea- word an efficient or for dynamic each covered programming source language search algorithm. word. This Not makes all feature it possible functions to develop this nice property: term For our the decision conventional rule which lexicon depends feature on function the full (LEX), sentence. we obtain have additional in Therefore, an this instead feature will be function used to will rerank not the be integrated set of candidate in the translations dynamic programming produced by the search search. but

Structure of Search Space

We search have algorithm, to structure we generate the search search space hypotheses in a suitable that way correspond to search to efficiently. prefixes of In target our language sentence. sentences. A hypothesis Each structured is hypothesis extended graph by is the appending translation a one of target a part potheses be as a node word. of the The source with source set the of language can representing sentence all hy- start, senting goal partial nodes translations. representing There complete is a directed translations, edge between and intermediate hypotheses nodes repre- the hypothesis is obtained by appending one word to hypothesis . Each n 1 edge and n costs resulting 2 from the contributions of all feature functions. n 1 has 2 if associated n search problem can be reformulated as finding the optimal path through this Finally, graph. our cable In alignment the first step, template we determine exists. Every the set possible of all source application phrases of in an alignment f̃ for which template an appliz =( F J ′ 1 ,E I ′ 1 ,Ã ) to a subsequence f j j + J ′ − 1 of the source sentence is called an alignment template instantiation Z =( z,j ) . Hence, the set of all alignment template instantiations for the source sentence f 1 J is

( 27 ) { Z z,j z F J ,E I ,Ã j :p z f j J ′ 1 0 } =( ) | =( ′ ′ ) ( | + − 1 1 ∧∃ j ) > (23) (24) (25) (26) Computational Linguistics Volume 30, Number 4 If the source introduce a sentence new contains alignment words template that that have not performs been a seen in one-to-one the training translation data, of we these words by itself. each of In each the target second word step, we position in determine the a set alignment of probable template target language instantiation. Only words then in the these for are hypothesized search. We call this selection of highly probable words the (Tillmann template and Ney instantiation, 2000). As a criterion for a word words observation in pruning alignment we use e at position i δ ( E i ,C ( e )) ∑ Ã ( i,j ) · [BAR] Ã i ,j ) · p ( e | f j ) j = 0 ∑ i ( ′ ′ J ′ (28) In our A experiments, decision is a we hypothesize triple Z,e,l only the five Z, d =( ) consisting best-scoring of an words. the generated word e, and the sequence index l of decisions the alignment generated template word in Z. instantiation corresponds to a valid of . The possible decisions A hypothesis are as follows: n d i 1 1. Start 1. a This new decision alignment can template: be made only d i =( if Z the i ,e i ,1 previous ) . In this decision case, the index finished l = an instantiation alignment template not overlap and if the with newly any chosen previously alignment d i − 1 template alignment template instantiation. does of The resulting LM and the decision (expression the contribution the WRD score chosen corresponds features (24)) to for the (25)) produced for the word started and alignment the contribution features AT (expression template. of AL and

Extend only if the an alignment previous decision template: uses d i the =( same Z i ,e i ,l alignment ) . This decision template can be made instantiation score and has as index to the l − contribution 1: d i − 1 =( Z of i ,e the i − 1 ,l − decision corresponds LM 1 and ) . The the resulting features (expression (24)). WRD

Finish hypothesis the translation is marked of as a a sentence: goal hypothesis. d i =( EOS,EOS,0 This decision ) . is In possible this case, only the alignment if the previous template decision instantiations d i − 1 finished completely an alignment cover the template input sentence. and if the The expression resulting (26). decision score corresponds to the contribution of 2. 3. Any valid and , a complete segmentation sequence of decisions phrases, a d I translation phrase 1 + 1 into uniquely alignment corresponds , and to a a sequence certain e I 1 K π 1 K of alignment template instantiations z K corresponding score described in expressions 1 . The (24)–(26). sum of the decision scores is equal to the possible A straightforward sequences of decisions. representation Obviously, of all there hypotheses would would be a large be the redundancy prefix tree in of this all search guishable space in the representation, sense that the because subtrees there following are many these search search nodes nodes that are identical. are indistin- can recombine these identical search nodes; that is, we have to maintain only the most We probable distinguished In general, hypothesis by the neither criterion (Bellman language for 1957). recombining nor translation a set of model. nodes In is performing that the hypotheses recombination, can be Och and Ney The Alignment Template Approach to Statistical Machine Translation Algorithm for breadth-first search with pruning. we recombination obtain a search for the graph alignment instead templates of a search is described tree. The exact in Section criterion 5.5. for performing

Note that here some of the simplifying notation of Section 4 has been used. Search Algorithm

Theoretically, the search space. we We could use use a breadth-first any graph search search algorithm algorithm to with search pruning. the optimal This approach path in offers In pruning, very good we always possibilities compare for adjusting hypotheses the that trade-off have produced between quality the same and number efficiency. target words. of nentially, Figure it 7 is shows not possible a structogram to explicitly of the represent algorithm. it. As Therefore, the search we space represent increases the search expo- space duces implicitly, new hypotheses using the extending functions the current Extend hypothesis and Recombine by one . The word. function Some hypothe- Extend pro- ses These might are recombined be identical by or indistinguishable the function by the language . We expand and the translation search space models. that only hypotheses step, with the use same two number Recombine different of types target of language pruning. First, words we are perform recombined. such In the pruning we pruning relative to the score Q̂ of the current best hypothesis. We ignore all hypotheses that have This type a probability of pruning lower can than be performed log ( t p )+ when Q̂, where the hypothesis t p is an adjustable extensions pruning are computed. parameter. Second, best hypotheses. in histogram The pruning two pruning (Steinbiss, parameters Tran, and Ney and 1994), have we to maintain be optimized only the respect N p to the trade-off between efficiency and quality. t p N p with

Implementation

In mentation this section, of a search we describe algorithm various for issues the alignment involved important decision in the implementation template in performing approach. an efficient imple- a hypothesis. A very the Theoretically, design decision it and would a back-pointer be possible to to the represent previous search is the representation hypotheses only of by would be associated a very inefficient representation for the implementation hypothesis. of the operations Yet this Computational Linguistics Volume 30, Number 4 that all have to be information performed required in to the search. perform The hypothesis efficiently the representation computations should needed in contain but should contain no more information than that, to keep the memory the search small. consumption In search, mation: we produce hypotheses n, each of which contains the following infor-

1. e: the final target word produced

2. h: the state of the language model (to predict the following word)

3. c = the c J source 1 : the coverage sentence vector ( j 1 representing means the the already position j is covered positions covered, c = 0 of the position is not c = covered) j means j

4. Z: a final reference target to word the alignment template instantiation that produced the

5. l: the position instantiation of the final target word in the alignment template

6. Q ( n ) : the accumulated score of all previous decisions

7. n ′ : a reference to the previous hypothesis Using this representation, we can perform the following operations very efficiently:

• Determining used to extend whether a hypothesis. a specific To alignment do this, we template alignment instantiation are still check free whether instantiation in the hypothesis the positions can be of coverage the vector. template

• Checking whether the whether coverage a hypothesis vector contains is final. no To vector uncovered do this, we as position. determine representation, the operation to check whether a hypothesis Using a bit final can be implemented very efficiently. is

• Checking recombining whether two hypotheses two hypotheses can 1 1 ,h be 1 , recombined. c 1 ,Z 1 ,l 1 ) and The criterion for ,h , ,Z ,l is n =( e n 2 =( e 2 2 c 2 2 2 )

h 1 = h 2 ∧ identical identical language coverage vector model state c 1 = c ( ( Z 2 ∧ 1 = Z 2 ∧ l 1 = l 2 ) ∨ alignment alignment template template instantiation )= J Z )= l ) ) instantiation is finished identical ( J ( Z 1 l 1 ∧ ( 2 2 We sentence. compare This in makes beam the search comparison those hypotheses of the probabilities that cover different problematic. parts Therefore, of the input integrate an admissible estimation of the remaining probabilities to arrive at a complete we translation. in the next section. Details of the heuristic function for the alignment templates are provided

Heuristic Function

To A heuristic improve function the comparability estimates the of search probabilities hypotheses, of reaching we introduce the goal node heuristic from functions. a certain Och and Ney The Alignment Template Approach to Statistical Machine Translation search is, for node. each An admissible search node, the heuristic product function of edge is always an probabilities optimistic of reaching estimate; a goal that is always equal to good or smaller than function the is estimated crucial to probability. being able to For an A*-based node algorithm, a heuristic translate long search For to a beam search the algorithm, scoring of the heuristic search function hypotheses. The has a goal different sentences. improve is to motivation. make the It is used all hypotheses more comparable, in order to minimize the chance that probabilities the of leading to the Heuristic optimal (1997) functions translation Och, for Ueffing, search is and in pruned away. hypothesis and Ney statistical (2001). MT Wang have and been used Waibel in (1997) Wang have and Waibel simple heuristic function Ney for Model 2 have of Brown described et an al. (1993) that was not described a Och, Ueffing, and (2001) admissible heuristic admissible. Model 4 of Brown et al. (1993) and an almost-admissible heuristic function function for empirically obtained. that is We have introduced to in keep in mind computing the that a heuristic heuristic function better function of is is more helpful than only if the a pruning compensated overhead gain obtained through search hypotheses. The heuristic for by the described in the following are designed such that their computation can be functions efficiently. performed source The sentence basic idea positions for developing that have a heuristic not been function covered for so an far alignment still have model to be translated is that all to complete the sentence. If we have an estimation r X ( j ) of the X optimal score for translating can be inferred position by summing j, then the over value the contribution of the heuristic for every function position R ( n ) for that a is node not in n the function): coverage vector n ) (here X denotes different possibilities to choose j c ( the heuristic R X ( n )= ∑ r X ( j ) ( 29 ) j c ( n ) The not every situation word in is the translated case of the alone, alignment but typically template the approach words are is more translated complicated, basic for the heuristic function in context. as Therefore, template approach the is quantity a function that assigns to in every the case probability. Using Z ) , we can alignment of the heuristic stantiation maximal r ( template alignment a r ( Z ) induce a position-dependent in- function Z r ( j ) :

r ( j ) : = Z:j Z j max j Z J Z 1 r ( Z ) / J ( Z ) (30) ( ) ( )+ ( ) Here, template J ( Z instantiation ) denotes the number and of source denotes language the position words produced easily Z that j ( Z if ) of the first by word. have to can be shown source the alignment It ( Z ) is admissible, then r ( j ) is also admissible. language show that for all nonoverlapping r sequences the following holds: We Z K 1

∑ r ( Z k ) ≤ ∑ r ( j ) k = 1 j c ( Z K 1 ) templates Here, c ( Z K 1 ) denotes . This can the be set shown of all easily: positions covered by the sequence of alignment Z K 1

K K J ( Z k )

∑ r ( Z k ) = ∑∑ r ( Z k ) / J ( Z k ) k = 1 k = 1 j = 1 ( 31 ) (32) Computational Linguistics Volume 30, Number 4 Algorithm the translation. min-jumps to compute the minimum number of needed jumps D ( c1 J, j ) to complete = ∑ r ( Z k ( j ) ) / J ( Z k ( j ) ) (33)

j ∈ c ( Z K 1 )

≤ ∑ Z:j Z j max j Z J Z 1 r ( Z ) / J ( Z ) (34) j c Z K ( ) ≤ ≤ ( )+ ( ) − ∈ ( 1 ) Here, In the k following, ( j ) denotes we the develop phrase various index k heuristic that includes functions the target language of increasing word probability a heuristic function r ( Z ) complexity. position j. The simplest of realization an alignment of template instantiation: r ( Z ) takes into account only the prior

R AT ( Z =( z,j )) = λ AT · log p ( z | f j,j + J ( z ) − 1 ) (35) The lexicon model can be integrated as follows:

WRD j ( Z )+ J ( Z ) − 1 R ( Z ) = λ WRD · ∑ max e log p ( e | f j ′ ) j ′ = j ( Z ) The there language exists an model optimal can language be incorporated model probability: by considering that for each target word

p L ( e )= max p ( e | e ′ ,e ′′ e ,e ) ′ ′′ Here, over model all we possible assume the lexicon different a trigram language language into one model model. heuristic histories. In general, model We can it also is necessary combine the to maximize and function: language

WRD+LM j ( Z )+ J ( Z ) − 1 R ( Z ) = ∑ max λ WRD log ( p ( e | f j ))+ λ LM log ( p L e ′ ( e )) (38) j ′ = j ( Z ) the minimum To include sum the phrase of all jump alignment widths probability that is needed in the to heuristic complete function, the translation. we compute sum can be computed efficiently using the algorithm shown in Figure 8. Then, This admissible heuristic function for the jump width is obtained by an

R AL ( c ,j ) = λ AL · D ( c ,j ) (36) ( 37 ) (39) Och and Ney The Alignment Template Approach to Statistical Machine Translation Table 2 Statistics for development Verbmobil corpus task: (Dev), test training corpus corpus (Test) (Train), (Words*: conventional words dictionary without (Lex), punctuation marks). [BAR] No Preprocessing German English With Preprocessing German English Train Lex Dev Test Sentences Words Words* Singletons Vocabulary Entries Extended vocabulary Sentences Words Trigram perplexity Sentences Words Trigram perplexity Combining tic function all for the a search heuristic hypothesis functions for the various models, we obtain as final heuris- n

R ( n R AL c n ) ,j n ( ))+ ∑ R AT j )+ R WRD+LM j ) ) = ( ( ( ( ( )

j c ( n ) (40)

418,974 3,453 7,940 — — 453,612 1,698 4,673 28.1 30.5 450,297 1,763 4,780 26.3 29.9 420,919 3,570 8,102 — — Results 6.1 Results on the Verbmobil Task

We domain present of appointment results on the scheduling, Verbmobil travel task, planning, which is and a speech the statistics for this task. hotel We use reservation translation a training (Wahlster task in the 2000). which Table development is used 2 shows to train which the corpus alignment is used to template estimate the model model and scaling the language factors, and models, corpus, corpus, a test a corpus. reference On translations average, 3.32 for the reference test corpus translations are used. for the development corpus and 5.14 in A standard . vocabulary However, not had all been words defined of this for vocabulary the various were speech observed recognizers in the train- used ing Verbmobil adding corpus. about Therefore, 13,000 German–English the translation vocabulary entries from was an extended online bilingual semiautomatically lexicon avail- by able also multi-word on the Web. translations, The resulting especially lexicon for contained the large not number only word-word of German entries, compound but words. rule-based To counteract preprocessing the steps sparseness were applied of the training data, any a other couple type of of straightforward before processing:

• normalization numbers of • • time spelling and (e.g., date phrases • don’t → do not)

• splitting of German compound words. Computational Linguistics Volume 30, Number 4 for So the far, in machine evaluation of translation experimental research results. there is Therefore, no generally we use accepted various criterion following experiments, we use: criteria. In the

• WER WER (word is error computed rate)/mWER as the (multireference minimum number of word error rate): substitution, The and deletion operations that have to be performed to insertion, generated sentence into the target sentence. In the convert case of the the multireference reference word translation error is rate used, for as each for the test sentence, WER, but a not whole just For translation hypothesis, the set a of single translations. each edit distance reference most similar sentence is calculated (Nießen et al. 2000). to the

• PER that (position-independent it requires a perfect WER): word A order. shortcoming An of acceptable the WER order is different from that of the target sentence is sentence, can the fact word that so have the a measure alone introduce could an be misleading. additional measure To the overcome this problem, position-independent we WER rate. This as measure compares the words in the two sentences, word error the word order. ignoring

• BLEU precision (bilingual of unigrams, evalutation bigrams, understudy) trigrams, score: and 4-grams This score with measures respect to the whole set of reference translations, all with evaluation a penalty criteria for too-short used here, sentences a (Papineni BLEU larger BLEU measures et al. 2001). scores, accuracy, Unlike better. that is, other the the opposite of error rate. Hence, the In plate the length, following, search we pruning, analyze and the effect language of various model system n-gram components: size. A systematic alignment evaluation tem- of rule-based) the alignment has been template performed system in comparing the it with other alignment-template-based project translation Hahn There, the Verbmobil and system is described approaches achieved in a Tessiore (e.g., and nificantly von larger (2000). systems number of Och, “approximately and Vogel 2001). correct” translations than the competing sig- translation (Ney, the 6.1.1 maximum Effect of Alignment length of the Template alignment Length. templates Table in the 3 shows source the language. effect of Typically, constraining is necessary We that to restrict using alignment the alignment templates template with length only one to keep or two memory words requirements it low. languages see results in very bad translation quality. Yet using alignment templates in the source lengths as small as three words yields optimal results. with of 6.1.2 beam Effect search of Pruning pruning and and Heuristic of the heuristic Function. function. In the We following, use the following we analyze criteria: the effect

Number algorithm of misses search the errors: most probable A search translation error occurs and when produces the search translation which is less of probable. optimal As we translation, typically cannot we cannot efficiently a compute compute the the probability the number of of search the search errors. errors Yet by comparing we can compute the translation a lower efficiently bound on number • Och and Ney The Alignment Template Approach to Statistical Machine Translation Table Effect 3 of alignment template length on translation quality. AT 1 length PER [%] 29.8 mWER [%] 2 39.9 BLEU [%] 27.0 33.0 44.6 3 26.5 30.7 53.6 4 5 26.9 31.4 56.1 6 26.8 31.4 55.7 26.5 30.9 55.7 7 26.5 30.9 56.0 56.1 Table Effect 4 model of pruning (N parameter 50,000). t p and heuristic function on search efficiency for direct-translation p = [BAR] t p • • pruning Tables parameter 4 and 5 show the 50,000. effect Tables of the 6 pruning and 7 show parameter the effect t p of with the the pruning histogram rameter with the N pruning p = parameter 10 12 . In all four tables, we provide pa- N p t p = − results for using function. no heuristic functions first is an and lexicon (AT+WRD), The the second estimate three of variants the alignment of an increasingly template and infor- the mative heuristic probability adds an estimate of the language model the (+LM) heuristic probability, functions cannot a heuristic are and described the third in also even Section adds function, more 5.6. the alignment probability (+AL). These Without guarantee search-error-free translation. than We draw a hundred the conclusion seconds per that sentence heuristic function is very important to obtaining an efficient search algorithm. a good found we have under found specific using pruning very conservative thresholds pruning with the thresholds. best translation that Average trade-off translation between efficiency time per and sentence: quality. Hence, Pruning we is present used to the adjust average the time needed to translate one sentence of the test corpus. Translation different correct quality translations. (mWER, BLEU): Therefore, Typically, a search a error sentence result poorer translation quality. It is even does not can have many necessarily search error can improve in translation quality. Hence, we analyze possible on quality, using the automatic evaluation criteria the that effect a of mWER search and BLEU. translation Computational Linguistics Volume 30, Number 4 Table Effect 5 (N of pruning 50,000). parameter t p and heuristic function on error rate for direct-translation model p = [BAR] t no heuristic function error AT+WRD rates [%] mWER BLEU mWER BLEU +LM mWER BLEU +AL p mWER BLEU Table Effect 6 of pruning parameter N p and heuristic function on search efficiency for direct-translation model ( t p 10 − 12 = ). [BAR] N 1 p 100 10 10,000 1,000 50,000 time AT+WRD [s] search 0.0 errors 0.0 154 238 2.3 0.3 33 69 119.2 18.3 10 5 time +LM [s] search 0.0 errors 0.0 148 238 2.4 0.3 27 60 146.2 21.1 2 5 time +AL [s] search 0.0 errors 0.0 232 0.2 98 2.0 21 14.3 1 5 75.2 0 Table Effect 7 of pruning parameter N p and heuristic function on error rate for direct-translation model (t p 10 − 12 = ). N 1 p 100 10 10,000 1,000 50,000 mWER no heuristic function 64.4 BLEU 46.6 46.9 29.7 37.8 41.0 51.5 49.8 35.5 35.7 53.1 52.9 Och and Ney The Alignment Template Approach to Statistical Machine Translation Table Effect 8 CLM: of the length class-based of the 5-gram). language model history (Unigram/Bigram/Trigram: word-based; [BAR] Language model type PP PER [%] mWER [%] BLEU [%] Unigram Bigram Trigram Trigram + CLM do In not addition, use a the search heuristic errors function. If have we a more compare severe the effect error rates on in the error Table 7, rates which if we spond to about 55 search errors in Table 6, we obtain an mWER of 36.7% (53 corre- errors) using combined no heuristic function function. and The an reason mWER is that of 32.6% input is without first. a (57 search This heuristic errors) search the heuristic part the sentence translated yields function, using the “easy” of severe often errors. reordering 6.1.3 n-gram-based Effect of language the Length models. of the Ideally, Language we would Model like History. to take into In this account work, long-range we use only dependencies. data. Therefore, Yet we long expect n-grams that extending are seen rarely the history and are length therefore will rarely at some used point on unseen improve Table further quality. We see shows translation that the the effect quality. not 8 language of the model length perplexity of the language improves model from history 29.9 a trigram model. 4,781 for on translation model to for The corresponding translation quality a improves unigram from taking an into mWER account of 45.9% the bigram to an mWER dependence, of 31.8%. which The achieves largest effect an mWER seems of to 32.9%. come If from perform observe log-linear additional interpolation small of improvement a trigram model in translation with a class-based quality to 5-gram an mWER model, we we 30.9%. an of

− 2 10 − 4 10 − 6 10 − 8 10 − 10 10 − 12 0.2 2.0 11.9 45.6 114.6 97 61 41 38 34 0.2 2.8 15.0 50.9 119.2 57 21 7 6 5 0.3 4.1 19.9 65.2 146.2 40 11 5 3 2 0.2 1.8 9.5 32.0 75.2 13 3 1 1 0 10 10 10 10 10 − − − − − − 10 12 2 4 6 8 46.8 39.8 50.9 37.1 51.3 35.7 53.0 36.1 52.9 35.7 52.9 35.0 31.8 31.4 31.3 31.2 53.8 55.0 55.7 55.8 55.9 32.3 30.9 31.2 31.0 31.0 55.0 55.6 55.7 55.9 55.9 30.7 30.8 30.9 30.8 30.8 55.9 56.0 56.0 56.0 56.0 38.3 29.9 — 26.9 26.8 26.5 32.9 31.8 30.9 53.0 55.2 56.1 Results on the Hansards task

The kept Hansards by law in both task French involves and the English. proceedings About of three the million Canadian parallel parliament, sentences which of this are bilingual Here, we use data a have subset been of the made data available containing by only the sentences Linguistic of Data up to Consortium 30 words. Table (LDC). shows The the results training for French and test to corpus English statistics. 9 Because of memory limitations, the and maximum for English alignment to French template are shown length in words. compare here against the single-word-based search has Table been 10. restricted Model 4 described to four in Tillmann We results (2001). We the see single-word-based that the alignment search. template approach for obtains significantly better than

Results on Chinese–English

Various news domain statistical, were example-based, evaluated in the and NIST rule-based 2002 MT MT evaluation. systems for a Using Chinese–English 4 the alignment Computational Linguistics Volume 30, Number 4 Table 9 Corpus statistics for Hansards task (Words*: words without punctuation marks). French English Training Sentences Words Words* Vocabulary Singletons

Sentences Words Trigram perplexity Test Table Translation 10 results on the Hansards task. Translation Alignment templates approach Single-word-based: Single-word-based: monotone reordering search search WER English [%] → French 60.9 PER 47.9 [%] 66.0 66.6 54.4 56.3 Table Corpus 11 punctuation statistics marks). for Chinese–English corpora—large data track (Words*: words without [BAR] Chinese No preprocessing English Chinese With preprocessing English Train Lex Dev Test template problem domain approach is described the translation in this of article, Chinese we news participated text into in English. these evaluations. Table 11 gives The an form overview words that on the have training been converted and test data. to lowercase The English letters. vocabulary The number consists of sentences of full- has more been than artificially once to the increased training by corpus, adding in certain order parts to give of the larger original weight training to those material of therefore the training corpus to be that especially consist of helpful high-quality for the translation aligned Chinese of the news test data. text and parts expected are 444 Sentences Unique Words sentences Words* Singletons Vocabulary Entries Extended vocabulary Sentences Words Trigram perplexity Sentences Words Och and Ney The Alignment Template Approach to Statistical Machine Translation Table 12 Results larger of Chinese–English values are better). NIST MT evaluation, June 2002, large data track (NIST-09 score: [BAR] System NIST-09 score [BAR] Alignment template approach 7.65 Competing Best of six research systems commercial off-the-shelf systems 5.03–7.34 6.08 The words Chinese are not language marked. poses Chinese special text is problems provided because as a the sequence boundaries of of characters, Chinese unclear which interpreted as characters For have to be statistical grouped MT, together would be to obtain possible to entities and words. ignore that it this can is it fact be treat the Chinese characters experiments as elementary that the units and existing translate alignment them into models English. and preliminary showed produce Yet results if the Chinese characters are segmented in a preprocessing step into better words. For We use the the LDC English segmentation corpus, tool. single 5 corpus is tokenized; it is the then following segmented preprocessing into sentences, steps and are all applied. uppercase First, the are converted not to lowercase. to deal with As the final evaluation criterion does not distinguish characters it is Then, necessary the preprocessed Chinese the case the lengths of the source and target and sentences English information. case, corpora are significantly are sentence different. aligned From in which resulting corpus, we automatically replace translations. In addition, only sentences the with To less improve than 60 the words translation in English of Chinese and Chinese numbers, are used. nese number and date expressions. For the statistical we learning, use a categorization all number and of are with one of date Chi- expressions ber and date expressions replaced are subjected two to generic a rule-based symbols, translation $ number by or simple $ date. lexicon The num- lookup. put using The the translation alignment of information. the number For and Chinese date expressions and English, is inserted this categorization into the out- implemented To evaluate independently MT quality on of this the other tool. This tool provides a modified BLEU task, NIST language. is score made by computing available a the weighted NIST-09 precision evaluation n-grams modified by a length penalty for very short translations. Table 12 shows the of the results results of the obtained official evaluation were statistically performed significantly by NIST in better June than 2002. any With other a score score than 0.12 are statistically significant competing of 7 . 65, approach. the 95% level. Differences to We conclude in the language that NIST pairs the developed larger such as Chinese–English alignment template and that approach the developed is also at applicable statistical various example models unrelated translations. indeed seem to be largely language-independent. Table 13 shows

22,175,069 100,269 40,199 97,646 — 22,163,092 20,063,378 78,332 31,319 88,773 179.8 31,175,023 67,103 15,324 76,182 26,361 — 24,540 1,645,631 29,212,384 33,044,374 26,828,721 30,849,149 92,488 24,933 100,704 80,977 32,267 993 237,154 — 878 5,336 54,190 25,852 — 24,144 32,511,418 26,344 85,116 93,350 171,922 31,607 — Conclusions

We more have general presented than the a widely framework used for source–channel statistical MT approach. for natural It allows languages a baseline which MT is Computational Linguistics Volume 30, Number 4 Table 13 Example translations for Chinese–English MT. Reference Translation Reference Translation Reference Translation Reference Translation Reference Translation Reference Translation Reference Translation system the alignment to be extended template easily approach by adding for statistical new feature machine functions. translation, We which have described different alignment between levels: a phrase-level words. As a alignment result between context phrases and uses the of words has a a greater word- two level influence, alignment explicitly. and An advantage the changes single of in this word method order from is that source machine to target translation language is can learned be learned automatically approach through capable the use of of a achieving bilingual better training translation corpus. We results have on shown various that fully presented is tasks the compared is especially to interesting, other statistical, as our example-based, system is structured or rule-based simpler translation than many systems. competing This systems. Significant the Accomplishment Fourteen Open Border Achieved Cities in in the China Economic Construction of The opening significant up of the economy achievements in of China’s construction fourteen City made Xinhua been News Agency, achieved in 1995 Beijing, in the Feb. 12—Exciting economic accomplishment construction of China’s has border Xinhua cities open to foreigners. fourteen outside News Agency, world of the Beijing, 1995 in February the 12—China’s fourteen border opening pleased up to the to obtain the construction of the economy. Foreign To Investment increase the in Jiangsu’s operation of Agriculture foreign on investment the in Increase Jiangsu agriculture According and Economic to the Cooperation, data provided as of today November by the this Ministry year, of China Foreign has Trade actually 40.007 billion utilized US 46.959 dollars billion of direct US investment dollars of foreign from foreign capital, businessmen. including The provided external that economic this year, and the trade foreign cooperation capital actually Department utilized today November to US $46.959 billion, including of foreign company by China direct on investment was US $40.007 billion. According and Forestry to officials of Jiangsu, from the the ”Three-Capital” Provincial Department ventures of approved Agriculture agencies within the more agricultural 500 and system have of utilized Jiangsu over Province 700 million since by have numbered than 1994 dollars times worth than of foreign in 1993. capital, respectively three times and seven US Jiangsu more Jiangsu Province Province for system the Secretaries the approval said of that, the “three-funded” from the 1994 years, enterprises, rate of more there than are US more $700 million, than 500, 1993 foreign years investment before three utilization and seven. The as compared actual amount with the of foreign same period capital last has year. also increased more than 30% The than actual 30% compared amount of with foreign the investment same period has last increased year. by more Import dollars and This Export Year in Pudong New District Exceeding 9 billion US Foreign new Region trade exceeds imports US and $9 exports billion of this year to the Pudong Och and Ney The Alignment Template Approach to Statistical Machine Translation the We flat expect phrase that better translation segmentation that we can be achieved perform in our by using model. A models that promising go beyond gradually extend the languages model to take ideas into from account Wu and to some Wong extent (1998) or the avenue recursive is to of natural using Alshawi, structure Douglas phrases (2000). We source expect or other target improvements language and from as well better from learning Bangalore, and in generalization nonconsecutive learned-phrase pairs. methods for the Acknowledgments The work while the reported first here author was was carried with the out Lehrstuhl Science für Informatik Department, VI, RWTH Computer Aachen–University of Technology. References Alshawi, Shona Douglas. Hiyan, Srinivas 2000. Learning Bangalore, and dependency collections of translation finite state models head transducers. as Bellman, Computational Programming Richard. Linguistics Princeton 1957. Dynamic , 26(1):45–60. Princeton. . University Press, Berger, Della Adam Pietra, L., Vincent Peter F. J. Della Brown, Pietra, Stephen A. John Printz, R. and Gillett, Lubos John Ure˘s. D. Lafferty, 1994. The Harry Candide In Proceedings system of the for Human Technology ARPA machine Workshop translation. Language , pages on Berger, 157–162, Adam Plainsboro, L., Stephen NJ, A. March. and maximum Vincent J. Della Pietra. Della entropy 1996. Pietra, approach to natural A language Linguistics processing. , 22(1):39–72. Computational Brown, Pietra, Peter Vincent F., J. J. Cocke, Della Pietra, Stephen Frederick A. Della Jelinek, and Paul John S. Roossin. D. Lafferty, 1990. Robert A statistical L. Mercer, approach Computational to machine Linguistics translation. Brown, Peter Della F., Stephen A. , Vincent Pietra, and Della 16(2):79–85. J. R. L. Pietra, 1993. The mathematics of statistical Mercer. machine estimation. translation: Computational Parameter 19(2):263–311. Linguistics , Charniak, Yamada. Eugene, 2003. Syntax-based Kevin Knight, language and Kenji Summit models IX for , machine pages 40–46, translation. New Orleans, In MT Darroch, September. Generalized J. N. and iterative D. Ratcliff. scaling 1972. models. Annals of Mathematical for Statistics log-linear 43:1470–1480. , Dempster, Rubin. A. P., 1977. N. M. Laird, Maximum and D. likelihood B. incomplete data via the EM from Journal of the Royal Statistical algorithm. Series B, 39(1):1–22. Society, Garcı́a-Varea, and Ismael, Hermann Ney. Francisco 1998. An Casacuberta, DP-based search algorithm iterative, machine translation. In Proceedings for statistical International Processing Conference , on pages Spoken 1235–1238, Language of the Sydney, November. (ICSLP’98) Garcı́a-Varea, Hermann Ney, Ismael, and Franz Francisco Josef Och, Casacuberta. for statistical 2001. machine Refined translation lexicon using models maximum entropy of 39th approach. Annual Meeting In a Proceedings Association for the Computational Linguistics of the (ACL), July. pages 204–211, Toulouse, France, Germann, Knight, Daniel Ulrich, Marcu, Michael and Jahr, Kenji Kevin 2001. Fast decoding and optimal decoding Yamada. for the 39th machine Annual translation. Meeting of In the Proceedings Association for of Computational 228–235, Toulouse, Linguistics France, (ACL) July. , pages Gildea, alignment Daniel. for 2003. of machine Loosely 41st Annual translation. tree-based Proceedings the Meeting In Association for Computational Linguistics of the Knight, (ACL), Kevin. pages 1999. 80–87, Decoding Sapporo, complexity Japan, July. in Computational word-replacement Linguistics translation , 25(4):607–615. models. Koehn, Marcu. Philipp, 2003. Statistical Franz Josef phrase-based Och, and Daniel translation. Language Technology In Proceedings and North of the American Human Association Conference (HLT/NAACL) for Computational , pages Linguistics Edmonton, Alberta. 127–133, Marcu, phrase-based, Daniel and joint William probability Wong. model 2002. for A statistical Proceedings machine of the Conference translation. Methods Natural Language on (EMNLP-2002) , pages 133–139, Processing Empirical In in Philadelphia, July. Computational Linguistics Ney, Hermann. 1995. On the probabilistic-interpretation neural-network classifiers of discriminative training and Transactions on Pattern criteria. Analysis IEEE Intelligence , 17(2):107–119. and Machine Ney, Hermann, Wessel. 1995. Margit Generet, Extensions of and Frank discounting for language absolute Proceedings of the Fourth modeling. European In on Speech Communication and Conference pages 1245–1248, Madrid, Technology Ney, Franz Josef September. , Hermann, Stephan for Vogel. statistical 2001. The Och, and translation RWTH of system dialogues. Workshop In on Proceedings of spoken Human the Language ARPA San Diego, March. Technology , Nießen, Leusch, Sonja, and Franz Hermann Josef Ney. Och, 2000. Gregor evaluation tool for machine translation: An Fast research. evaluation In Proceedings for machine of the translation International and Conference on (LREC) Language Second Resources 39–45, Athens, Evaluation May. , pages Nießen, Ney, DP-based and Sonja, Christoph Stephan search Tillmann. Vogel, Hermann algorithm for 1998. statistical A machine 36th Annual translation. Meeting of In the COLING-ACL Association for ’98: Computational International Conference Linguistics on and Computational 17th Linguistics August. , pages 960–967, Montreal, Och, for Franz determining Josef. 1999. bilingual An efficient ’99: Conference word of the classes. method EACL Ninth European In Chapter Linguistics of the , pages Association 71–76, for Bergen, Computational June. Norway, Och, training Franz in Josef. statistical 2003. Minimum machine translation. error rate In the Proceedings Association of for the Computational 41st Annual Linguistics Meeting of July. (ACL), pages 160–167, Sapporo, Japan, Och, Discriminative Franz Josef and training Hermann and maximum Ney. 2002. entropy translation. models In Proceedings for statistical Meeting the Association of the machine Annual of for 40th Computational 295–302, Philadelphia, Linguistics July. (ACL) , pages Och, systematic Franz Josef comparison and Hermann of various Ney. 2003. A statistical Computational alignment Och, Franz Josef, Linguistics models. Nicola Ueffing, , 29(1):19–51. Hermann search Ney. 2001. for statistical An efficient and algorithm machine A* Volume 30, Number 4 translation. Translation In Data-Driven Workshop , Machine Toulouse, France, pages July. 55–62, Papineni, R. Kishore Todd Ward. A., Salim 1997. Roukos, and language Feature-based understanding. Conference Technology on , Speech In European pages Communication 1435–1438, and Greece, September. Rhodes, Papineni, R. Kishore Todd Ward. A., Salim 1998. Roukos, Maximum and and discriminative training of likelihood translation models. In direct International Conference Proceedings on Acoustics, of the and Signal Processing , pages Speech, Seattle, May. 189–192, Papineni, Ward, Kishore and A., Wei-Jing Salim A for automatic Zhu. Roukos, evaluation 2001. Todd method Bleu: machine translation. Technical Report of RC22176 Division, (W0109-022), Thomas J. Watson IBM Research Center, Yorktown Heights, NY. Research Steinbiss, Hermann Volker, Ney. Bach-Hiep 1994. Improvements Tran, and beam search. In Proceedings of the in International Processing (ICSLP’94) Conference , on pages Spoken 2143–2146, Language Tessiore, Yokohama, 2000. Functional Lorenzo Japan, and September. validation Walther of von a machine Hahn. interpretation Wolfgang Wahlster, system: editor, Verbmobil. Verbmobil: In Foundations pages 611–631. of Speech-to-Speech Springer, Berlin. Translations , Tillmann, and Dynamic Christoph. Programming 2001. Based Word Search Re-ordering Algorithms Translation . for Ph.D. Statistical thesis, Machine Science Department, RWTH Computer Tillmann, Germany. Aachen, extension Christoph. algorithm 2003. for statistical A projection machine and Mark translation. Steedman, In editors, Michael Proceedings Collins of in the Natural 2003 Language Conference Processing on Empirical , pages Methods Sapporo, Japan. 1–8, Tillmann, 2000. Word Christoph re-ordering and Hermann and DP-based Ney. search In COLING in statistical ’00: The machine Conference Computational 18th International translation. pages 850–856, on Saarbr¨ucken, Linguistics Germany, , Tillmann, July. 2003. Word Christoph reordering and Hermann and a dynamic Ney. programming statistical machine beam translation. search algorithm for Computational Linguistics , 29(1):97–133. Och and Ney The Alignment Template Approach to Statistical Machine Translation Tillmann, Christoph, Hermann Ney, Stephan and Alex Vogel, DP-based search using Zubiaga. monotone 1997. A alignments Proceedings in of statistical the 35th translation. the Association Annual In for Conference Computational of pages 289–296, Madrid, July. Linguistics , Venugopal, Alex Ashish, Waibel. Stephan 2003. Vogel, translation Effective and extraction from phrase models. Meeting In Proceedings the of alignment of Association the Linguistics pages for 41st Annual (ACL) , Computational Japan, July. 319–326, Sapporo, Vogel, Stephan, Christoph Hermann Tillmann. Ney, 1996. and word alignment in HMM-based In COLING ’96: The statistical 16th translation. Conference pages 836–841, on Computational International Copenhagen, Linguistics August. , Wahlster, Foundations Wolfgang, of Speech-to-Speech editor. 2000. Translations Verbmobil: Springer, Berlin. . Wang, Ye-Yi. 1998. Grammar Inference and Statistical School of Machine Translation Computer Science, . Ph.D. thesis, Technologies Institute, Language University, Pittsburgh. Carnegie Mellon Wang, Ye-Yi Decoding and Alex algorithm Waibel. in 1997. translation. In statistical Annual Proceedings Conference of the of the 35th Computational Association Linguistics , pages for Madrid, Wu, Dekai July. 366–372, Machine and William translation Wong. with a 1998. grammatical channel. In stochastic ’98: for 36th Annual Meeting of COLING-ACL Computational the Linguistics Association International Conference on and 17th Linguistics , pages Computational Yamada, August. 1408–1414, Montreal, syntax-based Kenji and statistical Kevin translation Knight. 2001. model. A In the Proceedings Association of for the Computational 39th Annual Linguistics Meeting of (ACL), July. pages 523–530, Toulouse, France,