PaTrans - A Patent Translation System Bjarne Orsnes, Bradley Music & Bente Maegaard Center for Language ~lbchnology Njalsgade 80

Abstract

This paper describes Pa~lh'ans - a fully automatic production MT system designed for producing raw translations of patent texts fl'om English into Danish. First we describe the backbone of tile system: the EUROTRA research project, and prototype. Then we give an overview of the trauslat, ion process and the basic flmetionality of Pa'I~'ans, and finally we describe some recent extensions for improving processing efficiency and the translation quality of unexl)ected input encountered in real-lit~ texts.

Introduction

Pa\]~'ans 1 is a fully-automatic machine translation system designed for English-Danish translation of patent, texts. It is based on the linguistic specifications and to some extent on the software of the EUROTRA project of the European Community (Copeland et al., 1991a; Copeland et al., 1991b). Pa'IYans consists of a core grammar and translation module and a host of peripheral utilities: terin databases, general databases, editors for pre- and postediting, document handling facilities, facilities for creating and updating term databases. In this short presentation we will concentrate on the grammar, lexicon and translation module and on some of the new features of Pa~i~'ans. 2 From EUROTRA to PaTrans EUROTRA was the European Community MT research programme. The Community started the programme in 1982, with the goal of creating an advanced systeln for automatic translation capable of treating all the otficial working languages of the Community. When the programme finished in 1992, it had delivered a huge amount of research 1paTrans was developed for Lingtech A/S. results and an implemented prototype of a multilingual translation system. The PaTrans developmeat relics on the prototype resources (Macgaard and Hansen, 1995), the system architecture and linguistic specifications, as well as on the experienced staff created by EUROTRA. 2.1 The EUROTRA Prototype EUROTI{A was a transfer-based multilingual MT project. Because of the multilinguality, the prototype was quite "clean" in terms of separate modules for analysis, transtL'r and synthesis of the various languages and language pairs. 2.1.1 Sottware

The software component consisted of the t;lanslation kernel, used tbr analysis, transfer and generation. The trmisb~tion kernel had mechanisms for treating grammar rules, dictionary information and mapping rules. 2.1.2 Lingware

For all languages, the project produced a large grammar and a general language dictionary. Though insufficient for the task at hand, the Pa: 'lk'ans development eould buil<l on the English and Danish grammars and dictionaries, as well as on the transfer module from English into Daifish. 2.2 Customizing EUROTRA Patent texts are characterised by the vocabulary they contain: terlns belonging t;o the fiehl tt'eated , e.g. chemistry, and patent document terms of a more legal nature. But; patent documents are also charaeterised by tile frequency of some linguistic phenomena nd the absence of others, e.g. we had to develop ~ treatnmnt of lists and emmmration, and conversely we could simplify the treatment of modality considerably. The current maintenance and further development of the system continues this text type specific lille. The success of the system is mainly based on this fundamental lninciple of tailoring it; to a specific text type and sub jeer field.

An overview of the Translation Process Document handling

The document handling step has four main flmctions: • Format Preservation Input to docuinent

handling is a text from a text processing sys-

tem which has been marked up in SGML. Tile

SGML codes denote e,.g. titles, paragraphs,

text segments that should not be translated,

etc. All information about doc, ument layout

is stored separately and taken away from the

translation process. • • • Formula Recognition The docmnent handler automatically recognises certain text typical untranslatable units, such as chemical formulas and tables. Term Reeognition Terms and multi-word units are also recognised at this stage, in this context, words are treated as terms if they are subject specific or if they have a unique translation in the given text type. They are recognised during text handling and have their translation equivalent attaehed to them along with inorphosyntactic information for both source and target language. Segmentation Finally tile text, is separated into units for translation i.e. sentences for which various recognition patterns haw ~. been set up. In some patent texts of specfic subject tields, tile sentences are incredibly long. In these cases, there is no point in trying to arrive at a complete parse of the whole sentence, since the parse is most likely to fail and processing will be too space and time consuming. Therefore the docmnent handler attempts to arrive at a meaningflfl partition of the sentences by identifying sentence internal boundaries and submitting the individual subparts for translation.

Disambiguation

Before the text is passed on to the parser, it is subjected to a thorough process of disambiguation. This is one of the new features of PaTrans compared to the EUR()TRA model and will be discussed in detail below.

Source language analysis

Since PaTrans is based on the transfer translation model tile surface strings of the text are sequentially transformed into an interinediate representation defined by several mapping principles.

During source language analysis the sentences are assigned a surface syntactic structure. This surface syntactic structure is converted into a language-neutral transfer represent, ation ordering the constituents of the sentence in a canonical order with heads preceeding arguments and ar= guments preceding modifiers (Copeland et al., 1991a). The, transfer representation is a reflection of tile argument structure of the predicates where iuformation about surface syntactic realization appears as features on the individual nodes. Function words (coRjmwtions, determiners, prepositional case markers) are featurized and tense/aspect and negation represented in language-neutral features.

The output of source language analysis is thus a tree with multilwered information including syntactic and morphosyntactic features, as well as the syntactic/semantic relationships between the predicators and the arguments,

At, all levels, sets of preference rules based on heuristic principles select among competing analyses, e.g. for PP-attachment (Bennett and Paggio, 1993). 3.1.3 Transfer

PaTrans adheres to simple transfer, i.e. the substitution of source language lexical units with target language lexical units by means of lexical transfer rules, 9 while the source language stru<> tural representation ismapped directly onto the target language transfer representation which is input to tile generation module. There are two main reasons why complex transfer (i.e. transfer where the strucl;ure of the input representation is altere(t) is kept at a minimum: • Complex transfer is costly inasmuch as the

Target syntactic generation

During gelmration, the transti;r representation is mat)ped onto a target syntactic structure through intermediate representational lewfls. At, the first level, the target language lexical units are looked up in the lexical database and mon(}lingually relevant features are calculated on the

2Recall theft this only applies to words of the general vocabulary which require disaint}iguation during analysis and not to terms general applicability of the rules is usually very restricted. • A transfer ule applies to any object matching its left-hand side and performs the mapping defined on the right-hand side. Due to the 'fail-soft'-mechanisin (discussed below), the structure of the objects which the transfer rules nmst apply to cannot he flflly predicted. In order for complex transfer to work in all cases, rules must be set up not only for correctly parsed input structures, but also for tile special fail-soft structures. For this reason, complex transfer is costly and is only used for frequent phenomena considered crucial for good translation, e.g. converting certain English ing-forins into l)anish relative clauses. basis of the language-neuLral representation, e.g. tense and asl)eet.

At Lhe second level (Lhe relational level) surface syntactic flmcLions are (:alculaLed and certain flmcLion words, sut:h as t)reposiLional markers are inserted. Finally, the relational sLru(:ture is mapped onto the level defining tim constituenL sLructure of Lhe target language sentent:e. At; Lhis level all informaLion wiLh indetmndenL lexical expressions is t)resent. 3.1.5 Target morphological generation

PaqA'ans has a highly develot)ed mori)hological module which l)rovi(les an almost eomt)leLe coverage of Dmfish inflecLional morl)hoh)gy. The module is based on sLrueture, buihling rules whi(:b allow for downwards ext)ansion. Regular inflection, syncope and gemination is accounLed for while only completely irregular word forms will have, to be coded in their entirety. PaTrans also has a limited strategy for LranslaLing (:ompounds composil, ionally. Generally, comI)ounds are co(led in the (terminoh)gical) dictionari('.s, 1)uL the t)arser tries to translate (:ompom~ds which are not code(t in the dictionarie.s by translating their individual subparts. 3.1.6 Document generation

Finally, the doemnent generation module inserLs ~fll SGML-inarkers anti all iLems which have been inarke.d as mlLranslatable (tal)les, formulas, illlllflbe, rs el;(;.), and a separate conversion programme converts the output into WoldPerfecL forHIaL. a

The lexica

l'a~iYans distinguishes two kinds of voealmlm'ies: the general vocabulary and Lhe Lerminologi(:al vocabulm'ies. • • The general vocabulary is stored in a monolingual English dictionary, a monolingual l)anish dictionary separated into a. inLo syntactic and a morphological evel, and a t)ilingual transfer dictionary. The terminology is divided into sul).ject specific databases. As PaTrans is used for a numl)er of ditferenL subject fields, the prioriLy of the databases is user-defined and flexible, The user specifies which term bases are to be used for a translation .job, and in wtfich order of prioriLy. When a term is fomld in one tel'in base, it; is not looked up fllrLher in the subsequenL databases.

auntil now, all texts have been dcliv('.r('.d in WordPerfect, lint the conversion programme, may of (;oursc l)e adat)tcd to odmr t;t.'xl; processing syst,ems,

PaTerm Coding Tool

For ease of mainLenance and updating, PaTrans has a special coding; tool. As mentioned above, Lhe l'aTrans term 1)ases conLain terms as well as words aim expressions which behave like terms, i.e. which have unique translations. New terms occur in each and every pate.nt documenL whict~ is submitted for trmlsladon. Consequently, it; is iml)ortant thaL Lhe use, r, who is noL necessarily a (;onll)Htal;ional inguisL, (;all elIcode L(;rtns ill a.n efficient and precise way. The PaTerm coding tool provides a screen wiLh fiehls Lo fill in, and in most; cases an atlswer is proposed by t;he system, st) Lhat Lit(', user llas to make jllSt one accet)Lance ke, ysta'olce. Care has been taken (;o t)resent Lhe mosL frequenL, and therefore ntosL t)robable, answer on tim Lop of the. list, Pa'l~erln asks Lhe. minimum number of quest, ions and COmlmtes the, remaining linguisLic information from the answers re.ceived. This also saves Lime tbr the user.

Special Features Error Recovery

Since the system runs in a praetical environment, it must, ne, ver fail to I)roduce, an olltput, even if iL encounLers an unanalysable sentence. ConsequenLly, a f~dl-sofl: inechanism was inLroduce, d. Tim fail-soft; mt'.ehanism works at all levels of representation. If the parser fiJls to assign a wellforme(t sLr|le\[;urc Lo the input, a path is selected i\]om tim chart which spans the greatest: amount of dm inlmL ~ril(l already c.reated constituents are collecLed. Tim qualiLy of fail-selL output; varies considerably and recent work has attempLed Lo improve the results of fail-soft;. Disambiguadon of individual words, the selection of al)propriaLe readings and Lhe determinaLion of individual (xmsLituents at a very early stage are (:rueial in arriving aL a 'l)esL-tit' lmrse.

Interestingly, Lhere are some flmdamental diilieulties in combining advanced MT with fail-soft, straLegies. The most sLriking example of this is the fact; that PaTrans aims at a very deep analysis of the source, text, and aL the same Lime t;he formalism alh)ws for non-lnonotoni(; mappings l)eLweell levels of represenLadon. Due Lo Lhe minxpetted mid 1;() some extent Ulq)re, dictat)le, strllctlne of tSil-sofl; analyses, snl)seqllent granlnlar rllles may fail to al)ply ,resulLing in ouLput represenl;aLions where inforination e.g. about Lhc degree of adjectives an(1 other inforlnatiol~ stemming fl'om flmction words has been lost, Current efforts (;onsequently aim at preserving informaLion at all levels.

'Fagging

llefore Lhe Lext is submiLted to the parser, the Lext, is Lagged, i.e,. dm tagger t, rics to determine the t)arl;-of-st)e(w.h of the individual words based on local cooccurrence restrictions. There are two reasons why the tagger has been integrated into the system: • • Since the overall translation system is unification-based, words are disambiguated by the application of all possible rules, which is highly inefficient. If the sentence is fail-sorted, one intermediate analysis is picked from the chart, which means that all words may not have been disambiguated properly by the grammar rules. If, however, the words have been disambiguated and impossible readings have been discarded prior to parsing the 'best-fit'-parse is considerably better than it would otherwise have been.

The tagger is a public-domain, rule based tagger. It has been trained on a corpus of the Wall Street Journal and on patent texts within the subject field. In addition, it has been augmented with several 'local' contextual rules developed by the linguists working with PaTrans. The integration of the tagger has not only provided for more effecient processing but, more importantly, also for a higher quality of the translations of fail-softed sentences. Current efforts aim at improving the performance of the tagger.

Preparsing

The original EUROTRA-parser has been augmented with special rules which apply before the actual grammar rules (Music, 1993). The goal is to enable more efficient handling of long sentences that are otherwise unprocessable given moderate resources. With pre-rules, sentences are segmented via pattern-matching, before they are sent to the parser. In this way, the number of parse paths that the system has to consider is reduced considerably.

To give greater power to the preparser, pre-rule application has been made cyclic. This means that the output from one rule application (or one application cycle) is used as input to a new cycle which starts at the beginning of the rule set. In principle then, any rule can feed (i.e. create the preconditions needed for application of) any other rule, while at the same time allowing prioritization of rules, The pre-rules not only add structure to the input, they are also used for lexical disambiguation based on collocatives and immediate context. Where the rule based tagger described above is able to determine the part-ofspeech of individual words based on prior training and contextual rules, pre-rules can select individual readings of words within the same partof-speech. Pre-rules have been developed for lexteal disambiguation and for parsing of adverbial phrases, complex verb groups, coordinated thatclauses, indexed lists, valency-bound prepositional phrases and explicitly marked intervals (e.g. from •.. to, between.., and). The effects of pre-rules are twofold: On tile one hand they assign structure to tile input at a shallow level, which nevertheless increases processing efficiency considerably, on the other hand they also improve fail-soft results since inappropriate readings of words in a given context are discarded at an early stage. 6 Performance PaTrans is in everyday use at the translation agency Lingtech where it is being used for all texts which are suited for it in its current version, i.e. chemical, biochemical, medical etc. patents, and gradually also a considerable amount of mechanical patents. PaTrans is making the translation process faster and more efficient, and it has proven to be a good business for Lingteeh, saving around 50% of the raw translator cost. 7 Conclusion PaTrans is a running production translation system producing cost-effective raw translations of patent texts. But PaTrans is also a project which combines academic research and practical applications and which has shown that MT is viable in limited domains. Current work concentrates on improving the coordination of the rule-based part of the systeln and the fail-soft component. References Bennett, P. and Paggio, P., editors (1993). Prc-

\]erenee in EuTvtra, volume 3 of Studies in Ma-

chine Translation and Natural Language Pro-

cessing. Commission of the European Commu-

nities, Luxembourg. Copeland, C., Durand, J., Krauwer, S., and

Maegaard, B., editors (1991a). The Eurotra

Linguistic Specifications, volume 1 of Studies

in Machine Translation and Natural Language

Processing. Commission of the European Com-

munities, Luxembourg. Copeland, C., Durand, J., Krauwer, S., and Mac-

gaard, B., editors (1991b). The Eurotra For-

mal Specifications, volume 2 of Studies in Ma-

chine Translation and Natural Language Pro-

tossing. Commission of the European Commu-

nities, Luxembourg. Maegaard, B. and Hansen, V. (1995)• PaTrans -

Machine Translation of Patent Texts. From Re-

search to Practical Application. In Convention

Digest: Second Language Engineering Conven-

tion, London, pages 1--8. Music, B. (1993). Preparsing in the PaTrans MT

System. In Bits ~d Bytes: Datalingvistisk Foren-

ings ~rsmCde nr. 3, pages 82 90. Institut for

Sprog og Kommunikation, Odense Universitet.