<?xml version="1.0"?><!DOCTYPE article SYSTEM "/project/take/software/searchbench_offline_processing/paperxml_generator/aclextractor/src/python/../resource/dtd/paperxml.dtd"><article><header><firstpageheader><page local="1" global="1053"/><title>Distributing and Porting General Linguistic Tools</title><author surname="Genthial" givenname="Damien"><org  name="GRENOBLE Cedex" country="FRANCE"/></author><author surname="Courtin" givenname="Jacques"><org  name="GRENOBLE Cedex" country="FRANCE"/></author><author surname="Menezo" givenname="Jacques"><org  name="GRENOBLE Cedex" country="FRANCE"/></author></firstpageheader><frontmatter><p><b>Distributing and Porting General Linguistic Tools</b></p><p><b>Damien Genthial, Jacques Courtin and Jacques Menézo</b></p><p>TRILAN-CLIPS, IMAG-Campus, BP 53 F-38040 GRENOBLE Cedex, FRANCE {Damien.Genthial Jacques.Courtin Jacques.Menezo}@imag.fr</p></frontmatter><abstract>Our main motivation is to build general and adaptable linguistic tools and we have faced the problem of their portability. We first make a quick de­scription of the linguistic tools we have at hand and we explain why linguistic tools, unlike other software tools, present particular portability problems. We then discuss code portability and also data portability and we describe the method we have used for a French lexicon, showing that portability leads to a more "natural" computational lexicon. We then propose the use of a command language to interface the tools with more complex applications and we show that this technique facilitates integration of tools from various sources, entails a better exploitation of linguistic resources and makes easier the distribution of tools on several machines. </abstract></header><body><section number="1." title="Introduction"><p>Our main motivation is to build general and adaptable linguistic tools and we have faced the problem of portability of these tools. The problem has raised sharply when we decided to implement a distributed version of the tools. The idea is to have bricks to build complex linguistic systems and to make possible, and easy, communication between bricks. We have three points in mind:</p><p>• integration of tools from various sources: the linguistic system must not know the details of the internal architecture of the tools it uses, so it should be easier to substitute one tool by another (for example you can easily change the morphological parser);</p><p>• better exploitation of linguistic resources by embedding them in very general tools;</p><p>• possibility of a distribution on several machines of a net, allowing tools to be shared by several users (and thus the cost can also be shared).</p><p>After a quick description of the tools we have at hand, we will explain why linguistic tools, unlike other software tools, present particular portability problems. We will then discuss the user interface portability and we will propose a simple method which, making this portability easier, is also a good way towards distributed tools and easy communication between them.</p></section><section number="2." title="Linguistic tools at hand and motivation"><p>We have a complete morphological system based on a general finite slate transducer. Its main characteristics are its reversibility (the same data are used for parsing and generating) and its adaptability (the system includes editors which permit easy and interactive modification of the data). This system is operational on PC and Macintosh architecture with a real size French lexicon, but switching from one architecture to the other is a painful process, mainly because switching the system also implies switching the lexicon (see next section).</p><p>We also have three lexical correctors: one based on similarity keys, another on phonetics and a third, more original, which correct ficxional errors in French. All these tools arc operational on PC architecture only.</p><p>Finally, we have two syntactic parsers which build dependency structures. One is based on the notion of dependency relations and is very fast but has a limited power of expression. The other uses typed-feature structures to increase this power but pay the bill with slower parses. Both works on PC and Macintosh.</p><p>The interesting point comes when we de­cided to make all these tools available on Unix systems. The goal is to gain flexibility and power by a distribution of the linguistic tools in a client/server architecture. With such an architecture, tools arc more easy to use and arc sharablc among applications. For example, as proposed by (Genthial, 1994) a phonetic/graphic transducer which implements a lexical correction, can also be used in a syntactic corrector to determine the most probable correction. Tools can also be dispatched on different machines, such that one can, for example, write on his PC or Mac and use the linguistic tools of a Unix server.</p><p>So the problem to solve looks like a soft­ware engineering one: we have a lot of code, written in different programming languages on two different machines, and we want to implement it on a new architecture.<page local="2" global="1054"/> But we have add a more heavy constraint: we want that code and data obtained on the new architecture (Unix) can easily — by easily we mean in only a few minutes — be put back on the other ones (Mac and PC).</p></section><section number="3." title="Code and data portability"><p>Code portability is not specific to computa­tional linguistics, it is a well known problem in the software engineering domain, but im­plementing a linguistic application means also implementing an important amount of data and thus raise the problem of data portability. Considering morphological level for example, implies coding a lexicon including words with their category, their morphological properties,... Categories and properties are symbols chosen by the linguist and he can always choose symbols which can be expressed in the same way on different machines, and thus be portable. But words are character strings, coded with the character set of the machine used and so the portability of the word list rely upon the portability of this character set. The ASCII character set, which is the basic set on almost every machine, is fully portable but it does not contain every character of every natural language: using the French <i>ê ox à </i>or ç implies the use of an extended character set which is not portable.</p><p>After a small discussion on code portability, we will present a method to achieve data portability.</p><doubt alpha="44.4" length="9" tooSmall="False" monospace="0.0">3.1. Code</doubt><p>Code portability is heavily tied with the programming language used for writing pro­grams: the more portable is the language, the more portable is the code. That is the reason why we had chosen the Pascal programming language in the early 70's : the language was well defined and we used only the standard features. But the language has evolved and the evolution leads to incompatibility between versions.</p><p>On the contrary, the C language has been standardised in 1989 by the ANSI and we can now speak of a real portability of code from one architecture to another. We have then chosen to use C and the biggest part of rewriting Pascal units to C modules has been achieved by a Pascal to C translator.</p><p>But one problem remains: we want to put back the C translation on the original machine with minimal work, and the original code includes a user interface with pull-down menus and dialogues which are impossible to translate as is. So we have made an effort to cut the C version in two parts:</p><p>• the user interface, which is heavily un-portable and must be rewritten on every machine (see section 4 for a discussion on interface portability);</p><p>• the tool kernels, written in strict ANSI-C.</p><p>Thanks to the language standard, the kernels (about 8000 lines of code) have been compiled, without changing even a comma, on Macintosh, PC and two different Unix machines.</p><doubt alpha="44.4" length="9" tooSmall="False" monospace="0.0">3.2. Data</doubt><p>Two kinds of data may be used in linguistic applications: textual data and binary data. Most of them are textual because they can easily be printed, displayed and modified with the standard tools of the host system. But sometimes you need to compile data to gain efficiency: the application becomes faster and use less disk space.</p><p>Binary data in linguistic applications are for example integers, bit vectors coding properties, floating-point numbers coding statistics and so on. Their portability is not a real problem because one can easily translate them in textual form on the original machine, put this form on the target machine, and compile them back.</p><p>As said before, portability of textual data rely upon portability of the character set, so using ASCII set ensures a great portability but forbids writing special characters. Such special characters are all French accented letters <i>(à, â, é, è,ê,...) </i>which can be coded (and typed) on every machine but the codes are different from one machine to another. Moreover, all special character codes are above the ASCII maximal code and this entails a disturbing side effect: when sorting words of a lexicon you get all words starting with an accented letter at the end of the list (see example on Figure 1).</p><p>errer oui outre ouïe vent érudit ôter <i>Figure 1 :</i><i> Sorted accented strings on a PC</i></p><p>When the lexicon is big enough, the word <i>érudit </i>is far from <i>errer, </i>which is computa­tionally sounded but unacceptable for the common user.</p><p>We have then defined an internal code for special characters based on the ASCII character set. The code is a reduced version of one defined by GETA in (Boitet, 1982) an accented letter is coded with the letter without accent, a vertical bar, and a number cor­responding to the accent (see examples on Figure 2)<footnote anchor="1"/>.<page local="3" global="1055"/></p><doubt alpha="37.5" length="40" tooSmall="False" monospace="0.0">à -&gt; al2 â-&gt; al36-&gt; ell è -&gt; el2 ê-&gt; el3</doubt><p><i>Figure 2 : Examples of the code for accented characters</i></p><p>All textual data are then completely portable provided that source and target machines use ASCII. But there are two drawbacks: you can not ask the user to learn this code and you can not use the standard string comparing functions. For the first problem, we simply write two procedures: one for reading strings and one for writing. Their purpose is to translate from one representation to the other such that the user has no need to know the internal code: he can type special letters as usual on his keyboard. For the second, the solution is to write our own comparing function, which is not so difficult and have an advantage: we can implement a "natural" order on words (the order used in paper dictionaries). We then obtain a human sounded order which can also have a computational advantage in correction systems. Consider for example the four French words <i>cote, côte, coté </i>and <i>côté: </i>their proximity in the lexicon is a guarantee for a corrector to find the correction if one is used for the other, guarantee that you cannot have with the preceding order (765 root words between the roots <i>cote </i>and côté in our French root dictionary, which contains a total of 35 000 roots).</p><p>With this code, we get textual portability of data and a natural dictionary order which is preserved on all machines where the dictionary is implemented.</p><doubt alpha="53.1" length="32" tooSmall="False" monospace="0.0">4.    Driving    tools    with a</doubt><p><b>command language</b></p><p>Once you have achieved the portability of your software kernels, you are faced the portability of the user interfaces. Here you have two choices:</p><p>1. write portable interfaces by using very simple textual interactions with the user so that you can write the code in ANSI-C;</p><footnote label="1">The code defined by the Text Encoding Initiative (Sperberg, 1994), derived from SGML, is usable for electronical transfer, but a little cumbersome for a lexicon which might contains as much as 200 or 300 thousands words</footnote><p>2. write a modern interface, heavily tied with the graphical interface of the host machine, and partially or completely rewrite it each time you want to implement it on a new architecture.</p><p>We have chosen to proceed in two steps:</p><p>• first make the first choice even if we get a very poor user interface, not acceptable on modern graphic computers; such interfaces are very easy to write and permit at least to debug the tools.</p><p>• then make the second choice, try to minimise the rewriting cost and, moreover, to make the kernels completely in­dependent of the interface.</p><p>To minimise the rewriting cost, we use a graphical library which is freely available and portable from one machine to another.</p><p>To make the kernels completely indepen­dent of the interface, we propose to have a user interface which is strictly limited to communications with the user. The architec­ture is a client/server one, where the user in­terface (the client) calls the kernels (the servers) for linguistic treatments (see Figure 3).</p><p><i>CLI: Command Language Interface</i></p><doubt alpha="100.0" length="3" tooSmall="False" monospace="0.0">CLI</doubt><p><b>Morphological Parser</b> <b>Morphological Generator</b> <b>Similarity Key Correction</b> <b>Phonetic/Graphic Transducer</b> <b>Data/Lexicons Editors</b></p><p><i>Figure 3 : Distributed Architecture with separated user interface</i></p><p>You can imagine as much clients as you need, for example:</p><p>• one for a lemmatiser which calls only the morphological parser and generator;</p><p>• one, more complex, for a detec­tion/correction system, which uses all tools to produce correction of lexical errors;</p><page local="4" global="1056"/><p>• one, with pull-down menus and windows, devoted only to the editors (modification of the lexicons, of the linguistic data,...).</p><p>Of course, all interfaces are sharing the tools with the others and it must be easy to add a new tool to an interface (for example a new correction method) or to substitute a given tool by an other (one can change the phonetic/graphic transducer to get an im­proved version).</p><p>To obtain this flexibility and to make pos­sible the distribution of tools (on the same machines or on all a net), we propose, as (Boitet, 1994) in the white-board architecture, to add a manager on each module. Our manager take the form of a textual command language which is used to drive the module (Antworth, 1990) has used such a command language interface in PC-KIMMO.</p><p>The general form of a command would be the following:</p><doubt alpha="52.5" length="40" tooSmall="False" monospace="0.0">verb(arg^ =&gt; param^; arg2 =&gt; parang;...)</doubt><p>where <b>verb </b>is the command and where <b>arg^ </b>and <b>parang </b>are respectively the names and the values of its parameters.</p><p>Parameter values could be integers, floating point numbers, booleans, objects (denoted with the same syntax as a command), or a list of the preceding.</p><p><b>Examples:</b></p><p><b>Parse(string =&gt; "to_be_parsed")</b> <b>Generate(</b> <b>List(dictionnary =&gt; "dict_name")</b> <b>Add_dictionnary(word =&gt; "to_add";</b></p><doubt alpha="61.3" length="106" tooSmall="False" monospace="0.0">word =&gt; "aimer"; filter =&gt; filter( category =&gt; "verb"; variables =&gt; ["present", "singular", "3rd_person"])</doubt><doubt alpha="63.2" length="19" tooSmall="False" monospace="0.0">like =&gt; "paradigm")</doubt><p>Each tool must be build on the same frame: it reads only from one input stream (its standard input) and write to only one output stream (its standard output) and the main algorithm is an interpreter.</p><p>Using such a command language interface entails 4 main advantages:</p><p>• it can be used as the only (but rough) in­terface for a given tool;</p><p>• you can write programs in this language and thus automate the use of the tool;</p><p>• the interpreter does not use machine specific feature so the entire tool can be written in strict ANSI-C and thus be heavily portable (without changing a comma);</p><p>• connecting the tool to a more sophisticated interface program is very easy: it requires only the ability of passing text from one application to the other. You can for example put a morphological parser on a machine such that it can be called by electronic mail: you send the string to be parsed in a mail and the answer contains the words, with their category and properties.</p></section><section number="5." title="Conclusion"><p>We have used the portability frame presented in this paper for the main tools of our system: a morphological parser and a morphological generator, which use a root and endings lexicon to parse or generate about 250 000 French forms. The lexicon must be un-compiled and compiled back when porting from Mac to PC but the whole process docs not take more than a dozen minutes. On the contrary, thanks to the similarity in their architectures, the same lexicon can be used on Mac and on Unix machines.</p><p>Concerning the code, we have now portable versions of the tools mentioned above, plus a lexical dcsambiguer and a lexical corrector using similarity keys. We are able to deliver libraries for these tools (and their data for French) on Mac, PC and Unix.</p></section><references><p>E.L. Antworth (1990). PC-KIMMO : A Two-level Processor for Morphological Analysis, <i>Summer Institute of Linguistics, </i>Dallas, Texas.</p><p>Christian Boitet (1982). Le point sur ARIANE-78. <i>Rapport ADI 811423, </i>GETA-Champollion et CAP SOGETI France, Grenoble.</p><p>Christian Boitet and Marc Seligman (1994). The "white-board" architecture: a way to integrate heterogeneous components of NLP systems. <i>CoLing'94, Kyoto, Japan, August 94, </i>Vol. 1, pp 426-430.</p><p>Damien Genthial and Jacques Courtin (1994). Towards a More User-Friendly Correction. <i>CoLing'94, Kyoto, Japan, August 94, </i>pp 1083-1088.</p><p>CM. Sperbcrg-McQueen and L. Burnard (1994). Guidelines for Electronic Text Encoding and Interchange, <i>in press, </i>Chicago and Oxford.</p></references></body></article>