Efficiency Considerations for LFG-Parsers - Incremental and Table-Lookup Techniques

Abstract

The paper presents a concise description of the LFG-ParserGenerator developed at the EWH in Koblenz. Special attention is paid to efficiency considerations to speed up the system in the execution phase. Lexicon is separated, ll(k)-parsing tables are used and some preliminary ratifications are carried out before the actual execution. The run time system follows the single path strategy and produce.'; the f-structures imultaneously with the processing of the c-stru~:tures.

General Considerations of Parsing Efficiency

Basic parsing techniques (both shift reduce and recursive descent) seem to be inherently inefficient inasmuch as they proceed strictly according to the sequence of the rules in the grammar and they are not able to exploit the surrounding (preceeding and following) syntactic information. Their scope is limited to a single rule and they jump mechanically to the sequentially next rule, even if such a move is obviously abortive and must be innmdiately abandoned (Winograd 1'483, 108-115; Phillips 1984; Hellwig 1988).

Parsing tables - as they are conceived in current compiler construction devices for LR(k) and LL(k) languages - make 1. the izfformation provided by the grammar accessible throughout he entire processing and not just at the point where they happen to occur, and 2. tlmy can be constructed algorittnnically (Aho/Ullman 1979).

The LFG-Model of the EWH: General

Design The Koblenzer LFG-Parser-Generator is an interactive system, designed to create and to test grammars for natural languages according to the linguistic philosophy of the LFG as conceived in Bresnan und Kaplan (1982). Both lexicon and syntax follow closely the original format specifications. The system can be divided into two main phases: preprocessing and actual execotion). 1. Prepro~:essing ofthe input gralmnar (including lexicon) ge-

nerates the executable code, which in turn involves two lo-

gically distinct steps: • Generating the P~OLOG code and • Optimizing the PP~OLOG code, - and 2. the actual execution phase analyses the input string and

produces the f-structures. 2.1 Code-Generation In the preprocessing phase the grammar ules are entered into the system and translated into an executable PB,OLOG Code. This part of the system is written in PASCAL. The implementation includes facilities for the treatment of the metavariables ~and needed for the treatment of the long distance dependencies (Weisweber 1986). The grammar may contain both optional categories and multiply reoccurring categories (marked by the Kleene-star .-operator). I ,1 I Generator I t i i -Figure 1: The LFG-System in Koblenz

In order to facilitate the generation of the tables with the reach relations the phrase-structure portion of the rules of the gran~nar (c-structure rules) is extracted and stored as azl additional, separate data set. 2.2 Code Optimization The second task of the preprocessor isto produce a more efficient PKOLOG code. Optimization covers construction of parsing table and code revision.

In order to speed up the actual analysis in the execution phase the preprocessor constructs a table of reach relations on the basis of first and follow sets, connecting nontermlnal and preterminal nodes with a lookahead of 1.

The definition of the first and follow sets is based on context free grammar (Aho/Ullman (1979, 186-192,429-30)): G = (N, r~,7,,S) ~,jO, E (N E ~,)* and ..4 F_ 31".

The first sets are defined for a non terminal symbol A over a string c~ of preterminals as the potential preterminal symbols which can occur in the leftmost position of the string: F~IRST(a) = {a E Y, I c~==~afl} u {el ,~=~e}

The follow sets of a nonterminal A are defined as the first sets of the preterminals which may occur after the nonterminal A: FOLLOW(A) := {a E $ I S=~aA~ A a E FIRST(fl)} u {$1 S:~A*}.

Contrary to the standard efinition of the terms (op. cit.) the Koblenzer system does not exclude the application to left recursive constructions. The reach relations are build up uniformly both for left recursive and for all other constructions.

The first and follow sets allow to define the reach relations, which provide the information for a nonterminals A (in the stack) and for a preterminal symbol (located in the input string a) by which production rule(s) the preterminal can be accessed: REACH(A,a,P)~3P67 > with P = A ~ ~ : a 6 FIRST(~) ^ ~(a = ~) Va 6 FOLLOW(A) ^ a =~e

The reach relations are valid for all context free languages and extend the applicability of LL(1)-tables for them in general. They are calculeted over the first and follow sets and stored in tables for the execution phase. The practical construction of the table of reach relations is based on the systematic separation of dictionary and grammar ules, without which the construction of the table would not be feasible.

There are a number of grammatically predefined f-descriptions, which caa be preprocessed in advance independently ofthe actual input, reducing the number of unifications at run time. Preliminary unification of f-structures can be carried out in the following configurations: If an f-description subsumes another f-description, the subsumed f-structure can be regarded as already unified and dropped. In the execution phase the system will Use only the subsuming (i.e. larger) f-description. E.g. if a dictionary entry in the PROLOG code, produced in the preprocessing phase, has the specifications as (TSUBJ NUM) = SO, and simultaneously: !(TSUBJ), the later can be safely dropped in order to avoid the vacuous ratification of the explicit subject in the execution phase. If an f-description is unified with new attributes, hitherto not used in the grammar, the operation will always succeed, regardless of the actual value of the attributes. Unifications of this type can be carried out safely in advance regardless of later possible changes of the attribute value. • • • There are further minor possible f-structure configurations which can be simplified before the actual unification in the execution phase. The current optimization will recognize I OJpn I Monitor

help facilities

pretty printer

traces etc. \[ Lexicon lookup \]

Constructing 1

reach relations

table LL(l)-Parser -------* I'"' Unification of' incremental construction of I ~----J f-structures I f-structures ~tstralni~g I ions and I ----------~ teness Test s~ Figure 2: The run time system some of these special cases and replace the general unification procedures by specialized and hence more restricted procedures already at the time of code generation. The general broad unification procedures (merge functions) will be substituted here by more specific and computationally ess expensive procedures.

The Run Time System

Firstly, the run time system can be characterized by the basic sepa-

ration of lexicon lookup and actual parsing, The separation of lexicon

rules and syntactic rules is based on the linguistic insight that

the two components (lexicon and grammar) reflect entirely diffe-

rent language properties. The division can be supported also by

consideration ofprocessing efficiency.

The lexicon lookup is carried out at the beginning of the pro-

cessing and it immediately allows the rejection of input in case of

missing entries in the lexicon. The user can enter another word

on the spot and proceed with the processing of the same sentence.

The next step is the inspection of the LL(1) tables by means of

which the reach relations are established, The table of reach relations

provides the optimal subset of grammatical symbols and connects

them to the lexlcal entries occurring in the actual input sentence.

Secondly, the run time system is characterized by the single-

pass strategy of processing, i.e. the input is read in only once,

merging two fundamental tasks of the LFG: 1. the constructing

of the c-structures and 2. the unification of the f-structures in a

single step.

A special treatment isnecessary for the left recursive construc-

tions. The entries in the LL(1)-table for potential left recursions may be used only as long as the repetion is not spurious, otherwise their further application is suspended. At the time of the processing of phrase structure rules, the associated functional description is processed immediately. At this point the nodes relevant to the functional assignments are easily accessible as the left hand side symbol (for the metavariable T)and the right hand side symbols (for the metavariables ~} in the rules.

As the input is processed the f-structure isconstructed step by step incrementally. All available attributes and values are merged together as soon as they emerge, which is efficient for at least two reasons: 1. There is no need to store and reprocess the cumulated f-equations in an additional step and 2. merging the f-descriptions incrementally step by step operates with smaller chunks, which implies faster unification.

The incremental processing means that at the end of the input sentence the analysis is complete and solved and does not need to be scanned again in order to solve a series of f-equations. There is only one single control operation at the end of the sentence checking the wellformedness (completeness and exhaustiveness) of the output.

The single-pass model differs therefore from the Kaplan-Bresnanmodel by lacking a separate processing phase for the cumnlated f-structures following tile generation of c-structures. In fact there is no explicit need for retaining the c-structures, except for their possible display in tutorials and in tracing errenous production, while testing the rules of the input grammar,

The current implementation delivers both the c-structure as well as the f-structure of the input sentence. In case of multiple interpretations all c-structures and all valid f-structures are displayed in succession. 4 Adequacy and Efficiency of Grammars LFG-Grammars have been mostly studied from the point of view of linguistic adequacy, i.e. they have been developed in order to cover substantial aspects of natural anguage syntax phenomena. The parser should help the working linguist to find the optimal grammar for a particular language, to test the individual rules of the grammar as well as the general formalism.

Parsing efficiency can be studied at least at three different levels: 1. in view of the efficiency of developing grammars (concerns

the work of the linguist). 2. in view o~ optimizing processing of the input grammar (con-

cerus the preproce~sing phase) 3. in view of optimizing the processing of the input sentence"

(concerml the execution phase and the run time system)

The present study is concerned mostl~ with the third type of efficiency i.e. ~lith improving parsing efficiency, presupposing linguistic adequm:y of the model. Pructical efficiency of standard parsing technique~ depends on the amount of back tracking and reprocessing needed in cases of errenous analysis. While using standard recursive desct:nt parsing techniques guarantees the comprehensive coverage of the whole language, it does not exploit available information in an optimal manner. Parsing tables with REACH relations entail more information than single phrase structure rules, they acctmtulate information on the distribution of symbols in the whole grannnar and enable the parser to build up the correct c-structure at the first attempt. If for some reason no valid structure is produced, ordinary back tracking applies and the input string is reprocessed.

In the final version of the paper further details of current improvements will be reported.

Litera~ure used

Aho, Alfred V. and Ullman, Jeffrey D. (1979) Principles of

Compiler Design. Reading, Massachusetts et al., Addison-

Wesley Publishing Company. Bresnan, Joan (1982) (ed.) The mental representation ofgram-

maticalrelations. Cambridge, Massachusetts, The MIT Press. Hellwig, Peter (in print) Parsing natiirlicher Sprachen: Grund-

lagen und Reallsiertmgen. In: BAtori, I., Lenders, W. aud

Putschke, W. (eds): Computational Linguistics -An inter-

national Handbook on Computer Oriented Language Kese-

arch and Applications. Berlin, Walter de Gruyter Kindermann, J6rg and Meier, Justus (1986)Anextensionof

LR-Paraing for Lexical-Fauctional Grammar. Universit~it

Bielefeht, FakultKt LiLi, Forschungsschwerpunkt 8prach- und

Textveracbeitun. (To be published in: Reyle, U. (ed.): Word

Order and Parsing in Unification Grammars). Phillips, Brhm (1984) An object-oriented parser. In: Bara, Bruno

G. and Guida, Giovanni (eds.) Computational Models of

NaturM Language Processing. Amsterdam et al. North-

Holland, 297-321. Tomlta~ Masaru (1987) An Efficient Augmented-Context-Free

Parsing Algoritlnn. CL 13:31-46. Wejsweber~ Wilhelm (1986) Ein Parsergenerator ftir die lexical

functioned grammar (LFG). EWH Rheinland-Pfalz - Abtei-

lung Kol,lenz - Fachberichte Informatik 4/86. Winograd, ~l~rry (1983) Language as a Cognitive Process -Syn-

tax. Reading, Massachusetts et al., Addison-Wesley Pnblis-

hing Cmapany. 27