<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:ns2="http://www.tei-c.org/ns/Examples">
    <teiHeader>
        <fileDesc>
            <titleStmt>
                <title>FeaMble Learnability of Formal Grammars and \[~?he Theory of Natm'al Language Acquisition</title>
            </titleStmt>
        </fileDesc>
    </teiHeader>
    <text>
        <front/>
        <body>
            <div>
                <p>Naoki ABE l)epartment of Computer and Information Science</p>
                <p>University of Pem~sylvania</p>
                <p>Philadelphia, PA 19104-6389 A bst;ract We propo;:e to apply a. complexity theoretic notion of feasible learnability called &quot;polynomial learnability&quot; to the evaluation of grammatical formalisms for linguistic de.~;criptiol). Polylm-. mill h;arnability was originally defined by Valiant in the context of bo,llean concept t(!arniiig and sul)scquetltly generalized hy Blumec el, al. to i~llinita.cy domains. We give a clear, intuitive exposition el' this notion (/l' k'arnability au(l what characteristics of a collection of hmguages may or many not help feasible learn-ability under this paradigm. In particular, we preset,t a novel, nontrivJal ::onstraint on the degree of &quot;locality&quot; of grammars which allows a ri&amp; class of mildly context sensitive languages to be feasibly learnable. We discuss pos,';ihle implications of this observati(m to the theory of natm'al language acquisition. t. Introduct, ion A central i~sue o\[ linguistic theory is the &quot;t)~'ojectio~l prohhml&quot;, which was origblally prol)osed by Noam Chomsky \[?\] and sub sequ(mtly l.?d to much of the development in modern linguistics. This probh,.m pose~ the question: &quot;i\[ow is it posslbk~ for human infants to acquire thei,' native language on the basis of casual exposure to limited data in a short amount of t, ime?&quot; The proposed solulion is that the human infant in ell\;ct &quot;knows&quot; what the natura{ language that it is trying to learn could possibly be. Another way to look at it is that there is a re.latively small set of possible grammars that it would be able to learn, and its learmng stratergy, implicitly or explicitly, takes adwmtage of this apriori knowledge. The goal of linguistic theory, then, is to &amp;aractedze this set of possible grammars, by specifiying the constraints, often cMled the &quot;Uniwwsal (Irammar&quot;. Tile theory of inductiw~' inference oilers a precise solution to this problem, by characterizing exactly what collections of (or its dual &quot;constraints ou&quot;) languages atisfy tile requirement for being the set of possible grammars, i,e. are learnable? A theory of &quot;feasible&quot; inference is particularly interesting because the language acquisitkm process of a human infant is feasible, not to mention its relewmce to the technological counterpart of such a pwbh'.m.</p>
                <p>In this paper, we investigate the learuability of formal grammars for linguistic description with respect to a complexity theoretic notion of feasible lea.rnability called 'polynomial earnability'. Polynomial learnabillty was originally developed by Valiant \[?\], \[?\] in the context of learning boolean coitcei)t from examples, artd subsequently generalized by I llumer et al. for arbitrary concepts \[?\]. We apply this criterion of feasible lcarnability to subclasses of formal grammars thai, are of considerable linguistic interest. Specifically, we present a novel, nontrivial constraint on gramma,:s called &quot;k. locality&quot;, which ena\])k~s a rich ehlss of mildly context sensitive grammars called l{ank&lt;~d Node Rewriting G'rammars (RNI{.( 0 to be limsibly lear1~able. \'Vc discuss possible implications of this result to thc Lheory of natural Inn guagc acqui:~ition.</p>
            </div>
            <div1>
                <head xml:id="sec2">Polynomial Learnability</head>
                <div2>
                    <head xml:id="sec2ol">Formal Modeling of Learning</head>
                    <p>What constitutes a good model of tile learning behavior? Below we list tlve basic elements that any formal model of learning must con&lt;,. (c.f. \[13\]) 1. Objects to be learned: l,ct us call them ~knacks' for full</p>
                    <p>generality. The question of learnability is asked of a col-</p>
                    <p>lection of knacks. 2. Environment: The way in whidl 'data' are available to tile</p>
                    <p>learner. 3. I\[ypotheses: I)escriptious t))r 'knacks', usually CXl)ressed</p>
                    <p>in a certain language. 4. /,earners: Ill general functions from data to hypotheses. 5. Criterion of l,earning: \])efines precisely what is meant by</p>
                    <p>the question; When is a learner said to 'learn' a giwm</p>
                    <p>collection of 'knacks' on the basis of data obtained through</p>
                    <p>the enviromnent ?</p>
                    <p>In most cases 'knacks' can be thought of as subsets of some universe (set) of objects, from which examples are drawn. 1 (Such a set is often called the 'domain' of the learning problem.) The obvions example is the definition of what a language is in the theory of natural language syntax. Syntactically, the English language is nothing but the set of all grammatical sentences, although this is subject to much philosophical controversy. The corresponding mathematical notion of a formal language is one that is fi'ee of such a controversy. A formal language is a subset of the set of all strings in .E* for some alphabet E. Clearly E* is tile domMn. The characterization of a kna&amp; as a subset of a universe is in fact a very general one. For example, a boolean concept of n variables is a subset of the set of all assignments to those n variables, often written 2 '~. Positive examples in this case are assignments to the n variables which 'satisfy' the concept in question.</p>
                    <p>When the 'knacks' under consideration can in fact be thought of as subsets of some domain, the overall picture of a learning model looks like the one given in Figure 1. 2.2 Polynomial Learnability Polynomial learnability departs from the classic paradigm of language learning, 'idenitification in the limit', ~ in at least two important aspects, lilt enforces a higher demand oil tile time</p>
                    <p>1First order structures are an example in which langtlages arc more than just subsets of some set \[14\].</p>
                    <p>2Identification in the limit w¢~s originally proposed and studied by Gold \[8\], and has subsequently been generalized in many diflbrent ways. See for example \[13\] for a comprehensive treatment of this and related paradigms. The Environment o The Knacks The Hypotheses The Domain The Learner The Crileriony Figure 1: A Learning Model complexity by requiring that the learner converge in time polynomial, but on the other hand relaxes the criterion of what constitutes a 'correct' grammar by employing an approximate, and probabilistic notion of correctness, or aecraey to be'precise. Furthermore, this notion of correctness is intricately tied to both the time complexity requirement and the way in which the environment presents examples to the learner, Specifically, the environment is assumed to present to the learner examples from the domain with respect to an unknown (to the learner) but fixed probability distribution, and the accuracy of a hypothesis is measured with respect to that same probability distribution. This way, the learner is, so to speak, protected from 'bad' presentations of a knack. We now make these ideas precise by specifying the five essential parameters of this learning paradigm. 1. Objects to be learned are languages or subsets of ?2&quot; for</p>
                    <p>some fixed alphabet E. Although we do not specify apri-</p>
                    <p>ori the language in which to express these grammars a, for</p>
                    <p>each collection of languages Z; of which we ask the learn-</p>
                    <p>ability, we fix a class of grammars G (such that L(~) = £</p>
                    <p>where we write L(~) to mean {L(G) I G E ~}) with re-</p>
                    <p>spect to which we will define the notion of 'complexity' or</p>
                    <p>'size' of a language. We take the number of bits it takes to</p>
                    <p>write down a grammar under a reasonable 4,fixed encod-</p>
                    <p>ing scheme to be the size of the grammar. The size of a</p>
                    <p>language is then defined as the size of a minimal grammar</p>
                    <p>for it. (For a language L, we write size(L) for its size.) 2. The environment produces a string in E* with a time-</p>
                    <p>invariant probability distribution unknown to the learner</p>
                    <p>and pairs it with either 0 or 1 depending on whether the</p>
                    <p>string is in the language in question or not, gives it to the</p>
                    <p>learner. It repeats this process indefinitely. 3. The hypotheses axe expressed as grammars. The class of</p>
                    <p>grammars allowed as hypotheses, say &quot;H, is not necessarily</p>
                    <p>required to generate exactly the class Z; of languages to be</p>
                    <p>learned. In general, when a collection £ can be learned by</p>
                    <p>a learner which only outputs hypotheses from a class 7&quot;/,</p>
                    <p>we say that £ is learnable by Tl, and in particular, when</p>
                    <p>Z; = L(~)) is learnable by ~, the class of representations G</p>
                    <p>is said to be properly learnable. (See \[6\].) 4. Learners passively receive an infinite sequence of positive</p>
                    <p>and negative examples in the manner described above, and</p>
                    <p>aPotentAally any 'l?urning program could be a hypothesis</p>
                    <p>~By a reasonblc encoding, we mean one which can represent n ditrerent. grannnars using O(log*~) bits. at each initial (finite) segment of such a sequence, output a hypothesis. In other words, they are functions from finite sequences of positive and negative examples 5to grammars. 5. A learning function is said to polynomially learn a col2.3 Blumer et al. \[5\] have shown an extremely interesting result revealing a connection between reliable data compression and polynomial learnability. Occam's l~azor is a principle in the philosophy of science which stipulates that a shorter theory is tobe preferred as long as it remains adequate. B\]umel&quot; el; al. define a precise version of such a notion in the present context of learning which they call Occam Algorithm, and establishes a relation between the existence of such an algorithm and polynomiM learnability: If there exists a polynomial time algorithm which reliably &quot;compresses&quot; any sample of any language in a given collection to a provably small consistent grammar for it, then such an Mogorithm polynomially learns that collection in the limit. We state this theorem in a slightly weaker form. Definition 2.1 Let £ be a language collection with associated represenation ~ with size function &quot;size&quot;. (Define a sequence of subclasses of ~ by 7~n = {G e 7-\[ \] size(G) _&lt; n}.) Then A is an Occar(~ algorithm for £ with range size f(m, ~z) if and only if! VLE£</p>
                    <p>VS C graph(L)</p>
                    <p>if size(L) = n and \] S I= m then</p>
                    <p>A(S) is consistent with S</p>
                    <p>and A(S)) e 7~I(,~,m )</p>
                    <p>and .A runs in time polynomial in the length of S. Theorem 2.1 (Blumer et al.) If A is an Occam algorithm for f~ with range size f(n,m) = O(nk~ ~) for some k &gt;_ ; 0 &lt; c~ &lt; 1 then .4 polynomially learns £ in the limit. We give below an intuitive explication of why an 0cesta Algorithm polynomiMly learns in the limit. Suppose A is an Occam Algorithm for £, and let L ~ l: be the language to be learned, and n its size. Then for an arbitrary sample for L of an arbitrary size, a minimal consistent language for it will never have size larger than size(L) itself. Hence A's output on a sample of size m will always be one of the hypotheses in H\](m,~), whose cardinality is at most 2\](~,n). As the sample size m grows, its effect on the probability that any consistent hypothesis in 7~i(,~,, 0 is accurate will (polynomially) soon dominate that of the growth of the eardinality of the hypothesis class, which is less than linear in the sample size. Sin the sequel, we shall call them 'labeled samples' SThe symmetric difference between two sets A and B is (A-B)U(B-A). rFor any langugage L, ~jraph(L) = {(x, O} I x C-: L} U {{a:, I) \] a: ~ L}. lection of languages just in case it is computable in time polynomial ill the length of the input sample, and for an arbitrary degrees of accuracy e and confidence 5, its output on a sample produced by the environment by the manner described above for any language L in that collection, will be an e-approximation of the unknown language L with confidence probability at least 1 -- a, no matter what the unknown distribution is, as long as the number of strings in the sample exceeds p(e -~, 5 -~, size (L)) for some fixed plynomial p. Here, grammar G is an e-approximation of language L, if the probability distribution over the symmetric difference 6of L and I,(G) is at most e. Occam Algorithm</p>
                </div2>
                <div2>
                    <head xml:id="sec3">Rar~ked Node Rewriting Grammars</head>
                    <p>In this section, we define l, hc class of nrihlly context sensitive grammars under consideration, or Ranked Node Rewriting (\]ram.mars (RNR(~'s). \[{NR(\]'s are based on the underlying ideas of Tree Adjoining Grammars (TArt's) s and are also a specical case of context fi'ee tree grammars \[15\] in which unres~,ricted use of w~rial)les for moving, copying and deleting, is not permitted, in other words each rewriting in this system replaces a &quot;ranked&quot; noclterminal node of say rank j with an &quot;incomplete&quot; tree containing exactly j edges that have no descendants. If we define a hierarchy of languages generated by subclasses of RNRG's having nodes and rules with hounded rank j (RNRLj), then RNRL0 = CFL, and RNRLa :: TAL. 9 We formally define these grammars below. Definition 'LI (Preliminaries) 77ze following definitions are necessar!l Jb',&quot; the ,~equel. (i) The set of labeled directed trees over an alphabet E is denoted 7;&gt; (ii) r\['ll.e Ta.'ll.'. of an &quot;incomplete&quot; tree is the number of outgoing edges with no descendents. (iii) The rarth oj'a node is the. number of outgoing edges. (iv) The ~u&amp; 4'a symbol is defined if the rank of any node labeled by it is always the same, and equal~ that rank. (v) A ranked alphabet is one in which every symbol has a rank. (vi) I,l)r writ,': rank(x) for the rank of a~ything x, if it is defined. Definition 3.2 (Ranked Node Rewriting Grammars) A ronl;ed nodt; re'writing grammar C is a q'uinl,ph' {&gt;',,v, E'e, ~, It,., Re;) where: (i) EN is a ranked nonterminal alphabet. (ii) );'r is a germinal alphabet di4oint fi'om F~N. We let ~; = }-;N U 2T. (iii) ~ is a distinguished symbol distinct from any member of E, indicating &quot;a'a outgoing edge with no descendent&quot;, m (iv) It; is a finite set of labeled trees over E. We refer ~o I(; as ~he &quot;initial trees&quot; of the grammar. (v) Ra is a finite set of rewriting rules: R&lt;~ C {(A,a} I A e Y,'N &amp; a C T~u{.} &amp; rank(A) = rank(re)}. (In the sequel, we write A --. o for rewriting rule {A, ce).) (vO ,'a,,V(c) = ,ha, {,-~,4.(A) I A e EN}. We emphasize that the nonterminM vs. terminal distinction above does not coiadde with the internal node vs. frontier node distinction. (See examples 2.1 - 2.3.) tiaving defined the notions of 'rewriting' and 'derivation' in the obvious manner, the tree language of a grammar is then defiimd as the set of trees over the terminal alphabet, whid~ can be derived fi'om the grammar. 11 This is analogous to the way the string language of a rewriting grammar in the Chomsky hierarchy is defined. Definition 3.:&quot;1 ('IYee Languages and String Languages) The tree language and string Iang~tagc of a RNRG G, denoted</p>
                    <p>s'\]?ree adjoitdng grammars were introduced a.s a formalism for linguistic description by aoshi et al. \[10\], \[9\]. Various formal and computational properties of TAG's were studied in \[17\]. Its linguistic relevance was demons~rated in \[12\].</p>
                    <p>9This hierar,:hy is different fi'om the hierarchy of &quot;meta-TAL's&quot; invented and studied exl.ensively by Weir in \[20\].</p>
                    <p>l°ln context free t.ree grammars iu \[15\], variables are used in place of ~J. 'l'hese variables can then be used in rewriting rules to move, copy, or erase subtrees.. \[t is i;his restriction of avoiding such use of variables Hint keeps RNR,G's within the class of etlicient, ly recognizable rewriting systems called &quot;Linear context fi'ee rewriting systems&quot; (\[18\]).</p>
                    <p>II'Phis is how an &quot;obligatory adjunction constraint&quot; in the tree adjoining</p>
                    <p>nunar formalism can be sintulated. S</p>
                </div2>
                <div2>
                    <head xml:id="sec9:"></head>
                    <p>S IV.. b a 8 f S derived : s a s a s f S $ f b # c d # c S d s s I a a j\[--. b s c d b # c</p>
                    <p>b )v c d Figurc 2: a, fl, 7 and deriving 'aabbccddeeff' by G:~ T((;) and Leo repectively, are defined as follows; s e )~ e /~(c') = {.,ji~ld(~) t ~ ~ T(O)}. If we now define a hierarchy of languages generated by subclasses of RNRG's with bounded ranks, context fi'ee languages ((',FL) and tree adjoining languages (TAt) constitute the first two members of the hierarchy. Definition 3.4 l;br each j ~ N RNI~Gj = {GIG C RNRG &amp; rank(G) &lt; J}. l;br each j ~ N, I{NIU, j = {L(C) I O e: antiC;;} Theorem 3.1 I{NI~Lo - CFL ~tn.d l~N I~\[.1 : !I'AL. We now giw; some examples of grammars in this laierarchy, J2 which also illustrate the way in which the weak generative capacity of different levels of this hierarchy increases progressively. Example 3.1. 1), = {3% ~ \[ n. C N} C Gl' , is generated by the following l?~Nl~(_7o 9rammar~ where o' is shown in Figure 2. 6', = ({s}, {,,a,b},L {s'}, {,5'--~ ~,,~ + s(~)}) Example 3.2 I)2 -- {a'W~c'~d '~ \] n G N} C- TAL is ocher, ted by the following \]~N I~G1 grammar, where/~ is shown in Figure 2. C;~=({S},{s,a,b,e,d},~,{(S(,~))},{S'-, ,'&lt;S'+s(~)}&gt; Example 3.3 L3 = {a'%'*c'~d'~e'~f '~ I n C N} ¢ TAL is generated by the following RNI?,G2 grnmn~ar, where 7 is shown 5* t,'igure 2. C;':~ = ({S'}, {s, a, b, ,.', d, c, f}, ~, {(,5'(A, A))}, {5'--, 7, ,5'-~ ,~(~, I1)})</p>
                </div2>
                <div2>
                    <head xml:id="sec4">K-Local Grammars</head>
                    <p>q'he notion of qocality' of a grammar we define in this paper is a measure of how much global dependency there is within the grammar. By global dependency within a gramnlar, we. mean the interactions that exist between different rules and nonterminals in the grammar. As it is intuitively clear, allowing unbounded amont of global interaction is a major, though not only, cause of a combinatorial explosion in a search for a right grammar. K-locality limits the amount of such interaction, by</p>
                    <p>tSSimpler trees are represented as term struct.ures, whereas lnore involved trees are shown in the figure. Also note that we rise uppercase letters for nonterminals and lowercase for terminals.</p>
                    <p>IaSome linguistic motiwltions of this extension of'lDkG's are argagned for by the author in \[1\]. bounding the number of different rules that can participate in any slngle derivation.</p>
                    <p>Pormally, the notion of &quot;k-locality&quot; of a grammar is defined with respect to a formulation of derivations due originally to Vijay-Shankar, Weir, and 3oshi (\[\[9\]), which is a generalization of the notion of parse trees for CFO's. In their formulation, a derivation is a tree recording the tfistory of rewritings. The root of a derivation tree is labeled with an initial tree, and the rest of the nodes with rewriting rules. Each edge corresponds to a rewriting; the edge from a rule (host rule) to auother rule (applied rule) is labeled with the address of the node in the host l, ree at which the rewriting takes place.</p>
                    <p>The degree of locality of a derivation is the number of distinct kinds of rewritings that appear in it. In terms of a derivation tree, the degree of locality is the number of different kinds of edges in it, where two edges are equivalent just in ease the two end nodes are labeled by the same rules, and the edges themselves are labeled by the same node address. Definition 4.1 Let 7)(G) denote the set of all derivation trees of G, and let r 6 D(G). Then, the degree of locality oft, written locality(r), is d4ned as follows, locality(r) = card{(p,q,,t) I there is an edge in r from a node labeled with p to another labeled with q, and is itself labeled with 77} The degree of locality of a gramm,~r is the maximum of those of all its derivations. Definition 4.2 a RNRG G is called k-local if max{locality(r) \] r e ~(C)} _&lt; k. We write k-Local-I~NRO - {(7 I G (5 RNRG and G is k-Local} and k-Local-t2Nl~L = { L(G) I G C k-Local-i~NR(: }, etc.. Example 4.1 L1 = {a&quot;bna&quot;b '' I n,m C N} ~ /t-Local-RNRLo since all the derivations of G, -({S}, {s,a,b}, ~, {s(S,S)}, {S -+ sea, S,b), S --~ A}) generating Lt have deflree of locality at most 4. l,br example, the derivation for the string a3b3ab has degree of locality 4 as shown in Figure 8.</p>
                    <p>Because locality of a derivation is the number of distinct kinds of rewritings, inclusive of the positions at which they takc place, k-locality also puts a bound on the number of nonterminal occurrences in any rule. In fact, had we defined the notion of klocality by the two conditins: (i) at most k rules take part in any derivation, (if) each rule is k-bounded, t4, the analogous learnability result would follow essentially by the same argument. So, k-locality in effect forces a grammar to be an unbounded union of boundedly simple grammar, with bounded number of rules each boundedly small, with a bounded number of nonterminals. This fact is captured formally by the existence of the following normal form with only a polynomial expansion factor. Lelnma 4.1 (K-Local Normal Form) For every k-Local-RNRGj G, if we let n = size(G), then there is a RNRGj G' such that ~. L( C') = r,,( a). 2. c' is in k-local normal form, i.c. O' = U{1\]~ I i C -rG,}</p>
                    <p>such that: (a) each lIi has a nonterminal set that is: disjoint from</p>
                    <p>any other IIj. (b) each tI~ is k-sire, pie, that is i. each Ili contains exactly i initial tree. a Sb s 2 s 2 s ---../1Xm a S b a Sb locality(~-) = 4 s s s s--*A A m s.. A a Sb S S a Sb s s s -'/1&quot;,, a S b a Sb s.o ,/-: A 2..£- s-*A.--S S I s s-./l',, aS b s s s s 2 A S S s aS b Figure 3: Degree of locality of a derivation of a3b3ab by G1 if. each Hi contains at most k rules. iii. each IIi contains at most k nonterminal occurrences. s. ~i~e(c~&quot;) = o(~+'). Crucially, the constraint of k-locality on RNRG's is an interesting one because not only each k-local subclass is an exponential class containing infinitely many infinite languages, but also klocal subclasses of the RNRG hierarchy become progressively more complex as we go higher in the hierarchy. In particular, for each j, IlNP~Gj can &quot;count up to&quot; 2(j + 1) and for each k &gt; 2, k-local-RN\[4Gj can also count up to 2(j + 1)) 5 We summarize these properties of k-loeal-RNRL's below. Theorem 4.1 Pbr every k E N, 1. Vj E N UkeN k-local-RNRLj = RNRLj. ~. Vj C N Vk &gt; 3 k-local-RNRLj+l is incomparable with</p>
                    <p>RNRLp 3. Vj, k ~ N k-local:RNRLj is a p~oper subset of (k+I)-</p>
                    <p>loeal-t~NRLj. 4. Vj Vk &gt; 2 E N k-local-RNRLj contains infinitely many</p>
                    <p>infinite languages. hfformal t'roof:</p>
                    <p>1 is obvious because for each grammar in RNRLj, the degree of locality o~&quot; the grannnar is finite.</p>
                    <p>As for 2, we note that the sequence of the languages (for the first three of which we gave example grammars) L~ = {a~*a~...a~ I u ~ N} are each in 3-1ocal-RNRLI_I but not in RNRLi_2.</p>
                    <p>To verii} 3, we give the following sequence of languages Lj,k such that for each j and k, Lj, k is in k-local-RNRLj but not in (k-1)-local-RNRL/. Intuitively this is because k-local-languages can have at most O(k) mutually independent dependencies ina single sentence. Example 4.2 For each j, k ~ N, let Lj,k = { al ~ ...a20+1 '~ ) al 2,~2 ...a 2~, (j+l) ... a knk 1 ...a2(j~t) kn~ \]nl,n2,...,nk e N}.</p>
                    <p>is obvious because Zoo = Uwe~.Lw where Lt~ = {w&quot; \] n e N} are a subset of 2-1ocal-I~NRL0, and hence is a subset of k:localRNl~Lj for every j and k &gt;_ 2. £¢¢ clearly contains inifinitely many infinite languages. \[\]</p>
                </div2>
            </div1>
            <note n="7:" place="below"></note>
            <note n="13" place="below"></note>
            <note n="14'K-bounded'" place="below">here means k nontermineJ occurrences in each rule, \[4\]. For instance, a context free grammar in Chomsky Normal l%rm has only 2-bounded rules.</note>
            <note n="2" place="below"></note>
            <div1>
                <head xml:id="sec5">K-Local Languages Are Learnable</head>
                <p>It turns out that each k-loeal subclass of each RNRLj is polynomially lear~lable. Theorem 5. t For each j and k, k-local-RNRLj is polynomially Icarnable. This theorem can be proved by exhibiting an Occam Algorithm i(c.f, Subsection 2.3), for this class with a range size which is l logarithmic in the sample size, and polynomial in the size of a minimal consistent grammar. We ommit a detailed proof and igiw~ an informal outline of the proof. :</p>
                <p>1. By the Normal Form Lemma, for any k-local-RNRG G,</p>
                <p>there is a language quivalent k-local-RNR.G H in k-local</p>
                <p>normal form whose size is only polynomially larger than</p>
                <p>the size of G.</p>
                <p>t~A class of grammars Gis said to be able to &quot;count up to&quot; j, just in case {a?a'~...a\] n e N} e {L(G) \[ G (~ G} but {ai'a'~...a~+ 1 \[ n e N} ¢ {c(G) I a e 6}. which we call &quot;Weighted Set Cover&quot; (WSC), and prove (in \[2 D the existence of an approximation algorithm with a performance guarantee which suffices to ensure that the output of ,4 will be a basis set consistent with the sample which is provably only polynomially larger than a minimal one, and is less than linear in the sample size. The algorithm runs in time polynomial in the size of a minimal consistent grammar and the sample length. Discussion: Possible Implications to the Theory of Natural Language Acquisition 2. The number of k-simple grammars with is apriori infinite,</p>
                <p>but for a given positive sample, the number of such gram-</p>
                <p>mars that are 'relevant' to that sample (i.e. which could</p>
                <p>have been used to derive any of the examples) is polyno-</p>
                <p>mially bounded in the length of the sample. This follows</p>
                <p>essentially by the non-erasure and non-copying properties</p>
                <p>of RNRG's. (See \[3\] for detail.) 3. Out of the set of k-simple grammars in the normal form</p>
                <p>thus obtained, the ones that are inconsistent with the neg-</p>
                <p>ative sample are eliminated. Such a filtering can be seen to</p>
                <p>be performable in polynomial time, appealing to the result</p>
                <p>of Vijay-Shankar, Weir and Joshi \[18\] that Linear Context</p>
                <p>Free Rewriting Systems (LCFRS's) are polynomial time</p>
                <p>recognizable. That R.NRG's are indeed LCFRS's follow</p>
                <p>also from the non-erasure and non-copying properties. 4. What we have at this stage is a polynomially bounded set</p>
                <p>of k-simple grammars of varying sizes which are all con-</p>
                <p>sistent with the input sample. The 'relevant' part 10 of</p>
                <p>a minimal consistent grammar in k-local normal form is</p>
                <p>guaranteed to be a subset of this set of grammars. What</p>
                <p>an Oceam algorithm needs to do, then, is to find some sub-</p>
                <p>set of this set of k-simple grammars that &quot;covers&quot; all the</p>
                <p>points in the positive sample, and has a total size that is</p>
                <p>provably only polynomially larger than the minimal total</p>
                <p>size of a subset that covers the positive sample and is less</p>
                <p>than linear in the sample size. 5. We formalize this as a variant of &quot;Set Cover&quot; problem 6 We have shown that a single, nontrivial constraint of 'k-locality' allows a rich class of mildly context sensitive languages, which are argued by some \[9\] to be an upperbound of weak generative capacity that may be needed by a hnguistic formalism, to be learnable. Let us recall that k-locality puts a bound on the amount of global interactions between different parts (rules) of a grammar. Although the most concise discription of natrual anguage might require almost unbounded amount of such interactions, it is conceivable that the actual grammar that is acquired by humans have a bounded degree of interactions, and thus in some cases may involve some inefficiency and redundancy. To illustrate the nature of inefficiecy introduced by 'forcing' a grammar to be k-loeal, consider the following. The syntactic ategory of a noun phrase seems to be essentially context independent in the sense that a noun phrase in a subject position and a noun phrase in an object positionare more or less syntactically equivalent. Such a 'generalization' contributes to the 'global' interaction in a grammar. Thus, for a k-local grammar (for some relatively small k) to account for it, it may have to repeat the same set of noun phrase rules for different constructions. t¢This ,lotion is to be made precise.</p>
                <p>As is stated in Section 4, for each fixed k, there are clearly a lot of languages (in a given class) which could not be generated by a k-local grammar. However, it is also the case that many languages, for which the most concise grammar is not a k-local grammar, can be generated by a less concise (and thus perhaps less explanatory) grammar, which is k-locah In some sense, this is similar to the well-known distinction of 'competence' and 'performance'. It is conceivable that performance grammars which are actually acquired by humans are in some sense much less efficient and less explanatory than a competence grammar for the same language. After all when the 'projection problem' asks: 'How is it possible for human infants to acquire their native languages...', itdoes not seem necessary that it be asking the question with respect o 'competence grammars', for what we know is that the set of 'performance grammars' is feasibly learnable. The possibility that we are suggesting here is that 'k-locality ~ is not visible in competence grammars, however, it is implicitly there so that the languages generated by the class of competence grammars, which are not necessarily k-local, are indeed all k-local languages for some fixed 'k'.</p>
            </div1>
            <div1>
                <head xml:id="sec7">Conclusions</head>
                <p>We have investigated the use of complexity theory to the evalu-</p>
                <p>ation of grammatical systems as linguistic formalisms from the</p>
                <p>point of view of feasible learnability. In particular, we have</p>
                <p>demonstrated that a single, natural and non-trivial constraint</p>
                <p>of &quot;locality&quot; on the grammars allows a rich class of mildly con-</p>
                <p>text sensitive languages to be feasibly learnable, in a well-defined</p>
                <p>complexity theoretic sense. Our work differs from recent works</p>
                <p>on efficient learning of formal languages, for example by An-</p>
                <p>gluin (\[4\]), in that it uses only examples and no other powerful</p>
                <p>oracles. We hope to have demonstrated that learning formal -- grammars need not be doomed to be necessarily computation-</p>
                <p>ally intractable, and the investigation of alternative formulations</p>
                <p>of this problem is a worthwhile endeavonr. 8 Acknowledgment The research reported here in was in part supported by an IBM graduate fellowship awarded to the author. The author gratefully acknowledges his advisor, Scott Weinstein, for his guidance and encouragement throughout his research. He has also benefitted from valuable discussions with Aravind Joshi and David Weir. Finally he wishes to thank Haim Levkowitz and Ethel Schuster for their kind help in formatting this paper. References \[1\] Naoki Abe. Generalization of tree adjunction as ranked</p>
                <p>node rewriting. 1987. Unpublished manuscript. \[2\] Naoki Abe.. Polynomial earnability and locality of formal</p>
                <p>grammars. In 26th Meeting of A.C.L., June 1988. \[3\] Naoki Abe. Polynomially learnable subclasses of mildy con-</p>
                <p>text sensitive languages. 1987. Unpublished manuscript. \[4\] Dana Angluin. Leafing k-bounded context-free grammars.</p>
                <p>Technical Report YALEU/DCS/TR-557, Yale University,</p>
                <p>August 1987. \[5\] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth.</p>
                <p>Classifying learnable geometric oncepts with the vapnikchervonenkis dimension. In Proc. 18th ACM Syrup. on Theory of Computation, pages 243 - 282, 1986. \[6\] A. Blumer, A. Ehrenfeueht, D. Hausslor, and M. War-</p>
                <p>muth. Learnability and the Vapnik-Chervonenkis Dimen-</p>
                <p>sion. Technical Report UCSC CI~L-87-20, University of</p>
                <p>California at Santa Cruz, Novermber 1987. \[7\] Noam Chomsky. Aspects of the Theory of Syntax. The MIT</p>
                <p>Press, 1965. \[8\] E. Mark Gold. Language identification in the limit. Infor-</p>
                <p>mation and Control, 10:447-474, 1967. \[9\] A. K. Joshi. How much context-sensitivity is necessary for</p>
                <p>characterizing structural description - tree adjoining gram-</p>
                <p>mars. In D. Dowty, L. Karttunen, and A. Zwicky, edi-</p>
                <p>tors, Natural Language Processing - Theoretical, Computa-</p>
                <p>tional~ and Psychological Perspectives, Cambridege Univer-</p>
                <p>sity Press, 1983. \[10\] Aravind K. Joshi, Leon Levy, and Masako Takahashi. Tree</p>
                <p>adjunct grammars. Journal of Computer and System Sci-</p>
                <p>ences, 10:136-163, 1975. \[11\] M. Kearns, M. Li, L. Pitt, and L. Valiant. On the learn-</p>
                <p>ability of boolean formulae. In Proc. 19th ACM Syrup. on</p>
                <p>Theory of Comoputation, pages 285 - 295, 1987. \[12\] A. Kroch and A. K. Joshi. Linguistic relevance of tree ad-</p>
                <p>joining grammars. 1989. To appear in Linguistics and Phi-</p>
                <p>losophy. \[13\] Daniel N. Osherson, Michael Stob, and Scott Weinstein.</p>
                <p>Systems That Learn. The MIT Press, 1986. \[14\] Daniel N. Osherson and Scott Weinstein. Identification in</p>
                <p>the limit of first order structures. JouT&quot;aal of Philosophical</p>
                <p>Logic, 15:55 - 81, 1986. \[15\] William C. Rounds. Context-free grammars on trees. In</p>
                <p>A CM Symposium on Theory of Computing, pages 143-148,</p>
                <p>1969. \[16\] Leslie G. Valiant. A theory of the learnable. Communica-</p>
                <p>tions of A.C.M., 27:1134-1142, 1984. \[17\] K. Vijay-Shanker and A. K. Joshi. Some computational</p>
                <p>properties of tree adjoining grammars. In 23rd Meeting of</p>
                <p>A.C.L., 1985. \[18\] K. Vijay-Shanker, D. J. Weir, and A. K. Joshi. Character-</p>
                <p>izing structural descriptions produced by various grarmnat-</p>
                <p>ieal formalisms. In 25th Meeting of A.C.L., 1987. \[19\] K. Vijay-Shanker, D. J. Weir, and A. K. Joshi. On the</p>
                <p>progression from context-freo to tree adjoining languages.</p>
                <p>In A. Manaster-Ramer, editor, Mathematics of Language,</p>
                <p>John Benjamins, 1986. \[20\] David J. Weir. From Context-Free Grammars to Tree Ad-</p>
                <p>joining Grammars and Beyond - A dissertation proposal.</p>
                <p>Technical Report MS-CIS-87-42, University of Pennsylva-</p>
                <p>nia, 1987.</p>
            </div1>
        </body>
        <back/>
    </text>
</TEI>
