<?xml version="1.0"?><!DOCTYPE article SYSTEM "/project/take/software/searchbench_offline_processing/paperxml_generator/aclextractor/src/python/../resource/dtd/paperxml.dtd"><article><header><firstpageheader><page local="1"/><title>A Statistically Emergent Approach for Language Processing: Application to Modeling Context Effects in Ambiguous Chinese Word Boundary Perception</title><pubinfo>t © 1996 Association for Computational Linguistics</pubinfo><author surname="Gan" givenname="Kok-Wee"><org  name="National University of Singapore" country="Singapore"/></author><author surname="Palmer" givenname="Martha"><org  name="Konan University" country="Japan" city="Kobe"/></author><author surname="Lua" givenname="Kim-Teng"><org  name="National University of Singapore" country="Singapore"/></author></firstpageheader><frontmatter><p><b>A Statistically Emergent Approach for Language Processing: Application to Modeling Context Effects in Ambiguous Chinese Word Boundary Perception</b></p><p><b>Kok-Wee Gan* Martha Paktiert</b></p><p>Hong Kong University of Science and       University of Pennsylvania</p><p>Technology</p><p><b>Kim-Teng Lua*</b></p><p>National University of Singapore</p></frontmatter><abstract><b><i>This paper proposes that the process of language understanding can be modeled as a collective phenomenon that emerges from a myriad of microscopic and diverse activities. The process is analogous to the crystallization process in chemistry. The essential features of this model are: asynchronous parallelism; temperature-controlled randomness; and statistically emergent active symbols. A computer program that tests this model on the task of capturing the effect of context on the perception of ambiguous word boundaries in Chinese sentences is presented. The program adopts a holistic approach in which word identification forms an integral component of sentence analysis. Various types of knowledge, from statistics to linguistics, are seamlessly integrated for the tasks of word boundary disambiguation as well as sentential analysis. Our experimental results showed that the model is able to address the word boundary ambiguity problems effectively.</i></b> </abstract></header><body><section number="1." title="Introduction"><p>This paper suggests that the language understanding process can be effectively mod­eled as the statistical outcome of a large number of independent activities occurring in parallel. There is no global controller deciding which processes to run next. All pro­cessing is done locally by many simple, independent agents that make their decisions stochastically. The system is self-organizing, with coherent behavior being a statisti­cally emergent property of the system as a whole. The model, in a nutshell, simulates language understanding as a crystallization process. This process consists of a series of hierarchical, structure-building activities in which high-level linguistic structures are formed from their constituents and get properly hooked up to each other as the process converges.</p><p>The essential features of the model are:</p><p>•    The process of sentence analysis is a series of computational activities that determine how various constituents in a sentence can be meaningfully related.</p><p><b>* Department of Computer Science, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong</b> <b>t Department of Computer Information Science, University of Pennsylvania, Philadelphia, PA 19104-6389 </b><b>Department of Information Systems &amp; Computer Science, National University of Singapore, Lower Kent Ridge Road, Singapore 119260, Republic of Singapore</b><page local="2" global="532"/></p><p>• All computational activities are carried out by a large number of procedures known as <b>codelets.</b></p><p>• A linguistic structure is not built by a single codelet. Rather, it is constructed by a sequence of codelets. The execution of this sequence of codelets is interleaved with other codelets that are responsible for building other structures.</p><p>• The order by which structures are built is not explicitly programmed, but is an emergent outcome of chains of codelets working in an asynchronous parallel mode.</p><p>• Computational activities are a combination of top-down and bottom-up activities.</p><p>• Computational activities are indirectly guided by a semantic network of linguistic concepts, which ensures that these activities do not operate independently of the system's representation of the context of a sentence.</p><p>• Decision making is stochastic, with the amount of randomness being controlled by a parameter known as the <b>computational temperature.</b></p><p>We have applied our model to the task of capturing the effect of context on the perception of ambiguous word boundaries in Chinese sentences (Gan 1993). Our ap­proach differs from existing work on Chinese word segmentation (Liang 1983; Wang, Wang, and Bai 1991; Fan and Tsai 1988; Chang, Chen, and Chen 1991; Chiang et al. 1992; Sproat and Shih 1990; Wu and Su 1993; Lua and Gan 1994; Lai et al. 1992; Sproat et al. 1994; Sproat et al. 1996) primarily in that our system performs sentence inter­pretation, in addition to word boundary identification. Our system figures out where the word boundaries of a sentence are by determining how various constituents in a sentence can be meaningfully related. The relations the system builds represent its interpretation of the sentence. In the initial stage of a run, the system constructs re­lations between characters of a sentence. Through a spreading activation mechanism, the system gradually shifts to the construction of words and of relations between words. Later, the system progresses to identifying and constructing chunks (in other words, phrases), and to establishing connections between chunks. Note that there is no top-level executive that decides the order of these activities. At any given time, the system stochastically selects one action to execute. Therefore, efforts toward building different structures are interleaved, sometimes cooperating and sometimes competing. The system's high-level behavior, therefore, arises from its low-level stochastic actions.</p><p>We will give a detailed description of this application in this paper. In Section 2, we introduce the problem of ambiguous Chinese word boundary perception, and follow, in Section 3, with a summary of the current practices in Chinese word identification. We describe our model in Section 4, showing a sample run of our program in Section 5 to illustrate the behavior of the model. Finally, some discussions of the model are covered in Section 6. In Section 7, we compare our model with others, and explore areas for future research in Section 8.</p></section><section number="2." title="Ambiguous Chinese Word Boundary Perception"><p>A written Chinese sentence consists of a series of evenly spaced Chinese characters. Each character corresponds to one syllable. A word in Chinese can be made up of a single character, such as tR <b><i>fan </i></b>'rice', or it can be a combination of two or more characters, such as <b><i>shuïguô </i></b>'fruit'.<page local="3" global="533"/> It is possible that the component characters of a word are free<footnote anchor="1"/>, such as <b><i>shuï </i></b>and jfc <b><i>guö </i></b>of the word tKHI <b><i>shuïguô </i></b>'fruit', which mean 'water' and 'fruit' respectively. For any two Chinese characters in a sentence, denoted as <b><i>x </i></b>and y, if <b><i>xy </i></b>cannot be combined together to function as a word, a single word boundary exists between these two characters. If <b><i>x </i></b>and <b><i>y </i></b>can be constituents of the same word, yet at the same time may also be free, then word boundary ambiguity exists in these two characters. If there is a unique word boundary before <b><i>x </i></b>and after y, we refer to the ambiguity existing in xy as a <b>combination ambiguity. </b>On the other hand, if there is a word boundary ambiguity between the characters <b><i>xy </i></b>and the char­acter that precedes or follows them, say z, and these three characters can be grouped into either <b><i>xy z </i></b>or <b><i>x yz, </i></b>then we say that an <b>overlap ambiguity </b>exists. A sentence that allows an ambiguous fragment to have multiple word boundaries will end up with more than one interpretation. This type of ambiguity is called <b>global ambiguity </b>with respect to the sentence. On the other hand, if only one way of segmenting the word boundary of an ambiguous fragment is allowed in a sentence, we call this <b>local ambiguity </b>with respect to the sentence. Global ambiguity can only be resolved with discourse knowledge. An example for each category is shown in (1) to (4).<footnote anchor="2"/> Through­out this paper, we follow the guidelines on Chinese word segmentation adopted in China.<footnote anchor="3"/> <b>Overlap, Local Ambiguity</b> <b><i>zhè    wèi    zhtyuân   </i></b>göngzuö   de yâli hën dà this   CL<footnote anchor="4"/>   worker   work        STRUC<footnote anchor="5"/>   pressure   very great "This worker faces great pressure in his work.'</p><doubt alpha="45.0" length="20" tooSmall="False" monospace="0.0">(1)mmmm      mtiîb *</doubt><p>The underlined fragment MlfF <b><i>yuan göngzuö </i></b>in (1) has overlap, local ambiguity. The middle character X <b><i>gong </i></b>can combine with the previous character j| <b><i>yuan </i></b>to form the word jtl <b><i>yuângông </i></b>'worker', leaving the third character functioning as a monosyllabic word <b><i>zuö </i></b>'do'. The middle character can also combine with the next character to form the word IfF <b><i>göngzuö </i></b>'work', leaving the first character alone. The sentence containing this fragment allows only one way of segmenting the word boundary, which is shown in (1). The character M <b><i>yuan </i></b>combines with the character preceding it, l$j <b><i>zhi, </i></b>to form the bisyllabic word IfêÈjt <b><i>zhîyuân </i></b>'worker', and the two characters I <b><i>gong </i></b>and     <b><i>zuö </i></b>form a word.</p><p><b>Overlap, Global Ambiguity</b></p><doubt alpha="37.9" length="29" tooSmall="False" monospace="0.0">(2) a. sffi   nm±g   s    !mm</doubt><doubt alpha="65.5" length="148" tooSmall="False" monospace="0.0">wömen   yào     xuesheng   huôdé      you yiyi we        want   student    live   CSC6have meaning 'We want our students to have a meaningful life.'</doubt><p><b>1 A free character is one which can occur independently as a word (Li and Thompson 1981).</b></p><p><b>2 The characters underlined in sentences (1) to (4) are the locations of word boundary ambiguities we would like to focus on. This convention will be used throughout in this paper.</b></p><p><b>3 See <i>Contemporary Chinese Language Words Segmentation Standard Used for Information Processing, </i>fifth edition, 1988, published in China.</b></p><p><b>4 CL stands for a CLassifier.</b></p><p><b>5 STRUC stands for the STRUCture word </b>6Ü <b><i>de.</i></b></p><p><b>6 CSC stands for the Complex Stative Construction word <i>de.</i></b></p><page local="4" global="534"/><doubt alpha="43.5" length="23" tooSmall="False" monospace="0.0">b.mi   m   mém.m   % mm</doubt><doubt alpha="60.0" length="95" tooSmall="False" monospace="0.0">wömen   yào     xué     shênghuôdé     yöu yiyi we       want   learn   life CSC   have meaning</doubt><p>'We want to learn how to lead a meaningful life.'</p><p>The fragment <b><i>xué shéng huô </i></b>also has overlap ambiguity, where the middle character can either combine with the first character to form a word, or combine with the last character to form a word. The sentence containing this fragment has two plausible interpretations as shown in (2a) and (2b). Both alternations: Ü5 <b><i>xuéshéng huô </i></b>'student live' (2a) and P £Ëfë <b><i>xué shênghuô </i></b>Team life' are acceptable.</p><p><b>Combination, Local Ambiguity</b> <b><i>nî     de biäoqing   shifën huâjï</i></b></p><doubt alpha="38.5" length="13" tooSmall="False" monospace="0.0">0)m     mm mm</doubt><doubt alpha="62.7" length="59" tooSmall="False" monospace="0.0">you   STRUC   look        very funny 'You look very funny.'</doubt><p>In (3), the two characters in the fragment <b><i>~tft shifën </i></b>can either function as two au­tonomous words + <b><i>sM </i></b>'ten' and <b><i>ft fên </i></b>'mark', or they can combine together to function as a bisyllabic word <b><i>-\-ft</i></b><b><i> shîfên </i></b>'very'. Given the sentential context of (3), however, only the second alternation is correct.</p><p><b>Combination, Global Ambiguity</b> <b><i>wömen dôu hën <u>nân</u> guö </i></b>we all very hard live 'We all have a hard life.'</p><doubt alpha="43.8" length="16" tooSmall="False" monospace="0.0">(4)a.nn   »ism m</doubt><doubt alpha="63.6" length="11" tooSmall="False" monospace="0.0">b.mii  i it</doubt><doubt alpha="61.6" length="73" tooSmall="False" monospace="0.0">wömen   dôu   hën nânguowe        all    very sad 'We all feel very sad.'</doubt><p>The fragment <b><i>W&amp;Ê</i></b><b><i> nânguo </i></b>also has combination ambiguity. It differs from (3) in that the sentence in which it appears has two plausible interpretations. Hence, this fragment can either be segmented as HÉ <b><i>nân </i></b>'hard' and S <b><i>guö </i></b>'live' in (4a), or as HÉ<b>î(§ </b><b><i>nânguo </i></b>'sad' in (4b).</p><p>Word boundary ambiguity is a very common phenomenon in written Chinese, due to the fact that a large number of words in modern Chinese are formed from free characters (Chao 1957). The problem also exists in continuous speech recognition research, where correct interpretation of word boundaries in an utterance requires lin­guistic and nonlinguistic information. However, people have a fascinating ability to fluidly perceive groups of characters as words in one context but break these groups apart in a different context. This human capability highlights the fact that there is a continual interaction between word identification and sentence interpretation. We are therefore motivated to study how our statistically emergent model can be used to sim­ulate the interactions between word identification and sentence analysis. In particular, we want to study how the model (i) handles fragments with local ambiguities, such as those in sentences (1) and (3), when they appear in different sentential contexts and (ii) handles fragments with global ambiguities, such as those in sentences (2) and (4), when there is no discourse information.</p><page local="5" global="535"/></section><section number="3." title="Existing Approaches"><p>Traditionally, word identification has been treated as a preprocessing issue, distinct from sentence analysis. We will therefore only discuss current practices in word iden­tification, leaving sentence analysis aside. Several techniques have been used in word identification, ranging from simple pattern matching, to statistical approaches, to rule-based methods. The most popular pattern-matching method is based on the Maximum Matching heuristics, commonly known as the MM method (Liang 1983; Wang, Wang, and Bai 1991). This method scans a sentence from left to right. In each step, the longest matched substring is selected as a word by dictionary look-up. For example, in sen­tence (5), computer   STRUC   invention   implication profound 'The invention of the computer has profound implications.' the first three characters are identified as the word ItfUH <b><i>jîsuànjî </i></b>'computer' because it is the longest matched substring found in a word dictionary. With the same reasoning, the words ffà de 'STRUC, i|tji <b><i>fâming </i></b>'invention', <b><i>MM yïyî </i></b>'implication', and <b><i>MJZ zhdngdà </i></b>'profound' are identified.</p><doubt alpha="66.7" length="12" tooSmall="False" monospace="0.0">(5)%mmeummmm</doubt><doubt alpha="65.9" length="41" tooSmall="False" monospace="0.0">jîsuànjî      de fâming       yîyizhôngdà</doubt><p>Statistical techniques include the relaxation approach (Fan and Tsai 1988; Chang, Chen, and Chen 1991; Chiang et al. 1992), the mutual information approach (Sproat and Shih 1990; Wu and Su 1993; Lua and Gan 1994), and the Markov model (Lai et al. 1992). These approaches make use of co-occurrence frequencies of characters in a large corpus of written texts to achieve word segmentation without getting into deep syntactic and semantic analysis. For example, the relaxation approach uses the usage frequencies of words and the adjacency constraints among words to iteratively derive the most plausible assignment of characters into word classes. First, all possi­ble words in a sentence are identified and assigned initial probabilities based on their usage frequency. These probabilities are updated iteratively by employing the consis­tency constraints among neighboring words. Impossible combinations are gradually filtered out, leading to the identification of the most likely combination. The mutual information approach is similar to the relaxation approach in principle. Here, mutual information is used to measure how strongly two characters are associated. The mu­tual information score is derived from the ratio of the co-occurrence frequency of two characters to the frequency of each character. In a sentence, the mutual information score for each pair of adjacent characters is determined. The pair having the highest score is grouped together. The sentence is split into two parts by the two characters just grouped. The same procedure is applied to each part recursively. Eventually, all word boundaries will be identified.</p><p>Both the pattern-matching and the statistical approaches are simple and easy to implement. It is well known, however, that they perform poorly when presented with ambiguous fragments that have alternate word boundaries in different sentential con­texts. For instance, the fragment <b><i>shlfên, </i></b>which is a bisyllabic word in sentence (3a), functions as two separate words in sentence (6).</p><doubt alpha="51.6" length="124" tooSmall="False" monospace="0.0">(6)HtH      ^m      ± jttä    zhï     kao      dào     shifen he   only   score   ASP   ten mark 'He scores only ten marks.'</doubt><page local="6" global="536"/><p>The MM method will regard this fragment as a bisyllabic word <b><i>-\-ft</i></b><b><i> shîfSn </i></b>'very' regardless of the sentential context in (3a) and (6), since this word is longer than the lengths of the two monosyllabic words + <b><i>shi </i></b>'ten' and <b><i>ft fën </i></b>'mark'. As a result, this method fails to correctly identify the word boundaries in sentence (6). Within statistical approaches, considering, for example, the mutual information method (Lua and Gan 1994), the same fragment is identified as a bisyllabic word in both sentences (3a) and (6)<footnote anchor="7"/>.</p><p>By checking the structural relationships among words in a sentence, rule-based approaches aim to overcome limitations faced by pattern-matching and statistical ap­proaches. However, many of the rules in existing rule-based systems (Huang 1989; Yao, Zheng, and Wu 1990; Yeh and Lee 1991; He, Xu, and Sun 1991; Chen and Liu 1992) are either arbitrary and word-specific, or overly general. For example,</p><doubt alpha="100.0" length="4" tooSmall="False" monospace="0.0">Rule</doubt><p>Given an ambiguous fragment <b><i>xyz </i></b>where <b><i>x, </i></b>z, <b><i>xy, </i></b>and <b><i>yz </i></b>are all possible words, if <b><i>x </i></b>can be analyzed as a so-called direction word, segment the fragment as <b><i>x yz, </i></b>else segment it as <b><i>xy z </i></b>(Liang 1990).</p><p>This syntactic rule works in sentence (7).</p><doubt alpha="40.0" length="10" tooSmall="False" monospace="0.0">(7)(èmTm±_</doubt><p><b><i>tä fü xià <u>shënzi </u></i></b>he bend down body 'He bends down his body.'</p><p>The fragment Tlr-f <b><i>xià shën zi </i></b>in sentence (7) is ambiguous. As T <b><i>xià </i></b>'down' is a direction word, the fragment is segmented as T <b><i>m?</i></b><b><i> </i></b><b><i>xià shënzi </i></b>'down body', which is as desired.</p><p>Similarly, this rule will segment the fragment ^rHA <b><i>wài guô rén </i></b>as <b><i>ft </i></b>HA <b><i>wài guôrên </i></b>'out citizen', since <b><i>ft wài </i></b>'out' is also a direction word. Therefore, when this fragment appears in sentence (8a), <b><i>ta shi wàiguôrén </i></b>he COPULA foreigner 'He is a foreigner.' the word boundaries identified will be:</p><doubt alpha="57.1" length="14" tooSmall="False" monospace="0.0">(8)a.  fÉmftmA</doubt><doubt alpha="50.0" length="18" tooSmall="False" monospace="0.0">b. ffem       ftma</doubt><p><b><i>tâ    shi wài guôrén</i></b> he   COPULA   out citizen which is incorrect.</p><p>Examples (7) and (8) illustrate that although syntactic information has been incor­porated in word segmentation, there are still errors. In contrast, people are extremely flexible in their perception of word boundaries of ambiguous fragments appearing in different sentential contexts. We believe that the separation of word identification from the task of analysis accounts for the difference in performance. This has motivated us to study how word identification and sentence analysis can be integrated.</p><p><b>7 This result is reported in Gan (1994).</b></p><page local="7" global="537"/></section><section number="4." title="The Statistically Emergent Model"><p>This model is inspired by the work done in the Fluid Analogies Research Group (Hof-stadter 1983; Meredith 1986; Mitchell 1990; French 1992). There are four main compo­nents in this model. Namely, (i) the <b>conceptual network, </b>which is a network of nodes and links representing some permanent linguistic concepts; (ii) the <b>workspace, </b>which is the working area in which high-level linguistic structures representing the system's current understanding of a sentence are built and modified; (iii) the <b>coderack, </b>which is a pool of structure-building agents (codelets) waiting to run; and (iv) the <b>computational temperature, </b>which is an approximate measure of the amount of disorganization in the system's understanding of a sentence.</p><subsection number="4.1" title="The Conceptual Network"><p>This is a network of nodes and links representing some permanent linguistic concepts (Figure 1).</p><p>In the network, a node represents a concept. For example, the node labeled <b><i>charac­ter </i></b>represents the concept of character; the node <b><i>word </i></b>represents the concept of word; the node <b><i>chunk </i></b>represents the concept of chunk; the nodes <b><i>character-1, character-2, </i></b>up to <b><i>character-n </i></b>represent the actual characters in a sentence; the <b><i>affix </i></b>and <b><i>affinity </i></b>nodes represent the concepts of relations between characters; the nodes <b><i>classifier, reflexive ad­jective, structure, </i></b>etc., represent the concepts of relations between words; the nodes <b><i>agent, patient, theme, </i></b>etc., represent the concepts of relations between chunks.</p><p>A link represents an association between two nodes. There are four types of links:</p><p>(i) <b>category-of links, </b>or <b><i>is-a </i></b>links, which connect instances to types, for example, the connections from <b><i>character-1, character-2, </i></b>up to <b><i>character-n </i></b>to the <b><i>character </i></b>node; (ii) <b>has-instance links, </b>the converse of category-of links; (iii) <b>has-relation links, </b>which associate a node with the relations it contributes, for example, the connection from the <b><i>character </i></b>node to the <b><i>affix </i></b>node represents that the <b><i>character </i></b>node contributes to the character-based relation named as <b><i>affix; </i></b>(iv) <b>part-of links, </b>which represent <b><i>part-of </i></b>relations between two nodes. The direction of a <b><i>part-of </i></b>link, for instance, the link from the <b><i>character </i></b>node to the <b><i>word </i></b>node, is interpreted as 'the <b><i>character </i></b>is part of the <b><i>word'.</i></b></p><p>During a run of the program, nodes become activated when perceived to be rele­vant, and decay when no longer perceived to be relevant. Nodes also spread activation to their neighbors, and thus concepts closely associated with relevant concepts also become relevant. The activation levels of nodes can be affected by processes that take place in the workspace. Several nodes in the network (e.g., <b><i>agent, patient, word, chunk, </i></b>etc.), when activated, are able to exert top-down influences on the types of activities that may occur in the workspace in subsequent processing. The context-dependent activation of nodes enables the system to dynamically decide what is relevant at a given point in time, and influences what types of actions the system engages in.</p></subsection><subsection number="4.2" title="The Workspace"><p>The workspace is meant to be the region where the system does the parsing and construction required to understand a sentence. This area can be thought of as corre­sponding to the locus of the creation and modification of mental representations that occurs in the mind as one tries to form a coherent understanding of a sentence. The construction process is done by a large number of processing agents.</p><p>Figure 2 shows an example of a possible state of the workspace when the system is processing sentence (9).</p><page local="8" global="538"/><p><b>chunk </b><b>—a—</b> <b>adjective </b><b>-</b><b>*</b><b>-</b> <b>predicate ^</b></p><doubt alpha="100.0" length="5" tooSmall="False" monospace="0.0">agent</doubt><doubt alpha="0.0" length="1" tooSmall="False" monospace="0.0">/</doubt><doubt alpha="100.0" length="7" tooSmall="False" monospace="0.0">patient</doubt></subsection></section><section number="7" title="theme"><doubt alpha="71.4" length="7" tooSmall="False" monospace="0.0">word—K—</doubt><p><b>* object name</b> <b>character</b>-1 <b>character-2</b> <b>character-3</b> <b>character-n * * character C</b> <b>lexical marker ^</b> <b>classifier</b> <b>reflexive adjective</b> <b>structure</b> <b>coordination</b></p><doubt alpha="100.0" length="4" tooSmall="False" monospace="0.0">goal</doubt><doubt alpha="100.0" length="6" tooSmall="False" monospace="0.0">source</doubt><doubt alpha="100.0" length="4" tooSmall="False" monospace="0.0">tune</doubt><p><b>complex </b><b>*|</b><b> Stative construction</b> <b>^ affinity</b> <b>Legends:</b><b></b></p><doubt alpha="66.7" length="21" tooSmall="False" monospace="0.0">Av\N|judgment\\V\VX1-</doubt><doubt alpha="100.0" length="6" tooSmall="False" monospace="0.0">manner</doubt><doubt alpha="63.6" length="11" tooSmall="False" monospace="0.0">\\\\Adegree</doubt><doubt alpha="66.7" length="3" tooSmall="False" monospace="0.0">Vv—</doubt><p><b>&lt;</b><b>-</b><b>&gt; has-instance &amp; category-of link</b> <b>--</b><b>&gt; has-relation link</b></p><doubt alpha="55.6" length="18" tooSmall="False" monospace="0.0">----&gt; part-of link</doubt><doubt alpha="0.0" length="1" tooSmall="False" monospace="0.0">\</doubt><p><b>Figure 1</b></p><p>The conceptual network.</p><doubt alpha="29.4" length="34" tooSmall="False" monospace="0.0">(9)   Jffe     *A     *T      HIIt</doubt><p><b><i>tâ     <u>bënrén</u>   shêng         </i></b>le      sân     gè hâizi she   self       give birth   ASP   three   CL child 'She herself has given birth to three children.'</p><p>There are three types of objects that may exist in the workspace: character objects, word objects, and chunk objects. The Chinese characters in Figure 2 not enclosed by rectangles, namely, the characters H <b><i>sân </i></b>and j@ <b><i>gè, </i></b>are character objects. When a few Chinese characters are enclosed by a rectangle, for example ;fcA <b><i>bënrén, </i></b>it indicates that these characters make up a word object. The constituent characters of the word still exist in the workspace but they become less explicit in the figure. If a group of characters is enclosed by two rectangles, for example, the character <b><i>shêng, </i></b>it indicates that a chunk object exists, made up of word objects. In short, the immediate constituents of a word object are character objects, and those of a chunk object are<page local="9" global="539"/></p><doubt alpha="60.0" length="15" tooSmall="False" monospace="0.0">\\^ A*|quantity</doubt><doubt alpha="33.3" length="3" tooSmall="False" monospace="0.0">V]-</doubt><table class="main" frame="box" rules="all" border="1" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>aspect</b></p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>direction</b></p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>demonstrative</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>question</b></p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><p>Figure 2</p><p>A possible state of the workspace.</p><p><b>,—</b><b> Non-linguistic — statistical</b> <b>Types of relations</b> <b>— Linguistic</b> <b>between character objects</b> <b>between word objects</b> <b>between chunk objects</b></p><p>Figure 3</p><p>An overview of the types of relations.</p><p>word objects. It is possible to have unitary constituency whereby one object is the only part of another object. The chunk object 4 <b><i>shêng </i></b>'give birth' is an example.</p><p>Each object in the workspace has a list of descriptions not shown in Figure 2. For example, descriptions of character objects include their morphological category (stem/affix) and whether they are bound or unbound.<footnote anchor="8"/> Descriptions of word objects include their categorial information and sense. Descriptions of chunk objects may also include these two descriptions, except that here, these two descriptions are derived from the category and the sense of the word that is the governor.</p><p>The directed arc connecting two objects in Figure 2 denotes a linguistic relation between the objects connected. We adopt the dependency grammar notation (Tesnière 1959; Mel'cuk 1988) in which the object pointed to by an arrow is the dependent while the object where the arrow originates is the governor. The undirected arc connecting the characters ï£ <b><i>hâi </i></b>and <b>7 </b><b><i>zi </i></b>in Figure 2 represents a statistical relation, and statistical relations are undirected in our representation.</p><p>An overview of our classification of relations is shown in Figure 3.</p><p>A list of all types of relations is summarized in Table 1; a detailed exposition can be found in Gan (1994).</p><p>In Figure 2, the connection between the word objects <b><i>tâ </i></b>'she' and <b><i>bënrén </i></b>'self is a reflexive adjective relation, the connection between the word objects Ëî <b><i>shêng </i></b>'give birth' and <b>7 </b><b><i>le </i></b>'ASP' is an aspectual relation, and the two arcs connecting the character objects <b><i>W, hâi </i></b>and <b><i>J-</i></b><b><i> zi </i></b>are affix and affinity relations.</p><p><b>8 A bound character cannot occur independently as a word.</b></p><table class="main" frame="box" rules="all" border="1" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p><b>affinity</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b><i>m   </i></b># a</p></td><td class="cell"><p></p></td><td class="cell"><p><b>7    </b>H <b>ffl</b></p></td><td class="cell"><p><b>% 7</b></p><p><b><i>A</i></b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>T L</b></p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>reflexive adjective    aspectual affix</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><page local="10" global="540"/><doubt alpha="71.4" length="7" tooSmall="False" monospace="0.0">Table 1</doubt><p>A list of all types of relations.</p><p>Object Type       Relation Type Example</p><subsection number="4.3" title="The Coderack"><p>The building of linguistic structures (e.g., word and chunk objects, descriptions of objects, relations between objects) is carried out by a large number of agents known as codelets. These codelets reside in a data structure called the coderack. A codelet is a piece of code that carries out some small, local task that is part of the process of building a linguistic structure. For example, one codelet may check for the possibility of building an aspectual relation between the words *<b>ë </b><b><i>shêng </i></b>'give birth' and <b>t" </b><b><i>le </i></b>'ASP' of sentence (9). There are several codelet types. Each type is responsible for building one of the relations shown in Table 1. In addition, there are <b>word </b>and <b>chunk </b>codelet types, which are responsible for the construction of words and chunks. Two special codelet types, namely, <b>breaker </b>and <b>answer, </b>will be explained in Section 5. Here, we make a distinction between codelets and codelet type. The latter is a prewritten piece of code while the former are instances of the latter.</p><p>In the initial stage when the program is presented with a sentence, the default codelets initialized in the coderack are <b>affix </b>and <b>affinity </b>codelets. They will construct relations between character objects. Some default bottom-up word codelets are also posted to determine whether monosyllabic words could be constructed from character objects. When the <b><i>word </i></b>node in the conceptual network becomes activated by activation spreading from the <b><i>character </i></b>node, more top-down word codelets will be posted. When word objects are constructed, nodes denoting relevant relations between words will be activated. These nodes in turn cause the posting of codelets that will build relations between word objects. Again, by activation spreading to the <b><i>chunk </i></b>node, codelets building chunk objects will be posted, which will further lead to the posting of codelets that determine how chunk objects can be related.<page local="11" global="541"/></p><table class="main" frame="box" rules="all" border="0" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p><b>Object </b>1</p></td><td class="cell"><p><b>Object </b>2</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>character</b></p></td><td class="cell"><p><b>affinity relation</b></p></td><td class="cell"><p><b><i>m</i></b></p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>character</b></p></td><td class="cell"><p><b>affix relation</b></p></td><td class="cell"><p><b><i>m</i></b></p></td><td class="cell"><p>—</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>word</b></p></td><td class="cell"><p><b>classifier relation</b></p></td><td class="cell"><p><b><i>m</i></b><b>'eu</b></p></td><td class="cell"><p><b>3£ </b><b>'snake'</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>word</b></p></td><td class="cell"><p><b>reflexive adjective relation</b></p></td><td class="cell"><p><b>flfeffl </b><b>'they'</b></p></td><td class="cell"><p><b>'self</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>word</b></p></td><td class="cell"><p><b>structure relation</b></p></td><td class="cell"><p>'STRUC</p></td><td class="cell"><p><b>'father'</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>word</b></p></td><td class="cell"><p><b>coordination relation</b></p></td><td class="cell"><p><b>fD </b><b>'and'</b></p></td><td class="cell"><p>'Lisi'</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>word</b></p></td><td class="cell"><p><b>adjective relation</b></p></td><td class="cell"><p><i>M 'blue'</i></p></td><td class="cell"><p><b>^ 'sky'</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>word</b></p></td><td class="cell"><p><b>complex stative relation</b></p></td><td class="cell"><p><b>f# </b>'STRUC</p></td><td class="cell"><p><i>V} </i><b>'good'</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>word</b></p></td><td class="cell"><p><b>attitude relation</b></p></td><td class="cell"><p><b><i>mt </i></b><b>'really'</b></p></td><td class="cell"><p><b>£ 'go'</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>word</b></p></td><td class="cell"><p><b>disposal relation</b></p></td><td class="cell"><p>ffl 'BA'</p></td><td class="cell"><p><b>'door'</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>word</b></p></td><td class="cell"><p><b>quantity relation</b></p></td><td class="cell"><p><b>Sff3 </b><b>'we-</b></p></td><td class="cell"><p><b>SS 'all'</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>word</b></p></td><td class="cell"><p><b>manner relation</b></p></td><td class="cell"><p><b>ll 'able'</b></p></td><td class="cell"><p><b>ff§ 'sing'</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>word</b></p></td><td class="cell"><p><b>degree relation</b></p></td><td class="cell"><p>îb <b>'very'</b></p></td><td class="cell"><p>i?!)! <b>'nervous'</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>word</b></p></td><td class="cell"><p><b>aspectual relation</b></p></td><td class="cell"><p>IB <b>'sleep'</b></p></td><td class="cell"><p><b>T </b>'ASP'</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>word</b></p></td><td class="cell"><p><b>direction relation</b></p></td><td class="cell"><p><b>#7 </b><b>'table'</b></p></td><td class="cell"><p><b>_h 'on'</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>word</b></p></td><td class="cell"><p><b>demonstrative relation</b></p></td><td class="cell"><p><i>m </i><b>'this'</b></p></td><td class="cell"><p>ô <b>'fish'</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>word</b></p></td><td class="cell"><p><b>interrogative relation</b></p></td><td class="cell"><p><i>WWt </i><b>'what'</b></p></td><td class="cell"><p><b>Bf^ 'time'</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>chunk</b></p></td><td class="cell"><p><b>agent relation</b></p></td><td class="cell"><p><b>fife </b><b>'he'</b></p></td><td class="cell"><p><b>JTfiST 'broke'</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>chunk</b></p></td><td class="cell"><p><b>patient relation</b></p></td><td class="cell"><p>Pj <b>'door'</b></p></td><td class="cell"><p><b>'broke'</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>chunk</b></p></td><td class="cell"><p><b>theme relation</b></p></td><td class="cell"><p><b><i>fê. </i></b><b>'chant'</b></p></td><td class="cell"><p><i>S </i><b>'scripture'</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>chunk</b></p></td><td class="cell"><p><b>source relation</b></p></td><td class="cell"><p><b>ÎJÈfpH 'from </b>China'</p></td><td class="cell"><p>IhIJK <b>'return'</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>chunk</b></p></td><td class="cell"><p><b>goal relation</b></p></td><td class="cell"><p><b>SIMM 'to room'</b></p></td><td class="cell"><p><b><i>m </i></b><i>'get'</i></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>chunk</b></p></td><td class="cell"><p><b>time relation</b></p></td><td class="cell"><p><b>'today'</b></p></td><td class="cell"><p><b>^Flrlß 'not well'</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><p>Note that there is no top-level executive deciding the order in which codelets are executed. At any given time, one of the existing codelets is selected to execute. The selection is a stochastic one, and it is a function of the relative urgencies of all existing codelets. The urgency of a codelet is a number assigned at the time of its creation to represent the importance of the task that it is supposed to carry out (this is an integer between 1 to 7, with 1 as the least urgent and 7 as the most urgent). Many codelets are independent and they run in parallel. Therefore, efforts towards building different structures are interleaved, sometimes co-operating and sometimes competing. The rate at which a structure is built is a function of the urgencies of its dedicated codelets. More promising structures are explored at high speeds and others at lower speeds. Almost all codelets make one or more stochastic decisions, and the high-level behavior of the program arises from the combination of thousands of these very small choices. In other words, the system's high-level behavior arises from its low-level stochastic substrate. To summarize, the macroscopic behavior of the system is not preprogrammed; the details of how it emerges from the low-level stochastic architecture of the system are given in Sections 5.2 and 5.3.</p></subsection><subsection number="4.4" title="The Computational Temperature"><p>The computational temperature is an approximate measure of the amount of coherency in the system's interpretation of a sentence: the value at a given time is a function of the amount and quality of linguistic structures that have been built in the workspace. The computational temperature is in turn used to control the amount of randomness in the local action of codelets. If many good linguistic structures have been built, the temperature will be low, and the system will make decisions less randomly. When few good linguistic structures have been found, the temperature will be high, leading to many more random decisions and hence to more diverse paths being explored by codelets.<footnote anchor="9"/></p><p>The notion of temperature used here is similar to that in simulated annealing (Kirkpatrick, Gelatt, and Vecchi 1983). Both start with a high temperature, allowing all sorts of random steps to be taken, and slowly cool the system down by lowering the temperature. However, the decrease in temperature in our system is not necessarily monotonie. It varies according to the amount of coherency in the system's interpreta­tion of a sentence. Thus, our system has an extra degree of flexibility, which allows uphill steps in temperature; in effect, this means that the system is annealing at the metalevel as well.</p></subsection></section><section number="5." title="An Example"><p>We will use a sample run of the program on sentence (9) to illustrate many central features of the model, including the selection of a codelet; the selection of competing alternatives; the interaction between the workspace and the conceptual network; etc. Note that this section would be overwhelmed with details if a step-by-step explanation were given. A detailed trace of the system's execution on this sentence can be found in Gan (1994), and a short description of the program's behavior can be found in Gan (1993). Here, only selected snapshots are highlighted.</p><p>Sentence (9) is an example with local, overlap, and combination ambiguities in the</p><p><b>9 "Diverse paths" refers to different ways of analyzing the structure of a sentence.</b></p><page local="12" global="542"/><p>fragment <b>:fcA£ rén <i>shêng. </i></b>Without considering the sentential context, these three characters have three possible word boundaries: ^ À ^ <b><i>bën rén shêng </i></b>'CL human give birth', <b>:£A £ <i>bënrén shêng </i></b>'self give birth' or <b>Aï </b>i&gt;£n <b><i>rénshêng </i></b>'CL life'. Given the sentential context of (9), however, only the second alternative is correct.</p><subsection number="5.1" title="Initial Setup"><p>When the parsing process starts, the program is presented with the sentence. The temperature is clamped at 100 for the first 80 cycles to ensure that diverse paths are explored initially (the range of the temperature varies between 0 and 100). A <b>cycle </b>is the execution of one codelet. The number 80 is decided based on intuition and trial-and-error; it is not necessarily optimal. The workspace is initialized with nine character objects, each corresponding to a character of the sentence. Since the workspace contains only character objects, the only relevant concepts are: character, affinity, affix, and each character of the sentence. The corresponding nodes in the conceptual network, namely: <b><i>character, affinity, affix, </i></b>5fe <b><i>ta, ^ bën, </i></b>up to <b><i>zi, </i></b>are set to full activation. Fourteen instances of word codelet are posted to the coderack. They are responsible for identifying and constructing monosyllabic words. Twenty instances of affinity codelet are also posted to identify and construct affinity relations between characters. Eight instances of affix codelet are posted to identify and construct affix relations between characters. In general, the number of codelets posted is a function of the length of a sentence.</p></subsection><subsection number="5.2" title="Selection of a Codelet"><p>Among all codelet instances that exist in the coderack, only one of them is stochas­tically selected to execute each time. The choice of which codelet instance to execute depends on three factors: (i) its urgency, (ii) the number of codelet instances in the coderack that are of the same type as the individual instance, and (iii) the current temperature. At cycle 0, the coderack contains the statistics as shown in Table 2. The temperature-regulated urgency (Lff) is derived in the following way:</p><doubt alpha="26.7" length="15" tooSmall="False" monospace="0.0">ut= u(u°-^30(1)</doubt><p>where <b><i>t </i></b>denotes the temperature, which ranges between [0,100]. This equation is used to magnify differences in urgency values when the temperature is low. Conversely, at high temperatures, it will minimize differences in urgency values. The idea is to let the system explore diverse paths when the temperature is high, while always stick to one search path when the temperature is low.</p><p>At cycle 0 where the temperature is 100, the temperature-regulated urgencies of the three codelet types are the same. The probability of selecting an instance of a word codelet, an affinity codelet, and an affix codelet is 33.3%, 47.6%, and 19.1% respectively.</p><table class="main" frame="box" rules="all" border="1" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>Table </b>2</p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Initial state of the coderack.</p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Codelet Type    Urgency (Ü)</p></td><td class="cell"><p>Temperature-regulated Urgency</p></td><td class="cell"><p>Quantity</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p><b><i>U, </i></b>= 100             <b><i>U, = </i></b>0</p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>word 2</p></td><td class="cell"><p>2 16</p></td><td class="cell"><p>14</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>affinity 3</p></td><td class="cell"><p>2 81</p></td><td class="cell"><p>20</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>affix 3</p></td><td class="cell"><p>2 81</p></td><td class="cell"><p>8</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><page local="13" global="543"/><p><b>Figure 4</b></p><p>State of the workspace at cycle 17.</p><p>These probabilities are derived as follows:</p><p>where Q, and <b><i>Qj </i></b>are the quantities of codelet types C, and C, respectively, and <b><i>Uj/t</i></b>are the urgencies of codelet types C, and Cy at temperature £ respectively, and <b><i>n </i></b>is the total number of codelet types.</p><p>Supposing that the coderack contains the same types of codelets with the same quantities, but the temperature is 0, the probability of selecting an instance of a word codelet, an affinity codelet, and an affix codelet becomes 8.99%, 65.01%, and 26.00% respectively. Therefore, at low temperatures, codelets with high urgency are preferred.</p></subsection><subsection number="5.3" title="Construction of Linguistic Structures"><p>Linguistic structures include high-level objects (words and chunks) and relations be­tween two objects (see Table 1). In this run, for example, an affinity relation between the character objects <b><i>^ bën </i></b>and <b>A <i>rén </i></b>is constructed by an instance of an affinity codelet at cycle 17 (Figure 4).</p><p>An affinity codelet works on any two adjacent character objects to evaluate whether an affinity relation should be built between these two characters. The affinity relation is a quantitative measure that reflects how strongly two characters co-occur statis­tically. It is derived from mutual information (Fano 1961), which is the probability that two characters occur together versus the probability that they are independent. Mathematically, it is:</p><p>where <b><i>A(a,b) </i></b>is the affinity relation between the character objects <b><i>a </i></b>and <b><i>b, P(a,b) </i></b>is the probability that the two character objects co-occur consecutively, <b><i>P(a) </i></b>and <b><i>P(b) </i></b>are the probabilities that <b><i>a </i></b>and <b><i>b </i></b>occur independently. To derive affinity relations between characters, we have the usage frequencies of 6,768 Chinese characters specified in the GB2312-80 standard, and the usage frequencies of 46,520 words derived from a corpus. The total usage frequency of these words is 13,019,814. (The data was obtained from Liang Nanyuan, Beijing University of Aeronautics and Astronautics.)</p><p>Note that efforts towards building different structures are interleaved, as many codelets are independent and they run in parallel. Apart from the initial set of codelets present at the onset of processing, new codelets are sometimes created by old codelets to continue working on a task in progress, and these codelets may in turn create other codelets, and so on.<page local="14" global="544"/> The cycle in which a structure is built is not preprogrammed. Rather, it emerges from the statistics of the interaction of all codelets in the coderack.</p><doubt alpha="20.0" length="15" tooSmall="False" monospace="0.0">^)=1°S2o7^T (3)</doubt><doubt alpha="50.0" length="8" tooSmall="False" monospace="0.0">P(a)P{b)</doubt><table class="main" frame="box" rules="all" border="1" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b><i>m   </i></b><b># a</b></p></td><td class="cell"><p></p></td><td class="cell"><p><b>7   </b>H <b><i>m</i></b></p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>affinity</b></p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table></subsection><subsection number="5.4" title="Selection of Competing Structures"><p>It may happen that a structure being constructed is in conflict with an existing struc­ture. In this run, for example, an affinity relation between the characters <b>A <i>rén </i></b>and ^ <b><i>shêng </i></b>is being considered at cycle 79. This structure is in conflict with the previously constructed affinity relation between the characters <b><i>bën </i></b>and <b>A <i>rén. </i></b>The decision about which competing structure should win is decided stochastically as a function of two factors: (i) the strengths of the competing structures, and (ii) the temperature. The strength of a structure is an approximate measure of how promising the structure is. It is an integer ranging between 0 and 100, inclusive. The strengths of different structures are derived according to either linguistic knowledge encoded in the lexicon or certain statistical measures. Equation (3) is a key factor in deriving the strength of an affinity relation. In this run, the strength of the proposed affinity relation between the characters <b>A <i>rén </i></b>and 4 <b><i>shêng </i></b>is 55, while that of the existing affinity relation between the characters <b><i>bën </i></b>and <b>A <i>rén </i></b>is 56. These two values are adjusted by the temperature according to equation (4).</p><doubt alpha="16.7" length="18" tooSmall="False" monospace="0.0">St= S(120-')/40(4)</doubt><p>where <b><i>St </i></b>is the temperature-regulated strength, <b><i>S </i></b>is the original strength, and <b><i>t </i></b>is the temperature. The effect of equation (4) is similar to equation (1): to maximize differences in strength values at low temperatures, and to minimize differences at high temperatures. At cycle 79, the temperature is still clamped at 100, and hence the temperature-regulated strengths of these two competing structures are both 7 (rounded up to the nearest integer). The decision about which structure should win is therefore a random one, as both have an equal probability of success. According to equation (4), at low temperatures, it is increasingly difficult for a new structure of lesser strength to win in competition against existing structures of greater strength. Since the system's behavior is more random at high temperatures, it is able to explore diverse paths in the initial stage when little structure has been built. When a large number of structures deemed to be good have been found, which entails a low temperature, the system will proceed in a more deterministic fashion, always preferring good paths to bad ones. Indeed, in this case, the new affinity relation between the characters <b>A <i>rén </i></b>and 3= <b><i>shêng </i></b>has won. Instead of destroying the affinity relation between the characters ^ <b><i>bën </i></b>and <b>A <i>rén, </i></b>this structure is retained, but it becomes dormant in the workspace.</p><p><b>5.5 The </b>interaction between <b>the </b>Workspace and <b>the </b>Conceptual Network</p><p>Activated nodes in the conceptual network spread activation to their neighbors, and thus concepts closely related to relevant concepts also become relevant. In this run, for example, the nodes <b><i>word </i></b>and <b><i>chunk </i></b>become activated at cycle 80 due to activation spreading from the <b><i>character </i></b>node. Activated nodes influence what tasks the system will focus on subsequently through the posting of top-down codelets. For example, at cycle 80, the activated <b><i>word </i></b>node causes the proportion of <b><i>word </i></b>codelets to increase to 93%. This is an important feature of the system: the context-dependent activation of nodes, which enables the system to dynamically decide what is relevant at a given point in time, and influences what actions to take through the posting of top-down codelets.</p><page local="15" global="545"/><p><b>Figure 5</b></p><p>State of the workspace at cycle 180.</p></subsection><subsection number="5.6" title="Detection and Resolution of Erroneous Structures"><p>By the end of cycle 180, the following structures have been built (Figure 5):</p><p>• active relations: an affinity relation between the characters <b>a </b><b><i>rén </i></b>and <b>s </b><b><i>shêng, </i></b>ï£ <b><i>hâi </i></b>and ^ <b><i>zi, </i></b>an affix relation between the characters ï£ <b><i>hâi </i></b>and <b><i>J- zi;</i></b></p><p>• active word objects: ï^^p <b><i>hâizi </i></b>'child', <b>as </b><b><i>rénshéng </i></b>'life', and    <b><i>bën </i></b>'CL';</p><p>• active chunk objects: <b>as </b><b><i>rénshéng </i></b>'life', and        <b><i>hâizi </i></b>'child';</p><p>• dormant relations: an affinity relation between the characters     <b><i>bën </i></b>and <b>a </b><b><i>rén.</i></b></p><p>Among them, the word <b><i>it&gt;</i></b><b><i> bën </i></b>'CL' is a classifier. This word has activated the <b><i>classifier </i></b>node in the conceptual network, which in turn causes the posting of <b><i>classifier </i></b>codelets to the coderack. The responsibility of this type of codelet is to explore the possibility of establishing a classifier relation between a classifier and an <b>object name.<footnote anchor="10"/></b>The use of a classifier is in general idiosyncratic. This type of idiosyncrasy is encoded in the lexicon. Since ^ <b><i>bën </i></b>cannot be the classifier of the object name <b>as </b><b><i>rénshéng </i></b>'life', a special type of codelet known as a breaker codelet is posted to the coderack. The role of a breaker is to identify erroneous linguistic structures, and set them to dormant, restoring any dormant competing structure when necessary.</p><p>At cycle 187, a breaker codelet is executed that examines structures that are "in-trouble", namely, the words <b><i>bën </i></b>and <b>as </b><b><i>rénshéng </i></b>'life'. Since the component characters of the second word can be free, the breaker codelet concludes that this is an erroneous grouping. The word <b>as </b><b><i>rénshéng </i></b>'life' is made dormant. The other structures that support the word <b>as </b><b><i>rénshéng </i></b>'life', namely the affinity relation be­tween the characters <b>a </b><b><i>rén </i></b>and <b>s </b><b><i>shêng </i></b>and the chunk <b>as </b><b><i>rénshéng </i></b>'life', are also made dormant. The competing alternative, the affinity relation between the characters</p><p><b><i>bën </i></b>and <b>a </b><b><i>rén, </i></b>is reactivated. This snapshot also illustrates an important feature of the system: syntactic analysis can be performed without waiting for the system to complete the task of word identification.</p><p><b>10 The term object name is borrowed from Meaning-Text linguistics (Mel'cuk 1988). It refers to words that cannot have a semantic dependent. A more formal attempt to define this term can be found in Polguère (to appear).</b></p><table class="main" frame="box" rules="all" border="1" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p><b>affinity</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p><b>A £</b></p></td><td class="cell"><p><b>t  h </b><b><i>m</i></b></p></td><td class="cell"><p></p></td><td class="cell"><p><b>1</b><b>-f-</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p><b>:</b></p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p><b>_,.</b><b> </b><b>1</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>dormant affinity affinity</b></p></td><td class="cell"><p></p></td><td class="cell"><p><b>affix</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><page local="16" global="546"/><p>she   self       give birth   ASP   three   CL child 'She herself has given birth to three children.' <b>Figure 6</b></p><p>State of the workspace at cycle 373.</p></subsection><subsection number="5.7" title="The Final State"><p>Figure 6 shows the state of the workspace at the end of cycle 373. For easy reference, sentence (9) is repeated here:</p><doubt alpha="37.5" length="24" tooSmall="False" monospace="0.0">(9)   fcfe*A     S 7H fi</doubt><doubt alpha="65.8" length="38" tooSmall="False" monospace="0.0">tâbënrénshêng le      sân     gè hâizi</doubt><p>The list of structures built are:</p><p>• active relations: an affinity relation between the characters ^ <b><i>bën </i></b>and <b>A </b><b><i>rén, </i></b><b>R </b><b><i>hâi </i></b>and <b>7 </b><b><i>zi, </i></b>an affix relation between the characters ï£ <b><i>hâi </i></b>and <b>7 </b><b><i>zi, </i></b>a reflexive adjective relation between the words jffe <b><i>tâ </i></b>'she' and <b>^cA </b><b><i>bënrén </i></b>'self, a classifier relation between the words fi <b><i>gè </i></b>'CL' and <b>R^F" </b>Zztfz'z! 'child', a quantity relation between the words H <b><i>sân </i></b>'three' and <b>17 </b>Zzaz'zi 'child', an aspectual relation between the words <b>S </b><b><i>shêng </i></b>'give birth' and <b>7 </b>Ze 'ASP';</p><p>• active words:     <b>fs </b>'she', <b>:£A </b><b><i>bënrén </i></b>'self, <b>S </b><b><i>shêng </i></b>'give birth', <b>7 </b>Ze 'ASP', H son 'three', fi gè 'CL', and <b>R7 </b>ZzßZzz 'child';</p><p>• active chunks:       <b>A </b>Z5 <b><i>bënrén </i></b>'she herself, <b>S </b>s/îëng 'give birth', and Hfllî<b>£7</b><b> </b>sân gè Zzâz'zz 'three CL children';</p><p>• dormant relations: an affinity relation between the characters <b>A </b><b><i>rén </i></b>and <b>S </b><b><i>shêng;</i></b></p><p>• dormant words: <b>AS </b><b><i>rénshéng </i></b>'life';</p><p>• dormant chunks: <b>AS </b><b><i>rénshéng </i></b>'life'.</p><p>Comparing the above structures with the complete analysis of the sentence in Figure 7 (for simplicity, we have omitted relations between characters in Figure 7), it is observed that the system has not yet constructed the agent and theme relations. They were not identified because the system has come to a stop at cycle 381, after an instance of <b>answer </b>codelet was executed. This type of codelet reports on the word<page local="17" global="547"/></p><table class="main" frame="box" rules="all" border="1" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>affinity</b></p></td><td class="cell"><p><b>affinity affix</b></p></td><td class="cell"><p><b>-L</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>'   1   '   '         ^     '        '   1   ' <b><i>'—f—<footnote anchor="1"/> *-</i></b></p></td><td class="cell"><p></p></td><td class="cell"><p><b><i>m</i></b></p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>reflexive adjective aspectual</b></p></td><td class="cell"><p><b>quantity</b></p><p><b>classifier</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><doubt alpha="0.0" length="1" tooSmall="False" monospace="0.0">7</doubt><doubt alpha="100.0" length="5" tooSmall="False" monospace="0.0">theme</doubt><p><b>reflexive adjective aspect</b></p><p><b>classifier <u>quantity</u></b></p><p><b>Figure 7</b></p><p>A complete analysis of sentence <b>(9).</b></p><p><b>Figure 8</b></p><p>A graph of structures constructed against cycle number.</p><p>boundaries of a sentence. The system currently adopts a greedy approach and starts posting large numbers of this type of codelet as soon as it has identified a plausible interpretation of the word boundaries of a sentence. Hence, although instances of agent and theme codelets were present in the coderack, they were being overwhelmed by the ubiquitous answer codelets.</p><p>Figure 8 summarizes the cycle number in which various types of structures were constructed during this run. In this figure we see that affinity relations are built earlier than words, reflecting the system's preference for words of greater lengths. The system makes use of statistical information (the mutual information scores) to make quick and reliable guesses of the locations of these words. It can also be observed that overall, there is a gradual shift in the types of operations executed, from being character-centered initially, to word-centered, and then to chunk-centered. From time to time, however, the construction of different types of structures is interleaved.</p></subsection></section><section number="6." title="System Performance and Discussions"><p>Thirty ambiguous fragments that have alternating word boundaries in different sen­tential contexts were presented to the system and the system was able to resolve all the ambiguities. The test set covers the four types of word boundary ambiguities described in Section 2.<page local="18" global="548"/> When the sentential contexts of locally ambiguous fragments (both the overlap and combination type) were varied, our system was able to identify the correct word boundaries. When the system was presented with sentences with global ambiguities, it produced all the plausible alternative word boundaries. However, at any run of such a sentence, only one alternative is generated. The system's behavior is similar to human performance in the goblet /faces recognition problem in perception (Hoffman and Richards 1984). We cannot see both the goblet and the faces at the same time, but we are able to switch back and forth between these two interpretations. The frequencies of generating all the alternatives vary from one sentence to another. It is important to note that such frequencies are not meant to indicate some kind of "goodness" measure of alternative word boundary interpretations. Neither are they meant to reflect the preferences of a human. They are merely a reflection of the usage frequencies of Chinese characters and words in our dictionary.</p><table class="main" frame="box" rules="all" border="1" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>1</b></p></td><td class="cell"><p></p></td><td class="cell"><p><b>1</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b><i>m</i></b></p></td><td class="cell"><p></p></td><td class="cell"><p><b># A </b>i-_i</p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p><b>7</b></p><p><footnote anchor="1"/>—X—'</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>1</b></p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><table class="main" frame="box" rules="all" border="0" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p><b>400</b></p></td><td class="cell"><p><b>r</b></p></td><td class="cell"><p>•</p></td><td class="cell"><p><b>{</b></p></td><td class="cell"><p>•</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p><b>300</b></p></td><td class="cell"><p>-</p></td><td class="cell"><p>t</p></td><td class="cell"><p><b>i</b></p></td><td class="cell"><p>•           • •</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>Cycle</b></p></td><td class="cell"><p><b>200 100</b></p></td><td class="cell"><p><b><i>0</i></b></p></td><td class="cell"><p>1 • •</p></td><td class="cell"><p>• •</p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p><b>0</b></p></td><td class="cell"><p><b>1 *</b></p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p>-1-1-</p><p><b>affinity affix</b></p></td><td class="cell"><p><b>word</b></p></td><td class="cell"><p><b>chunk</b></p></td><td class="cell"><p>-1-1<b>-r </b>i</p><p><b>reft. adj.   classifier  quantity aspectual</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><p>The system's ability to generate different word boundaries for a globally ambigu­ous sentence arises from its stochastic search mechanism, which does not rule out a priori certain possibilities. This feature enables the system to occasionally discover less-obvious interpretations of word boundaries. For example, in addition to the two apparent ways of aligning the fragment <b>BS® </b><b><i>yïjïnggud </i></b>as either <b>BS ® </b><b><i>yïjîngguo </i></b>'already over' or <b>B Sïû </b><b><i>yïjîngguo </i></b>'already go through' in sentences (10a) and (10b), a less-obvious possibility that the system has identified is: <b>B S </b>S <b><i>yïjïnggud </i></b>'already experience over', where i§ <b><i>gud </i></b>'over' is the complement of <b>S </b><b><i>jtng </i></b>'experience'.</p><doubt alpha="38.5" length="26" tooSmall="False" monospace="0.0">(10)a.   $BSft      7PSnft</doubt><doubt alpha="64.3" length="42" tooSmall="False" monospace="0.0">wo   yïjïng     guö     le xuéshêng shîdài</doubt><p>I      already   over   ASP student period 'My student days are over.' <b><i>wö   yl jîngguo le       xuéshêng shîdài</i></b></p><doubt alpha="50.0" length="18" tooSmall="False" monospace="0.0">b.s a      is7ssmx</doubt><p>I already go through ASP student period T have already gone through the period as a student.' c. <b><i>m </i></b><b>a      </b><b>s        </b><b>ft   7   </b><b>PS </b><b>urn</b></p><doubt alpha="62.8" length="43" tooSmall="False" monospace="0.0">wö   yï jïng guö    le      xuéshêng shîdài</doubt><p>I     already   experience   over   ASP   student period T have already experienced student life.'</p><p>The system rarely produces the less-obvious interpretations. This demonstrates that its mechanisms are able to strike an effective balance between random search and deterministic search, imbuing it with both flexibility and robustness.</p><p>An issue that arises from the nondeterministic feature of the system is: will the word boundaries of a locally ambiguous sentence vary at different runs? To address this, we ran the program with each sentence 20 times. We found that for sentences cov­ered by our current set of linguistic descriptions, the system arrived at the same word boundaries despite different paths being taken at each run. For linguistic phenom­ena not yet covered, suboptimal solutions may sometimes be generated. For example, when the program worked on sentence (10), it produced sentence (11) once as the answer.</p><page local="19" global="549"/><doubt alpha="0.0" length="4" tooSmall="False" monospace="0.0">(11)</doubt><p>'China has many resources which have either been exploited or not yet been exploited.'</p><p>In this run, the bisyllabic word <b><i>mm </i></b><b><i>käifä </i></b>'develop' has been wrongly identified as two monosyllabic words <b><i>m </i></b><b><i>käi </i></b>'open' and Ü <b><i>fä </i></b>'distribute'. To determine the proper use of two juxtaposed predicates, such as <b><i>m </i></b><b><i>käi </i></b>'open' and <b><i>m </i></b><b><i>fä </i></b>'distribute' in this case, requires a careful study of serial verb constructions. It is inevitable that the system would make such a mistake as our linguistic descriptions have not yet covered this phenomenon.</p><p>In comparison, consider the performance of a strictly statistical approach based on mutual information (Lua and Gan 1994): the latter wrongly identified the word boundaries in 11 out of the 30 ambiguous fragments. For the 6 fragments that appear in globally ambiguous sentences, the mutual information approach gave only one interpretation of the word boundaries. In terms of processing speed, the mutual infor­mation approach took an average of 110.4 ms to process one character; our approach took 1.7 s.<footnote anchor="11"/> The extra time in our approach is spent in parsing sentences.</p></section><section number="7." title="Conclusion"><p>In this paper, we reported on a stochastically emergent model for language processing and described its application to the modeling of context effects in ambiguous Chinese word boundary interpretation. The model simulates language processing as a collective phenomenon that emerges from a myriad of microscopic and diverse activities. The proposed mechanism, whereby word objects and chunk objects are formed by the hooking up of character objects as the latter are gradually cooled down, is analagous to the crystallization process in chemistry.</p><p>Our application is distinct from existing work in two main respects:</p><p>Word identification: We show that the full power of natural language processing can be brought to bear on the issue of word identification effectively and seamlessly. The model is able to resolve ambiguities appearing in different sentential contexts. This is an improvement over statistical approaches such as the relaxation method (Fan and Tsai 1988), which generates all possible ways of grouping the characters of a sentence into words, and then uses some scoring function to select the</p><p><b>11 The mutual information approach was written in Borland C, version 2.0 while the new approach was written in Borland C++, version 3.0. Both ran on a 33 MHz, 386 machine.</b></p><table class="main" frame="box" rules="all" border="0" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p><b>B</b></p></td><td class="cell"><p><b><i>mm</i></b></p></td><td class="cell"><p>ft</p></td><td class="cell"><p>ft</p></td><td class="cell"><p>*</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b><i>zhöngguö</i></b></p></td><td class="cell"><p><b><i>V<footnote anchor="1"/></i></b></p></td><td class="cell"><p><b><i>käifä</i></b></p></td><td class="cell"><p><b><i>hé</i></b></p></td><td class="cell"><p><b><i>shàng</i></b></p></td><td class="cell"><p><b><i>wèi</i></b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>China</p></td><td class="cell"><p>already</p></td><td class="cell"><p>exploit</p></td><td class="cell"><p>and</p></td><td class="cell"><p>yet</p></td><td class="cell"><p>not</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>BBS</b></p></td><td class="cell"><p><b><i>m</i></b></p></td><td class="cell"><p></p></td><td class="cell"><p><b>IB</b></p></td><td class="cell"><p><b>IS</b></p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>käifä</p></td><td class="cell"><p>de</p></td><td class="cell"><p>zîyuân</p></td><td class="cell"><p>dôu</p></td><td class="cell"><p>hën</p></td><td class="cell"><p>duö</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>exploit</p></td><td class="cell"><p>STRUC</p></td><td class="cell"><p>resource</p></td><td class="cell"><p>all</p></td><td class="cell"><p>very</p></td><td class="cell"><p>many</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><table class="main" frame="box" rules="all" border="0" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p><b>B</b></p></td><td class="cell"><p><b><i>mm</i></b></p></td><td class="cell"><p><b><i>m</i></b></p></td><td class="cell"><p>ft</p></td><td class="cell"><p>*</p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b><i>zhöngguö</i></b></p></td><td class="cell"><p><b><i>yi</i></b></p></td><td class="cell"><p><b><i>käifä</i></b></p></td><td class="cell"><p><b><i>hé</i></b></p></td><td class="cell"><p><b><i>shàng</i></b></p></td><td class="cell"><p><b><i>wèi</i></b></p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>China</p></td><td class="cell"><p>already</p></td><td class="cell"><p>exploit</p></td><td class="cell"><p>and</p></td><td class="cell"><p>yet</p></td><td class="cell"><p>not</p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b><i>m</i></b></p></td><td class="cell"><p>H</p></td><td class="cell"><p><b><i>&amp;</i></b></p></td><td class="cell"><p><b>ÄS</b></p></td><td class="cell"><p><b>»</b></p></td><td class="cell"><p><b>IB</b></p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>kâi</p></td><td class="cell"><p>fa</p></td><td class="cell"><p>de</p></td><td class="cell"><p>zîyuân</p></td><td class="cell"><p>döu</p></td><td class="cell"><p>hën3</p></td><td class="cell"><p>duö</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>open</p></td><td class="cell"><p>distribute</p></td><td class="cell"><p>STRUC</p></td><td class="cell"><p>resource</p></td><td class="cell"><p>all</p></td><td class="cell"><p>very</p></td><td class="cell"><p>many</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><page local="20" global="550"/><p>best combination. At the same time, this model eliminates the use of ad hoc rules, as syntactic and semantic analysis are interleaved with word identification. This application is diametrically opposed to the reductionist approach of separating word segmentation and sentence analysis into two distinct stages. We have argued that our approach can avoid the computational problem of combinatorial explosion as the architecture has appropriate mechanisms to regulate run-time resources dynamically.</p><p>•    Sentence analysis: We show that a sentence can be analyzed without assuming a presegmented input. The main feature is that there is no fixed, predetermined order of morphological, syntactic, and semantic analysis, since the control mechanism is a nondeterministic one. Essentially, the order in which these analyses are carried out is dependent on what has been discovered so far by the system, and the system's perception of what is relevant to the task it is currently investigating.</p><p>The essential idea of the proposed model is that of stochastically guided conver­gence to what is called a globally optimum state. This model shares some features with APRIL (Annealing Parser for Realistic Input Language) (Sampson, Haigh, and Atwell 1989). APRIL uses simulated annealing to determine the most plausible parse tree of a sentence. It begins with an arbitrary tree. Many local modifications are generated randomly. They are either adopted or rejected according to their effect on a plausibil­ity measure. Modifications that improve the plausibility measure are always accepted; while unfavorable modifications are rejected only if the loss of merit exceeds a certain threshold. The threshold value is generated randomly but its mean value decreases according to some predefined schedule. This differs from the behavior of the compu­tational temperature in our system, which does not have a monotonically decreasing property. Our system further differs from APRIL in the following aspects: (i) APRIL begins with an arbitrary parse tree whereas our system begins with no parse structure; (ii) APRIL<b>'S </b>plausibility measure is defined using statistics collected from a treebank of manually parsed English text while ours is derived from mutual information statis­tics and linguistic constraints; (iii) parse trees in APRIL are immediate-constituency type while ours are dependency-based. That is, nodes in our system are either char­acters, words, or chunks. There are no nonterminal nodes defined with grammatical categories.</p><p>Our model also shares some features with connectionist models, such as fine­grained parallelism, local actions, competition, spreading activation, and statistically emergent effects from a large number of small, subcognitive events. On the other hand, the representation of concepts is quite different: they are encoded as atomic, symbolic primitives instead of distributed as weighted connections between nodes in a network, which is common in connectionist systems. Therefore, in terms of the degree to which concepts are distributed, our representation has a strong symbolic flavor; in terms of the extent to which high-level behavior emerges from lower-level processes, ours has a strong subsymbolic orientation. By providing an account of the language understanding process at such an intermediate level of description, it is hoped that our results will provide a guide to connectionists studying how such intermediate-level structures can emerge from neurons or cell-assemblies in the brain.</p><page local="21" global="551"/></section><section number="8." title="Future Work"><p>Our application, which handles only thirty sentences at present, has enabled us to focus on the mechanisms that underlie the process of sentence comprehension, and their interactions. With the progress made in this study, which would not have been possible if we had plunged straight into large-scale unrestricted texts, our next concern would be to address the issue of scalability. There are two aspects to this issue.</p><p>• The effect of various parameter values chosen for the formulae shown in Section 5 on the operation of the program: These values are set by trial-and-error. They are not specifically tailored to our test set. To finesse these parameters in order to completely weed out unpromising search paths is impossible, since decision making in the system is stochastic. We therefore do not anticipate that the setting of the various parameter values is an issue during scaling up. The values of the parameters may affect the rate of convergence, but they will not affect the accuracy of the system in terms of the analysis results.</p><p>• The possibility of generating thousands of codelets as a result of using a large lexicon: We do not expect such a scenario to occur. Instead, having a large lexicon means that the system is able to handle more sentences. The number of codelets spawned to process a sentence is determined by the number of characters and words in the sentence, and the types of words and chunks in the sentence, not by the size of the lexicon. In addition, there are built-in mechanisms to manage the growth of codelets. We have demonstrated in Section 5 how we have made use of statistics (the maximum matching heuristics and mutual information) to avoid generating all possible word boundary combinations. The sample run in Section 5 has also demonstrated that the program need not finish executing all codelets in the coderack before it is allowed to stop, and that simpler and more clear-cut decisions tend to be made before the more subtle ones. Furthermore, certain features of the system, namely, the stochastic selection of a codelet by relative urgencies, the use of the conceptual network as a top-down controller, the interactions between the conceptual network and the workspace, enable the system to dynamically decide on the number and the types of codelets to be generated.</p><p>The real bottleneck when scaling-up is the acquisition of linguistic descriptions, as our current work has limited breadth and depth of coverage. Therefore, the cur­rent system has less practical value to people working on the word segmentation problem, where the main concern is to develop algorithms that work for large-scale text. However, the proposed model provides a useful architecture for us to study the root of what people do when they encounter unknown words in text. This issue of unknown-word resolution has been the single major problem in the segmentation of unrestricted text. Understanding how higher-level knowledge is brought to bear on this issue is essential to the design of an effective solution. Hence, our next goal is to apply the model to handle the unknown-word problem, including treatments of un­known compounds such as personal names, previously unseen place names, foreign names in transliteration, and company names.</p><page local="22" global="552"/></section><section title="Acknowledgments"><p>Throughout the course of this work, we have benefited from discussions with Alain Polguère, Melanie Mitchell, Robert French, Ngai-lai Cheng, Chew Lim Tan, Loke Soo Hsu, Gee Kim Yeo, Guojin, Zhibiao Wu, and Paul Wu. We would like to express our thanks to them. We are also grateful to the reviewers for their insightful comments and suggestions.</p></section><references><p>Chang, Jyun-Sheng, C. D. Chen; and S. D. Chen. 1991. Chinese word segmentation through constraint satisfaction and statistical optimization (in Chinese). In <b><i>Proceedings of ROCLING-IV, </i></b>R.O.C. Computational Linguistics Conference, pages 147-165.</p><p>Chao, Yuen-Ren. 1957. Formal and semantic discrepancies between different levels of Chinese structure. <b><i>Bulletin of The Institute of History and Philosophy, </i></b>XXVIII: 1-16.</p><p>Chen, Keh-Jiann and Shing-Huan Liu. 1992. Word identification for Mandarin Chinese sentences. In <b><i>Proceedings of COLING-92, </i></b>pages 101-107.</p><p>Chiang, Tung-Hui, Jing-Shin Chang, Ming-Yu Lin, and Keh-Yih Su. 1992. Statistical models for word segmentation and unknown resolution. In <b><i>Proceedings of ROCLING V, </i></b>R.O.C. Computational Linguistics Conference, pages 121-146.</p><p>Fan, Charng-Kang and Wen-Hsiang Tsai. 1988. Automatic word identification in Chinese sentences by the relaxation technique. <b><i>Computer Processing of Chinese and Oriental Languages, </i></b>4(1): 33-56.</p><p>Fano, Robert M. 1961. <b><i>Transmission of Information. </i></b>MIT Press, Cambridge, MA.</p><p>French, Robert M. 1992. <b><i>Tabletop: An Emergent, Stochastic Computer Model of Analogy-Making. </i></b>Ph.D. thesis, University of Michigan.</p><p>Gan, Kok-Wee. 1993. Integrating word boundary identification with sentence understanding. In <b><i>Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, </i></b>pages 301-303. Ohio State University, June.</p><p>Gan, Kok-Wee. 1994. <b><i>Integrating Word Boundary Disambiguation with Sentence Understanding. </i></b>Ph.D. thesis, Department of Information Systems &amp; Computer Science, National University of Singapore.</p><p>He, Ke-Kang, Hui Xu, and Bo Sun. 1991. Design principle of expert system for automatic word segmentation in written</p><p>Chinese (in Chinese). <b><i>Journal of Chinese Information Processing, </i></b>5(2): 1-14.</p><p>Hoffman, Donald D. and Whitman A. Richards. 1984. Parts of recognition. <b><i>Cognition, </i></b>18: 65-96.</p><doubt alpha="65.6" length="32" tooSmall="False" monospace="0.0">Hofstadter, Douglas R. 1983. The</doubt><p>architecture of JUMBO. In <b><i>Proceedings of the International Machine Learning Workshop, </i></b>edited by Ryszard Michalski.</p><p>Huang, Xiang-Xi. 1989. A produce-test approach to automatic segmentation of written Chinese (in Chinese). <b><i>Journal of Chinese Information Processing, </i></b>3(4): 42-48.</p><p>Kirkpatrick, S., C. D. Gelatt Jr., and M. P. Vecchi. 1983. Optimization by simulated annealing. <b><i>Science, </i></b>220: 671-680.</p><p>Lai, T. B. Y, S. C. Lun, C. F. Sun, and M. S. Sun. 1992. A tagging-based first-order Markov model approach to automatic word identification for Chinese sentences. In <b><i>Proceedings of the 1992 International Conference on Computer Processing of Chinese &amp; Oriental Languages, </i></b>pages 17-23.</p><p>Li, Charles N. and Sandra A. Thompson. 1981. <b><i>Mandarin Chinese: A Functional Reference Grammar. </i></b>University of California Press.</p><p>Liang, Nan-Yuan. 1983. Automatic word segmentation in written Chinese and an automatic word segmentation system—CDWS (in Chinese). In <b><i>Proceedings of the National Chinese Language Processing System.</i></b></p><p>Liang, Nan-Yuan. 1990. The knowledge of Chinese words segmentation (in Chinese). <b><i>Journal of Chinese Information Processing, </i></b>4(2): 29-33.</p><p>Lua, Kim-Teng and Kok-Wee Gan. 1994. An application of information theory in Chinese word segmentation. <b><i>Computer Processing of Chinese &amp; Oriental Languages, </i></b>8(1): 115-123.</p><p>Mel'cuk, Igor A. 1988. <b><i>Dependency Syntax: Theory And Practice. </i></b>State University Press of New York.</p><p>Meredith, Marsha J. 1986. Seek-Whence: A model of pattern perception. Technical Report 214, Computer Science Department, Indiana University, Bloomington, IN.</p><p>Mitchell, Melanie. 1990. <b><i>Copycat: A Computer Model of High-Level Perception and Conceptual Slippage in Analogy-Making. </i></b>Ph.D. thesis, University of Michigan.</p><p>Polguère, Alain. To appear. Meaning-text semantic networks as a formal language. In <b><i>Current Issues In Meaning-Text Linguistics, </i></b>edited by Leo Wanner.</p><page local="23" global="553"/><p>Sampson, Geoffrey, Robin Haigh, and Eric Atwell. 1989. Natural language analysis by stochastic optimization: A progress report on project APRIL. <b><i>Journal of Experimen tal and Theoretical Artificial Intelligence, </i></b>1(4): 271-287.</p><p>Sproat, Richard and Chilin Shih. 1990. A statistical method for finding word boundaries in Chinese text. <b><i>Computer Processing of Chinese &amp; Oriental Languages, </i></b>4(4): 336-351.</p><p>Sproat, Richard, Chilin Shih, William Gale, and Nancy Chang. 1994. A stochastic finite-state word-segmentation algorithm for Chinese. In <b><i>Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, </i></b>pages 66-73.</p><p>Sproat, Richard, Chilin Shih, William Gale, and Nancy Chang. 1996. A stochastic finite-state word-segmentation algorithm for Chinese. <b><i>Computational Linguistics, </i></b>22(3).</p><p>Tesniêre, Lucien. 1959. <b><i>Elements de la syntaxe structurale. </i></b>Klincksieck, Paris.</p><p>Wang, Xiaog-Long, Kai-Zhu Wang, and Xiao-Hua Bai. 1991. Separating syllables and characters into words in natural language understanding (in Chinese). <b><i>Journal of Chinese Information Processing, </i></b>5(3): 48-58.</p><p>Wu, Ming-Wen and Keh-Yih Su. 1993. Corpus-based automatic compound extraction with mutual information and relative frequency count. In <b><i>Proceedings of R.O.C. Computational Linguistics Conference VI, </i></b>pages 207-216.</p><p>Yao, Tian-Shun Gui-Ping Zhang, and Ying-Ming Wu. 1990. A rule-based Chinese automatic segmenting system (in Chinese). <b><i>Journal of Chinese Information Processing, </i></b>4(1): 37-43.</p><p>Yeh, Ching-Long and Hsi-Jian Lee. 1991. Rule-based word identification for Mandarin Chinese sentences—A unification approach. <b><i>Computer Processing of Chinese &amp; Oriental Languages, </i></b>5(2): 97-118.</p><page local="24" global="554"/></references></body></article>