<?xml version="1.0"?><!DOCTYPE article SYSTEM "/project/take/software/searchbench_offline_processing/paperxml_generator/aclextractor/src/python/../resource/dtd/paperxml.dtd"><article><header><firstpageheader><page local="1" global="689"/><title>Exploring variation across biomedical subdomains</title><pubinfo>Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010),pages 689-697, Beijing, August 2010</pubinfo><author surname="Lippincott" givenname="Tom"><org  name="University of Caen" country="France" city="Caen"/></author><author surname="Ó Séaghdha" givenname="Diarmuid"><org  name="University of Caen" country="France" city="Caen"/></author><author surname="Sun" givenname="Lin"><org  name="University of Caen" country="France" city="Caen"/></author><author surname="Korhonen" givenname="Anna"><org  name="University of Caen" country="France" city="Caen"/></author></firstpageheader><frontmatter><p><b>Exploring variation across biomedical subdomains</b></p><p><b>Tom Lippincott </b>and <b>Diarmuid O Seaghdha </b>and <b>Lin Sun </b>and <b>Anna Korhonen</b></p><p>Computer Laboratory University of Cambridge</p><p>{tl318,do242,ls418,alk2 3}@cam.ac.uk</p></frontmatter><abstract>Previous research has demonstrated the importance of handling differences be­tween domains such as "newswire" and "biomedicine" when porting NLP systems from one domain to another. In this paper we identify the related issue of <i>subdomain variation, </i>i.e., differences between subsets of a domain that might be expected to be­have homogeneously. Using a large corpus of research articles, we explore how subdo­mains of biomedicine vary across a variety of linguistic dimensions and discover that there is rich variation. We conclude that an awareness of such variation is necessary when deploying NLP systems for use in single or multiple subdomains. </abstract></header><body><section number="1" title="Introduction"><p>One of the most noticeable trends in the past decade of Natural Language Processing (NLP) re­search has been the deployment of language pro­cessing technology to meet the information re­trieval and extraction needs of scientists in other disciplines. This meeting of fields has proven mu­tually beneficial: scientists increasingly rely on automated tools to help them cope with the expo­nentially expanding body of publications in their field, while NLP researchers have been spurred to address new conceptual problems in theirs. Among the fundamental advances from the NLP perspec­tive has been the realisation that tools which per­form well on textual data from one source may fail to do so on another unless they are tailored to the new source in some way. This has led to signifi­cant interest in the idea of contrasting <i>domains </i>and the concomitant problem of <i>domain adaptation,</i> as well as the production of manually annotated domain-specific corpora.<footnote anchor="1"/></p><p>One definition of <i>domain variation </i>associates it with differences in the underlying probability distributions from which different sets of data are drawn (Daume III and Marcu, 2006). The concept also mirrors the notion of variation across thematic subjects and the corpus-linguistic notions of <i>reg­ister </i>and <i>genre </i>(Biber, 1988). In addition to the differences in vocabulary that one would expect to observe, domains can vary in many linguistic variables that affect NLP systems. The scientific domain which has received the most attention (and is the focus of this paper) is the biomedical domain. Notable examples of corpus construction projects for the biomedical domain are PennBioIE (Kulick et al., 2004) and GENIA (Kim et al., 2003). These corpora have been used to develop systems for a range of processing tasks, from entity recognition (Jin et al., 2006) to parsing (Hara et al., 2005) to coreference resolution (Nguyen and Kim, 2008).</p><p>An implicit assumption in much previous work on biomedical NLP has been that particular subdo­mains of biomedical literature - typically molec­ular biology - can be used as a model of biomed­ical language in general. For example, GENIA consists of abstracts dealing with a specific set of subjects in molecular biology, while PennBioIE covers abstracts in two specialised domains, cancer genomics and the behaviour of a particular class of enzymes. This assumption of representative­ness is understandable because linguistic annota­tion is labour-intensive and it may not be worth­while to produce annotated corpora for multiple subdomains within a single discipline if there is little task-relevant variation across those subdomains.<page local="2" global="690"/> However, such conclusions should not be made before studying the actual degree of difference be­tween the subdomains of interest.</p><footnote label="1">A workshop dedicated to domain adaptation is collocated with ACL 2010.</footnote><p>One of the principal goals of this paper is to map how the concept of "biomedical language", often construed as a monolithic entity, is composed of diverse patterns of behaviour at more fine-grained topical levels. Hence we study linguistic variation in a broad biomedical corpus of abstracts and full papers, the PMC Open Access Subset.<footnote anchor="2"/> We select a range of lexical and structural phenomena for quantitative investigation. The results indicate that common subdomains for resource development are not representative of biomedical text in general and furthermore that different linguistic features often partition the subdomains in quite different ways.</p></section><section number="2" title="Related Work"><p>A number of researchers have explored the dif­ferences between non-technical and scientific lan­guage. Biber and Gray (2010) describe two distinctive syntactic characteristics of academic writing which set it apart from general English. Firstly, in academic writing additional information is most commonly integrated by pre- and post­modification of phrases rather than by the addi­tion of extra clauses. Secondly, academic writing places greater demands on the reader by omitting non-essential information, through the frequent use of passivisation, nominalisation and noun com­pounding. Biber and Gray also show that these ten­dencies towards "less elaborate and less explicit" language have become more pronounced in recent history.</p><p>We now turn to corpus studies that focus on biomedical writing. Verspoor et al. (2009) use measurements of lexical and structural variation to demonstrate that Open Access and subscription-based journal articles in a specific domain (mouse genomics) are sufficiently similar that research on the former can be taken as representative of the lat­ter. While their primary goal is different from ours and they do not consider variation across multiple domains, they do compare their mouse genomics corpus with small reference corpora drawn from newswire and general biomedical sources. This analysis unsurprisingly finds differences between the domain and newswire corpora across many linguistic dimensions; more interestingly for our purposes, the comparison of domain text to the broader biomedical superdomain shows a more complex picture with similarities in some aspects (e.g., passivisation and negation) and dissimilari­ties in others (e.g., sentence length, semantic fea­tures).</p><footnote label="2">http://www.ncbi.nlm.nih.gov/pmc/ about/openftlist.html</footnote><p>Friedman et al. (2002) document the "sublan­guages" associated with two biomedical domains: clinical reports and molecular biology articles. They set out restricted ontologies and frequent co­occurrence templates for the two domains and dis­cuss the similarities and differences between them, but they do not perform any quantitative analysis.</p><p>Other researchers have focused on specific phe­nomena, rather than cataloguing a broad scope of variation. Cohen et al. (2008) carry out a de­tailed analysis of argument realisation with respect to verbs and nominalisations, using the GENIA and PennBioIE corpora. Nguyen and Kim (2008) compare the behaviour of anaphoric pronouns in newswire and biomedical corpora; they improve the performance of a pronoun resolver by incorpo­rating their observations, thus demonstrating the importance of capturing domain-specific phenom­ena. Nguyen and Kim's findings are discussed in more detail in Section 5.4 below.</p></section><section number="3" title="Subdomains in the OpenPMC Corpus"><p>The Open Access Subset of PubMed (OpenPMC) is the largest publicly available corpus of full-text articles in the biomedical domain. OpenPMC is comprised of 169,338 articles drawn from 1233 medical journals, totalling approximately 400 mil­lion words. The NIH maintains a one-to-many mapping from journals to 122 subject areas (NIH, journals, but these account for over 70% of the database by byte size and word count. Journals are assigned up to five subject areas with the majority assigned one (69%) or two (26%) subjects. In this paper we adopt the OpenPMC subject areas (e.g. "Pulmonary Medicine", "Genetics", "Psychiatry") as the basis for subdomain comparison.</p><doubt alpha="63.6" length="44" tooSmall="False" monospace="0.0">2009b). This covers about 400 of the OpenPMC</doubt><page local="3" global="691"/><p><b>Genetics Medical Informatics Neoplasms Molecular Biology Biochemistry Public Health Environmental Health Microbiology Medicine Biotechnology Neurology Science</b> <b>Communicable Diseases Cell Biology Rheumatology Critical Care Tropical Medicine Physiology Pulmonary Medicine Endocrinology Virology Botany Vascular Diseases Ophthalmology Genetics, Medical Embryology Psychiatry Biomedical Engineering Veterinary Medicine Pediatrics Gastroenterology Geriatrics Pharmacology Obstetrics Education Complementary Therapies Ethics</b></p><doubt alpha="48.6" length="35" tooSmall="False" monospace="0.0">10      20 30 Word count (millions)</doubt><doubt alpha="0.0" length="2" tooSmall="False" monospace="0.0">40</doubt><p>Figure 1: OpenPMC word count by subdomain, dark colouring indicates data assigned single sub­domain, each lighter shade indicates an additional overlapping subdomain</p></section><section number="4" title="Methodology"><subsection number="4.1" title="Data selection and preprocessing"><p>An important initial question was how to treat data with multiple classifications: we only consider journals assigned a single subdomain, to avoid the added complexity of interactions in data from overlapping subdomains. To ensure sufficient data for comparing a variety of linguistic features, we discard the subdomains with less than one mil­lion words meeting the single-subdomain criterion. After review, we also drop the "Biology" subdo­main, which appears to function as a catch-all for many loosely related areas. Figure 1 shows the distribution of data across the subjects we use, by word-count, with lighter-coloured areas represent­ing data that is assigned multiple subjects. These subjects provide a convenient starting point for di­viding the corpus into subdomains (hereafter, "sub­domain" will be used rather than "subject"). We also add a reference subdomain, "Newswire", com­posed of a 6 million word random sample from the English Gigaword corpus (Graff et al., 2005). The final data set has a total of 39 subdomains.</p><p>Articles in the OpenPMC corpus are formatted according to a standard XML tag set (NIH, 2009a). We first convert each article to plain text, ignoring "non-content" elements such as tables and formulas, and split the result into sentences, aggregating the results by subdomain.</p></subsection><subsection number="4.2" title="Feature extraction"><p>We investigate subdomain variation in our cor­pus across a range of lexical, syntactic, sentential and discourse features. The corpus is lemmatised, tagged and parsed using the C&amp;C pipeline (Cur-ran et al., 2007) with the adapted part-of-speech and lexical category tagging models produced by Rimell and Clark (2009) for biomedical parsing.</p><p>From this output we count occurrences of noun, verb, adjective and adverb lemmas, part-of-speech (POS) tags, grammatical relations (GRs), chunks, and lexical categories. The lemma features are Zipfian-distributed items from an open class, so we have experimented with filtering low-frequency items at various thresholds to reduce noise and improve processing speed. The other feature sets can be viewed as closed classes, where filtering is unnecessary.</p><p>Since verbs are central to the meaning and struc­ture of sentences, we consider their special behav­ior by constructing features for each verb's dis­tribution over other grammatical properties. Sev­eral grammatical properties are captured by pairing each verb with its POS (indicating e.g. tense, such as present, past, and present participle). Voice is de­termined from additional annotation output by the C&amp;C parser. Table 1 shows the POS-distribution for the verb "restrict", in two subdomains from the corpus. Finally, we record distributions over verb subcategorization frames (SCFs) taken by each verb, and over the GRs it participates in.</p><page local="4" global="692"/><doubt alpha="62.2" length="37" tooSmall="False" monospace="0.0">Subdomain_VB    VBG    VBN    VBP VBZ</doubt><doubt alpha="47.9" length="71" tooSmall="False" monospace="0.0">Medical Informatics 35 .291)6 .09 iïï-Cell Biology.ï4.43      .05.ï0.29</doubt><p>Table ï: Distribution over POS tags for verb "re­strict", in two subdomains</p><p>SCFs were extracted using a system of Preiss et al. (2007).</p><p>To facilitate a more robust and interpretable anal­ysis of vocabulary differences, we estimate a "topic model" of the corpus with Latent Dirichlet Analy­sis (Blei et al., 2003) using the MALLET toolkit.<footnote anchor="3"/>As preprocessing we divide the corpus into arti­cles, removing stopwords and words shorter than 3 characters. The Gibbs sampling procedure is parameterised to induce Ï00 topics, each giving a coherent cluster of related words learned from the data, and to run for Ï000 iterations. We collate the predicted distribution over topics for each article in a subdomain, weighted by article wordcount, to produce a topic distribution for the subdomain.</p></subsection><subsection number="4.3" title="Measurements of divergence"><p>Our goal is to illustrate the presence or absence of differences between the feature sets, and to do so we calculated the Jensen-Shannon divergence and the Pearson correlation. Jensen-Shannon diver­gence is a finite symmetric measurement of the di­vergence between probability distributions, while Pearson correlation quantifies the linear relation­ship between two real-valued samples.</p><p>The count-features are weighted, for a given subdomain, by the feature's log-likelihood be­tween the subdomain's data and the rest of the corpus. Log-likelihood has been shown to perform well when comparing counts of potentially low-frequency features (Rayson and Garside, 2000) such as found in Zipfian-distributed data. This serves to place more weight in the comparison on items that are distinctive of the subdomain with respect to the entire corpus.</p><p>While the count-features are treated as a single distribution for the purposes of JSD, the verbwise-features are composed of many distributions, one for each verb lemma. Our approach is to com­bine the JSD of the verbs, weighted by the loglikelihood of the verb lemma between the two subdomains in question, and normalize the dis­tances to the interval [0, ï]. Using the lemma's log-likelihood assumes that, when a verb's distribution behaves differently in a subdomain, its frequency changes as well.</p><footnote label="3">http://mallet.cs.umass.edu</footnote><p>We present the results as dendrograms and heat maps. Dendrograms are tree structures that illustrate the results of hierarchical clustering. We perform hierarchical clustering on the inter-subdomain divergences for each set of features. The algorithm begins with each instance (in our case, subdomains) as a singleton cluster, and re­peatedly joins the two most similar clusters until all the data is clustered together. The order of these merges is recorded as a tree structure that can be visualized as a dendrogram in which the length of a branch represents the distance between its child nodes. Similarity between clusters is calculated us­ing average distance between all members, known as "average linking".</p><p>Heat maps show the pairwise calculation of a metric in a grid of squares, where square (x, <i>y)</i><i> </i>is shaded according to the value of <i>metric(subx, suby</i>). For our measurements of JSD, black represents 0 (i.e. identical distributions) and white represents the metric's theoretical maxi­mum of ï. We also inscribe the actual value inside each square. Dendrograms are tree structures that illustrate the hierarchical clustering procedure de­scribed above. The dendrograms present all 39 subdomains, while for readability the heatmaps present ï2 subdomains selected for representative­ness.</p></subsection></section><section number="5" title="Results"><p>Different thresholds for filtering low-frequency terms had little effect on the divergence measures, and served mainly to improve processing time. We therefore report results using a cutoff of ï50 occur­rences (over the entire 234 million word data set) and log-likelihood weights. The results of Pearson correlation and JSD show similar trends, and due to its specific design for comparing distributions we only report the latter.</p><page local="5" global="693"/><subsection number="5.1" title="Vocabulary and lexical features"><p>Differences in vocabulary are what first comes to mind when describing subdomains. Word features are fundamental components for systems such as POS taggers and lexicalised parsers; one therefore expects that these systems will be affected by vari­ation in lexical distributions. Figure 2a uses JSD calculated on each subdomain's distribution over ï00 LDA-induced topics to compare vocabulary distributions. Subdomains related to molecular biology (Genetics, Molecular Biology) show the smallest divergences, an interesting fact since these are heavily used in building resources for BioNLP. The dendrogram shows a rough division into "pub­lic policy", "patient-centric", "applied" and "mi­croscopic" subdomains, with the distance between unrelated subdomains such as Biochemistry and Pediatrics almost as large as their respective differ­ences from Newswire.</p><p>We omit figures for variation over noun, verb and adjective lemmas due to space restrictions; in general, these correlate with the variation in LDA topics though there are some differences. Figure 2b shows JSD calculated on distributions over adverb lemmas. Part of the variation is due to character­istic markers of scientific argument ("therefore", "significantly", "statistically"). A more interesting factor is the coining of domain-specific adverbs, an example of the tendency in scientific text to use complex lexical items and premodifiers rather than additional clauses. This also has the effect of mov­ing subdomain-specific objects and processes from verbs and nouns to adverbs. This behavior seems non-continuous, in that subdomains either make heavy, or almost no, use of it: for example, Pedi­atrics has no subdomain-specific items among the its ten top adverbs by log-likelihood, while Neo­plasms has "histologically", "immunohistochemi-cally" and "subcutaneously". These information-dense terms could prove useful for tasks like auto­matic curation of subdomain vocabularies, where they imply relationships between their components, the items they modify, etc.</p></subsection><subsection number="5.2" title="Verb distributional behavior"><p>Modelling verb behavior is important for both syn­tactic (Collins, 2003) and semantic (Korhonen et al., 2008) processing, and subdomains are known to conscript verbs into specific roles that change the distributions of their syntactic properties (Roland and Jurafsky, ï998). The four properties we con­sidered verbs' distributions over (SCF, POS, GR and voice) produced similar inter-subdomain JSD values. Figure 2c demonstrates how verbs differ between subdomains with respect to SCFs. For example, while the Pediatrics subdomain uses the verb "govern" in a single SCF among its ï2 pos­sibilities, the Genetics subdomain distributes its usage over 7 of them. Two subdomains may both use "restrict" with high frequency (e.g. Molecular Biology and Ethics), but with different frequency distributions over SCFs.</p></subsection><subsection number="5.3" title="Syntax"><p>It is difficult to measure syntactic complexity accu­rately without access to a hand-annotated treebank, but it is well-known that sentence length corre­lates strongly with processing difficulty (Collins, Ï996). The first column of Table 2 gives average sentence lengths (excluding punctuation and "sen­tences" of fewer than three words) for selected domains. All standard errors are &lt; 0.1. It is clear that all biomedical subdomains typically use longer sentences than newswire, though there is also vari­ation within biomedicine, from an average length of 27 words in Molecular Biology to 24.5 words in Pediatrics.</p><p>"Packaging" information in complex pre- and/or post-modified noun phrases is a characteristic fea­ture of academic writing (Biber and Gray, 20ï0). This increases the information density of a sen­tence but brings with it syntactic and semantic ambiguities. For example, the difficulty of resolv­ing the internal structure of noun-noun compounds and strings of prepositional phrases has been the fo­cus of ongoing research in NLP; these phenomena have also been identified as significant challenges in biomedical language processing (Rosario and Hearst, 200ï; Schuman and Bergler, 2006). The second and third columns of Table 2 present aver­age lengths for full noun phrases, defined as every word dominated by a head noun in the grammat­ical relation graph for a sentence, and for base nominals, defined as nouns plus premodifying ad­jectives and nouns only. All standard errors are &lt; 0.01. Newswire text uses the simplest noun<page local="6" global="694"/></p><p>Communicable Diseases^ <b>o.64 </b>Cell Biology Biochemistry</p><doubt alpha="59.5" length="37" tooSmall="False" monospace="0.0">PsychiatryD.660.53Pediatricsd.66 0.59</doubt><doubt alpha="60.2" length="88" tooSmall="False" monospace="0.0">Newswire^.690.67 0.63 0.66 0.54 0.60 (Neoplasms^eoMolecular Biology^ Medical Informatics</doubt><doubt alpha="56.0" length="25" tooSmall="False" monospace="0.0">GeneticsMn.-M1fcM0.590.5:</doubt><doubt alpha="38.9" length="18" tooSmall="False" monospace="0.0">EthiCSd.670.570.50</doubt><p>Critical Carets 0.54 <b><i>m mM</i></b></p><doubt alpha="0.0" length="20" tooSmall="True" monospace="0.0">0.590.560.610.500.63</doubt><doubt alpha="0.0" length="23" tooSmall="True" monospace="0.0">0.54 0.57^^0.590.670.59</doubt><doubt alpha="94.4" length="18" tooSmall="True" monospace="0.0">Pulmonary Medicine</doubt><doubt alpha="100.0" length="8" tooSmall="True" monospace="0.0">Medicine</doubt><doubt alpha="100.0" length="16" tooSmall="True" monospace="0.0">Gastroenterology</doubt><doubt alpha="100.0" length="13" tooSmall="True" monospace="0.0">Endocrinology</doubt><doubt alpha="100.0" length="9" tooSmall="True" monospace="0.0">Neoplasms</doubt><doubt alpha="100.0" length="12" tooSmall="True" monospace="0.0">Rheumatology</doubt><doubt alpha="88.2" length="17" tooSmall="True" monospace="0.0">Genetics, Medical</doubt><doubt alpha="100.0" length="12" tooSmall="True" monospace="0.0">Pharmacology</doubt><doubt alpha="100.0" length="10" tooSmall="True" monospace="0.0">Geriatrics</doubt><doubt alpha="100.0" length="10" tooSmall="True" monospace="0.0">Pediatrics</doubt><doubt alpha="92.3" length="13" tooSmall="True" monospace="0.0">Critical Care</doubt><doubt alpha="94.1" length="17" tooSmall="True" monospace="0.0">Vascular Diseases</doubt><doubt alpha="100.0" length="10" tooSmall="True" monospace="0.0">Obstetrics</doubt><doubt alpha="100.0" length="13" tooSmall="True" monospace="0.0">Ophthalmology</doubt><doubt alpha="100.0" length="10" tooSmall="True" monospace="0.0">Psychiatry</doubt><doubt alpha="100.0" length="12" tooSmall="True" monospace="0.0">Therapeutics</doubt><doubt alpha="100.0" length="9" tooSmall="True" monospace="0.0">Education</doubt><doubt alpha="92.3" length="13" tooSmall="True" monospace="0.0">Public Health</doubt><doubt alpha="94.1" length="17" tooSmall="True" monospace="0.0">Tropical Medicine</doubt><doubt alpha="95.2" length="21" tooSmall="True" monospace="0.0">Communicable Diseases</doubt><doubt alpha="94.7" length="19" tooSmall="True" monospace="0.0">Veterinary Medicine</doubt><doubt alpha="95.0" length="20" tooSmall="True" monospace="0.0">Environmental Health</doubt><doubt alpha="91.3" length="23" tooSmall="True" monospace="0.0">Complementary Therapie:</doubt><doubt alpha="94.7" length="19" tooSmall="True" monospace="0.0">Medical Informatics</doubt><doubt alpha="95.5" length="22" tooSmall="True" monospace="0.0">Biomedical Engineering</doubt><doubt alpha="94.1" length="17" tooSmall="True" monospace="0.0">Molecular Biology</doubt><doubt alpha="100.0" length="8" tooSmall="True" monospace="0.0">Genetics</doubt><doubt alpha="100.0" length="7" tooSmall="True" monospace="0.0">Science</doubt><doubt alpha="100.0" length="8" tooSmall="True" monospace="0.0">Virology</doubt><doubt alpha="100.0" length="12" tooSmall="True" monospace="0.0">Microbiology</doubt><doubt alpha="91.7" length="12" tooSmall="True" monospace="0.0">Cell Biology</doubt><doubt alpha="100.0" length="10" tooSmall="True" monospace="0.0">Physiology</doubt><doubt alpha="100.0" length="9" tooSmall="True" monospace="0.0">Neurology</doubt><doubt alpha="100.0" length="6" tooSmall="True" monospace="0.0">Botany</doubt><doubt alpha="100.0" length="13" tooSmall="True" monospace="0.0">Biotechnology</doubt><doubt alpha="100.0" length="10" tooSmall="True" monospace="0.0">Embryology</doubt><doubt alpha="100.0" length="12" tooSmall="True" monospace="0.0">Biochemistry</doubt><doubt alpha="0.0" length="2" tooSmall="False" monospace="0.0">4*</doubt><p>(a) LDA-induced distribution over topics</p><p>Psychiatry Pediatrics Newswire Neoplasms Molecular Biology Medical Informatics Genetics Ethics Critical Care Communicable Diseases Cell Biology Biochemistry</p><doubt alpha="85.7" length="35" tooSmall="True" monospace="0.0">■ Veterinary Medicine Critical Care</doubt><doubt alpha="88.5" length="61" tooSmall="True" monospace="0.0">- Biochemistry Molecular Biology Vascular Diseases Obstetrics</doubt><doubt alpha="84.6" length="26" tooSmall="True" monospace="0.0">■ Public Health Geriatrics</doubt><doubt alpha="84.2" length="19" tooSmall="True" monospace="0.0">- Tropical Medicine</doubt><doubt alpha="88.0" length="25" tooSmall="True" monospace="0.0">■ Therapeutics Psychiatry</doubt><doubt alpha="91.4" length="35" tooSmall="True" monospace="0.0">Ophthalmology Cell Biology Virology</doubt><doubt alpha="92.0" length="87" tooSmall="True" monospace="0.0">Complementary Therapie Education Embryology Microbiology Medical Informatics Physiology</doubt><doubt alpha="89.6" length="48" tooSmall="True" monospace="0.0">Communicable Discuses - NeurologyrJ1Pharmacology</doubt><doubt alpha="88.2" length="17" tooSmall="True" monospace="0.0">Genetics. Medical</doubt><doubt alpha="83.3" length="6" tooSmall="True" monospace="0.0">atolo.</doubt><doubt alpha="92.3" length="65" tooSmall="True" monospace="0.0">Pulmonary Medicine Gastroenterology Biomedical Engineerinc Ethics</doubt><doubt alpha="92.1" length="63" tooSmall="True" monospace="0.0">Biotechnology Environmental Health Medicine Pediatrics Genetics</doubt><p>(b) Adverb lemma frequencies<page local="7" global="695"/></p><figure caption="Figure 2: Subdomain variation plotted as heat maps and dendrograms"></figure><doubt alpha="15.2" length="33" tooSmall="False" monospace="0.0">0.660.650.67ffi^So.600.690.660.66</doubt><doubt alpha="100.0" length="3" tooSmall="True" monospace="0.0">Nco</doubt><doubt alpha="100.0" length="4" tooSmall="True" monospace="0.0">Rhei</doubt><p>phrase structures; there is notable variation across PMC domains. Full NP and base nominal lengths do not always correlate; for example, Cell Biol­ogy uses relatively long base NPs (nominalisations and multitoken names in particular) but relatively simple full NP structures.</p></subsection><subsection number="5.4" title="Coreference"><p>Resolving coreferential terms is a crucial and chal­lenging task when extracting information from texts in any domain. Nguyen and Kim (2008) compare the use of pronouns in the newswire and biomedical domains, using the GENIA cor­pus as representative of the latter. Among the dif­ferences observed between the domains were the absence of any personal pronouns other than third-person neuter pronouns in the GENIA corpus, and a greater proportion of demonstrative pronouns in GENIA than in the ACE or MUC newswire cor­pora. Corroborating the importance of domain modelling, Nguyen and Kim demonstrate that tai­loring a pronoun resolution system to specific prop­erties of the biomedical domain improves perfor­mance.</p><p>As our corpus is not annotated for coreference we restrict our attention to types that are reliably coreferential: masculine/feminine personal pro­nouns (he, <i>she </i>and case variations), neuter personal pronouns <i>(they, it </i>and variations) and definite NPs with demonstrative determiners such as <i>this </i>and that. To filter out pleonastic pronouns we used a combination of the C+C parser's pleonasm tag and heuristics based on Lappin and Leass (Ï994). To filter out the most common class of non-anaphoric demonstrative NPs we simply discarded any match­ing the pattern <i>this...paper\study\article.</i></p><p>Table 3 presents statistics for selected types of coreferential noun phrases in a number of domains. The results generally agree with the findings of Nguyen and Kim (2008): biomedical text is on average 200 times less likely than news text to use gendered pronouns and twice as likely to use anaphoric definite noun phrases. At the domain level, however, there is clear variation within the biomedical corpus. In contrast to Nguyen and Kim's observations about GENIA some domains do make non-negligible use of gendered pronouns, most notably Ethics (usually to refer to other schol­ars) and domains such as Psychiatry and Pediatrics where studies of actual patients are common. All biomedical domains use demonstrative NPs more frequently than newswire and only one (Ethics) matches newswire for frequent use of neuter 3rd-person pronouns.</p></subsection></section><section number="6" title="Conclusion"><p>In this paper we have explored the phenomenon of linguistic variation at a finer-grained level than previous NLP research, focusing on subdomains rather than traditional domains such as "newswire" and "biomedicine".<page local="8" global="696"/> We have identified patterns of variation across dimensions of vocabulary, syntax and discourse that are known to be of importance for NLP applications. While the magnitude of vari­ation between subdomains is unsurprisingly less pronounced than between coarser domains, sub­domain variation clearly does exist and should be taken into account when considering the generalis-ability of systems trained and evaluated on specific subdomains, for example molecular biology.</p><table caption="Table 2: Average sentence, NP and base nominal lengths across domains" class="main" frame="box" rules="all" border="1" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Sentence length</p></td><td class="cell"><p>Full NP length</p></td><td class="cell"><p>Base nominal length</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Mol. Biology</p></td><td class="cell"><p>27.0</p></td><td class="cell"><p>Biochemistry</p></td><td class="cell"><p>4.03</p></td><td class="cell"><p>Biochemistry</p></td><td class="cell"><p>Ï.85</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Genetics</p></td><td class="cell"><p>26.6</p></td><td class="cell"><p>Genetics</p></td><td class="cell"><p>3.90</p></td><td class="cell"><p>Neoplasms</p></td><td class="cell"><p>Ï.85</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Cell Biology</p></td><td class="cell"><p>26.3</p></td><td class="cell"><p>Critical Care</p></td><td class="cell"><p>3.86</p></td><td class="cell"><p>Mol. Biology</p></td><td class="cell"><p>Ï.84</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Ethics</p></td><td class="cell"><p>26.2</p></td><td class="cell"><p>Neoplasms</p></td><td class="cell"><p>3.85</p></td><td class="cell"><p>Genetics</p></td><td class="cell"><p>Ï.83</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>PMC Average</p></td><td class="cell"><p>25.9</p></td><td class="cell"><p>PMC Average</p></td><td class="cell"><p>3.85</p></td><td class="cell"><p>PMC Average</p></td><td class="cell"><p>Ï.80</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Biochemistry</p></td><td class="cell"><p>25.8</p></td><td class="cell"><p>Pediatrics</p></td><td class="cell"><p>3.84</p></td><td class="cell"><p>Cell Biology</p></td><td class="cell"><p>Ï.80</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Neoplasms</p></td><td class="cell"><p>25.5</p></td><td class="cell"><p>Med. Informatics</p></td><td class="cell"><p>3.84</p></td><td class="cell"><p>Critical Care</p></td><td class="cell"><p>Ï.80</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Psychiatry</p></td><td class="cell"><p>25.3</p></td><td class="cell"><p>Comm. Diseases</p></td><td class="cell"><p>3.8Ï</p></td><td class="cell"><p>Med. Informatics</p></td><td class="cell"><p>Ï.78</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Critical Care</p></td><td class="cell"><p>25.0</p></td><td class="cell"><p>Therapeutics</p></td><td class="cell"><p>3.80</p></td><td class="cell"><p>Comm. Diseases</p></td><td class="cell"><p>Ï.78</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Therapeutics</p></td><td class="cell"><p>24.9</p></td><td class="cell"><p>Mol. Biology</p></td><td class="cell"><p>3.79</p></td><td class="cell"><p>Therapeutics</p></td><td class="cell"><p>Ï.75</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Comm. Diseases</p></td><td class="cell"><p>24.9</p></td><td class="cell"><p>Psychiatry</p></td><td class="cell"><p>3.77</p></td><td class="cell"><p>Psychiatry</p></td><td class="cell"><p>Ï.75</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Med. Informatics</p></td><td class="cell"><p>24.6</p></td><td class="cell"><p>Ethics</p></td><td class="cell"><p>3.69</p></td><td class="cell"><p>Pediatrics</p></td><td class="cell"><p>Ï.73</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Pediatrics</p></td><td class="cell"><p>24.6</p></td><td class="cell"><p>Cell Biology</p></td><td class="cell"><p>3.55</p></td><td class="cell"><p>Ethics</p></td><td class="cell"><p>Ï.65</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Newswire</p></td><td class="cell"><p>Ï9.Ï</p></td><td class="cell"><p>Newswire</p></td><td class="cell"><p>3.Ï8</p></td><td class="cell"><p>Newswire</p></td><td class="cell"><p>Ï.60</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><p>Future work includes directly evaluating the ef­fect of subdomain variation on practical tasks, in­vestigating further dimensions of variation such as nominalisation usage and learning alternative subdomain taxonomies directly from the corpus text. Ultimately, we expect that a more nuanced understanding of subdomain effects will have tan­gible benefits for many applications of scientific language processing.</p></section><section title="Acknowledgements"><p>This work was supported by EPSRC grant EP/G05Ï070/Ï, the Royal Society (AK) and a Dorothy Hodgkin Postgraduate Award (LS).</p><table caption="Table 3: Frequency of coreferential types (proportion of all NPs) across domains" class="main" frame="box" rules="all" border="1" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Pronouns (neuter, 3rd)</p></td><td class="cell"><p>Pronouns (non-neuter, 3rd)</p></td><td class="cell"><p>Demonstrative NPs</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Ethics</p></td><td class="cell"><p>0.0658</p></td><td class="cell"><p>Newswire</p></td><td class="cell"><p>0.059Ï</p></td><td class="cell"><p>Genetics</p></td><td class="cell"><p>0.0275</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Newswire</p></td><td class="cell"><p>0.0607</p></td><td class="cell"><p>Ethics</p></td><td class="cell"><p>0.0037</p></td><td class="cell"><p>Med. Informatics</p></td><td class="cell"><p>0.0263</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Therapeutics</p></td><td class="cell"><p>0.0354</p></td><td class="cell"><p>Pediatrics</p></td><td class="cell"><p>0.00Ï5</p></td><td class="cell"><p>Biochemistry</p></td><td class="cell"><p>0.0263</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Med. Informatics</p></td><td class="cell"><p>0.0346</p></td><td class="cell"><p>Psychiatry</p></td><td class="cell"><p>0.0009</p></td><td class="cell"><p>Ethics</p></td><td class="cell"><p>0.0260</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Psychiatry</p></td><td class="cell"><p>0.0342</p></td><td class="cell"><p>Comm. Diseases</p></td><td class="cell"><p>0.0009</p></td><td class="cell"><p>Mol. Biology</p></td><td class="cell"><p>0.025Ï</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Pediatrics</p></td><td class="cell"><p>0.0308</p></td><td class="cell"><p>Therapeutics</p></td><td class="cell"><p>0.0005</p></td><td class="cell"><p>PMC Average</p></td><td class="cell"><p>0.0226</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>PMC Average</p></td><td class="cell"><p>0.0284</p></td><td class="cell"><p>PMC Average</p></td><td class="cell"><p>0.0005</p></td><td class="cell"><p>Cell Biology</p></td><td class="cell"><p>0.02Ï0</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Genetics</p></td><td class="cell"><p>0.0275</p></td><td class="cell"><p>Critical Care</p></td><td class="cell"><p>0.0004</p></td><td class="cell"><p>Comm. Diseases</p></td><td class="cell"><p>0.0207</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Critical Care</p></td><td class="cell"><p>0.0272</p></td><td class="cell"><p>Neoplasms</p></td><td class="cell"><p>0.0002</p></td><td class="cell"><p>Neoplasms</p></td><td class="cell"><p>0.0205</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Mol. Biology</p></td><td class="cell"><p>0.0258</p></td><td class="cell"><p>Med. Informatics</p></td><td class="cell"><p>0.0002</p></td><td class="cell"><p>Psychiatry</p></td><td class="cell"><p>0.020Ï</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Biochemistry</p></td><td class="cell"><p>0.025Ï</p></td><td class="cell"><p>Genetics</p></td><td class="cell"><p>0.000Ï</p></td><td class="cell"><p>Critical Care</p></td><td class="cell"><p>0.020Ï</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Neoplasms</p></td><td class="cell"><p>0.0227</p></td><td class="cell"><p>Mol. Biology</p></td><td class="cell"><p>2.5 x 10-5</p></td><td class="cell"><p>Therapeutics</p></td><td class="cell"><p>0.0Ï92</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Cell Biology</p></td><td class="cell"><p>0.02Ï7</p></td><td class="cell"><p>Biochemistry</p></td><td class="cell"><p>2.0 x 10-5</p></td><td class="cell"><p>Pediatrics</p></td><td class="cell"><p>0.0Ï9Ï</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Comm. Diseases</p></td><td class="cell"><p>0.02Ï3</p></td><td class="cell"><p>Cell Biology</p></td><td class="cell"><p>1.5 x 10-5</p></td><td class="cell"><p>Newswire</p></td><td class="cell"><p>0.0ÏÏ8</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table></section><references><p>Biber, Douglas and Bethany Gray. 20Ï0. Challeng­ing stereotypes about academic writing: Complexity, elaboration, explicitness. <i>Journal of English for Academic Purposes, </i>9(Ï):2-20.</p><p>Biber, Douglas. Ï988. <i>Variation Across Speech and Writing. </i>Cambridge University Press, Cambridge.</p><p>Blei, David M., Andrew Y. Ng, and Michael I. Jordan.</p><p>2003. Latent Dirichlet allocation. <i>Journal ofMa-chine Learning Research, </i>3:993-Ï022.</p><p>Cohen, K. Bretonnel, Martha Palmer, and Lawrence Hunter. 2008. Nominalization and alternations in biomedical language. <i>PLoS ONE, </i>3(9):e3Ï58.</p><p>Collins, Michael John. Ï996. A new statistical parser based on bigram lexical dependencies. In <i>Proceed­ings ofACL-96, </i>Santa Cruz, CA.</p><p>Collins, Michael. 2003. Head-driven statistical mod­els for natural language parsing. <i>Computational Linguistics, </i>29(4):589-637.</p><p>Curran, James, Stephen Clark, and Johan Bos. 2007. Linguistically motivated large-scale NLP with C&amp;C and Boxer. In <i>Proceedings of the ACL-07 Demo and Poster Sessions, </i>Prague, Czech Republic.</p><p>Daume III, Hal and Daniel Marcu. 2006. Domain adaptation for statistical classifiers. <i>Journal of Ar­tificial Intelligence Research, </i>26:Ï0Ï-Ï26.</p><p>Friedman, Carol, Pauline Kraa, and Andrey Rzhetsky. 2002. Two biomedical sublanguages: a description based on the theories of Zellig Harris. <i>Journal of Biomedical Informatics, </i>35(4):222-235.</p><p>Graff, David, Junbo Kong, Ke Chen, and Kazuaki Maeda, 2005. <i>English Gigaword Corpus, 2nd Edi­tion. </i>Linguistic Data Consortium.</p><page local="9" global="697"/><p>Hara, Tadayoshi, Yusuke Miyao, and Jun'ichi Tsu-jii. 2005. Adapting a probabilistic disambiguation model of an HPSG parser to a new domain. In <i>Pro­ceedings ofIJCNLP-05, </i>Jeju Island, South Korea.</p><p>Jin, Yang, Ryan T. McDonald, Kevin Lerman, Mark A. Mandel, Steven Carroll, Mark Y. Liberman, Fer­nando C. Pereira, Raymond S. Winters, and Peter S. White. 2006. Automated recognition of malignancy mentions in biomedical literature. <i>BMC Bioinfor-matics, </i>7:492.</p><p>Kim, J.-D., T. Ohta, Y. Tateisi, and J. Tsujii. 2003. GENIA corpus - a semantically annotated corpus for bio-textmining. <i>Bioinformatics, </i>Ï9(Suppl. Ï):iÏ80iÏ82.</p><p>Korhonen, Anna, Yuval Krymolowski, and Nigel Col­lier. 2008. The choice of features for classifica­tion of verbs in biomedical texts. In <i>Proceedings ofCOLING-08, </i>Manchester, UK.</p><p>Kulick, Seth, Ann Bies, Mark Liberman, Mark Mandel, Ryan McDonald, Martha Palmer, Andrew Schein, Lyle Ungar, Scott Winters, and Pete White. 2004. Integrated annotation for biomedical information ex­traction. In <i>Proceedings of the HLT-NAACL-04 Workshop on Linking Biological Literature, Ontolo­gies and Databases, </i>Boston, MA.</p><p>Lappin, Shalom and Herbert J. Leass. Ï994. An algo­rithm for pronominal anaphora resolution. <i>Compu­tational Linguistics, </i>20(4):535-56Ï.</p><p>Nguyen, Ngan L.T. and Jin-Dong Kim. 2008. Explor­ing domain differences for the design of a pronoun resolution system for biomedical text. In <i>Proceed­ings ofCOLING-08, </i>Manchester, UK.</p><p>NIH. 2009a. Journal publishing tag set. http://dtd.nlm.nih.gov/publishing/.</p><doubt alpha="45.7" length="46" tooSmall="False" monospace="0.0">NIH.        2009b.        National  library of</doubt><p>medicine: Journal      subject terms.</p><p>http://wwwcf.nlm.nih.gov/serials/journals/index.cfm.</p><p>Preiss, Judita, E.J. Briscoe, and Anna Korhonen. 2007. A system for large-scale acquisition of verbal, nom­inal and adjectival subcategorization frames from corpora. In <i>Proceedings ofACL-07, </i>Prague, Czech Republic.</p><p>Rayson, Paul and Roger Garside. 2000. Comparing corpora using frequency profiling. In <i>Proceedings </i><i>ofthe</i><i> ACL-00 Workshop on Comparing Corpora, </i>Hong Kong.</p><p>Rimell, Laura and Stephen Clark. 2009. Port­ing a lexicalized-grammar parser to the biomedi­cal domain.   <i>Journal of Biomedical Informatics,</i> 42(5):852-865.</p><p>Roland, Douglas and Daniel Jurafsky. Ï998. How verb subcategorization frequencies are affected by corpus choice. In <i>Proceedings ofCOLING-ACL-98, </i>Montreal, Canada.</p><p>Rosario, Barbara and Marti Hearst. 200Ï. Classify­ing the semantic relations in noun compounds via a domain-specific lexical hierarchy. In <i>Proceedings ofEMNLP-01, </i>Pittsburgh, PA.</p><p>Schuman, Jonathan and Sabine Bergler. 2006. Post­nominal prepositional phrase attachment in pro-teomics. In <i>Proceedings of the HLT-NAACL-06 BioNLP Workshop on Linking Natural Language and Biology, </i>New York, NY.</p><p>Verspoor, Karin, K Bretonnel Cohen, and Lawrence Hunter. 2009. The textual characteristics of tradi­tional and Open Access scientific journals are simi­lar. <i>BMC Bioinformatics, </i>Ï0:Ï83.</p></references></body></article>