WordNet
'''WordNet''' is a [[semantic lexicon]] for the [[English language]]. It groups English words into sets of synonyms called ''[[synsets]]'', provides short, general definitions, and records the various [[semantic]] relations between these [[synonym]] sets. The purpose is twofold: to produce a combination of [[dictionary]] and [[thesaurus]] that is more intuitively usable, and to support automatic text analysis and [[artificial intelligence]] applications. The database and software tools have been released under a [[BSD License|BSD style license]] and can be downloaded and used freely. The [[database]] can also be browsed [[online]].
WordNet was created and is being maintained at the Cognitive Science Laboratory of [[Princeton University]] under the direction of [[psychology]] [[professor]] [[George Armitage Miller|George A. Miller]]. Development began in [[1985]]. Over the years, the project received about $3 million of funding, mainly from government agencies interested in [[machine translation]]. In recent years, Dr. [[Christiane Fellbaum]] has overseen the development of WordNet.
== Database contents ==
[[As of 2006]], the database contains about 150,000 words organized in over 115,000 [[synsets]] for a total of 207,000 word-sense pairs; in [[data compression|compressed]] form, it is about 12 [[megabyte]]s in size.[[http://wordnet.princeton.edu/man/wnstats.7WN WNSTATS(7WN) manual page]]
WordNet distinguishes between [[noun]]s, [[verb]]s, [[adjective]]s and [[adverb]]s because they follow different grammatical rules. Every synset contains a group of synonymous words or [[collocation]]s (a ''collocation'' is a sequence of words that go together to form a specific meaning, such as "[[carpool|car pool]]"); different senses of a word are in different synsets. The meaning of the synsets is further clarified with short defining ''glosses'' (Definitions and/or example sentences). A typical example synset with gloss is:
: good, right, ripe -- (most suitable or right for a particular purpose; "a good time to plant tomatoes"; "the right time to act"; "the time is ripe for great sociological changes")
Most synsets are connected to other synsets via a number of semantic relations. These relations vary based on the type of word, and include:
* [[Noun]]s
**''[[hypernym]]s'': ''Y'' is a hypernym of ''X'' if every ''X'' is a (kind of) ''Y'' (''canine'' is a hypernym of ''dog'')
**''[[hyponym]]s'': ''Y'' is a hyponym of ''X'' if every ''Y'' is a (kind of) ''X'' (''dog'' is a hyponym of ''canine'')
**''coordinate terms'': ''Y'' is a coordinate term of ''X'' if ''X'' and ''Y'' share a hypernym (''wolf'' is a coordinate term of ''dog'', and ''dog'' is a coordinate term of ''wolf'')
**''[[holonymy|holonym]]'': ''Y'' is a holonym of ''X'' if ''X'' is a part of ''Y'' (''building'' is a holonym of ''window'')
**''[[meronymy|meronym]]'': ''Y'' is a meronym of ''X'' if ''Y'' is a part of ''X'' (''window'' is a meronym of ''building'')
* [[Verb]]s
**''hypernym'': the verb ''Y'' is a hypernym of the verb ''X'' if the activity ''X'' is a (kind of) ''Y'' (''travel'' is an hypernym of ''movement'')
**''troponym'': the verb ''Y'' is a troponym of the verb ''X'' if the activity ''Y'' is doing ''X'' in some manner (''to lisp'' is a troponym of ''to talk'')
**''entailment'': the verb ''Y'' is entailed by ''X'' if by doing ''X'' you must be doing ''Y'' (''to sleep'' is entailed by ''to snore'')
**''coordinate terms'': those verbs sharing a common hypernym (''to lisp'' and ''to yell'')
* [[Adjective]]s
**''related nouns''
**''similar to''
**''participle of verb''
* [[Adverb]]s
**''root adjectives''
While semantic relations apply to all members of a synset because they share a meaning but are all mutually [[synonym]]s, words can also be connected to other words through lexical relations, including [[antonym]]s (opposites of each other) and derivationally related, as well.
WordNet also provides the ''polysemy count'' of a word: the number of synsets that contain the word. If a word participates in several synsets (i.e. has several senses) then typically some senses are much more common than others. WordNet quantifies this by the ''frequency score'': in which several sample texts have all words semantically tagged with the corresponding synset, and then a count provided indicating how often a word appears in a specific sense.
The morphology functions of the software distributed with the database try to deduce the [[lemma (linguistics)|lemma]] or [[root (linguistics)|root]] form of a [[word]] from the user's input; only the root form is stored in the database unless it has irregular inflected forms.
== Knowledge structure ==
Both nouns and verbs are organized into hierarchies, defined by hypernym or ''IS A'' relationships. For instance, the first sense of the word ''dog'' would have the following hypernym hierarchy; the words at the same level are synonyms of each other: some sense of ''dog'' is synonymous with some other senses of ''domestic dog'' and ''Canis familiaris'', and so on. Each set of synonyms (''synset''), has a unique index and shares its properties, such as a gloss (or dictionary) definition.
dog, domestic dog, Canis familiaris
=> canine, canid
=> carnivore
=> placental, placental mammal, eutherian, eutherian mammal
=> mammal
=> vertebrate, craniate
=> chordate
=> animal, animate being, beast, brute, creature, fauna
=> ...
At the top level, these hierarchies are organized into [[base types]], 25 primitive groups for nouns, and 15 for verbs. These groups form ''lexicographic files'' at a maintenance level. These primitive groups are connected to an abstract root node that have, for some time, been assumed by various applications that use WordNet.
In the case of adjectives, the organization is different. Two opposite 'head' senses work as binary poles, while 'satellite' synonyms connect to each of the heads via synonymy relations. Thus, the hierarchies, and the concept involved with lexicographic files, do not apply here the same way they do for nouns and verbs.
The network of nouns is far deeper than that of the other parts of speech. Verbs have a far ''bushier'' structure, and adjectives are organized into many distinct clusters. Adverbs are defined in terms of the adjectives they are derived from, and thus inherit their structure from that of the adjectives.
== Psychological justification ==
The goal of WordNet was to develop a system that would be consistent with the knowledge acquired over the years about how human beings process language. [[Anomic aphasia]], for example, creates a condition that seems to selectively encumber individuals' ability to name objects; this makes the decision to partition the parts of speech into distinct hierarchies more of a principled decision than an arbitrary one.
In the case of [[hyponym]]y, psychological experiments revealed that individuals can access properties of nouns more quickly depending on when a characteristic becomes a defining property. That is, individuals can quickly verify that ''canaries can sing'' because a canary is a songbird (only one level of hyponymy), but requires slightly more time to verify that ''canaries can fly'' (two levels of hyponymy) and even more time to verify ''canaries have skin'' (multiple levels of hyponymy). This suggests that we too store semantic information in a way that is much like WordNet, because we only retain the most specific information needed to differentiate one particular concept from similar concepts.[Collins A., Quillian M. R. 1972. "Experiments on Semantic Memory and Language Comprehension" in ''Cognition in Learning and Memory''. Wiley, New York.]
== WordNet as an ontology ==
The hypernym/hyponym relationships among the noun synsets can be interpreted as specialization relations between conceptual categories. In other words, WordNet can be interpreted and used as a lexical [[ontology (computer science)|ontology]] in the [[computer science]] sense. However, such an ontology should normally be corrected before being used since it contains hundreds of basic semantic inconsistencies such as (i) the existence of common specializations for exclusive categories and (ii) redundancies in the specialization hierarchy. Furthermore, transforming WordNet into a lexical ontology usable for knowledge representation should normally also involve (i) distinguishing the specialization relations into subtypeOf and instanceOf relations, and (ii) associating intuitive unique identifiers to each category. Although such corrections and transformations have been performed and documented as part of the [http://www.webkb.org/doc/wn/ integration of WordNet 1.7 into the cooperatively updatable knowledge base of WebKB-2], most projects claiming to re-use WordNet for knowledge-based applications (typically, knowledge-oriented information retrieval) simply re-use it directly.
== Limitations ==
Unlike other dictionaries, WordNet does not include information about [[etymology]], pronunciation and the forms of [[irregular verb]]s and contains only limited information about usage.
The actual lexicographical and semantical information is maintained in ''lexicographer files'', which are then processed by a tool called ''grind'' to produce the distributed database. Both grind and the lexicographer files are freely available in a separate distribution, but modifying and maintaining the database requires expertise.
Though WordNet contains a sufficient wide range of common words, it does not cover special domain vocabulary. Since it is primarily designed to act as an underlying database for different applications, those applications cannot be used in specific domains that are not covered by WordNet.
== Applications in Information Systems ==
WordNet has been used for a number of different purposes in information systems, including word sense disambiguation, information retrieval, automatic text classification, automatic text summarization, and even automatic crossword puzzle generation.
A project at [[Brown University]] started by [[Jeff Stibel]], [[James A. Anderson]], [[Steve Reiss]] and others called [[Applied Cognition Lab]] created a disambiguator using WordNet in 1998.[{{cite web|url=http://www.forbes.com/1999/10/04/feat.html|title= Forbes article|publisher=Forbes}}] The project later morphed into a company called [[Simpli]], which is now owned by [[ValueClick]]. George Miller joined the Company as a member of the Advisory Board. Simpli built an Internet search engine that utilized a knowledgebase principally based on WordNet to disambiguate and expand keywords and synsets to help retrieve information online. WordNet was expanded upon to add increased dimensionality, such as intentionality (used for x), people ([[Albert Einstein]]) and colloquial terminology more relevant to Internet search (i.e., blogging, ecommerce). [[Neural network]] algorithms searched the expanded WordNet for related terms to disambiguate search keywords (Java, in the sense of coffee) and expand the search synset (Coffee, Drink, Joe) to improve search engine results.[{{cite web|url=http://www.infotoday.com/newsbreaks/nb1220-2.htm|title= Info Today Publication|publisher=Info Today}}] Before the company was acquired, it performed searches across search engines such as [[Google]], [[Yahoo]]!, [[Ask.com]] and others.[{{cite web|url=http://www.findarticles.com/p/articles/mi_m0HDN/is_2000_May_25/ai_62332783|title= providence Journal Article|publisher=Providence Journal}}]
Another prominent example of the use of WordNet is to determine the [[semantic similarity|similarity]] between words. Various algorithms have been proposed, and these include considering the distance between the conceptual categories of words, as well as considering the hierarchical structure of the WordNet ontology. A number of these WordNet-based word similarity algorithms are implemented in a [[Perl]] package called [http://www.d.umn.edu/~tpederse/similarity.html WordNet::Similarity].
== Interfaces ==
Princeton maintains a list of [http://wordnet.princeton.edu/links related projects] that includes links to some of the widely used [[application programming interface]]s available for accessing WordNet using various programming languages and environments.
Other interfaces include the following:
* [http://www.defineitfast.com/find/ WordNet on Ajax::DefineItFast.com] allows users to browse Wordnet 3.0 using an [[Ajax (programming)|ajax]] interface.
* The [http://mfwallace.googlepages.com/jawbone.html Jawbone] project provides a [[Java (programming language)|Java]] API to the WordNet 2.1 and 3.0 data. The source code is released under the [[MIT license]].
* The [http://nltk.sourceforge.net/ Natural Language Toolkit] provides a [[Python (programming language)|Python]] API to the WordNet 3.0.
* [http://search.cpan.org/perldoc?Lingua::Wordnet Lingua::Wordnet] provides a [[Perl]] interface to WordNet.
* [http://search.cpan.org/perldoc?WordNet::Similarity WordNet::Similarity] Perl module for computing measures of semantic relatedness.
* [http://www.cozyenglish.com/dictionary Dictionary::CozyEnglish] implemented a WordNet 3.0 interface that integrates with [[WordPress]]. Blog and website owners can embed this API via a set of HTML code.
* The [http://www.visualthesaurus.com/ Visual Thesaurus] is a subscription-based commercial application that presents WordNet data through an innovative and user-friendly interface.
* [[WordWeb]] is an extended dictionary based on WordNet, also available commercially as [http://wordweb.info/developer/SQL.html SQL tables] for use in other applications. Includes many additional terms, derived forms and pronunciations.
* [http://www.ug.it.usyd.edu.au/~smer3502/assignment3/form.html Visual representation of WordNet] - interface which attempts to visualise the relations.
== Related projects ==
The [[EuroWordNet]] project has produced WordNets for several European languages and linked them together; these are not freely available however. The [[Global Wordnet]] project attempts to coordinate the production and linking of "wordnets" for all languages. [[Oxford University Press]], the publisher of the [[Oxford English Dictionary]], has voiced plans to produce their own online competitor to WordNet.
The [[eXtended WordNet]] is a project at the [[University of Texas at Dallas]] which aims to improve WordNet by semantically parsing the glosses, thus making the information contained in these definitions available for automatic knowledge processing systems. It is also freely available under a license similar to WordNet's.
The [[GCIDE]] project produces a dictionary by combining a [[public domain]] ''[[Webster's Dictionary]]'' from [[1913]] with some WordNet definitions and material provided by volunteers.
It is released under the [[copyleft]] license [[GNU General Public License|GPL]].
WordNet is also commonly re-used via mappings between the WordNet categories and the categories from other ontologies. Most often, only the top-level categories of WordNet are mapped. However, the authors of the [[Suggested Upper Merged Ontology|SUMO]] ontology have produced a mapping between all of the WordNet synsets, (including nouns, verbs, adjectives and adverbs), and [[SUMO class]]es. The most recent addition of the mappings provides links to all of the more specific terms in the MId-Level Ontology (MILO), which extends SUMO. [[OpenCyc]] has 12,000 terms linked to WordNet synonym sets.
In most works that claim to have integrated WordNet into other ontologies, the content of WordNet has not simply been corrected when semantic problems have been encountered; instead, WordNet has been used as an inspiration source but heavily re-interpreted and updated whenever suitable. This was the case when, for example, the [http://citeseer.ist.psu.edu/oltramari02restructuring.html top-level ontology of WordNet was re-structured] according to the [[OntoClean]] based approach or when WordNet was used as a primary source for constructing the lower classes of the SENSUS ontology.
[[FrameNet]] is a project similar to WordNet. It consists of a lexicon which is based on annotating over 100,000 sentences with their semantic properties. The unit in focus is the ''lexical frame'', a type of state or event together with the properties associated with it.
An independent project titled [http://wordnet.cemetech.net wordNet] with an initial lowercase w is an ongoing project to links words and phrases via a custom [[Web crawler]].
[[Lexical markup framework]] (LMF) is a work in progress within [[ISO/TC37]] in order to define a common standardized framework for the construction of lexicons, including WordNet.
The [http://www.ceid.upatras.gr/Balkanet/ BalkaNet] project has produced WordNets for six European languages (Bulgarian, Czech, Greek, Romanian, Turkish and Serbian). For this project, freely available XML-based WordNet editor was developed. This editor - [http://nlp.fi.muni.cz/projekty/visdic/ VisDic] - is not in active development anymore, but is still used for the creation of various WordNets. Its successor, [http://deb.fi.muni.cz DEBVisDic], is client-server application and is currently used for the editing of several WordNets (Dutch in Cornetto project, Polish, Hungarian, several African languages, Chinese).
== References ==
== See also==
* [[Semantic Web]]
* [[taxonomy]]
* [[Synonym Ring]]
* [[ThoughtTreasure]]
* [[Is-a]]
* [[Hyponym]]
== External links==
*Free Download single [http://www.tomeraider.com/ebooks/reference/dictionaries_and_thesauri/english/wordnet_dictionary_ebook--BK634.php Wordnet Dictionary] file for your [[Palm OS]], [[Pocket PC]] and [[Microsoft Windows|Windows]] in [[Tomeraider]] format.
* [http://wordnet.princeton.edu/ The WordNet Home Page]
* [http://wordnet.princeton.edu/links Wordnet Related Projects] – comprehensive list of interfaces and extensions
* [http://www.globalwordnet.org/ Global Wordnet] + [http://deb.fi.muni.cz/clients-debgrid.php DEBGrid] (Global Wordnet web interface)
[[Category:Dictionaries]]
[[Category:Knowledge representation]]
[[Category:Computational linguistics]]
[[de:WordNet]]
[[es:WordNet]]
[[fr:WordNet]]
[[gl:WordNet]]
[[ko:워드넷]]
[[it:Wordnet]]
[[nl:WordNet]]
[[ja:WordNet]]
[[uk:WordNet]]
[[zh:WordNet]]