WeSearch Data Collection (WDC)
==============================

The WeSearch Data Collection is a freely redistributable, partly annotated,
comprehensive sample of user-generated content.  The WDC contains data
extracted from a range of genres of varying formality (user forums, product
review sites, blogs, and Wikipedia) and covers two different domains (NLP and
Linux).  For full details about its construction, please see:

@inproceedings{Rea:Fli:Dri:12,
  author={Jonathon Read and Dan Flickinger and Rebecca Dridan 
          and Stephan Oepen and Lilja {\O}vrelid},
  title={The WeSearch Corpus, Treebank, and Treecache -- 
         A Comprehensive Sample of User-Generated Content},
  booktitle={Proceedings of the Eighth International Conference on 
             Language Resources and Evaluation},
  year = 2012,
  month = {May},
  address = {Istanbul, Turkey},
  pages = {1829--1835},
  url = {http://www.lrec-conf.org/proceedings/lrec2012/pdf/774_Paper.pdf}}


Naming Conventions:

Each ‘collection’ (i.e. a combination of domain and genre) is identified with a
three letter code, following the pattern ‘w[ln][bfrw]’:

  w       (WeSearch Data Collection)
  [ln]    (Linux or NLP)
  [bfrw]  (blogs, forums, reviews, or wikipedia)
  
Two collections draw on independently developed resources, viz. WLN and WLW,
which correspond to the WeScience Corpus and parts of the WikiWoods Corpus,
respectively.  Thus, the assignment of item identifiers in these collections
does not follow the same pattern as for the other parts of the WDC (in fact,
the WeScience data is maintaned and distributed separately, for now).


Directory structure:

For each collection, there are three sub-directories making available the text
at various levels of normalization (sometimes dubbed L0, L1, and L2):

 raw: Raw HTML files (L0).  There are separate sub-directories for each source
      website.  The files inside each source are HTML, which have been named to
      correspond to the path on the source website (replacing ‘/’ with ‘:’)

 txt: [incr tsdb()] import files, containing HTML annotations (L1).

 gml: [incr tsdb()] import files, containing GML annotations (L2).

Additional meta-information about the construction of the WDC is recorded in
auxiliary files (in each sub-directory, as appropriate) as follows:

Xref    : Mapping from original (‘raw’) document names to 8-digit identifier
          prefix (see below for the exaxt identifier format).

Account : Accounts of the deletions made to create L1 and L2 collections.  In 
          these, there is a line for all items in that collection.  The first
          two numbers are the item identifier and the character offset of the
          origin of the post.  Then, there are 0-many pairs of numbers.  For
          each of these, the first is a character position relative to the
          origin, and the second is the number of characters that have been
          deleted. 


[incr tsdb()] Import Files:

These files contain text sentences ready for import into [incr tsdb()].  The 
files numbered 00–03 are reserved:

   00 : for the benefit of future generations
   01 : a test set drawn from several sources
   02 : a single source test set
   03 : a development set
   
Each line in an import file contains an item identifier and string, delimited 
with ‘ |’.  Identifiers take the form: ‘DGSPPPPPIIII’.

   D = domain (1=linux, 2=nlp)
   G = genre (2=blogs, 3=forums, 4=reviews, 5=wiki)
   S = source (a unique number with respect to domain and genre)
        121 = embraceubuntu.com
        122 = ubuntu.philipcasey.com
        123 = www.linuxscrew.com
        124 = www.markshuttleworth.com
        125 = www.ubuntugeek.com
        126 = www.ubuntu-unleashed.com
        221 = blog.cyberling.org
        222 = gameswithwords.fieldofscience.com
        223 = lingpipe-blog.com
        224 = nlpers.blogspot.com
        225 = thelousylinguist.blogspot.com
   P = post (a unique number with respect to domain, genre and source)
   I = item (a unique number with respect to domain, genre, source and post)