version: 0.9 (20-feb-12) this directory contains documents and code for the 2012 ACL Special Workshop on Rediscovering 50 Years of Discoveries, specifically the Contributed Task. these files were collected through a collaboration of the Language Technology Laboratory at DFKI (c/o Ulrich Schaefer) and the Department of Informatics at the University of Oslo (UiO; c/o Jonathon Read and Stephan Oepen). for background, please see: http://translit.i2r.a-star.edu.sg/r50/taskintro/ the directory 'pdfextract/' contains an almost ready-to-run copy of PDFExtract, a program to extract textual content (organized in TEI-compliant XML) from the so-called PDF text stream. for background, please see: http://www.duo.uio.no/sok/work.html?WORKID=120077 http://elacin.github.com/PDFExtract/ the directory 'anthology' contains a collection of files from the ACL Anthology in various formats: + '.url' --- the download link for the actual PDF + '.paper.xml' --- textual content extracted from "O"CR (by DFKI) + '.pdfextract.xml' --- textual content extract from text stream (by UiO). seeing that not all files in the Anthology were digitally born, PDFExtract output is missing for about half the files (i.e. in the initial release, there is PDFExtract output only for documents that are part of the ACL Anthology Reference Corpus; ARC). we are currently working to try and fill in more of the document in this format, for example LREC proceedings. to complement what is distributed through SVN with the actual 'raw' PDF files, a command like the following can be used (in bash(1) syntax): for i in ?/*.url; do wget -O ${i%%.url}.pdf $(cat $i); done