The ‘raw’ version of the Brown corpus has been constructed by starting with the tagged version of the corpus available from and applying various (automatic and manual) transformations. Paragraph breaks from the tagged corpus were maintained. The Bergen Format I version of the data was used to inform the transformation decisions made. First, the automatic changes: * normalises spacing * drop the tags * treats sentences tagged as headlines as one-line paragraphs * remove extra spaces around punctuation, where this can be done automatically * maps LaTex quotes back to double straight quotes, where this can be done automatically * removes the double punctuation that wasn't in the raw for x in c*[0-9]; do base=`echo $x|perl -pe 's/(..)\d+/$1/;'`; if [ ! -d ../cooked/$base ]; then mkdir ../cooked/$base; fi; cat $x|../scripts/ > ../cooked/$base/$x; done Text matching the following patterns was manually corrected according to Bergen Format I version: /^['"] / / ['"]$/ / ['"] / /"' / / '[",?!;]/ /''/ /``/ Other errors were opportunistically corrected, if they came up while searching, but no other systematic corrections were made. To create the unsegmented.txt file, used in the segmentation experiments: cat cooked/*/* |perl -pe 's/ +/ /g;'|perl -pe 's/\n/ /;'|\ perl -pe 's/ +/\n\n/g' > unsegmented.txt And the segmented.txt, used for evaluation: cat cooked/*/* |grep -v "^$" > segmented.txt