The CDC data used for segmentation experiments reported in Read et. al. (2012) was created using the open-source Conan Doyle Corpus (CDC) available from http://www.delph-in.net/cdc/. To create the unsegmented.txt file, used in the segmentation experiments: cat {baskerville,cardboard,circle,wisteria}*.txt |perl -pe 's/ +/ /g;'|\ perl -pe 's/\n/ /;'| perl -pe 's/ +/\n\n/g' > unsegmented.txt And the segmented.txt, used for evaluation: cat {baskerville,cardboard,circle,wisteria}*.txt |grep -v "^$" > segmented.txt