Evaluation scripts for boilerplate removal
cleaneval.py (version 1.0)
cleaneval.pyis a fast evaluation script for the CleanEval-1 competition, written by Stefan Evert. It calculates precision, recall and F-score for automatically cleaned text dumps against a manually cleaned gold standard. All files must be in the format mandated for CleanEval-1. Labelled or unlabelled segmentation accuracy can also be calculated in terms of precision, recall and F-score.
Boilerplate Removal Software
Software for automatic cleaning of Web pages (boilerplate removal)
NCleaner is a simple tool for automatic boilerplate removal, using character-level n-gram models as classifiers. Since it does not make use of HTML structure, it can also be applied to existing text dumps of Web pages. NCleaner participated in CleanEval-1 (under the working title StupidOS), where it achieved competitive results for text cleanup (though not for segmentation accuracy). The first official release includes both a fully portable Perl implementation, as well as C code for faster processing on supported platforms (more than 20 million words per hour on a standard desktop computer).
You can check out cutting-edge source code with latest bug fixes from the sf.net repository with the following command:
svn co svn://svn.code.sf.net/p/webascorpus/code/cleaneval/Text-NCleaner Text-NCleaner
Google Web1T5 and other N-Gram Databases
Software for Google Web 1T 5-Grams and other N-Gram Databases
Web1T5-Easy (version 1.1)
Web1T5-Easy is a collection of Perl scripts for indexing and querying the Google Web 1T 5-Grams database with the open-source database engine SQLite. This package offers a quick and convenient way to build an interactively searchable version of the Web1T5 database, including a full collocation analysis and a simple, but powerful Web interface. It is not designed as a high-performance Web service and requires considerable amounts of disk space (approx. 220 GiB) as well as patience (indexing may take up to 2 weeks on a state-of-the-art server).
You will soon be able to check out cutting-edge source code from the sf.net repository with the following command:
svn co svn://svn.code.sf.net/p/webascorpus/code/ngrams/Web1T5-Easy/trunk Web1T5-Easy
Web1T5-Easy was written by Stefan Evert. An online demonstration is available here: