CleanEval Package
Evaluation scripts for boilerplate removal
-
cleaneval.py (version 1.0)
cleaneval.pyis a fast evaluation script for the CleanEval-1 competition, written by Stefan Evert. It calculates precision, recall and F-score for automatically cleaned text dumps against a manually cleaned gold standard. All files must be in the format mandated for CleanEval-1. Labelled or unlabelled segmentation accuracy can also be calculated in terms of precision, recall and F-score.
Boilerplate Removal Software
Software for automatic cleaning of Web pages (boilerplate removal)
-
NCleaner-1.0
NCleaner is a simple tool for automatic boilerplate removal, using character-level n-gram models as classifiers. Since it does not make use of HTML structure, it can also be applied to existing text dumps of Web pages. NCleaner participated in CleanEval-1 (under the working title StupidOS), where it achieved competitive results for text cleanup (though not for segmentation accuracy). The first official release includes both a fully portable Perl implementation, as well as C code for faster processing on supported platforms (more than 20 million words per hour on a standard desktop computer).