Software for automatic cleaning of Web pages (boilerplate removal)
NCleaner is a simple tool for automatic boilerplate removal, using character-level n-gram models as classifiers. Since it does not make use of HTML structure, it can also be applied to existing text dumps of Web pages. NCleaner participated in CleanEval-1 (under the working title StupidOS), where it achieved competitive results for text cleanup (though not for segmentation accuracy). The first official release includes both a fully portable Perl implementation, as well as C code for faster processing on supported platforms (more than 20 million words per hour on a standard desktop computer).
You can check out cutting-edge source code with latest bug fixes from the sf.net repository with the following command:
svn co svn://svn.code.sf.net/p/webascorpus/code/cleaneval/Text-NCleaner Text-NCleaner