Skip to main content.

Latest news:

New software releases: cleaneval.py and NCleaner-1.0

Welcome to the Web as Corpus community!

The World Wide Web has become an unprecedented and virtually inexhaustible source of authentic natural language data (also called a corpus) for researchers in linguistics, natural language processing, artificial intelligence and many other fields. Part of the appeal of this resource is the fast and easy access provided by commercial search engines like Google. However, the limited search functionality of such services and the lack of linguistic annotation (e.g. part-of-speech tagging) make them unsuitable for most scientific purposes.

Therefore, researchers interested in the Web as Corpus have joined forces to build their own Web corpora and “linguistic search engines”, or to find clever ways of using the limited information offered by the commercial search engines. This Web site is a central home for the Web as Corpus community, and is jointly administrated by its members. It provides general information about the Web as Corpus, current activities such as conferences and competitions, and offers open-source software and data sets for download.

More detailed information about specific aspects of the Web as Corpus can be found on several other pages (click on Other sites in the navigation menu on the left for links to these pages), and community discussions take place on the SIGWAC mailing list.