Skip to main content

Generate large textual corpora for almost any language by crawling the web

Project description

pypi badge GPL License

webcorpus is an end-to-end tool used to crawl and generate datasets from the crawled data. It can be used to generate monolingual corpora and has various processors to create labelled datasets automatically. webcorpus is particulary suited for low-resource languages which need automated methods for creating large-scale datasets.

This project has been used to generate IndicCorp, a large-scale corpora for Indic languages, and some datasets for IndicGLUE.

Installation

Make sure you have java installed on your system. Next, install it using pip:

sudo pip3 install webcorpus

Usage

To build the dataset, we first need to crawl the web and then process the crawls to create the final dataset.

Step 1: Crawling Sources

To start crawling websites, you first need to start the webcorpus crawling server:

webcorpus start

Once the server has started, you can start crawls using the following command.

webcorpus crawl --path <path> --name <name> --url <url> --log <path> [--host <ip address>]

You can see the status of the crawls anytime by executing:

webcorpus log [--host <ip address>]

The last two steps can also been remotely, which can be useful in distributed mode where you are running multiple webcorpus servers.

Step 2: Processing Corpus
webcorpus process --operation <operation code> --lang <lang code> --input <input path> --output <output path>

Currently, the following processing operations are supported: extract_arts, extract_sents, extract_genres, archive.

Project details


Release history Release notifications | RSS feed

This version

0.2

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for webcorpus, version 0.2
Filename, size File type Python version Upload date Hashes
Filename, size webcorpus-0.2-py3-none-any.whl (55.1 kB) File type Wheel Python version py3 Upload date Hashes View
Filename, size webcorpus-0.2.tar.gz (35.4 kB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page