Skip to main content

Generate large textual corpora for almost any language by crawling the web

Project description

pypi badge GPL License

webcorpus is an end-to-end tool used to crawl and generate datasets from the crawled data. It can be used to generate monolingual corpora and has various processors to create labelled datasets automatically. webcorpus is particulary suited for low-resource languages which need automated methods for creating large-scale datasets.

This project has been used to generate IndicCorp, a large-scale corpora for Indic languages, and some datasets for IndicGLUE.

Installation

Make sure you have java installed on your system. Next, install it using pip:

sudo pip3 install webcorpus

Usage

To build the dataset, we first need to crawl the web and then process the crawls to create the final dataset.

Step 1: Crawling Sources

To start crawling websites, you first need to start the webcorpus crawling server:

webcorpus start

Once the server has started, you can start crawls using the following command.

webcorpus crawl --path <path> --name <name> --url <url> --log <path> [--host <ip address>]

You can see the status of the crawls anytime by executing:

webcorpus log [--host <ip address>]

The last two steps can also been remotely, which can be useful in distributed mode where you are running multiple webcorpus servers.

Step 2: Processing Corpus
webcorpus process --operation <operation code> --lang <lang code> --input <input path> --output <output path>

Currently, the following processing operations are supported: extract_arts, extract_sents, extract_genres, archive.

Project details


Release history Release notifications | RSS feed

This version

0.2

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webcorpus-0.2.tar.gz (35.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

webcorpus-0.2-py3-none-any.whl (55.1 kB view details)

Uploaded Python 3

File details

Details for the file webcorpus-0.2.tar.gz.

File metadata

  • Download URL: webcorpus-0.2.tar.gz
  • Upload date:
  • Size: 35.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.2

File hashes

Hashes for webcorpus-0.2.tar.gz
Algorithm Hash digest
SHA256 36c6a63027936309fcbda4e0957d23007b78639e873925ba43360f51350aedfc
MD5 0637de14f05e1d554daced3f82d02338
BLAKE2b-256 1e2e8ff88ca95c23f989ba067b9b2c390350059d41d1415b10bdad85d2450b4d

See more details on using hashes here.

File details

Details for the file webcorpus-0.2-py3-none-any.whl.

File metadata

  • Download URL: webcorpus-0.2-py3-none-any.whl
  • Upload date:
  • Size: 55.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.2

File hashes

Hashes for webcorpus-0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e21533ef788ed23f13b29947ee4c93a32e4f7d890b13c6d97ff49f9e0432b36c
MD5 bb6a8523b030d07d2a65a46e7c6b79af
BLAKE2b-256 f4745fb341a62ee85233198ac15b87fb8023ebd78d62109cfb8dd07b09268e7c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page