A python module to generate a tokenized dump of Wikipedia for NLP

These details have not been verified by PyPI

Project links

Project description

WiToKit

Welcome to WiToKit, a Python toolkit to download and generate preprocessed Wikipedia dumps for all languages.

WiToKit can be used to converts a Wikipedia archive into a single .txt file, one (tokenized) sentence per line.

Note: WiToKit currently only supports xx-pages-articles.xml.xx.bz2 Wikipedia archives corresponding to articles, templates, media/file descriptions, and primary meta-pages.

Install

pip3 install witokit

On python3.5 you may need to pass on the --process-dependency-link flag:

pip3 install witokit --process-dependency-link

Use

Download

To download a .bz2-compressed Wikipedia XML dump, do:

witokit download ⁠\
  --lang lang_wp_code \
  --date wiki_date \
  --output /abs/path/to/output/dir/where/to/store/bz2/archives \
  --num-threads num_cpu_threads

For example, to download the latest English Wikipedia, do:

witokit download ⁠--lang en --date latest --output /abs/path/to/output/dir --num-threads 2

The --lang parameter expects the WP (language) code corresponding to the desired Wikipedia archive. Check out the full list of Wikipedias with their corresponding WP codes here.

The --date parameter expects a string corresponding to one of the dates found under the Wikimedia dump site corresponding to a given Wikipedia dump (e.g. https://dumps.wikimedia.org/enwiki/ for the English Wikipedia).

Important Keep num-threads <= 3 to avoid rejection from Wikimedia servers

Extract

To extract the content of the downloaded .bz2 archives, do:

witokit extract \
  --input /abs/path/to/downloaded/wikipedia/bz2/archives \
  --num-threads num_cpu_threads

Process

To preprocess the content of the extracted XML archives and output a single .txt file, tokenize, one sentence per line:

witokit process \
  --input /abs/path/to/wikipedia/extracted/xml/archives \
  --output /abs/path/to/single/output/txt/file \
  --lower \  # if set, will lowercase text
  --num-threads num_cpu_threads

Preprocessing for all languages is performed with Polyglot.

Sample

You can also use WiToKit to sample the content of a preprocess .txt file, using:

witokit sample \
  --input /abs/path/to/witokit/preprocessed/txt/file \
  --percent \  # percentage of total lines to keep
  --balance  # if set, will balance sampling, otherwise, will take top n sentences only

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.1.0

Sep 12, 2019

1.0.2

Sep 12, 2019

1.0.1

Sep 12, 2019

1.0.0

Sep 12, 2019

0.4.0

Jan 29, 2019

0.2.1

Jan 28, 2019

0.2.0

Jan 28, 2019

0.1.14

Dec 12, 2018

0.1.13

Nov 24, 2018

0.1.12

Nov 24, 2018

0.1.10

Nov 24, 2018

0.1.8

Nov 18, 2018

0.1.7

Nov 17, 2018

0.1.6

Nov 17, 2018

0.1.5

Nov 16, 2018

0.1.4

Nov 16, 2018

0.1.3

Nov 16, 2018

0.1.2

Nov 16, 2018

0.1.1

Nov 16, 2018

0.1.0

Nov 16, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

witokit-1.1.0.tar.gz (8.2 kB view details)

Uploaded Sep 12, 2019 Source

File details

Details for the file witokit-1.1.0.tar.gz.

File metadata

Download URL: witokit-1.1.0.tar.gz
Upload date: Sep 12, 2019
Size: 8.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/39.2.0 requests-toolbelt/0.8.0 tqdm/4.35.0 CPython/3.6.5

File hashes

Hashes for witokit-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`c0126f1fde980ded5e1359977387b09a45be66c645a931c350988c948433ebb0`
MD5	`4b978064d5f0d24e66edfc8b570c3393`
BLAKE2b-256	`5969338c656703434d75a92b37220729f822d17a523f121173c1b65388866406`

See more details on using hashes here.

witokit 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

WiToKit

Install

Use

Download

Extract

Process

Sample

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes