A python module to generate a tokenized dump of Wikipedia for NLP
Project description
WiToKit
Welcome to WiToKit
, a Python toolkit to download and generate preprocessed Wikipedia dumps for all languages.
WiToKit can be used to converts a Wikipedia archive into a single .txt file, one (tokenized) sentence per line.
Note: WiToKit currently only supports xx-pages-articles.xml.xx.bz2
Wikipedia archives corresponding to articles, templates, media/file descriptions, and primary meta-pages.
Install
pip3 install witokit
On python3.5 you may need to pass on the --process-dependency-link
flag:
pip3 install witokit --process-dependency-link
Use
Download
To download a .bz2-compressed Wikipedia XML dump, do:
witokit download \
--lang lang_wp_code \
--date wiki_date \
--output /abs/path/to/output/dir/where/to/store/bz2/archives \
--num-threads num_cpu_threads
For example, to download the latest English Wikipedia, do:
witokit download --lang en --date latest --output /abs/path/to/output/dir --num-threads 2
The --lang
parameter expects the WP (language) code corresponding
to the desired Wikipedia archive.
Check out the full list of Wikipedias with their corresponding WP codes here.
The --date
parameter expects a string corresponding to one of the dates
found under the Wikimedia dump site corresponding to a given Wikipedia dump
(e.g. https://dumps.wikimedia.org/enwiki/ for the English Wikipedia).
Important Keep num-threads <= 3 to avoid rejection from Wikimedia servers
Extract
To extract the content of the downloaded .bz2 archives, do:
witokit extract \
--input /abs/path/to/downloaded/wikipedia/bz2/archives \
--num-threads num_cpu_threads
Process
To preprocess the content of the extracted XML archives and output a single .txt file, tokenize, one sentence per line:
witokit process \
--input /abs/path/to/wikipedia/extracted/xml/archives \
--output /abs/path/to/single/output/txt/file \
--lower \ # if set, will lowercase text
--num-threads num_cpu_threads
Preprocessing for all languages is performed with Polyglot.
Sample
You can also use WiToKit to sample the content of a preprocess .txt file, using:
witokit sample \
--input /abs/path/to/witokit/preprocessed/txt/file \
--percent \ # percentage of total lines to keep
--balance # if set, will balance sampling, otherwise, will take top n sentences only
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file witokit-1.1.0.tar.gz
.
File metadata
- Download URL: witokit-1.1.0.tar.gz
- Upload date:
- Size: 8.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/39.2.0 requests-toolbelt/0.8.0 tqdm/4.35.0 CPython/3.6.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c0126f1fde980ded5e1359977387b09a45be66c645a931c350988c948433ebb0 |
|
MD5 | 4b978064d5f0d24e66edfc8b570c3393 |
|
BLAKE2b-256 | 5969338c656703434d75a92b37220729f822d17a523f121173c1b65388866406 |