Integrated Corpus-Building Environment
Project description
Expanda
The universal integrated corpus-building environment.
Introduction
Expanda is an integrated corpus-building environment. Expanda provides integrated pipelines for building a corpus dataset. Building corpus dataset requires several complicated pipelines such as parsing, shuffling, and tokenization. If the corpora are gathered from different applications, it would be a problem to parse various formats. Expanda helps to build corpus simply at once by setting build configuration.
Main Features
- Easy to build, simple for adding new extensions
- Manages build environment systemically
- Fast build through performance optimization (even written in Python)
- Supports multi-processing
- Extremely less memory usage
- Don't need to write new codes for each corpus. Just write one line for adding a new corpus.
Dependencies
- nltk
- ijson
- tqdm>=4.46.0
- mwparserfromhell>=0.5.4
- tokenizers>=0.7.0
- kss==1.3.1
Installation
With pip
Expanda can be installed using pip as follows:
$ pip install expanda
From source
You can install from source by cloning the repository and running:
$ git clone https://github.com/affjljoo3581/Expanda.git
$ cd Expanda
$ python setup.py install
Build your first dataset
Let's build Wikipedia dataset by using Expanda. First of all, install Expanda.
$ pip install expanda
Next, create a workspace to build dataset by running:
$ mkdir workspace
$ cd workspace
Then, download Wikipedia dump file from here.
In this example, we are going to test with part of the wiki.
Download the file through the browser, move to workspace/src
and rename to
wiki.xml.bz2
. Instead, run below code:
$ mkdir src
$ wget -O src/wiki.xml.bz2 https://dumps.wikimedia.org/enwiki/20200520/enwiki-20200520-pages-articles1.xml-p1p30303.bz2
After downloading the dump file, we need to setup the configuration file.
Create expanda.cfg
file and write the below:
[expanda.ext.wikipedia]
num-cores = 6
[tokenization]
unk-token = <unk>
control-tokens = <s>
</s>
<pad>
[build]
input-files =
--expanda.ext.wikipedia src/wiki.xml.bz2
The current directory structure of workspace
should be as follows:
workspace
├── src
│ └── wiki.xml.bz2
└── expanda.cfg
Now we are ready to build! Run Expanda by using:
$ expanda build
Then we can get the below output:
[*] execute extension [expanda.ext.wikipedia] for [src/wiki.xml.bz2]
[nltk_data] Downloading package punkt to /home/user/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
[*] merge extracted texts.
[*] start shuffling merged corpus...
[*] optimum stride: 17, buckets: 34
[*] create temporary bucket files.
[*] successfully shuffle offsets. total offsets: 102936
[*] shuffle input file: 100%|████████████████████| 102936/102936 [00:02<00:00, 34652.03it/s]
[*] start copying buckets to the output file.
[*] finish copying buckets. remove the buckets...
[*] complete preparing corpus. start training tokenizer...
[00:00:59] Reading files ████████████████████ 100
[00:00:04] Tokenize words ████████████████████ 405802 / 405802
[00:00:00] Count pairs ████████████████████ 405802 / 405802
[00:00:01] Compute merges ████████████████████ 6332 / 6332
[*] create tokenized corpus.
[*] tokenize corpus: 100%|█████████████████████| 1749902/1749902 [00:28<00:00, 61958.55it/s]
[*] split the corpus into train and test dataset.
[*] remove temporary directory.
[*] finish building corpus.
If you build dataset successfully, you can get the following directory tree:
workspace
├── build
│ ├── corpus.raw.txt
│ ├── corpus.train.txt
│ ├── corpus.test.txt
│ └── vocab.txt
├── src
│ └── wiki.xml.bz2
└── expanda.cfg
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file Expanda-1.2.0.tar.gz
.
File metadata
- Download URL: Expanda-1.2.0.tar.gz
- Upload date:
- Size: 13.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ee91631471a43d77a50be31a53c817c790ece0d9865a5d42487961a6a50b7f0c |
|
MD5 | 98f4a0952174edd9a28c28d993ba5a12 |
|
BLAKE2b-256 | f2613611d65b37adcea74bcc77b0b3a4f01484e429e2bcf96d852a08dab4c557 |
File details
Details for the file Expanda-1.2.0-py3-none-any.whl
.
File metadata
- Download URL: Expanda-1.2.0-py3-none-any.whl
- Upload date:
- Size: 21.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 39ee6b13c07ef9d25894fa271c7300ec9f5c9aae4c384b2c8be9fcd58354ecc6 |
|
MD5 | 6b84d423a71e6174635a7acec083e5d6 |
|
BLAKE2b-256 | a046e77da690a0f86c52eeb5220a00f5db73c0beecbffaf79f366cc25c5b7845 |