Skip to main content

Integrated Corpus-Building Environment

Project description

Expanda

The universial integrated corpus-building environment.

PyPI version build Documentation Status GitHub codecov CodeFactor

Introduction

Expanda is an integrated corpus-building environment. Expanda provides integrated pipelines for building corpus dataset. Building corpus dataset requires several complicated pipelines such as parsing, shuffling and tokenization. If the corpora are gathered from different applications, it would be a problem to parse various formats. Expanda helps to build corpus simply at once by setting build configuration.

Main Features

  • Easy to build, simple for adding new extensions
  • Manages build environment systemically
  • Fast build through performance optimization (even written in Python)
  • Supports multi-processing
  • Extremely less memory usage
  • Don't need to write new codes for each corpus. Just write one line for adding new corpus.

Dependencies

  • nltk
  • ijson
  • tqdm>=4.46.0
  • mwparserfromhell>=0.5.4
  • tokenizers>=0.7.0
  • kss==1.3.1

Installation

With pip

Expanda can be installed using pip as follows:

$ pip install expanda

From source

You can install from source by cloning the repository and running:

$ git clone https://github.com/affjljoo3581/Expanda.git
$ cd Expanda
$ python setup.py install

Build your first dataset

Let's build Wikipedia dataset by using Expanda. First of all, install Expanda.

$ pip install expanda

Next, create workspace to build dataset by running:

$ mkdir workspace
$ cd workspace

Then, download wikipedia dump file from here. In this example, we are going to test with part of enwiki. Download the file through web browser, move to workspace/src and rename to wiki.xml.bz2. Instead, run below code:

$ mkdir src
$ wget -O src/wiki.xml.bz2 https://dumps.wikimedia.org/enwiki/20200520/enwiki-20200520-pages-articles1.xml-p1p30303.bz2

After downloading the dump file, we need to setup the configuration file. Create expanda.cfg file and write the below:

[expanda.ext.wikipedia]
num-cores           = 6

[tokenization]
unk-token           = <unk>
control-tokens      = <s>
                      </s>
                      <pad>

[build]
input-files         =
    --expanda.ext.wikipedia     src/wiki.xml.bz2

Current directory structure of workspace should be as follows:

workspace
├── src
│   └── wiki.xml.bz2
└── expanda.cfg

Now we are ready to build! Run Expanda by using:

$ expanda build

Then we can get the below output:

[*] execute extension [expanda.ext.wikipedia] for [src/wiki.xml.bz2]
[nltk_data] Downloading package punkt to /home/user/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[*] merge extracted texts.
[*] start shuffling merged corpus...
[*] optimum stride: 17, buckets: 34
[*] create temporary bucket files.
[*] successfully shuffle offsets. total offsets: 102936
[*] shuffle input file: 100%|████████████████████| 102936/102936 [00:02<00:00, 34652.03it/s]
[*] start copying buckets to the output file.
[*] finish copying buckets. remove the buckets...
[*] complete preparing corpus. start training tokenizer...
[00:00:59] Reading files                            ████████████████████                 100
[00:00:04] Tokenize words                           ████████████████████ 405802   /   405802
[00:00:00] Count pairs                              ████████████████████ 405802   /   405802
[00:00:01] Compute merges                           ████████████████████ 6332     /     6332

[*] create tokenized corpus.
[*] tokenize corpus: 100%|█████████████████████| 1749902/1749902 [00:28<00:00, 61958.55it/s]
[*] split the corpus into train and test dataset.
[*] remove temporary directory.
[*] finish building corpus.

If you build dataset successfully, you can get the following directory tree:

workspace
├── build
│   ├── corpus.raw.txt
│   ├── corpus.train.txt
│   ├── corpus.test.txt
│   └── vocab.txt
├── src
│   └── wiki.xml.bz2
└── expanda.cfg

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Expanda-1.1.3.tar.gz (12.8 kB view details)

Uploaded Source

Built Distribution

Expanda-1.1.3-py3-none-any.whl (20.6 kB view details)

Uploaded Python 3

File details

Details for the file Expanda-1.1.3.tar.gz.

File metadata

  • Download URL: Expanda-1.1.3.tar.gz
  • Upload date:
  • Size: 12.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.7

File hashes

Hashes for Expanda-1.1.3.tar.gz
Algorithm Hash digest
SHA256 4df5bf3a9bb28c1a53a52fbf77b229db512de984311b2741cf12d64cf3745b41
MD5 55944fea9f1f8c274700f56540831cf0
BLAKE2b-256 0feb696ddb6fdb5aee1f75284fd7298615cbd23a6e8e75367112856b4f6d812e

See more details on using hashes here.

File details

Details for the file Expanda-1.1.3-py3-none-any.whl.

File metadata

  • Download URL: Expanda-1.1.3-py3-none-any.whl
  • Upload date:
  • Size: 20.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.7

File hashes

Hashes for Expanda-1.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 d46b8c9d602e0c663044014bf3356a59749e8a6f9143417a39f0a5764e7e8844
MD5 2cd014334258d8747a35b9cee9da87ac
BLAKE2b-256 b6bd975525d785139796235156a7646c34854fa2e87a0388c2270f1f88c28c1f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page