The unified corpus building environment for Language Models.
Project description
langumo
The unified corpus building environment for Language Models.
Table of contents
Introduction
langumo
is an unified corpus building environment for Language Models.
langumo
provides pipelines for building text-based datasets. Constructing
datasets requires complicated pipelines (e.g. parsing, shuffling and
tokenization). Moreover, if corpora are collected from different sources, it
would be a problem to extract data from various formats. langumo
helps to
build a dataset with the diverse formats simply at once.
Main features
- Easy to build, simple to add new corpus format.
- Fast building through performance optimizations (even written in Python).
- Supports multi-processing in parsing corpora.
- Extremely less memory usage.
- All-in-one environment. Never mind the internal procedures!
- Does not need to write codes for new corpus. Instead, add to the build configuration simply.
Dependencies
- nltk
- colorama
- pyyaml>=5.3.1
- tqdm>=4.46.0
- tokenizers==0.8.1
- mwparserfromhell>=0.5.4
- kss==1.3.1
Installation
With pip
langumo
can be installed using pip
as follows:
$ pip install langumo
From source
You can install langumo
from source by cloning the repository and running:
$ git clone https://github.com/affjljoo3581/langumo.git
$ cd langumo
$ python setup.py install
Quick start guide
Let's build a Wikipedia
dataset. First, install langumo
in your virtual
enviornment.
$ pip install langumo
After installing langumo
, create a workspace to use in build.
$ mkdir workspace
$ cd workspace
Before creating the dataset, we need a Wikipedia dump file (which is a source of the dataset). You can get various
versions of Wikipedia dump files from here.
In this tutorial, we will use
a part of Wikipedia dump file.
Download the file with your browser and move to workspace/src
. Or, use wget
to get the file in terminal simply:
$ wget -P src https://dumps.wikimedia.org/enwiki/20200901/enwiki-20200901-pages-articles1.xml-p1p30303.bz2
langumo
needs a build configuration file which contains the details of
dataset. Create build.yml
file to workspace
and write belows:
langumo:
inputs:
- path: src/enwiki-20200901-pages-articles1.xml-p1p30303.bz2
parser: langumo.parsers.WikipediaParser
build:
parsing:
num-workers: 8 # The number of CPU cores you have.
tokenization:
vocab-size: 32000 # The vocabulary size.
Now we are ready to create our first dataset. Run langumo
!
$ langumo
Then you can see the below outputs:
[*] import file from src/enwiki-20200901-pages-articles1.xml-p1p30303.bz2
[*] parse raw-formatted corpus file with WikipediaParser
[*] merge 1 files into one
[*] shuffle raw corpus file: 100%|██████████████████████████████| 118042/118042 [00:01<00:00, 96965.15it/s]
[00:00:10] Reading files (256 Mo) ███████████████████████████████████ 100
[00:00:00] Tokenize words ███████████████████████████████████ 418863 / 418863
[00:00:01] Count pairs ███████████████████████████████████ 418863 / 418863
[00:00:02] Compute merges ███████████████████████████████████ 28942 / 28942
[*] export the processed file to build/vocab.txt
[*] tokenize sentences with WordPiece model: 100%|███████████████| 236084/236084 [00:23<00:00, 9846.67it/s]
[*] split validation corpus - 23609 of 236084 lines
[*] export the processed file to build/corpus.train.txt
[*] export the processed file to build/corpus.eval.txt
After building the dataset, workspace
would contain the below files:
workspace
├── build
│ ├── corpus.eval.txt
│ ├── corpus.train.txt
│ └── vocab.txt
├── src
│ └── enwiki-20200901-pages-articles1.xml-p1p30303.bz2
└── build.yml
Usage
usage: langumo [-h] [config]
The unified corpus building environment for Language Models.
positional arguments:
config langumo build configuration
optional arguments:
-h, --help show this help message and exit
Documentation
You can find the langumo
documentation
on the website.
License
langumo
is Apache-2.0 Licensed.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file langumo-0.2.0.tar.gz
.
File metadata
- Download URL: langumo-0.2.0.tar.gz
- Upload date:
- Size: 20.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fc4d5209e8f283ddc3f424543b3a7e6c7037aa7a2ab7c0bd1e457b60b2c6405c |
|
MD5 | a57eacdb6eac7f3db5e349ad880d103a |
|
BLAKE2b-256 | cbb487206abdbc9ca4806f66fdb9a441d0586e51a107977ae01e3c514c8df3a9 |
File details
Details for the file langumo-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: langumo-0.2.0-py3-none-any.whl
- Upload date:
- Size: 27.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1be64ef6fa02857f1c52de663a5a03d1fbeb41efcbfa885433b6d0fcbb04a346 |
|
MD5 | acb5199ea118f1374abff52b0d4984d0 |
|
BLAKE2b-256 | 4be6a54bd0dd9b5dc206092ad2ee776accb94bb19627461e84caa50804dfd3ee |