Skip to main content

The unified corpus building environment for Language Models.

Project description

langumo

The unified corpus building environment for Language Models.

build Documentation Status PyPI PyPI - Python Version PyPI - Format
GitHub FOSSA Status codecov CodeFactor Codacy Badge

Table of contents

Introduction

langumo is an unified corpus building environment for Language Models. langumo provides pipelines for building text-based datasets. Constructing datasets requires complicated pipelines (e.g. parsing, shuffling and tokenization). Moreover, if corpora are collected from different sources, it would be a problem to extract data from various formats. langumo helps to build a dataset with the diverse formats simply at once.

Main features

  • Easy to build, simple to add new corpus format.
  • Fast building through performance optimizations (even written in Python).
  • Supports multi-processing in parsing corpora.
  • Extremely less memory usage.
  • All-in-one environment. Never mind the internal procedures!
  • Does not need to write codes for new corpus. Instead, add to the build configuration simply.

Dependencies

  • nltk
  • colorama
  • pyyaml>=5.3.1
  • tqdm>=4.46.0
  • tokenizers==0.8.1
  • mwparserfromhell>=0.5.4
  • kss==1.3.1

Installation

With pip

langumo can be installed using pip as follows:

$ pip install langumo

From source

You can install langumo from source by cloning the repository and running:

$ git clone https://github.com/affjljoo3581/langumo.git
$ cd langumo
$ python setup.py install

Quick start guide

Let's build a Wikipedia dataset. First, install langumo in your virtual enviornment.

$ pip install langumo

After installing langumo, create a workspace to use in build.

$ mkdir workspace
$ cd workspace

Before creating the dataset, we need a Wikipedia dump file (which is a source of the dataset). You can get various versions of Wikipedia dump files from here. In this tutorial, we will use a part of Wikipedia dump file. Download the file with your browser and move to workspace/src. Or, use wget to get the file in terminal simply:

$ wget -P src https://dumps.wikimedia.org/enwiki/20200901/enwiki-20200901-pages-articles1.xml-p1p30303.bz2

langumo needs a build configuration file which contains the details of dataset. Create build.yml file to workspace and write belows:

langumo:
  inputs:
  - path: src/enwiki-20200901-pages-articles1.xml-p1p30303.bz2
    parser: langumo.parsers.WikipediaParser

  build:
    parsing:
      num-workers: 8 # The number of CPU cores you have.

    tokenization:
      vocab-size: 32000 # The vocabulary size.

Now we are ready to create our first dataset. Run langumo!

$ langumo

Then you can see the below outputs:

[*] import file from src/enwiki-20200901-pages-articles1.xml-p1p30303.bz2
[*] parse raw-formatted corpus file with WikipediaParser
[*] merge 1 files into one
[*] shuffle raw corpus file: 100%|██████████████████████████████| 118042/118042 [00:01<00:00, 96965.15it/s]
[00:00:10] Reading files (256 Mo)                   ███████████████████████████████████                 100
[00:00:00] Tokenize words                           ███████████████████████████████████ 418863   /   418863
[00:00:01] Count pairs                              ███████████████████████████████████ 418863   /   418863
[00:00:02] Compute merges                           ███████████████████████████████████ 28942    /    28942
[*] export the processed file to build/vocab.txt
[*] tokenize sentences with WordPiece model: 100%|███████████████| 236084/236084 [00:23<00:00, 9846.67it/s]
[*] split validation corpus - 23609  of 236084 lines
[*] export the processed file to build/corpus.train.txt
[*] export the processed file to build/corpus.eval.txt

After building the dataset, workspace would contain the below files:

workspace
├── build
│   ├── corpus.eval.txt
│   ├── corpus.train.txt
│   └── vocab.txt
├── src
│   └── enwiki-20200901-pages-articles1.xml-p1p30303.bz2
└── build.yml

Usage

usage: langumo [-h] [config]

The unified corpus building environment for Language Models.

positional arguments:
  config      langumo build configuration

optional arguments:
  -h, --help  show this help message and exit

Documentation

You can find the langumo documentation on the website.

License

langumo is Apache-2.0 Licensed.

FOSSA Status

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langumo-0.2.0.tar.gz (20.0 kB view details)

Uploaded Source

Built Distribution

langumo-0.2.0-py3-none-any.whl (27.9 kB view details)

Uploaded Python 3

File details

Details for the file langumo-0.2.0.tar.gz.

File metadata

  • Download URL: langumo-0.2.0.tar.gz
  • Upload date:
  • Size: 20.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.7.9

File hashes

Hashes for langumo-0.2.0.tar.gz
Algorithm Hash digest
SHA256 fc4d5209e8f283ddc3f424543b3a7e6c7037aa7a2ab7c0bd1e457b60b2c6405c
MD5 a57eacdb6eac7f3db5e349ad880d103a
BLAKE2b-256 cbb487206abdbc9ca4806f66fdb9a441d0586e51a107977ae01e3c514c8df3a9

See more details on using hashes here.

File details

Details for the file langumo-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: langumo-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 27.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.7.9

File hashes

Hashes for langumo-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1be64ef6fa02857f1c52de663a5a03d1fbeb41efcbfa885433b6d0fcbb04a346
MD5 acb5199ea118f1374abff52b0d4984d0
BLAKE2b-256 4be6a54bd0dd9b5dc206092ad2ee776accb94bb19627461e84caa50804dfd3ee

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page