Skip to main content

The unified corpus building environment for Language Models.

Project description

langumo

The unified corpus building environment for Language Models.

build Documentation Status PyPI PyPI - Python Version PyPI - Format
GitHub FOSSA Status codecov CodeFactor Codacy Badge

Table of contents

Introduction

langumo is an unified corpus building environment for Language Models. langumo provides pipelines for building text-based datasets. Constructing datasets requires complicated pipelines (e.g. parsing, shuffling and tokenization). Moreover, if corpora are collected from different sources, it would be a problem to extract data from various formats. langumo helps to build a dataset with the diverse formats simply at once.

Main features

  • Easy to build, simple to add new corpus format.
  • Fast building through performance optimizations (even written in Python).
  • Supports multi-processing in parsing corpora.
  • Extremely less memory usage.
  • All-in-one environment. Never mind the internal procedures!
  • Does not need to write codes for new corpus. Instead, add to the build configuration simply.

Dependencies

  • nltk
  • colorama
  • pyyaml>=5.3.1
  • tqdm>=4.46.0
  • tokenizers==0.8.1
  • mwparserfromhell>=0.5.4
  • kss==1.3.1

Installation

With pip

langumo can be installed using pip as follows:

$ pip install langumo

From source

You can install langumo from source by cloning the repository and running:

$ git clone https://github.com/affjljoo3581/langumo.git
$ cd langumo
$ python setup.py install

Quick start guide

Let's build a Wikipedia dataset. First, install langumo in your virtual enviornment.

$ pip install langumo

After installing langumo, create a workspace to use in build.

$ mkdir workspace
$ cd workspace

Before creating the dataset, we need a Wikipedia dump file (which is a source of the dataset). You can get various versions of Wikipedia dump files from here. In this tutorial, we will use a part of Wikipedia dump file. Download the file with your browser and move to workspace/src. Or, use wget to get the file in terminal simply:

$ wget -P src https://dumps.wikimedia.org/enwiki/20200901/enwiki-20200901-pages-articles1.xml-p1p30303.bz2

langumo needs a build configuration file which contains the details of dataset. Create build.yml file to workspace and write belows:

langumo:
  inputs:
  - path: src/enwiki-20200901-pages-articles1.xml-p1p30303.bz2
    parser: langumo.parsers.WikipediaParser

  build:
    parsing:
      num-workers: 8 # The number of CPU cores you have.

    tokenization:
      vocab-size: 32000 # The vocabulary size.

Now we are ready to create our first dataset. Run langumo!

$ langumo

Then you can see the below outputs:

[*] import file from src/enwiki-20200901-pages-articles1.xml-p1p30303.bz2
[*] parse raw-formatted corpus file with WikipediaParser
[*] merge 1 files into one
[*] shuffle raw corpus file: 100%|██████████████████████████████| 118042/118042 [00:01<00:00, 96965.15it/s]
[00:00:10] Reading files (256 Mo)                   ███████████████████████████████████                 100
[00:00:00] Tokenize words                           ███████████████████████████████████ 418863   /   418863
[00:00:01] Count pairs                              ███████████████████████████████████ 418863   /   418863
[00:00:02] Compute merges                           ███████████████████████████████████ 28942    /    28942
[*] export the processed file to build/vocab.txt
[*] tokenize sentences with WordPiece model: 100%|███████████████| 236084/236084 [00:23<00:00, 9846.67it/s]
[*] split validation corpus - 23609  of 236084 lines
[*] export the processed file to build/corpus.train.txt
[*] export the processed file to build/corpus.eval.txt

After building the dataset, workspace would contain the below files:

workspace
├── build
│   ├── corpus.eval.txt
│   ├── corpus.train.txt
│   └── vocab.txt
├── src
│   └── enwiki-20200901-pages-articles1.xml-p1p30303.bz2
└── build.yml

Usage

usage: langumo [-h] [config]

The unified corpus building environment for Language Models.

positional arguments:
  config      langumo build configuration

optional arguments:
  -h, --help  show this help message and exit

Documentation

You can find the langumo documentation on the website.

License

langumo is Apache-2.0 Licensed.

FOSSA Status

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langumo-0.2.0.tar.gz (20.0 kB view hashes)

Uploaded Source

Built Distribution

langumo-0.2.0-py3-none-any.whl (27.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page