Skip to main content

The unified corpus building environment for Language Models.

Project description

langumo

The unified corpus building environment for Language Models.

build PyPI PyPI - Python Version PyPI - Format
GitHub FOSSA Status codecov CodeFactor Codacy Badge

Table of contents

Introduction

langumo is an unified corpus building environment for Language Models. langumo provides pipelines for building text-based datasets. Constructing datasets requires complicated pipelines (e.g. parsing, shuffling and tokenization). Moreover, if corpora are collected from different sources, it would be a problem to extract data from various formats simultaneously. langumo helps to build a dataset with the diverse formats simply at once.

Main features

  • Easy to build, simple to add new corpus format.
  • Fast building through performance optimizations (even written in Python).
  • Supports multi-processing in parsing corpora.
  • Extremely less memory usage.
  • All-in-one environment. Never mind the detailed procedures!
  • Does not need to write codes for new corpus. Instead, add to the build configuration simply.

Dependencies

  • nltk
  • colorama
  • pyyaml>=5.3.1
  • tqdm>=4.46.0
  • tokenizers>=0.8.0
  • mwparserfromhell>=0.5.4
  • kss==1.3.1

Installation

With pip

langumo can be installed using pip as follows:

$ pip install langumo

From source

You can install langumo from source by cloning the repository and running:

$ git clone https://github.com/affjljoo3581/langumo.git
$ cd langumo
$ python setup.py install

Quick start guide

Build your first dataset

Let's build a Wikipedia dataset. First, install langumo in your virtual enviornment.

$ pip install langumo

After installing langumo, create a workspace directory to use in building.

$ mkdir workspace
$ cd workspace

Before creating the dataset, we need a Wikipedia dump file (which is a source of our dataset). You can get various versions of Wikipedia dump files from here. In this tutorial, we will use a part of Wikipedia dump file. Download the file with your browser and move the downloaded file to workspace/src directory. Or, use wget to get the file in terminal simply:

$ mkdir src
$ wget -P src https://dumps.wikimedia.org/enwiki/20200901/enwiki-20200901-pages-articles1.xml-p1p30303.bz2

To build the dataset, langumo needs a build configuration file which contains the details of the dataset. Create build.yml file to the workspace directory and write the belows to the file:

langumo:
  inputs:
  - path: src/enwiki-20200901-pages-articles1.xml-p1p30303.bz2
    parser: langumo.parsers.WikipediaParser

  build:
    parsing:
      num-workers: 8 # The number of CPU cores you have.

    tokenization:
      vocab-size: 32000 # The vocabulary size.

Now we are ready to create our first dataset. Run langumo!

$ langumo

Then you can see the below outputs:

[*] import file from src/enwiki-20200901-pages-articles1.xml-p1p30303.bz2
[*] parse raw-formatted corpus file with WikipediaParser
[*] merge 1 files into one
[*] shuffle raw corpus file: 100%|██████████████████████████████| 118042/118042 [00:01<00:00, 96965.15it/s]
[00:00:10] Reading files (256 Mo)                   ███████████████████████████████████                 100
[00:00:00] Tokenize words                           ███████████████████████████████████ 418863   /   418863
[00:00:01] Count pairs                              ███████████████████████████████████ 418863   /   418863
[00:00:02] Compute merges                           ███████████████████████████████████ 28942    /    28942
[*] export the processed file to build/vocab.txt
[*] tokenize sentences with WordPiece model: 100%|███████████████| 236084/236084 [00:23<00:00, 9846.67it/s]
[*] split validation corpus - 23609  of 236084 lines
[*] export the processed file to build/corpus.train.txt
[*] export the processed file to build/corpus.eval.txt

After building the dataset, the workspace directory would contain the below files:

workspace
├── build
│   ├── corpus.eval.txt
│   ├── corpus.train.txt
│   └── vocab.txt
├── src
│   └── enwiki-20200901-pages-articles1.xml-p1p30303.bz2
└── build.yml

Write a custom Parser

langumo supports custom Parsers to use various formats of corpora in building. In this tutorial, we are going to see how to build Amazon Review Data (2018) dataset in langumo.

The basic form of Parser class is as below:

class AmazonReviewDataParser(langumo.building.Parser):
    def extract(self, raw: langumo.utils.AuxiliaryFile) -> Iterable[str]:
        pass

    def parse(self, text: str) -> str:
        pass

extract method yields articles or documents from raw-formatted file and parse method returns the parsed contents from extracted raw articles from extract.

To implement the parser, let's analyse Amazon Review Data (2018) dataset. The data format of Amazon Review Data (2018) is one-review-per-line in json (or, JSON Lines). That is, each line is a json-formatted review data as following:

{
    "image": ["https://images-na.ssl-images-amazon.com/images/I/71eG75FTJJL._SY88.jpg"], 
    "overall": 5.0, 
    "vote": "2", 
    "verified": true, 
    "reviewTime": "01 1, 2018", 
    "reviewerID": "AUI6WTTT0QZYS", 
    "asin": "5120053084", 
    "style": {
        "Size:": "Large", 
        "Color:": "Charcoal"
    }, 
    "reviewerName": "Abbey", 
    "reviewText": "I now have 4 of the 5 available colors of this shirt... ", 
    "summary": "Comfy, flattering, discreet--highly recommended!", 
    "unixReviewTime": 1514764800
}

We only need the contents in reviewText of the reviews. So the parser should only take reviewText from the json objects (extracted from extract method).

def parse(self, text: str) -> str:
    return json.loads(text)['reviewText']

Meanwhile, as mentioned above, reviews are separated by new-line delimiter. So extract method should yield each line in the file. Note that the raw files are deflated with gunzip format.

def extract(self, raw: langumo.utils.AuxiliaryFile) -> Iterable[str]:
    with gzip.open(raw.name, 'r') as fp:
        yield from fp

That's all! You've just implemented a parser for Amazon Review Data (2018). Now you can use this parser in build configuration. Let the parser class is in myexample.parsers package. Here is an example of build configuration.

langumo:
  inputs:
  - path: src/AMAZON_FASHION_5.json.gz
    parser: myexample.parsers.AmazonReviewDataParser

  # other configurations...

Usage

Command-line usage

usage: langumo [-h] [config]

The unified corpus building environment for Language Models.

positional arguments:
  config      langumo build configuration

optional arguments:
  -h, --help  show this help message and exit

Details of build configuration

Every build configuration files contain langumo namespace in top.

langumo.workspace

The path of workspace directory where temporary files would be saved to. It would be deleted automatically after building the dataset. Default: tmp

langumo.inputs

The list of input corpus files. Each item contains path and parser which imply the input file path and full class name of its parser respectively.

langumo.outputs

langumo creates a trained vocabulary file which is for WordPiece tokenizer and tokenized datasets for training and evaluation. You can configure the detail output paths in this section.

  • vocabulary: The output path of trained vocabulary file. Default: build/vocab.txt
  • train-corpus: The output path of splitted dataset for training. Default: build/corpus.train.txt
  • eval-corpus: The output path of splitted dataset for evaluation. Default: build/corpus.eval.txt

langumo.build.parsing

After each article is parsed to a plain text by Parser, langumo automatically splits the article into groups to fit its length to the given limitation. You can configure the details of parsing raw-formatted corpora.

  • num-workers: The number of Parser.parse processes. We recommend to set to the number of CPU cores. Default: 1
  • language: The language of your dataset. langumo will load corresponding sentence tokenizer to split articles into groups. Default: en
  • newline: The delimiter of paragraphs. Precisely, all line-break characters in articles would be replaced to this token. That's because the contents are separated by line-break characters as well. Default: [NEWLINE]
  • min-length: The minimum length of each content group. Default: 0
  • max-length: The maximum length of each content group. Default: 1024

langumo.build.splitting

Language models are trained with train dataset and evaluated with evaluation dataset. Usually the evaluation dataset are taken from the train dataset — extremly large dataset — to preserve domain. You can configure the scale of evaluation dataset.

  • validation-ratio: The ratio of evaluation dataset to train dataset. Default: 0.1

langumo.build.tokenization

You can configure the details of both training tokenizer and tokenizing sentences.

  • prebuilt-vocab: The prebuild vocabulary file path. If you want to use prebuilt vocabulary rather than train new tokenizer, do specify the path to this option. Note that subset-size, vocab-size and limit-alphabet would be ignored.
  • subset-size: The size of subset which is a portion of dataset. It is not efficient to train a tokenizer with whole dataset. Using a subset does not matter if it is well-shuffled. Default: 1000000000
  • vocab-size: The vocabulary size. Default: 32000
  • limit-alphabet: The maximum different characters to keep in the alphabet. Default: 1000
  • unk-token: The token to replace unknown subwords. Default: [UNK]
  • special-tokens: The list of special tokens. They would not be splitted into subwords. We recommend to add langumo.build.parsing.newline token. Default: [START], [END], [PAD], [NEWLINE]

Builtin Parsers

langumo provides built-in Parsers to use popular datasets directly, without creating new Parsers.

WikipediaParser (langumo.parsers.WikipediaParser)

Wikipedia articles are written in MediaWiki code. You can simply use any version of Wikipedia dump file with this parser. It internally use mwparserfromhell library.

EscapedStringParser (langumo.parsers.EscapedStringParser)

In json package, there is encode_basestring method which escapes texts to JSON-style string. For example,

Harry Potter and the Sorcerer's Stone 

CHAPTER ONE 

THE BOY WHO LIVED 

Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much.

is escaped as below:

"Harry Potter and the Sorcerer's Stone \n\nCHAPTER ONE \n\nTHE BOY WHO LIVED \n\nMr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much."

As you can see, multi-line content is changed to the single line. This parser handles the newline-separated contents escaped by json.encoder.encode_basestring. If you want to use your custom dataset to langumo, consider this format.

License

langumo is Apache-2.0 Licensed.

FOSSA Status

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langumo-0.1.2.tar.gz (21.9 kB view hashes)

Uploaded Source

Built Distribution

langumo-0.1.2-py3-none-any.whl (26.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page