Dataloader tools for language modelling
Project description
Installation:
pip install lm_dataloader
Design Philosophy
-
A library to unify lm dataloading at large scale
-
Simple interface, any tokenizer can be integrated
-
Minimal changes needed from small -> large scale (many multiple GPU nodes)
-
follows fairseq / megatron's 'mmap' dataformat, but with improvements. Those being:
- Easily combine multiple datasets
- unified into a single 'file' (which is actually a directory containing a .bin / .idx file)
- index files that are built on the fly are hidden files, leaving less mess in the directory.
- More straightforward interface, better documentation.
- Inspectable with a command line tool
- Can load from urls
- Can load from S3 buckets
- Can load from GCS buckets
- Can tokenize on the fly instead of preprocessing
Example usage
To tokenize a dataset contained in a .jsonl file (where the text to be tokenized can be accessed under the 'text' key):
import lm_dataloader as lmdl
from transformers import GPT2TokenizerFast
jsonl_path = "test.jsonl"
output = "my_dataset.lmd"
tokenizer = GPT2TokenizerFast.from_pretrained('gpt2')
lmdl.encode(
jsonl_path,
tokenize_fn=tokenizer.encode,
tokenizer_vocab_size=len(tokenizer),
output_prefix=output,
eod_token=tokenizer.eos_token_id,
)
This will create a dataset at "my_dataset.lmd" which can be loaded as an indexed torch dataset like so:
from lm_dataloader import LMDataset
tokenizer = GPT2TokenizerFast.from_pretrained('gpt2')
seq_length = tokenizer.model_max_length # or whatever the sequence length of your model is
dataset = LMDataset("my_dataset.lmd", seq_length=seq_length)
# peek at 0th index
print(dataset[0])
Command line utilities
There are also command line utilities provided to inspect / merge datasets, e.g:
lm-dataloader inspect my_dataset.lmd
Launches an interactive terminal to inspect the data in my_dataset.lmd
And:
lm-dataloader merge my_dataset.lmd,my_dataset_2.lmd new_dataset.lmd
Merges the datasets at "my_dataset.lmd" and "my_dataset_2.lmd" into a new file at "new_dataset.lmd".
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file lm_dataloader-0.0.2.tar.gz
.
File metadata
- Download URL: lm_dataloader-0.0.2.tar.gz
- Upload date:
- Size: 16.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d9d961cc7a88a2578a760c8c5f937bd07f993fdddc25e60de0b2f302a8d0cf79 |
|
MD5 | c13bb1880f9c0b6571cc975b008964da |
|
BLAKE2b-256 | 10f7bf415635b6dcc8e07918ff1fc27cc06c5c0531de3cab6a3e34adf21a19ff |