Library and scripts for common LM data utilities (tokenizing, splitting, packing, ...)
Project description
🛠️ datatools: Simple utilities for common data actions
Minimal scripts and reusable functions for implementing common data operations (tokenization, splitting, subsampling, packing, and more).
Built with special support for Mosaic Streaming Datasets (MDS).
Table of contents
Installation
Clone this repo and install via pip install -e . or install from PyPI via pip install datatools-py.
Installation Options
-
Core installation (without Hugging Face datasets support):
pip install datatools-py
-
Full installation (with Hugging Face datasets support):
pip install datatools-py[datasets] # or pip install datatools-py[full]
The core installation includes all necessary dependencies for working with MDS (Mosaic Streaming Datasets), JSONL, and NumPy files. The Hugging Face datasets library is only required if you need to load HuggingFace datasets, Arrow, or Parquet files.
Library
datatools provides core libraries that can be used to easily build custom data pipelines, specifically through from datatools import load, process.
Core functions
load(path, load_options)
Loads the dataset at the path and automatically infers its format (e.g., compressed JSON, PyArrow, MDS, etc.) based on clues from the file format and directory structure. It also supports MDS dataset over S3 and compressed MDS files (.mds.zstd, .mds.zst).
process(input_dataset, process_fn, output_path, process_options)
Processes an input dataset and writes the results to disk. It supports:
- Multi-processing with many CPUs, e.g.
ProcessOptions(num_proc=16)(or as flag-w 16) - Slurm array parallelization, e.g.
ProcessOptions(slurm_array=True)(or--slurm_array) automatically sets upjob_idandnum_jobsusing Slurm environment variables - Custom indexing, e.g. only working on a subset
--index_range 0 30or using a custom index file--index_path path/to/index.npySee ProcessOptions for details. - By default, output is written as mosaic-streaming MDS shards, which are merged into a single MDS dataset when the job finishes. The code also supports writing to JSONL files (
--jsonl) and ndarray files for each column (--ndarray). The shards for these output formats are not automatically merged.
The process_fn should be a function that takes one to three arguments:
- A subset of the data with
len(...)and.[...]access - The global indices corresponding to the subset (optional)
- The
process_idfor logging or sharding purposes (optional)
Example
from datatools import load, process, ProcessOptions
from transformers import AutoTokenizer
# Load dataset (can be JSON, Parquet, MDS, etc.)
dataset = load("path/to/dataset")
# Setup tokenizer and processing function
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
def tokenize_docs(data_subset):
for item in data_subset:
# Tokenize text and return dict with tokens and length
tokens = tokenizer.encode(item["text"], add_special_tokens=False)
# Chunk the text into 1024 token chunks
for i in range(0, len(tokens), 1024):
yield {
"input_ids": tokens[i:i+1024],
"length": len(tokens[i:i+1024])
}
# Process dataset with 4 workers and write to disk
process(dataset, tokenize_docs, "path/to/output", process_options=ProcessOptions(num_proc=4))
Scripts
datatools comes with the following default scripts:
tokenize: Tokenize datasets per documentpack: Pack tokenized documents into fixed sequencespeek: Print datasets as JSON to stdoutwrangle: Subsample, merge datasets, make random splits (e.g., train/test/validation), etc.merge_index: Merge Mosaic streaming datasets in subfolders into a larger dataset
Run <script> --help for detailed arguments. Many scripts automatically include all arguments from ProcessOptions (e.g., number of processes -w <processes>) and LoadOptions.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datatools_py-0.2.tar.gz.
File metadata
- Download URL: datatools_py-0.2.tar.gz
- Upload date:
- Size: 12.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
81350f07df45acae22a038a89448772ed6e3f59656e995c5e2aad5375d8d8757
|
|
| MD5 |
7d7a646898a7131accc129d44f8136de
|
|
| BLAKE2b-256 |
44ed270b0c2001d5cc4c194657640fc5dd4ce31485fe758de8ac28b3a8d4dfda
|
File details
Details for the file datatools_py-0.2-py3-none-any.whl.
File metadata
- Download URL: datatools_py-0.2-py3-none-any.whl
- Upload date:
- Size: 11.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
10bb77cfb0ac7b57c1aff4394d2c7102c0ca81e1b76ed944e0fa598e7b3631ac
|
|
| MD5 |
c922a82df5a44e0201ed8f7c012a5676
|
|
| BLAKE2b-256 |
ef79d40597ec23bc954e0ef22b82e6ba472c2009b5b88cc60bfeb495926ca6fe
|