Library and scripts for common LM data utilities (tokenizing, splitting, packing, ...)

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

🛠️ datatools: Simple utilities for common data actions

Minimal scripts and reusable functions for implementing common data operations (tokenization, splitting, subsampling, packing, and more).

Built with special support for Mosaic Streaming Datasets (MDS).

Installation
Library
- Core Functions
- Example
Scripts

Installation

Clone this repo and install via pip install -e . or install from PyPI via pip install datatools-py.

Installation Options

Core installation (without Hugging Face datasets support):
```
pip install datatools-py
```

Full installation (with Hugging Face datasets support):

pip install datatools-py[datasets]
# or
pip install datatools-py[full]

The core installation includes all necessary dependencies for working with MDS (Mosaic Streaming Datasets), JSONL, and NumPy files. The Hugging Face datasets library is only required if you need to load HuggingFace datasets, Arrow, or Parquet files.

Library

datatools provides core libraries that can be used to easily build custom data pipelines, specifically through from datatools import load, process.

Core functions

load(path, load_options)

Loads the dataset at the path and automatically infers its format (e.g., compressed JSON, PyArrow, MDS, etc.) based on clues from the file format and directory structure. It also supports MDS dataset over S3 and compressed MDS files (.mds.zstd, .mds.zst).

process(input_dataset, process_fn, output_path, process_options)

Processes an input dataset and writes the results to disk. It supports:

Multi-processing with many CPUs, e.g. ProcessOptions(num_proc=16) (or as flag -w 16)
Slurm array parallelization, e.g. ProcessOptions(slurm_array=True) (or --slurm_array) automatically sets up job_id and num_jobs using Slurm environment variables
Custom indexing, e.g. only working on a subset --index_range 0 30 or using a custom index file --index_path path/to/index.npy See ProcessOptions for details.
By default, output is written as mosaic-streaming MDS shards, which are merged into a single MDS dataset when the job finishes. The code also supports writing to JSONL files (--jsonl) and ndarray files for each column (--ndarray). The shards for these output formats are not automatically merged.

The process_fn should be a function that takes one to three arguments:

A subset of the data with len(...) and .[...] access
The global indices corresponding to the subset (optional)
The process_id for logging or sharding purposes (optional)

Example

from datatools import load, process, ProcessOptions
from transformers import AutoTokenizer

# Load dataset (can be JSON, Parquet, MDS, etc.)
dataset = load("path/to/dataset")

# Setup tokenizer and processing function
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
def tokenize_docs(data_subset):
    for item in data_subset:
        # Tokenize text and return dict with tokens and length
        tokens = tokenizer.encode(item["text"], add_special_tokens=False)
        
        # Chunk the text into 1024 token chunks
        for i in range(0, len(tokens), 1024):
            yield {
                "input_ids": tokens[i:i+1024],
                "length": len(tokens[i:i+1024])
            }

# Process dataset with 4 workers and write to disk
process(dataset, tokenize_docs, "path/to/output", process_options=ProcessOptions(num_proc=4))

Scripts

datatools comes with the following default scripts:

tokenize: Tokenize datasets per document
pack: Pack tokenized documents into fixed sequences
peek: Print datasets as JSON to stdout
wrangle: Subsample, merge datasets, make random splits (e.g., train/test/validation), etc.
merge_index: Merge Mosaic streaming datasets in subfolders into a larger dataset

Run <script> --help for detailed arguments. Many scripts automatically include all arguments from ProcessOptions (e.g., number of processes -w <processes>) and LoadOptions.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.5

Oct 17, 2025

This version

0.4

Oct 17, 2025

0.3

Sep 10, 2025

0.2

Aug 19, 2025

0.1

Feb 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datatools_py-0.4.tar.gz (13.2 kB view details)

Uploaded Oct 17, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

datatools_py-0.4-py3-none-any.whl (12.5 kB view details)

Uploaded Oct 17, 2025 Python 3

File details

Details for the file datatools_py-0.4.tar.gz.

File metadata

Download URL: datatools_py-0.4.tar.gz
Upload date: Oct 17, 2025
Size: 13.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for datatools_py-0.4.tar.gz
Algorithm	Hash digest
SHA256	`2e3a5184dff0e0d73ee353ffac01f4820364c6ab3403720d1e8be47c46ae9b5f`
MD5	`eb093fe832042768787aed34cb83d34c`
BLAKE2b-256	`0f2d004a381c835198d0c12dd60180ec5704e5a90fc87c20fa60a501c46daeec`

See more details on using hashes here.

File details

Details for the file datatools_py-0.4-py3-none-any.whl.

File metadata

Download URL: datatools_py-0.4-py3-none-any.whl
Upload date: Oct 17, 2025
Size: 12.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for datatools_py-0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2370b9ea7e10855e6cd065e68246b1364113a604f4dac6361e8d16d0dd38c8be`
MD5	`5542c5885b66d4b98ca0a47f77da5c21`
BLAKE2b-256	`4e739dd5af106a07878ed72b1710dd06c3559805de64e78777a06faa8f8442e9`

See more details on using hashes here.

datatools-py 0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🛠️ datatools: Simple utilities for common data actions

Table of contents

Installation

Installation Options

Library

Core functions

Example

Scripts

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes