Skip to main content

Python package for high-performance C++ dataloader

Project description

The Catid Dataloader

"I couldn't think of a better name." -catid

Optimized C++ dataloader for tokenized datasets, with simple Python bindings. Tested as part of my own language model training scripts and now releasing separately for others to use.

Features:

  • All operations are fully pipelined with training for zero delay.
  • Compatible with any n_vocab size.
  • Uses Zstd with byte planes to save a ton of disk space.
  • Negligible memory usage (does not use memory-mapped files).
  • Fast random-access disk IO achieved using liburing.
  • All operations are full parallelized for speed.
  • Uses an index file to accelerate data lookup.
  • Hash to verify entire dataset integrity quickly.
  • Supports fast checkpoint resume by skipping ahead a specified number of steps without re-reading.
  • Short strings are concatenated and separated by padding tokens to improve training throughput.
  • Supports seeded random access for reproducibility.

Benchmark results:

When pipelined with training, the dataloader takes approximately 0.01 milliseconds to read each microbatch, so basically it adds no delay to training.

Per 12 CPU cores on an SSD with a (huge) batch of 128 and context size of 8192, you can expect to achieve 6.25 milliseconds read speed per microbatch (measured in Python).

Installation

Install the catid_dataloader pip package:

# Optional: Create a conda environment
conda create -n dataloader python=3.10 -y && conda activate dataloader

# Install build dependencies
sudo apt install build-essential cmake

# Install the package
pip install -U catid_dataloader

Verify that it is working:

git clone https://github.com/catid/dataloader.git
cd dataloader

python test_catid_dataloader.py

Usage

From Python it looks like this:

from catid_dataloader import DataLoader, DataVerifier, EpochConfig

loader = DataLoader(data_path)

config = EpochConfig()
config.seed0 = 1234 # Seed for shuffling the data.
config.seed1 = 5678
config.local_rank = 0 # GPU index on this node.
config.local_rank_count = 2 # Number of GPUs on this node.
config.padding_token = -1 # Token to use for padding.
config.micro_batch_size = 128 # Number of rows per micro-batch.
config.context_size = 8192 # Number of tokens per row.
# Minimum number of tokens in a string.  Note that short strings above this size are concatenated and separated by padding tokens.
config.min_data_length = 32
config.start_step = 0

loader.begin_epoch(config)

while True:
    batch, is_cont, step, total_steps = loader.get_micro_batch()
    if batch is None:
        print("Dataset exhausted")
        break

The get_micro_batch() method returns four values:

  • The batch is a numpy tensor containing the tokenized text.
  • The is_cont flag is set to True if the data is continued from the last step, for each batch row.
  • The step (starting from 0, < total_steps) and total_steps are the current step and total number of steps in the dataset.

Each value returned will be None if the dataset is exhausted.

You can start from a specific step by setting the start_step parameter in the EpochConfig object, for resuming training from a checkpoint. Note that the context_size, micro_batch_size, min_data_length, local_rank, local_rank_count, seed0 and seed1 parameters must be set to the same value as when the checkpoint was created, as otherwise the data at each step will not match the previous training run.

Sharding

The provision directory contains scripts to download and shard datasets on multiple remote hosts using this package.

Manual Installation from Source

Build and install the catid_dataloader pip package:

sudo apt install build-essential cmake

conda create -n dataloader python=3.10 -y && conda activate dataloader

./install.sh

Verify that it is working:

python python test_catid_dataloader.py

Debugging

There are a few unit tests that can be run to verify the correctness of the code. To run them, first build the project:

sudo apt install build-essential cmake

rm -rf build
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Debug ..
make -j

./test_compressor
./test_tools
./test_worker_pool
./test_mapped_file
./test_uring_file

./test_data_prep
./test_data_loader

Credits

Latest versions of third-party libraries are included:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

catid_dataloader-0.2.0.tar.gz (2.7 MB view details)

Uploaded Source

File details

Details for the file catid_dataloader-0.2.0.tar.gz.

File metadata

  • Download URL: catid_dataloader-0.2.0.tar.gz
  • Upload date:
  • Size: 2.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for catid_dataloader-0.2.0.tar.gz
Algorithm Hash digest
SHA256 e0a1c3fceba7a2c1b2ee30b152a2aab6e301b3557ed8529dd5e7c239bbbfb844
MD5 265ed193ba10a203844c3171142e4f3e
BLAKE2b-256 3e6ca12d2a0b175ecbcda104e5040d1a918409140d3d03bb6408f7a16d0e9238

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page