High-performance data processing library for ML workloads.
Project description
dataio
A Rust/PyO3 data plane for Python ML — typed pipelines, async IO, zero-copy DLPack to PyTorch / NumPy / JAX.
- Zero-copy handoff to torch / numpy / jax
- Slot-write fast path: stackable transforms write straight into the batch buffer
- ~1.85-2× faster than
torch.utils.data.DataLoaderon normal workloads
Install
pip install dataio-rs
Quickstart
import dataio
dataset = (
dataio.from_files("data/images/**/*.jpg")
.shuffle(seed=42)
.decode_image(mode="rgb")
.resize_short(256)
.center_crop(224)
.normalize("imagenet")
)
with dataset.batched(128).load() as loader:
for batch in loader:
train_step(batch.tensor, batch.keys)
load() returns a DataLoader with sane defaults (output="torch",
concurrency=16, prefetch=4, pin_memory="auto"). batch.tensor /
batch.tensors lazily DLPack-wraps the requested framework;
batch.metadata, batch.keys, batch.indices, batch.errors track
the surviving samples.
Custom batching
When the chain doesn't fit (bucket-uniform sampling, joint task batching,
length-aware packing, multi-output samples) drive the loader with your
own factory — from_batches(fn) is pass-through, no rebatching:
def batches(): # zero-arg factory ⇒ multi-epoch
for chunk in batch_sampler:
yield [my_to_sample(r) for r in chunk]
with dataio.from_batches(batches).load() as loader:
for batch in loader: ...
For pre-built individual samples without your own batching:
dataio.from_samples(it).batched(N).load().
API
dataio.from_files(glob) → Dataset typed chain (decode/transform/batched)
dataio.from_records(records) → Dataset
dataio.from_manifest(path) → Dataset
dataio.from_batches(fn) → Batches pass-through, custom batching
dataio.from_samples(it) → Samples auto-batched via .batched(N)
Dataset.batched(N) → Batches chain terminal
Samples.batched(N) → Batches
Batches.load(**knobs) → DataLoader runtime (iterable + lifecycle)
load() knobs: output ∈ {"torch","numpy","jax","dlpack"},
concurrency, prefetch, order ∈ {"submit","completion"},
pin_memory ∈ {True, False, "auto"}, ragged ∈ {"list","error","skip"},
failure_policy ∈ {"drop","raise"}, min_survivors.
The Batches spec carries no runtime state — cheap to build, hold, and
pass around; resources allocate only at .load(...).
For multi-output samples, archive entry reads, or hand-built pipelines:
dataio.lib.{Sample, Pipeline, Source, Decoder, Transform, BytesOp}.
See python/dataio/lib.pyi.
Errors and diagnostics
loader = dataset.batched(64).load(failure_policy="drop")
batch = next(iter(loader))
print(batch.errors) # [{index, key, stage, message}, ...]
print(loader.diagnostics()) # samples_submitted/completed/failed, queue stats
failure_policy="raise" (with min_survivors=N) aborts the run when
fewer than N samples in a batch survive.
Benchmark
Synthetic 1024 RGB images, batch 32, workers/concurrency 8, CPU-only:
| Loader | samples/s | speedup | p99 |
|---|---|---|---|
torch.utils.data.DataLoader |
3,066 | 1.00× | 38 ms |
dataio (typed chain) |
5,667 | 1.85× | 9 ms |
Real I/O-bound workloads (S3/R2 fetches dominating) see the absolute samples/s gap shrink, but the tail-latency benefit and freeing the trainer thread from data-side blocking remain.
uv run python benches/bench_loader_matrix.py --image-dir /path/to/images
# or --synthetic /tmp/imgs --n 1024
Development
cargo test
uvx ruff check python examples benches
uv run python -m unittest discover -s python/tests -p 'test_*.py'
uvx maturin build --release
# Editable install for local hacking:
uv run --with maturin maturin develop --release
CI runs the lint job (cargo fmt, cargo clippy, uvx ruff) and the
builder job (cargo test, wheel build, install, Python unittest) on
every push.
Examples in examples/ cover Dataset chain, archive reads, .npy,
.safetensors. Full PyO3 surface in python/dataio/lib.pyi.
Source: https://github.com/Mikubill/dataio-rs · PyPI: https://pypi.org/project/dataio-rs/
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dataio_rs-0.2.1.tar.gz.
File metadata
- Download URL: dataio_rs-0.2.1.tar.gz
- Upload date:
- Size: 208.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8522258405fbc01ba2d13d34b0f0732d2ae3ad84f4a6b8515f69dde6ee99c5f6
|
|
| MD5 |
db5d31bb5704f5477cfc3614c4b3f981
|
|
| BLAKE2b-256 |
f17b767ac032108d13189bac2cd9e1d6f910b3277202596b246714101de129ee
|
File details
Details for the file dataio_rs-0.2.1-cp39-abi3-manylinux_2_38_x86_64.whl.
File metadata
- Download URL: dataio_rs-0.2.1-cp39-abi3-manylinux_2_38_x86_64.whl
- Upload date:
- Size: 8.5 MB
- Tags: CPython 3.9+, manylinux: glibc 2.38+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
58a8a394ceaebef3bac788450ba793833cfe8b560b583d3ef38d899dc60a987d
|
|
| MD5 |
cd7c0b62523c4d77ddd1d75f8f37bef9
|
|
| BLAKE2b-256 |
bd6cf3f54fa25e030c20e700baef3739330a8178f45a87938e3caa5189f0f92f
|