Zero-copy, hardware-accelerated robot-learning dataloader for Apple Silicon (MLX)
Project description
PyRoboFrames
Zero-copy, hardware-accelerated robot-learning dataloader for Apple Silicon.
PyRoboFrames feeds robot-learning training loops on Apple Silicon at hardware speed. It reads robot datasets (LeRobotDataset v3.0, with MCAP planned), decodes their multi-camera video on the Apple Media Engine via VideoToolbox, and hands the frames to MLX (and PyTorch-MPS) as arrays without a single CPU copy — turning the data path from the training bottleneck into a non-event.
Status: pre-alpha, under active construction. APIs will change and it is not yet on PyPI. The sections below describe the v0.1 goal — see What works today for the current state.
What works today
Implemented and tested (Rust core + Python):
- ✅ LeRobotDataset v3.0 readers — schema / cameras / fps; a per-episode index that resolves
a global frame to
(camera, video file, timestamp); and tabular state/action reading. - ✅ Working dataloader (tabular) —
RoboFrameDataset.from_path(...).loader(...)iterates NumPy batches ofobservation.state/actionwith a buffered/quasi-random shuffle,drop_last, and seeded reproducibility. Works today on any LeRobotDataset v3.0. - ✅ Temporal windows — LeRobot-style
delta_timestampsreturn[batch, steps, dim]arrays. - ✅ Decode scaffolding — the
Decodertrait (batched seeks), a decoded-frame LRU cache, a frame-buffer pool, and per-platform backend selection (VideoToolbox / FFmpeg / CUDA NVDEC).
Not usable yet (in progress):
- 🚧 Video frames — VideoToolbox (macOS) / FFmpeg / CUDA-NVDEC (Linux) decode are feature-gated stubs (the decode integration into the pipeline is done and tested).
- 🚧 Zero-copy MLX output (the Apple-Silicon differentiator).
- 🚧 The validation pass (
ds.validate()).
Try the working part now (state / action → NumPy)
import pyroboframes as prf
ds = prf.RoboFrameDataset.from_path("/path/to/lerobot_dataset")
print(ds) # episodes / frames / cameras
loader = ds.loader(batch_size=64, shuffle=True)
for batch in loader: # dict of NumPy arrays
state = batch["observation.state"] # [64, state_dim], float32
action = batch["action"] # [64, action_dim], float32
... # your training step
The video/MLX dataloader shown further below is the v0.1 target, not yet shipped.
The problem
Robot-learning datasets store observations as MP4 video (often several cameras per episode). During training, every sample requires seeking into those videos and decoding the right frames. This decode step is the dominant cost of the data pipeline — Hugging Face's own LeRobot tracker reports training that is "completely bottlenecked by video decoding even on servers with hundreds of cores," spending more time waiting on the dataloader than on backprop (lerobot#1623).
On Apple Silicon the problem is worse, and avoidably so: the standard Python stack (torchvision / PyAV / FFmpeg software decode) runs on the CPU and leaves the dedicated Media Engine idle, then copies frames across to the GPU — copies that are pure waste on a unified-memory machine. Meanwhile the compute side (MLX, M5 Neural Accelerators) is fast and underfed.
What PyRoboFrames does
This is the v0.1 design; see What works today for what's currently built.
LeRobotDataset / MCAP PyRoboFrames (Rust core) your training loop
┌───────────────────┐ ┌──────────────────────────────────┐ ┌────────────────────┐
│ parquet (state / │ │ index → sample → VideoToolbox HW │ │ MLX (Neural │
│ action) + mp4 │──▶│ decode → IOSurface (shared mem) → │──▶│ Accelerators) or │
│ video shards │ │ time-synced windows, no copy │ │ PyTorch-MPS │
└───────────────────┘ └──────────────────────────────────┘ └────────────────────┘
- Hardware decode via Apple VideoToolbox — uses the Media Engine, not the CPU.
- Zero-copy — decoded frames live in IOSurface-backed unified memory and are wrapped as MLX arrays without a host→device transfer (there is no "device transfer" on unified memory; we stop pretending there is).
- Time-synced windows — assembles
(multi-camera frames, joint state, action)windows by joining the parquet tabular data with the decoded video at matching timestamps. - Built-in validation — flags missing frames, non-monotonic timestamps, and camera/state misalignment before they silently corrupt a training run.
Why a Rust core with a Python API
The audience is ML researchers, so the product is a pip-installable Python package — the
Rust is invisible. Rust is the implementation because the hot path (HW decode, IOSurface
lifetime management, off-GIL prefetch, zero-copy buffer hand-off) is exactly where a safe
systems language with no GIL earns its keep. The result: a fast, safe core with an ergonomic
Python shell — and no Rust toolchain needed to pip install it — via
PyO3 + maturin.
Installation
Not yet released. When v0.1 ships:
pip install pyroboframes # macOS / Apple Silicon wheels, no Rust toolchain needed
Wheels are built for Apple Silicon (primary target) with a portable FFmpeg fallback for other platforms.
Quickstart (planned v0.1 API)
import pyroboframes as prf
# Open a LeRobot dataset (local path or Hugging Face Hub repo id)
ds = prf.RoboFrameDataset.from_hub("lerobot/aloha_sim_insertion_human")
# Validate before training
report = ds.validate()
report.raise_if_errors() # missing frames, timestamp gaps, cam/state mismatch
# Build a dataloader that yields MLX arrays, zero-copy, decoded on the Media Engine
loader = ds.loader(
batch_size=64,
cameras=["observation.images.top", "observation.images.wrist"],
delta_timestamps={"observation.images.top": [-0.1, 0.0]}, # temporal context (LeRobot-style)
tolerance_s=1e-4, # snap to the nearest frame within this tolerance
shuffle=True,
num_workers=4, # Rust worker pool, runs off the GIL
output="mlx", # or "numpy" / "torch" (MPS)
)
for batch in loader:
frames = batch["observation.images.top"] # mlx.core.array, already on GPU
state = batch["observation.state"]
action = batch["action"]
... # your MLX training step
Cross-platform
PyRoboFrames runs on both macOS and Linux from the same API and the same Rust core.
The platform-specific part is decode and output, selected behind a single Decoder trait:
- macOS (Apple Silicon) — the optimized path: VideoToolbox hardware decode → IOSurface → zero-copy MLX. This is the differentiator.
- Linux — the same engine, decoding via FFmpeg (VAAPI where available, software otherwise) and outputting NumPy / PyTorch.
- Linux + CUDA — when CUDA libraries are present (build with
--features cuda), NVIDIA NVDEC hardware decode with CUDA output for PyTorch.
Supported (target matrix)
| v0.1 | Planned | |
|---|---|---|
| Datasets | LeRobotDataset v3.0 | MCAP, RLDS, HDF5 |
| Decode (HW) | macOS: VideoToolbox · Linux: FFmpeg (VAAPI) + software · Linux+CUDA: NVDEC | ProRes, AV1 (M3+) |
| Output | macOS: MLX · all: NumPy | PyTorch (MPS/CUDA) via DLPack |
| Platform | macOS (Apple Silicon) · Linux (x86_64, aarch64) · Linux+CUDA | CUDA zero-copy output |
Benchmarks
The headline metric is decode+load throughput on Apple Silicon vs. the PyAV/CPU path. Numbers will be published here with a reproducible harness once v0.1 lands.
| Pipeline | Frames/s (M-series) | Notes |
|---|---|---|
| PyAV / CPU (baseline) | TBD | torchvision default backend |
| PyRoboFrames (VideoToolbox, zero-copy) | TBD | target: multiple× baseline |
Roadmap
See ARCHITECTURE.md for the full design and decisions.
- v0.1 — LeRobotDataset v3.0 → hardware decode (VideoToolbox on macOS, FFmpeg on Linux) → dataloader with zero-copy MLX (macOS) / NumPy (Linux), validation, and a benchmark harness.
- v0.2 — MCAP ingest, PyTorch-MPS output via DLPack.
- v0.3 — RLDS / HDF5 ingest, multi-Mac distributed loading.
Contributing
Contributions welcome — see CONTRIBUTING.md. The Rust core lives in
crates/, the Python package in python/. The most valuable early contributions are around
the MLX zero-copy init path (see mlx#2855)
and the benchmark harness.
Prior art & acknowledgements
docs/COMPARISON.md compares PyRoboFrames against LeRobot, torchcodec,
Robo-DM, DALI, FFCV and others, and records which of their techniques we adopt (a decoded-frame
cache, buffered shuffle, batched seeks, and LeRobot's delta_timestamps/tolerance_s API).
PyRoboFrames stands on LeRobot, MLX, Apple VideoToolbox, PyO3, and the Rust FFmpeg ecosystem. It deliberately does not reinvent robotics middleware — that space is well served by Zenoh and dora-rs. It targets the one layer they leave unsolved on Apple Silicon: the training data feed.
License
MIT © Georgi Mammen Mullassery
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyroboframes-0.1.0a0.tar.gz.
File metadata
- Download URL: pyroboframes-0.1.0a0.tar.gz
- Upload date:
- Size: 40.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
896717fba4544b3f64d8222f45c40614c36e4141990ea0ebf8555dde67b0093d
|
|
| MD5 |
b2a8415f013edf083c1ba14faefd60a2
|
|
| BLAKE2b-256 |
f3eac9ba8751913da7484c3033f967db092c34616775c69803c74bd6b57d0cef
|
File details
Details for the file pyroboframes-0.1.0a0-cp310-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: pyroboframes-0.1.0a0-cp310-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 3.1 MB
- Tags: CPython 3.10+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c33a77c5b06bc257073886b286a8504f6bf0adf67921d87c398f0dda41d388b5
|
|
| MD5 |
34c0f2acd23a53476993c83cd7d74a73
|
|
| BLAKE2b-256 |
892c7948335d25d3a49ae46d8cd6627c0603b720c1f04d14c916850536dca36d
|