Skip to main content

Zero-copy, hardware-accelerated robot-learning dataloader for Apple Silicon (MLX)

Project description

PyRoboFrames

Zero-copy, hardware-accelerated robot-learning dataloader for Apple Silicon.

PyRoboFrames feeds robot-learning training loops on Apple Silicon at hardware speed. It reads robot datasets (LeRobotDataset v3.0, with MCAP planned), decodes their multi-camera video on the Apple Media Engine via VideoToolbox, and hands the frames to MLX (and PyTorch-MPS) as arrays without a single CPU copy — turning the data path from the training bottleneck into a non-event.

Status: pre-alpha, under active construction. APIs will change and it is not yet on PyPI. The sections below describe the v0.1 goal — see What works today for the current state.


What works today

Implemented and tested (Rust core + Python):

  • LeRobotDataset v3.0 readers — schema / cameras / fps; a per-episode index that resolves a global frame to (camera, video file, timestamp); and tabular state/action reading.
  • Working dataloader (tabular)RoboFrameDataset.from_path(...).loader(...) iterates NumPy batches of observation.state / action with a buffered/quasi-random shuffle, drop_last, and seeded reproducibility. Works today on any LeRobotDataset v3.0.
  • Temporal windows — LeRobot-style delta_timestamps return [batch, steps, dim] arrays.
  • Decode scaffolding — the Decoder trait (batched seeks), a decoded-frame LRU cache, a frame-buffer pool, and per-platform backend selection (VideoToolbox / FFmpeg / CUDA NVDEC).

Not usable yet (in progress):

  • 🚧 Video frames — VideoToolbox (macOS) / FFmpeg / CUDA-NVDEC (Linux) decode are feature-gated stubs (the decode integration into the pipeline is done and tested).
  • 🚧 Zero-copy MLX output (the Apple-Silicon differentiator).
  • 🚧 The validation pass (ds.validate()).

Try the working part now (state / action → NumPy)

import pyroboframes as prf

ds = prf.RoboFrameDataset.from_path("/path/to/lerobot_dataset")
print(ds)                                   # episodes / frames / cameras
loader = ds.loader(batch_size=64, shuffle=True)

for batch in loader:                        # dict of NumPy arrays
    state  = batch["observation.state"]     # [64, state_dim], float32
    action = batch["action"]                # [64, action_dim], float32
    ...                                      # your training step

The video/MLX dataloader shown further below is the v0.1 target, not yet shipped.


The problem

Robot-learning datasets store observations as MP4 video (often several cameras per episode). During training, every sample requires seeking into those videos and decoding the right frames. This decode step is the dominant cost of the data pipeline — Hugging Face's own LeRobot tracker reports training that is "completely bottlenecked by video decoding even on servers with hundreds of cores," spending more time waiting on the dataloader than on backprop (lerobot#1623).

On Apple Silicon the problem is worse, and avoidably so: the standard Python stack (torchvision / PyAV / FFmpeg software decode) runs on the CPU and leaves the dedicated Media Engine idle, then copies frames across to the GPU — copies that are pure waste on a unified-memory machine. Meanwhile the compute side (MLX, M5 Neural Accelerators) is fast and underfed.

What PyRoboFrames does

This is the v0.1 design; see What works today for what's currently built.

LeRobotDataset / MCAP        PyRoboFrames (Rust core)              your training loop
┌───────────────────┐   ┌──────────────────────────────────┐   ┌────────────────────┐
│ parquet (state /  │   │ index → sample → VideoToolbox HW   │   │  MLX  (Neural       │
│ action) + mp4     │──▶│ decode → IOSurface (shared mem) →  │──▶│  Accelerators) or  │
│ video shards      │   │ time-synced windows, no copy       │   │  PyTorch-MPS        │
└───────────────────┘   └──────────────────────────────────┘   └────────────────────┘
  • Hardware decode via Apple VideoToolbox — uses the Media Engine, not the CPU.
  • Zero-copy — decoded frames live in IOSurface-backed unified memory and are wrapped as MLX arrays without a host→device transfer (there is no "device transfer" on unified memory; we stop pretending there is).
  • Time-synced windows — assembles (multi-camera frames, joint state, action) windows by joining the parquet tabular data with the decoded video at matching timestamps.
  • Built-in validation — flags missing frames, non-monotonic timestamps, and camera/state misalignment before they silently corrupt a training run.

Why a Rust core with a Python API

The audience is ML researchers, so the product is a pip-installable Python package — the Rust is invisible. Rust is the implementation because the hot path (HW decode, IOSurface lifetime management, off-GIL prefetch, zero-copy buffer hand-off) is exactly where a safe systems language with no GIL earns its keep. The result: a fast, safe core with an ergonomic Python shell — and no Rust toolchain needed to pip install it — via PyO3 + maturin.

Installation

Not yet released. When v0.1 ships:

pip install pyroboframes        # macOS / Apple Silicon wheels, no Rust toolchain needed

Wheels are built for Apple Silicon (primary target) with a portable FFmpeg fallback for other platforms.

Quickstart (planned v0.1 API)

import pyroboframes as prf

# Open a LeRobot dataset (local path or Hugging Face Hub repo id)
ds = prf.RoboFrameDataset.from_hub("lerobot/aloha_sim_insertion_human")

# Validate before training
report = ds.validate()
report.raise_if_errors()        # missing frames, timestamp gaps, cam/state mismatch

# Build a dataloader that yields MLX arrays, zero-copy, decoded on the Media Engine
loader = ds.loader(
    batch_size=64,
    cameras=["observation.images.top", "observation.images.wrist"],
    delta_timestamps={"observation.images.top": [-0.1, 0.0]},  # temporal context (LeRobot-style)
    tolerance_s=1e-4,           # snap to the nearest frame within this tolerance
    shuffle=True,
    num_workers=4,              # Rust worker pool, runs off the GIL
    output="mlx",               # or "numpy" / "torch" (MPS)
)

for batch in loader:
    frames = batch["observation.images.top"]   # mlx.core.array, already on GPU
    state  = batch["observation.state"]
    action = batch["action"]
    ...                                          # your MLX training step

Cross-platform

PyRoboFrames runs on both macOS and Linux from the same API and the same Rust core. The platform-specific part is decode and output, selected behind a single Decoder trait:

  • macOS (Apple Silicon) — the optimized path: VideoToolbox hardware decode → IOSurface → zero-copy MLX. This is the differentiator.
  • Linux — the same engine, decoding via FFmpeg (VAAPI where available, software otherwise) and outputting NumPy / PyTorch.
  • Linux + CUDA — when CUDA libraries are present (build with --features cuda), NVIDIA NVDEC hardware decode with CUDA output for PyTorch.

Supported (target matrix)

v0.1 Planned
Datasets LeRobotDataset v3.0 MCAP, RLDS, HDF5
Decode (HW) macOS: VideoToolbox · Linux: FFmpeg (VAAPI) + software · Linux+CUDA: NVDEC ProRes, AV1 (M3+)
Output macOS: MLX · all: NumPy PyTorch (MPS/CUDA) via DLPack
Platform macOS (Apple Silicon) · Linux (x86_64, aarch64) · Linux+CUDA CUDA zero-copy output

Benchmarks

The headline metric is decode+load throughput on Apple Silicon vs. the PyAV/CPU path. Numbers will be published here with a reproducible harness once v0.1 lands.

Pipeline Frames/s (M-series) Notes
PyAV / CPU (baseline) TBD torchvision default backend
PyRoboFrames (VideoToolbox, zero-copy) TBD target: multiple× baseline

Roadmap

See ARCHITECTURE.md for the full design and decisions.

  • v0.1 — LeRobotDataset v3.0 → hardware decode (VideoToolbox on macOS, FFmpeg on Linux) → dataloader with zero-copy MLX (macOS) / NumPy (Linux), validation, and a benchmark harness.
  • v0.2 — MCAP ingest, PyTorch-MPS output via DLPack.
  • v0.3 — RLDS / HDF5 ingest, multi-Mac distributed loading.

Contributing

Contributions welcome — see CONTRIBUTING.md. The Rust core lives in crates/, the Python package in python/. The most valuable early contributions are around the MLX zero-copy init path (see mlx#2855) and the benchmark harness.

Prior art & acknowledgements

docs/COMPARISON.md compares PyRoboFrames against LeRobot, torchcodec, Robo-DM, DALI, FFCV and others, and records which of their techniques we adopt (a decoded-frame cache, buffered shuffle, batched seeks, and LeRobot's delta_timestamps/tolerance_s API).

PyRoboFrames stands on LeRobot, MLX, Apple VideoToolbox, PyO3, and the Rust FFmpeg ecosystem. It deliberately does not reinvent robotics middleware — that space is well served by Zenoh and dora-rs. It targets the one layer they leave unsolved on Apple Silicon: the training data feed.

License

MIT © Georgi Mammen Mullassery

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyroboframes-0.1.0a0.tar.gz (40.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyroboframes-0.1.0a0-cp310-abi3-macosx_11_0_arm64.whl (3.1 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file pyroboframes-0.1.0a0.tar.gz.

File metadata

  • Download URL: pyroboframes-0.1.0a0.tar.gz
  • Upload date:
  • Size: 40.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.14.0

File hashes

Hashes for pyroboframes-0.1.0a0.tar.gz
Algorithm Hash digest
SHA256 896717fba4544b3f64d8222f45c40614c36e4141990ea0ebf8555dde67b0093d
MD5 b2a8415f013edf083c1ba14faefd60a2
BLAKE2b-256 f3eac9ba8751913da7484c3033f967db092c34616775c69803c74bd6b57d0cef

See more details on using hashes here.

File details

Details for the file pyroboframes-0.1.0a0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pyroboframes-0.1.0a0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c33a77c5b06bc257073886b286a8504f6bf0adf67921d87c398f0dda41d388b5
MD5 34c0f2acd23a53476993c83cd7d74a73
BLAKE2b-256 892c7948335d25d3a49ae46d8cd6627c0603b720c1f04d14c916850536dca36d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page