Skip to main content

Vet your robot datasets: diagnose, repair, and quality-score LeRobot-format episode data before you waste a training run.

Project description

robovet

Vet your robot data. Diagnose, repair, and quality-score LeRobot-format datasets — before you waste a training run on broken episodes.

$ robovet doctor ./my_dataset

  FAIL DATA-104   1 episode where metadata 'length' disagrees with the parquet
                  row count — the classic signature of a corrupted episode map.
  FAIL STATS-302  1 stat block disagrees with the actual data — every training
                  run normalizes with these numbers.
  WARN TIME-202   Loading this dataset requires tolerance_s ≥ 7.7e-03
                  (77× the default). Worst: episode 2, 7.29 ms off the grid.
  FAIL META-502   Σ episode lengths = 1086 but info.json total_frames = 1037 —
                  the metadata contradicts itself before a single file is read.

  5 fail · 4 warn · 23 pass
  UNSAFE TO TRAIN — fix the FAILs first.        (exit code 1 — CI-gate it)

Why this exists

Robot learning's bottleneck moved from models to data, and the data is quietly broken. An April 2026 audit of 10 popular open-source robot datasets found floating-point drift that breaks video decoding after ~45 episodes, a v2.1→v3.0 conversion bug that silently corrupts episode↔frame mapping (your run "works" — the policy just learns from jumbled sequences), datasets that only load with tolerance_s set to 100× the default, and no quality metrics anywhere. Hugging Face's own community-dataset cleaning run tells the same story: 111 of 240 datasets failed validation — and that pipeline is internal, not something you can run on yours. Meanwhile the 2026 consensus is that a well-curated 500-demo fine-tune beats a poorly-curated one at 10× the scale — curation tooling is the gap, not model size.

Every check in robovet maps to a documented, real-world failure. The receipts — issue numbers, audit findings, papers — live in PAIN.md.

Try it in 30 seconds (no robot required)

pip install robovet[video]

robovet demo ./demo          # synthetic SO-100-style dataset, 10 real-world
                             # defect classes injected (each tagged with the
                             # GitHub issue it reproduces)
robovet demo ./demo3 --v3    # same idea in the v3.0 shared-file layout
robovet doctor ./demo        # catches all of them; exit 1
robovet fix    ./demo --apply  # repairs the metadata class; .bak backups
robovet doctor ./demo        # metadata FAILs gone
robovet report ./demo -o report.html   # one self-contained, shareable page

robovet demo ./demo --clean builds the same dataset with zero defects, so you can see what all-green looks like.

Vet a Hub dataset before downloading it

pip install "robovet[hub]"
robovet doctor hf://lerobot/svla_so100_pickplace

Fetches only meta/ (a few MB), then runs every metadata-level check: structure, stats sanity, and the new META-5xx ledger cross-checks — episode↔ frame index math, Σlengths vs counters, per-episode stats freshness, video time windows. The #2401 corruption class is visible from metadata alone, so you find out a 4 GB dataset is broken before spending the bandwidth. Honest scope: the verdict says META CLEAN, never CLEAN — values, timestamps and video decode still need the full local doctor. --meta-only works on local paths too (instant pre-check on slow disks).

CI gate in 15 lines

name: robovet
on: [push, pull_request]
jobs:
  vet:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.12" }
      - run: pip install "robovet[video]"
      - run: robovet doctor ./datasets/my_task   # exit 1 on FAIL blocks the merge

Saw this error? Run this

You hit Run You get
ValueError: timestamps … tolerance_s on load robovet doctor → TIME-202 the exact minimal tolerance_s, per worst episode
wrong frames / IndexError after a v2→v3 conversion TIME — DATA-104/105 + META-501 which episodes' ledgers lie, three-way cross-check
TorchCodec/AV1 decode errors VIDEO-403 per-camera codec tiers and what to re-encode
loss=NaN out of nowhere DATA-107 + STATS-302 NaN/Inf locations and stale-normalization blocks

What it checks

Group Catches Maps to
STRUCT-0xx missing/invalid metadata, dangling episodes, orphan files lerobot#761 (no validator for hand-rolled conversions)
DATA-1xx episode↔frame mapping corruption, schema drift, NaN/Inf, dead dims lerobot#2401 (silent v2.1→v3.0 corruption)
TIME-2xx off-grid timestamps with the exact tolerance_s you'd need, non-monotonic time, cumulative FP drift lerobot#933, lerobot#3177
STATS-3xx stored normalization stats that disagree with the data ("normalization poison"), broken quantile stats (q01/q99) HF docs warning; phospho repair post; lerobot#2189
VIDEO-4xx video/parquet frame-count desync — including per-episode windows inside shared v3 files, codec-aware compatibility tiers (h264 ✓ / AV1 info — it's lerobot's own default / mpeg4-hevc warn), fps mismatch Correll-lab postmortem; phospho notes

robovet doctor exits 1 on any FAIL and takes --json, so it drops straight into CI: gate dataset merges the way Codecov gates coverage.

Quality scoring (triage, not truth)

robovet score ./my_dataset --csv scores.csv

Per-episode signals, all computed in one pass: jerk smoothness, idle ratio, gripper chatter, duration outliers, action saturation, exact duplicates. This is deliberately the cheap first pass — the smoothness-first approach the 2026 curation literature (rinse, Demo-SCORE, QoQ) argues should precede expensive policy-rollout or influence-function filtering. Scores put the worst episodes in front of a human in seconds; review before you delete. Statistical flags carry practical-significance guards, so homogeneous datasets don't self-flag.

Repair contract

robovet fix is dry-run by default. With --apply it rewrites only metadata (episode lengths, normalization stats, info.json counters), backs up every touched file as .bak, never modifies parquet or video payloads, and preserves everything it doesn't understand: quantile keys (q01/q99 — the v3 QUANTILES-normalization era), image-stat blocks, and unknown episode fields such as tags. A repair tool must never be the thing that deletes your data; the test suite enforces these guarantees. Frame surgery (tail-trimming desynced episodes, timestamp re-gridding) is on the roadmap under the same contract.

Scope, honestly

  • LeRobot v2.0 / v2.1 and v3.x are both first-class for diagnosis — each has its own synthetic fixture and test suite, and v3 gets per-episode video alignment inside shared files (VIDEO-405) plus per-episode stats checks parsed from the v3 metadata. fix currently rewrites v2.x episode metadata and global stats; v3 per-episode stats regeneration is on the roadmap.
  • robovet does not merge/split/delete episodes — lerobot ships that natively now. We do what the official stack doesn't: deep validation, metadata repair, and quality triage.
  • Local-first by design. Your data never leaves your disk — deployment-specific data is a competitive asset; treat it like one.

Library use

from robovet import load_dataset, run_doctor, score_dataset

ds = load_dataset("./my_dataset")
rep = run_doctor(ds)          # rep.exit_code, rep.results, rep.counts
sc  = score_dataset(ds, scan=rep.scan)   # reuses the same single IO pass

Apache-2.0. Issues and broken-dataset war stories very welcome — if your dataset breaks in a way robovet doesn't catch, that's a bug report we want.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

robovet-0.2.2.tar.gz (59.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

robovet-0.2.2-py3-none-any.whl (61.3 kB view details)

Uploaded Python 3

File details

Details for the file robovet-0.2.2.tar.gz.

File metadata

  • Download URL: robovet-0.2.2.tar.gz
  • Upload date:
  • Size: 59.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for robovet-0.2.2.tar.gz
Algorithm Hash digest
SHA256 848768fd7db4f00cd414e8fa576f9c945b0068f306ea4daf1d57b4203ef15f69
MD5 f8f72535c84eeec764539a2d8fdac0c7
BLAKE2b-256 1832c604f3996b3a22f3a102a9999c3b79f04d778344406d1dea04748abc7b96

See more details on using hashes here.

File details

Details for the file robovet-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: robovet-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 61.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for robovet-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 54db0c409e43346eecb19090247abb825c73109458ab26c3aeda2ca621a00ebb
MD5 19f8444db7caf067d957307b86f8da8b
BLAKE2b-256 b6317bd707a4b9acc585e9b648e30902aa89a67dd0313b8b0c9c53863b5b3b03

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page