The quality and synthesis layer for the open robot-learning data ecosystem.
Project description
trajlens
The quality and synthesis layer for the open robot-learning data ecosystem.
ruff for robot data — lint, fix, and generate clean LeRobotDataset datasets.
Status
Pre-v0.1 (0.1.0.dev0), under active development. Not yet on PyPI.
lint is implemented and audited against the public Hub (see Real-world audit below).
fix and web are stubs reserved for the v0.2 milestone.
Install (dev)
git clone https://github.com/<your-username>/trajlens
cd trajlens
uv venv
source .venv/bin/activate
uv pip install -e ".[dev,hub]"
The [hub] extra pulls in huggingface_hub; it's only required to lint datasets by Hub repo id rather than local path.
Usage
trajlens lint <path-or-org/dataset> # human-readable terminal report
trajlens lint <path-or-org/dataset> --json # machine-readable JSON report
trajlens lint <path-or-org/dataset> --report out.html
trajlens lint <path-or-org/dataset> --sarif out.sarif # SARIF 2.1.0, for CI annotations
trajlens lint <path-or-org/dataset> --deep # also decode video and verify per-frame stats
Exit codes follow lint-tool convention: 0 = clean, 1 = WARN present, 2 = FAIL or load ERROR — so trajlens lint composes directly into CI gates.
By default, checks that require materializing a lot of data over the network (full video decode, per-frame stats reconciliation) are skipped for Hub datasets and reported as INFO/skipped rather than run. Pass --deep to force them; expect this to be significantly slower and to fetch the full dataset.
Architecture
graph TD
subgraph Interfaces
CLI[CLI - typer]
WEB[Web dashboard - FastAPI + React, optional]
SDK[Python SDK / import]
end
subgraph Core
LOADER[Dataset Source Layer<br/>local + Hub, version-aware]
MODEL[Canonical Dataset Model<br/>typed in-memory view]
REGISTRY[Check Registry<br/>pluggable rules]
ENGINE[Check Engine<br/>runs checks, bounded]
REPORT[Report Builder<br/>terminal / json / html / sarif]
REPAIR[Repair Engine<br/>dry-run, diff, opt-in]
end
subgraph Synthesis [Pillar 3, later]
SIMBK[Sim Backend Protocol<br/>MuJoCo default]
AUG[Trajectory Augmenter<br/>MimicGen-style]
DR[Domain Randomizer]
WRITER[LeRobotDataset Writer]
end
CLI --> LOADER
WEB --> LOADER
SDK --> LOADER
LOADER --> MODEL
MODEL --> ENGINE
REGISTRY --> ENGINE
ENGINE --> REPORT
MODEL --> REPAIR
REPAIR --> WRITER
SIMBK --> AUG --> DR --> WRITER
WRITER --> MODEL
REPORT --> WEB
HUB[(Hugging Face Hub)] <--> LOADER
HUB <--> WRITER
What it checks
trajlens validates a LeRobotDataset (v2.0, v2.1, or v3.0) against its own declared metadata, independent of any particular consumer's assumptions. Checks are grouped by category and run as a check engine over each dataset:
| Category | Check | Severity | What it catches |
|---|---|---|---|
| STRUCTURAL | VERSION_DETECTED |
INFO | Reports the detected codebase_version. |
| STRUCTURAL | SCHEMA_CONSISTENCY |
FAIL | Parquet column dtypes/widths disagree with info.json's declared feature shapes. |
| STRUCTURAL | INDEX_CONTINUITY |
FAIL | Gaps or duplicates in frame_index/episode_index/global index columns. |
| STRUCTURAL | METADATA_DATA_AGREEMENT |
FAIL | Declared episode lengths/from-to boundaries disagree with actual Parquet row counts (catches #2401-class corruption). |
| STRUCTURAL | PATH_TEMPLATE_RESOLVES |
FAIL | A declared shard path (data or video) doesn't resolve to a readable file. |
| SEMANTIC | FEATURE_DIMENSIONALITY |
FAIL | A feature's actual column width doesn't match its declared shape. |
| SEMANTIC | TASK_INTEGRITY |
FAIL | A task_index reference has no corresponding, non-empty task description. |
| SEMANTIC | LANGUAGE_PRESENT |
WARN | An episode has no non-empty language/task description. |
| SEMANTIC | CAMERA_INTRINSICS_PLAUSIBLE |
INFO | Advisory; skipped where the LeRobot format carries no intrinsics field. |
| TEMPORAL | TIMESTAMP_MONOTONIC |
FAIL | Timestamps are not strictly increasing within an episode. |
| TEMPORAL | TIMESTAMP_SPACING |
WARN | Timestamp spacing is inconsistent with declared fps beyond decoder tolerance. |
| STATISTICAL | STATS_MATCH_DATA |
FAIL | Recomputed global Welford stats diverge from meta/stats.json. Skipped over Hub HTTP by default — too slow without --deep. |
| STATISTICAL | PER_EPISODE_STATS_MATCH |
WARN | Same, per-episode. Skipped over Hub HTTP by default. |
| STATISTICAL | VALUE_SANITY |
WARN | Out-of-range or NaN/Inf values in numeric features. Skipped over Hub HTTP by default. |
| VIDEO | DECODABLE_SPOTCHECK |
FAIL | A sampled video segment fails to decode. |
| KNOWNBUG | TIMESTAMP_DRIFT |
FAIL | Cumulative timestamp drift matching the known lerobot #3177 bug pattern. |
Every check's full result — message, severity, and structured details — is included in the JSON/HTML/SARIF report; the table above is the summary.
Real-world audit of the Hub
scripts/audit_hub.py runs trajlens lint --json against a random sample of public Hub datasets tagged lerobot, each in an isolated subprocess with a 60s timeout, and aggregates the results. It's how this project validates itself against the actual long tail of community datasets rather than only its own fixtures.
A 100-dataset run (2026-06-24) produced:
| Status | Count | Meaning |
|---|---|---|
| PASS | 19 | No issues found. |
| WARN | 0 | — |
| FAIL | 13 | A real check fired — schema mismatch, metadata/data disagreement, missing language, etc. |
| ERROR | 47 | Dataset failed to load (unsupported v2.x Hub streaming, malformed/missing meta/, mistagged or deleted repos) — never reached the check engine. |
| TIMEOUT | 21 | Exceeded the 60s per-dataset budget. |
These figures are from a single 100-dataset random sample (raw results: see the v0.1.0 release assets); audit_hub.py samples a fresh random subset of lerobot-tagged Hub datasets on each run, so rerunning it will produce a similarly-shaped but not identical distribution.
Of the 47 load-time ERRORs, none are trajlens bugs: about half (24) are the documented v0.1 limitation that v2.x Hub datasets can't be lazily streamed (shard paths are implicit and require a local filesystem to glob), and the rest are dead/mistagged Hub references, repos that aren't actually LeRobotDatasets (no meta/ directory at the repo root), or genuinely malformed meta/info.json (wrong dtype, missing required fields) on the dataset's side.
TIMEOUTs were investigated as a possible performance bug rather than accepted as an inherent network ceiling: profiling two small, previously-timing-out datasets (abdul004/so101_multi_task_v1, 125 episodes; Elvinky/pick_green_block_into_box, 102 episodes) found that loading a dataset's metadata over Hub HTTP was issuing dozens of small, separately-latency-bound reads per Parquet shard, and downloading the meta/ file tree one file at a time. Fixing both (single whole-shard fetch instead of scattered reads; parallelized meta/ download) brought those two datasets from 60s+ timeouts down to 33s and 11s respectively, and cut the audit's overall TIMEOUT count and mean per-dataset duration by roughly a third in before/after sampling. The remaining TIMEOUTs are concentrated in genuinely large multi-thousand-episode shards, where 60s is a real infra ceiling rather than a fixable inefficiency.
Launch audit findings
Of the 81 datasets that reached a grade (excluding ERROR/TIMEOUT, where no check ever ran), two known upstream lerobot bugs accounted for a meaningful share of the failures:
| Known bug | Prevalence (of successfully-linted datasets) |
|---|---|
KNOWNBUG.TIMESTAMP_DRIFT (#3177) |
3.1% |
STRUCTURAL.METADATA_DATA_AGREEMENT (#2401) |
18.8% |
audit_hub.py resamples a fresh random subset of lerobot-tagged Hub datasets on every run, so these are not a fixed, reproducible distribution — rerunning the audit will not return the same percentages, only a similarly-shaped one. Raw per-dataset results behind these specific numbers are attached to the v0.1.0 GitHub release as audit_results_100.json and audit_summary_100.txt.
Performance note: Hub vs. local
Linting a 100-episode dataset locally takes under 30 seconds.
Linting a Hub dataset directly (trajlens lint org/dataset) streams metadata and data shards over HTTP. It will inherently be slower than a local copy — typically under a minute for small-to-medium datasets, more for very large ones — because of unavoidable network round trips. For repeated linting, downloading the dataset locally first is still faster.
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file trajlens-0.1.0.tar.gz.
File metadata
- Download URL: trajlens-0.1.0.tar.gz
- Upload date:
- Size: 89.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5b55763ada9f92a43b7734fe6fe2b1184412edef4792946ed6e083006956c47d
|
|
| MD5 |
9857190218146ef7fe5f6c6cbf3bcaaa
|
|
| BLAKE2b-256 |
d7ce080d07c5e79caf94752877831ec525ab0727867e5c00057ef515a741b6a7
|
File details
Details for the file trajlens-0.1.0-py3-none-any.whl.
File metadata
- Download URL: trajlens-0.1.0-py3-none-any.whl
- Upload date:
- Size: 68.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
49d8dd073a9504486d19b03bb51314361bbf75ece5f3aaca9e7947a948ec9748
|
|
| MD5 |
4240028815418adab4756ef9227d16d0
|
|
| BLAKE2b-256 |
99dc1f43bb4e3d346d320e164b7f6048611ac48034c510e5579551939720d661
|