Bidirectional converter and validator for AgiBot World ↔ LeRobot v3 datasets.
Project description
embodied-data
Bidirectional converter and validator for AgiBot World ↔ LeRobot v3 datasets.
What it does
- Bidirectional conversion between AgiBot World (DigitalWorld sim + Beta/Alpha real hardware) and LeRobot v3.
- Schema-detect dispatcher — point
convertat any AgiBot root and the right reader fires automatically. - Five-check validator — schema conformance, fps consistency, timestamp monotonicity, action-dim consistency, frame ↔ video alignment.
- Batch + resume —
--max-episodesfor parallel conversion,meta/uuid_map.parquetfor restartable jobs. - Stdlib-first — h5py + pyarrow + av; no PyTorch dependency in the data path.
Quick start
LeRobot's pusht is the fastest end-to-end check (no HuggingFace gating, ~30 s):
pip install --upgrade embodied-data
huggingface-cli download lerobot/pusht --repo-type dataset --local-dir ./pusht
embodied-data preview ./pusht
embodied-data validate ./pusht
preview prints a per-episode stats table; validate runs all five checks and exits non-zero on failure.
Real AgiBot data (HuggingFace gated)
AgiBot World Beta and Alpha live on HuggingFace under a gated license. Request access on the AgiBotWorld-Beta page first, then:
huggingface-cli login
huggingface-cli download agibot-world/AgiBotWorld-Beta \
--repo-type dataset \
--include "task_info_675.json" "observations/675/936938/**" "proprio_stats/675/936938.h5" \
--local-dir ./agibot_beta_root
embodied-data convert \
./agibot_beta_root/675/936938 \
/tmp/beta_v3 \
--from agibot --to lerobot-v3
embodied-data validate /tmp/beta_v3
For batch conversion of a whole task, point convert at the task root and pass --max-episodes N. Streaming-extraction tips for partial Beta downloads are in docs/schema/beta.md.
Validation example
Why this exists
Robotics researchers spend days rewriting the same dataset conversion scripts. AgiBot World's official convert_to_lerobot.py has carried unresolved issues for months; LeRobot's v2.0 / v2.1 / v3.0 versions break each other; every lab writes its own timestamp alignment check. This tool is the layer that stops.
Concrete upstream issues this project addresses or works around:
- AgiBot-World #18 —
task_info_*.jsonlookup ambiguity for sub-roots - AgiBot-World #124 — Beta vs Alpha schema divergence
- AgiBot-World #149 — proprio HDF5 key drift across batches
- lerobot #2158 — v2 ↔ v3 episode-index incompatibility
- lerobot #2689 — fps/timestamp validation gap
Roadmap
- v0.3 (shipped on
main, awaiting tag) —observation.images.head_colorvideo for Beta / Alpha (single + batch) so v3 datasets are usable for VLA training end-to-end. - v0.3.x patches — multi-camera (fisheye / hand / back), sparse
action/*/indexmasks, end-pose flattening, reverse Beta path (seedocs/v0.3.x-patches.md). - v0.4+ — ALOHA HDF5 ingest, RLDS export, OpenX Embodiment alignment.
Cross-embodiment action-space retargeting and Chinese prompt embedding remain explicit non-goals.
Schema reference
docs/schema/overview.md— AgiBot variant matrixdocs/schema/digitalworld.md— DigitalWorld (sim) layoutdocs/schema/beta.md— Beta / Alpha (real hardware) layoutdocs/schema-lerobot-v3.md— LeRobot v3 target schema
Install
pip install embodied-data
embodied-data --help
Python 3.12+ required.
Development
git clone https://github.com/allenwu-blip/embodied-data.git
cd embodied-data
uv sync
uv run pytest
Coverage
- 56 commits, 3 PyPI releases (0.1.0 / 0.1.1 / 0.2.0); v0.3 staged on
main - 114 passing tests + 1 skipped (gated dataset)
- 4 upstream issue threads engaged
- 4 HuggingFace datasets exercised end-to-end (lerobot/pusht, AgiBotWorld-Beta, AgiBotWorld-Alpha, agibot-world/agibot_digital_world)
Acknowledgments
- HuggingFace LeRobot team for the v3 schema and reference datasets
- OpenDriveLab AgiBot World team for releasing Beta and Alpha under HF gating
License
MIT — see LICENSE.
Contact
Bug reports and feature requests: GitHub Issues.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file embodied_data-0.3.0.tar.gz.
File metadata
- Download URL: embodied_data-0.3.0.tar.gz
- Upload date:
- Size: 41.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
95a7f5e09903abf3d93ef796da0b259f693fb53521ef86ceeb8ff7b39425be8d
|
|
| MD5 |
3de4a5236e29600bab33161eedbb40a0
|
|
| BLAKE2b-256 |
1de20897bdc23125014268ec0fae43c48f4734e4864d7dda8b88aa3386490d87
|
File details
Details for the file embodied_data-0.3.0-py3-none-any.whl.
File metadata
- Download URL: embodied_data-0.3.0-py3-none-any.whl
- Upload date:
- Size: 53.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9798cc824981340d66c33d874e74600077bf499e64aca85ef3413480c035f3bc
|
|
| MD5 |
88aae86b7df9e5c98644f71d3ac45877
|
|
| BLAKE2b-256 |
19df05fb254eeb9c63b1e23015ac07d91df38f9bffc67c6f10079d83a4c313a1
|