具身智能多模态数据集工具库：统一的读写、校验和格式转换

These details have not been verified by PyPI

Project links

Project description

modelbest_robo_dataset

具身智能多模态数据集工具库。提供统一的读写、校验和格式转换。

设计理念

与 LeRobot 的区别

LeRobot 是优秀的机器人学习框架，本库在数据层面与其互补而非替代：

	LeRobot	modelbest_robo_dataset
定位	端到端训练框架（数据+策略+部署）	纯数据工具库（格式转换+存储+读取）
骨架存储	Parquet 表 (每行一帧)	SSTable partition (每条一个 episode)
数据模型	扁平表：每帧一行，所有 feature 列铺平	嵌套结构：Episode → Message → Content，env/user/ai 三轨道
元信息	info.json + tasks.jsonl	MetaContent (嵌在骨架内，Pydantic 序列化)
多模态	视频 + 状态	视频 + 状态 + 音频 + 力 + IMU + 语言指令
标注	task description	task_id, user_id, scene_id, quality_rating, dim_names
视频	原始 MP4，按 chunk 分文件	per-episode MP4（默认）或逐帧 PNG（可选），H264 CRF23 + 720p 上限
扩展性	围绕 HuggingFace Hub 生态	围绕 modelbest_sdk SSTable 生态

为什么不直接用 LeRobot 格式

三轨道模型：env/user/ai 的消息结构天然支持人机交互场景（人类语音纠正、机器人语音回复），LeRobot 的扁平表不适合这种嵌套关系
多源异构：RH20T 有力传感器+音频，fuse 有 IMU+触觉麦克风，RoboMind 有 3 种机器人变体——需要一个足够灵活的骨架来容纳这些差异
生产级存储：SSTable partition 支持大规模分布式训练的随机读取，比单个 Parquet 文件更适合 10 万+ episode 的场景
TimeseriesName 自描述：ai.action.delta_cartesian_position 本身就说明了控制空间和绝对/增量语义，不需要额外的 type 字段

与 LeRobot 的兼容

LeRobotSource 可以直接读取 LeRobot v3.0 格式的数据集并转换
TimeseriesName 的命名风格（env.obs.* / ai.action.*）兼容 LeRobot 社区惯例
dim_names 的设计参考了 LeRobot info.json 中的 features.*.names.motors

架构

modelbest_robo_dataset/
├── data_types.py              # Episode/Message/Content 类型定义
├── lerobot_state_semantics.py # LeRobot observation.state 语义推断（关节 vs 末端）
├── writer.py                  # RawEpisode → 统一格式
├── reader.py                  # 统一格式 → 训练采样
├── validator.py               # State/Action 一致性校验
└── sources/
    ├── base.py        # RawEpisode + EpisodeSource 接口
    ├── lerobot.py     # LeRobot v3.0
    ├── rh20t.py       # RH20T (上海交大)
    ├── fuse.py        # fuse/DIGIT (TFRecord/RLDS)
    └── robomind.py    # RoboMIND (HDF5)

LeRobot `observation.state` 语义推断

LeRobot v2/v3 里 observation.state 没有统一 schema：同一向量可能是关节角，也可能是末端位姿（位置 + 旋转 + 夹爪）。转换到本库时，writer.STATE_TYPE_MAP 需要区分 joint_position（映射到 env.obs.joint_position）与 ee_pose（映射到 env.obs.cartesian_position）。本库提供 元数据优先、Parquet 抽样数值为辅 的推断，不能单靠维度数判断。

实现与返回结果

模块：lerobot_state_semantics.py
API：infer_observation_state_semantics(root, max_sample_rows=5000) -> StateSemanticsResult
字段：label（joint_position / ee_pose / unknown）、confidence（0–1）、reasons（命中规则说明）、state_key、shape、sample_dim_names

规则优先级（概要）

特征键名（不区分大小写）：observation.state* 中含 eef、tcp、cartesian、pose、world_pose、ee_pose、ee_ 等子串 → 判为 ee_pose（高置信度）。
维度名（与 LeRobotSource._extract_dim_names 一致：支持 names.motors、names.axes，或顶层 names 为列表；键存在但值为 JSON null 时视为缺失，避免异常）：
- 倾向末端：维度名集合中同时含 x、y、z，或名称文本中含 quat、axis_angle、euler、rpy、rotation、orient 等；
- 倾向关节：名称中含 joint、shoulder、elbow、wrist、finger 等；
- 冲突：若末端与关节信号并存，且存在 x/y/z 三元组 → 偏向 ee_pose；否则继续走数值启发。
- 弱信号：仅 motor_0、motor_1… 等形式 不单独下结论，需结合 Parquet 抽样。
数值启发（读取 data/**/*.parquet 中 state 列，总行数上限由 max_sample_rows 控制，默认 5000）：
- 维度 ≥ 7：对「最后 4 维」与「第 4–7 列」两种四元数候选块分别算均值 |‖q‖ - 1| ，取更优者；若小于 0.05 → ee_pose；
- 维度 = 6：若前三维幅度接近米级、后三维接近弧度量级 → 弱信号 ee_pose（置信度较低）；
- 仅 motor_* 或无名，且抽样不满足单位四元数块 → joint_position；
- 缺少 meta/info.json、无 observation.state* 特征、无可用 Parquet 等 → unknown，并在 reasons 中写明原因。

局限

无法保证 100% 自动正确（例如元数据全写成 motor_* 但实际存的是末端位姿）。请以 confidence 与 reasons 为准做抽检；必要时在流水线侧人工指定或修正。

说明：单数据集 / 批量脚本通过 importlib 直接加载 lerobot_state_semantics.py，不经过包根 __init__.py，可在未安装 PyAV 的环境下运行。若在已安装全量依赖的环境中使用 Python API，可正常 from modelbest_robo_dataset.lerobot_state_semantics import infer_observation_state_semantics。

命令行

# 单个 LeRobot 数据集根目录（需含 meta/info.json）
python scripts/infer_lerobot_state.py /path/to/lerobot_dataset
python scripts/infer_lerobot_state.py /path/to/lerobot_dataset --json --max-rows 2000

# 父目录下：每个直接子目录若含 meta/info.json 则推断一次
python scripts/batch_infer_lerobot_state.py /path/to/parent \
  --output-format csv -o lerobot_state_summary.csv

# 递归查找所有 meta/info.json（数据集根 = meta 的父目录）
python scripts/batch_infer_lerobot_state.py /path/to/parent --recursive --output-format jsonl

batch_infer_lerobot_state.py 还支持 --output-format table|json、-o 输出到文件。

Python 调用示例

from pathlib import Path
from modelbest_robo_dataset.lerobot_state_semantics import infer_observation_state_semantics

r = infer_observation_state_semantics(Path("/path/to/lerobot_dataset"))
print(r.label, r.confidence, r.reasons)

数据格式

输出结构:

output_dir/
├── skeleton_episode/{name}/part-XXXXX       # SSTable partition 骨架
├── data/{name}/state/chunk-000/file-000.parquet
├── data/{name}/action/chunk-000/file-000.parquet
├── videos/{name}/{cam}/episode_000000.mp4   # 默认：整段 MP4，H264 CRF23，上限 720p
│   或 episode_000000.png                    # 逐帧模式：每 timestep 一张 PNG
└── meta/{name}/info.json

Episode 骨架

每个 Episode 包含三类消息:

角色	内容	说明
`env`	MetaContent, VideoContent, TimeseriesContent(state), AudioContent	环境感知
`user`	TextContent, AudioContent	人类干预
`assistant`	TimeseriesContent(action), TextContent	机器人输出

TimeseriesName 命名规范

采用 env.obs.* / ai.action.* 的 dot-separated 命名:

key	说明
`env.obs.joint_position`	关节角度 (绝对)
`env.obs.cartesian_position`	末端位姿 (绝对)
`env.obs.gripper_position`	夹爪开度
`env.obs.force_torque`	力/力矩
`env.obs.imu`	IMU
`ai.action.joint_position`	绝对目标关节角
`ai.action.cartesian_position`	绝对目标末端位姿
`ai.action.delta_joint_position`	关节角增量
`ai.action.delta_cartesian_position`	末端位姿增量

TimeseriesName 本身区分绝对/增量，不需要额外的 action_type 字段。

MetaContent 结构化字段

字段	类型	说明
`task_id`	str	任务 ID (如 "task_0001"，从目录名解析)
`quality_rating`	int	质量评分: 0=机器人失败, 1=任务失败, 2-9=完成质量, -1=未标注
`user_id`	str	操作者 ID (如 "user_0001")
`scene_id`	str	场景 ID (如 "scene_0001")
`dim_names`	dict	key=TimeseriesName, value=维度名列表

快速开始

转换数据集

from modelbest_robo_dataset import EmbodiedWriter
from modelbest_robo_dataset.sources import LeRobotSource

source = LeRobotSource(root="/path/to/lerobot/pusht", name="pusht")
writer = EmbodiedWriter(output_dir="/path/to/output", dataset_name="pusht")

for ep_id in source.list_episodes():
    raw = source.load_episode(ep_id)
    writer.write_episode(raw)
writer.finalize()

逐帧 episode（每 timestep 一条骨架 + 单帧 PNG）

默认下，每个源 trajectory 对应一条 Episode，视频为整段 episode_XXXXXX.mp4。若需要 每个时间步单独一条 Episode，且相机保存为 单张 PNG（而非 MP4），使用 EmbodiedWriter(..., one_frame_per_episode=True)。

可选：在写入前调用 set_shuffled_episode_indices(total_frames, seed)，为每条帧级 Episode 分配 打乱后的 episode_index（与 Parquet 行、PNG 文件名一致）。LeRobot 下可用 LeRobotSource.episode_frame_counts() 先求总帧数（需与 list_episodes 的截断方式一致，例如同样应用 max_episodes）。

from modelbest_robo_dataset import EmbodiedWriter
from modelbest_robo_dataset.sources.lerobot import LeRobotSource

source = LeRobotSource(root="/path/to/lerobot_ds", name="my_ds")
episodes = source.list_episodes()
counts = source.episode_frame_counts()
total_frames = sum(counts)

writer = EmbodiedWriter(
    output_dir="/path/to/output",
    dataset_name="my_ds",
    one_frame_per_episode=True,
)
writer.set_shuffled_episode_indices(total_frames, seed=42)

for ep_id in episodes:
    writer.write_episode(source.load_episode(ep_id))
writer.finalize()

说明：

状态/动作/各相机帧数不一致时，按 最短长度 对齐并打日志警告。
仅 video_files、无 video_frames 时，本库不会自动逐帧解码 MP4；可能只有时序无图像。
多帧 trajectory 下 不写入 整段 audio_env（仅单帧 trajectory 保留原逻辑）。

读取数据

from modelbest_robo_dataset import EmbodiedReader

reader = EmbodiedReader("/path/to/output", "pusht")
reader.summary()

sample = reader.load_sample(episode_idx=0, timestamp=1.0)
print(sample.keys())

命令行转换

# LeRobot 全部
python robo_dataset/scripts/convert.py --source lerobot --all --output /path/to/output

# RH20T 全部
python robo_dataset/scripts/convert.py --source rh20t --all --output /path/to/output

# fuse
python robo_dataset/scripts/convert.py --source fuse --output /path/to/output

# RoboMind (支持 puppet/franka/tiangong 三种变体，自动检测)
python robo_dataset/scripts/convert.py --source robomind \
  --input /path/to/failure_data --output /path/to/output --name robomind_failure

# RoboMind 全量转换 (约4小时)
python robo_dataset/scripts/convert.py --source robomind \
  --input /backup/.../robomind/failure_data --output /path/to/output \
  --name robomind_failure --max-episodes 1678

LeRobot：逐帧 PNG + 打乱 `episode_index`

适用于希望 每条样本对应一帧图像（videos/.../episode_XXXXXX.png），并对全数据集的 episode_index 随机打乱（可复现）的场景。

python robo_dataset/scripts/convert.py --source lerobot \
  --input /path/to/lerobot_dataset --output /path/to/output --name my_ds \
  --one-frame-per-episode --shuffle-seed 42

（若本仓库为根目录，请将命令中的 robo_dataset/scripts/convert.py 换成 scripts/convert.py。）

参数	说明
`--one-frame-per-episode`	每个 timestep 写一条 Episode；相机输出为 PNG，不再为整段 MP4。
`--shuffle-seed N`	必须与上一参数同时使用。在写入前根据数据表统计总帧数，对 `0..N-1` 的 `episode_index` 做固定种子的随机排列。

注意：--shuffle-seed 依赖数据源的 episode_frame_counts()；当前 LeRobot 已实现。其他 --source 若未实现该方法，请勿对该源使用 --shuffle-seed。

骨架记录在 SSTable 中的 追加顺序 仍为按源 Episode 依次写入；打乱的是每条记录内的 episode_index 及对应的 Parquet/图像路径，而非磁盘上的记录物理顺序。

已接入数据集

数据集	源格式	机器人	State (env.obs)	Action (ai.action)	模态	结构化标注
pusht	LeRobot	-	cartesian_position	cartesian_position	视频+时序	-
xarm_push_medium	LeRobot	xarm	joint_position	delta_joint_position	视频+时序	-
aloha_sim_insertion	LeRobot	aloha	joint_position	joint_position	视频+时序	-
fuse	TFRecord	DIGIT	cartesian_position, imu	delta_cartesian_position	视频+音频+IMU	-
rh20t_cfg1~7	RH20T	多种	cartesian_position, force_torque	cartesian_position	视频+音频+力	task_id, user_id, scene_id, quality_rating
robomind_failure	HDF5	tiangong/puppet/franka	joint_position	joint_position	多视频+时序	task_id, quality_rating=0 (失败数据)
robomind_puppet	HDF5	puppet (双臂)	joint_position	joint_position	多视频+时序	task_id
robomind_franka	HDF5	Franka	joint_position	joint_position	多视频+时序	task_id

依赖

numpy
pyarrow
pydantic>=2.0
av (PyAV)
Pillow
h5py              # RoboMind
tensorflow        # fuse (可选)
modelbest_sdk     # SSTable 骨架存储
thriftpy2         # modelbest_sdk 依赖

State/Action 规范

详见 docs/sop_state_action.md。

核心原则:

Action 必须使用原始数据，禁止用 np.diff 等方式构造
没有原始 action 的遥操作数据，用 action[t] = state[t+1] (shifted state)
TimeseriesName 自描述，delta_ 前缀表示增量，无前缀表示绝对目标

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.1

Apr 28, 2026

0.3.0

Apr 28, 2026

0.2.0

Apr 27, 2026

This version

0.1.0

Apr 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

modelbest_robo_dataset-0.1.0.tar.gz (47.8 kB view details)

Uploaded Apr 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

modelbest_robo_dataset-0.1.0-py3-none-any.whl (47.7 kB view details)

Uploaded Apr 24, 2026 Python 3

File details

Details for the file modelbest_robo_dataset-0.1.0.tar.gz.

File metadata

Download URL: modelbest_robo_dataset-0.1.0.tar.gz
Upload date: Apr 24, 2026
Size: 47.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for modelbest_robo_dataset-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`7c1bd07ad6e43a7f8c3efe33deb48ef9648f1e85c7afbb5813cdf0c2674c20c3`
MD5	`24dc37df241d441ba7d1b71475d969aa`
BLAKE2b-256	`1033c75aa0e3c1caa2cdd8b495834a9846adedb49155b85a01a8ec05c46cd064`

See more details on using hashes here.

File details

Details for the file modelbest_robo_dataset-0.1.0-py3-none-any.whl.

File metadata

Download URL: modelbest_robo_dataset-0.1.0-py3-none-any.whl
Upload date: Apr 24, 2026
Size: 47.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for modelbest_robo_dataset-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8a7c58ec8bca93aa4f7f3a6bae3b420c9180b15371d23107ee13598f37313fa7`
MD5	`8277cf8a9b79187805a23d5a1584a2d8`
BLAKE2b-256	`cc99d6fc90a443a044b43206bc61436cc7751c9b40620a014587990934b7a49d`

See more details on using hashes here.

modelbest-robo-dataset 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

modelbest_robo_dataset

设计理念

与 LeRobot 的区别

为什么不直接用 LeRobot 格式

与 LeRobot 的兼容

架构

LeRobot observation.state 语义推断

实现与返回结果

规则优先级（概要）

局限

命令行

Python 调用示例

数据格式

Episode 骨架

TimeseriesName 命名规范

MetaContent 结构化字段

快速开始

转换数据集

逐帧 episode（每 timestep 一条骨架 + 单帧 PNG）

读取数据

命令行转换

LeRobot：逐帧 PNG + 打乱 episode_index

已接入数据集

依赖

State/Action 规范

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

LeRobot `observation.state` 语义推断

LeRobot：逐帧 PNG + 打乱 `episode_index`