Loader for SmolVLA robotics datasets with deterministic train/val/test splits; LeRobot‑compatible, cached locally, downloads precompiled bundles from the Hugging Face Hub or rebuilds from a CSV.
Project description
smolvladataset
Simple, reliable loader for SmolVLA robotics datasets with built‑in train/val/test splits and caching.
This library accompanies the SmolVLA paper to help the community inspect how the training dataset is composed, rebuild it from source repositories, and customize the composition or split policy. It can either download a precompiled bundle from the Hugging Face Hub or rebuild locally from a CSV list of dataset repositories.
Features
- Reproducible train/val/test splits (deterministic seed)
- LeRobot‑compatible splits (
LeRobotDatasetinterface) - Automatic download and local caching (Hugging Face Hub)
- Optional precompiled dataset for fast startup
- Efficient Parquet storage with light schema normalization
Installation
pip install smolvladataset
# or using uv
uv add smolvladataset
Requirements
- Python 3.11+
- Core deps:
pandas,pyarrow,huggingface-hub,lerobot,datasets
Quick Start
from smolvladataset import SmolVLADataset
# Returns (train, val, test) as LeRobot‑compatible datasets
train, val, test = SmolVLADataset()
print(len(train), len(val), len(test))
# Access a row (dict of columns)
sample = train[0]
API
-
SmolVLADataset(csv_list=None, *, force_download=False, force_build=False, split_config=None)- Loads a precompiled bundle (default) or builds from a CSV of source repos and returns a tuple
(train, val, test). csv_list: Path to CSV whose first column lists HF dataset repo IDs (e.g.org/name). If omitted, a packaged default is used.force_download: With a customcsv_list, rebuild from sources even if cached; otherwise re‑download the precompiled bundle.force_build: Only whencsv_listis omitted: build from the default list instead of downloading the precompiled bundle.split_config: OptionalSplitConfig(train=..., val=..., test=..., seed=...).
- Loads a precompiled bundle (default) or builds from a CSV of source repos and returns a tuple
-
SplitConfig(train=0.8, val=0.1, test=0.1, seed=<int>)- Proportions must sum to 1.0. The seed controls deterministic shuffling.
- If no
split_configis provided, the default configuration matches the splits published on Hugging Face.
Advanced Usage
Custom Dataset List
# Use a CSV file with Hugging Face dataset repo IDs (a packaged default is used if omitted)
train, val, test = SmolVLADataset(csv_list="path/to/datasets.csv")
Custom Split Configuration
from smolvladataset import SmolVLADataset, SplitConfig
config = SplitConfig(train=0.7, val=0.15, test=0.15, seed=123)
train, val, test = SmolVLADataset(split_config=config)
Force Rebuild or Re‑download
# With a custom CSV, forces rebuild from sources
train, val, test = SmolVLADataset(csv_list="data/datasets.csv", force_download=True)
# With the default list, re‑download the precompiled bundle
train, val, test = SmolVLADataset(force_download=True)
# Build from default list instead of using the precompiled bundle
train, val, test = SmolVLADataset(force_build=True)
Cache Directory
Artifacts are cached under ~/.cache/smolvladataset/<hash>/ by default, where <hash> depends on the dataset list.
Dataset List Format
The library expects a CSV file whose first column contains Hugging Face dataset repository IDs:
dataset_repo_id
org/dataset-1
org/dataset-2
Lines beginning with # are ignored.
Cache Layout
merged.parquet— Combined dataset from all sources (includes adatasetcolumn)stats.parquet— Basic per‑source statisticstrain.parquet,validation.parquet,test.parquet— Split files (optional, for HF viewer convenience)
See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file smolvladataset-0.1.1.tar.gz.
File metadata
- Download URL: smolvladataset-0.1.1.tar.gz
- Upload date:
- Size: 18.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
858a4d6ddadd22d13908d3f33f242cf4a07d3d0f2875251b9bd5dd93ef733e99
|
|
| MD5 |
b13e0b7040f381e110f2e1157ebf66ff
|
|
| BLAKE2b-256 |
7b03f14a4c2a7049c6b919ca9a092ce5e13b08f1617301fcc91506d85b627f13
|
File details
Details for the file smolvladataset-0.1.1-py3-none-any.whl.
File metadata
- Download URL: smolvladataset-0.1.1-py3-none-any.whl
- Upload date:
- Size: 17.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7d73fdfa3e98781a1b2e1eec7b3d4cc980980b7a8c142c0aabe02053bc28fc44
|
|
| MD5 |
fc6b11a79c658bcec91e0269fab1bf48
|
|
| BLAKE2b-256 |
cd5bb7c7f654563f5377ee88f9859649e9d1b18ec007db9ec48139ba5d4c508d
|