Skip to main content

Loader for SmolVLA robotics datasets with deterministic train/val/test splits; LeRobot‑compatible, cached locally, downloads precompiled bundles from the Hugging Face Hub or rebuilds from a CSV.

Project description

smolvladataset

Simple, reliable loader for SmolVLA robotics datasets with built‑in train/val/test splits and caching.

This library accompanies the SmolVLA paper to help the community inspect how the training dataset is composed, rebuild it from source repositories, and customize the composition or split policy. It can either download a precompiled bundle from the Hugging Face Hub or rebuild locally from a CSV list of dataset repositories.

Features

  • Reproducible train/val/test splits (deterministic seed)
  • LeRobot‑compatible splits (LeRobotDataset interface)
  • Automatic download and local caching (Hugging Face Hub)
  • Optional precompiled dataset for fast startup
  • Efficient Parquet storage with light schema normalization

Installation

pip install smolvladataset
# or using uv
uv add smolvladataset

Requirements

  • Python 3.11+
  • Core deps: pandas, pyarrow, huggingface-hub, lerobot, datasets

Quick Start

from smolvladataset import SmolVLADataset

# Returns (train, val, test) as LeRobot‑compatible datasets
train, val, test = SmolVLADataset()
print(len(train), len(val), len(test))

# Access a row (dict of columns)
sample = train[0]

API

  • SmolVLADataset(csv_list=None, *, force_download=False, force_build=False, split_config=None)

    • Loads a precompiled bundle (default) or builds from a CSV of source repos and returns a tuple (train, val, test).
    • csv_list: Path to CSV whose first column lists HF dataset repo IDs (e.g. org/name). If omitted, a packaged default is used.
    • force_download: With a custom csv_list, rebuild from sources even if cached; otherwise re‑download the precompiled bundle.
    • force_build: Only when csv_list is omitted: build from the default list instead of downloading the precompiled bundle.
    • split_config: Optional SplitConfig(train=..., val=..., test=..., seed=...).
  • SplitConfig(train=0.8, val=0.1, test=0.1, seed=<int>)

    • Proportions must sum to 1.0. The seed controls deterministic shuffling.
    • If no split_config is provided, the default configuration matches the splits published on Hugging Face.

Advanced Usage

Custom Dataset List

# Use a CSV file with Hugging Face dataset repo IDs (a packaged default is used if omitted)
train, val, test = SmolVLADataset(csv_list="path/to/datasets.csv")

Custom Split Configuration

from smolvladataset import SmolVLADataset, SplitConfig

config = SplitConfig(train=0.7, val=0.15, test=0.15, seed=123)
train, val, test = SmolVLADataset(split_config=config)

Force Rebuild or Re‑download

# With a custom CSV, forces rebuild from sources
train, val, test = SmolVLADataset(csv_list="data/datasets.csv", force_download=True)

# With the default list, re‑download the precompiled bundle
train, val, test = SmolVLADataset(force_download=True)

# Build from default list instead of using the precompiled bundle
train, val, test = SmolVLADataset(force_build=True)

Cache Directory

Artifacts are cached under ~/.cache/smolvladataset/<hash>/ by default, where <hash> depends on the dataset list.

Dataset List Format

The library expects a CSV file whose first column contains Hugging Face dataset repository IDs:

dataset_repo_id
org/dataset-1
org/dataset-2

Lines beginning with # are ignored.

Cache Layout

  • merged.parquet — Combined dataset from all sources (includes a dataset column)
  • stats.parquet — Basic per‑source statistics
  • train.parquet, validation.parquet, test.parquet — Split files (optional, for HF viewer convenience)

See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smolvladataset-0.1.1.tar.gz (18.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

smolvladataset-0.1.1-py3-none-any.whl (17.8 kB view details)

Uploaded Python 3

File details

Details for the file smolvladataset-0.1.1.tar.gz.

File metadata

  • Download URL: smolvladataset-0.1.1.tar.gz
  • Upload date:
  • Size: 18.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.5

File hashes

Hashes for smolvladataset-0.1.1.tar.gz
Algorithm Hash digest
SHA256 858a4d6ddadd22d13908d3f33f242cf4a07d3d0f2875251b9bd5dd93ef733e99
MD5 b13e0b7040f381e110f2e1157ebf66ff
BLAKE2b-256 7b03f14a4c2a7049c6b919ca9a092ce5e13b08f1617301fcc91506d85b627f13

See more details on using hashes here.

File details

Details for the file smolvladataset-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for smolvladataset-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7d73fdfa3e98781a1b2e1eec7b3d4cc980980b7a8c142c0aabe02053bc28fc44
MD5 fc6b11a79c658bcec91e0269fab1bf48
BLAKE2b-256 cd5bb7c7f654563f5377ee88f9859649e9d1b18ec007db9ec48139ba5d4c508d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page