Skip to main content

Loader for SmolVLA robotics datasets with deterministic train/val/test splits; LeRobot‑compatible, cached locally, downloads precompiled bundles from the Hugging Face Hub or rebuilds from a CSV.

Project description

smolvladataset

Simple, reliable loader for SmolVLA robotics datasets with built‑in train/val/test splits and caching.

This library accompanies the SmolVLA paper to help the community inspect how the training dataset is composed, rebuild it from source repositories, and customize the composition or split policy. It can either download a precompiled bundle from the Hugging Face Hub or rebuild locally from a CSV list of dataset repositories.

Features

  • Reproducible train/val/test splits (deterministic seed)
  • LeRobot‑compatible splits (LeRobotDataset interface)
  • Automatic download and local caching (Hugging Face Hub)
  • Optional precompiled dataset for fast startup
  • Efficient Parquet storage with light schema normalization

Installation

pip install smolvladataset
# or using uv
uv add smolvladataset

Requirements

  • Python 3.11+
  • Core deps: pandas, pyarrow, huggingface-hub, lerobot, datasets

Quick Start

from smolvladataset import SmolVLADataset

# Returns (train, val, test) as LeRobot‑compatible datasets
train, val, test = SmolVLADataset()
print(len(train), len(val), len(test))

# Access a row (dict of columns)
sample = train[0]

API

  • SmolVLADataset(csv_list=None, *, force_download=False, force_build=False, split_config=None)

    • Loads a precompiled bundle (default) or builds from a CSV of source repos and returns a tuple (train, val, test).
    • csv_list: Path to CSV whose first column lists HF dataset repo IDs (e.g. org/name). If omitted, a packaged default is used.
    • force_download: With a custom csv_list, rebuild from sources even if cached; otherwise re‑download the precompiled bundle.
    • force_build: Only when csv_list is omitted: build from the default list instead of downloading the precompiled bundle.
    • split_config: Optional SplitConfig(train=..., val=..., test=..., seed=...).
  • SplitConfig(train=0.8, val=0.1, test=0.1, seed=<int>)

    • Proportions must sum to 1.0. The seed controls deterministic shuffling.
    • If no split_config is provided, the default configuration matches the splits published on Hugging Face.

Advanced Usage

Custom Dataset List

# Use a CSV file with Hugging Face dataset repo IDs (a packaged default is used if omitted)
train, val, test = SmolVLADataset(csv_list="path/to/datasets.csv")

Custom Split Configuration

from smolvladataset import SmolVLADataset, SplitConfig

config = SplitConfig(train=0.7, val=0.15, test=0.15, seed=123)
train, val, test = SmolVLADataset(split_config=config)

Force Rebuild or Re‑download

# With a custom CSV, forces rebuild from sources
train, val, test = SmolVLADataset(csv_list="data/datasets.csv", force_download=True)

# With the default list, re‑download the precompiled bundle
train, val, test = SmolVLADataset(force_download=True)

# Build from default list instead of using the precompiled bundle
train, val, test = SmolVLADataset(force_build=True)

Cache Directory

Artifacts are cached under ~/.cache/smolvladataset/<hash>/ by default, where <hash> depends on the dataset list.

Dataset List Format

The library expects a CSV file whose first column contains Hugging Face dataset repository IDs:

dataset_repo_id
org/dataset-1
org/dataset-2

Lines beginning with # are ignored.

Cache Layout

  • merged.parquet — Combined dataset from all sources (includes a dataset column)
  • stats.parquet — Basic per‑source statistics
  • train.parquet, validation.parquet, test.parquet — Split files (optional, for HF viewer convenience)

See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smolvladataset-0.1.0.tar.gz (18.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

smolvladataset-0.1.0-py3-none-any.whl (18.0 kB view details)

Uploaded Python 3

File details

Details for the file smolvladataset-0.1.0.tar.gz.

File metadata

  • Download URL: smolvladataset-0.1.0.tar.gz
  • Upload date:
  • Size: 18.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.5

File hashes

Hashes for smolvladataset-0.1.0.tar.gz
Algorithm Hash digest
SHA256 78e3c4d5d3f2ea9655a168e1221a260d20b2397a1f7b0d269e9581218eed80da
MD5 a98d7e1b4133a890bf0c9cba75063872
BLAKE2b-256 507038b137540b9eae88992fbde277dd58adf2a817e3cd957a985644df644cf1

See more details on using hashes here.

File details

Details for the file smolvladataset-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for smolvladataset-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4d2f6693feb757720ec21887acac01d269d90659f237ca2ac7a459ad635371d0
MD5 557b90875975947569ac71e9d1375617
BLAKE2b-256 88c85aa46fda7c81582ea464c0d75460ce554536103a95c09392fe61be209ece

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page