A loose federation of distributed, typed datasets

Project description

atdata

A loose federation of distributed, typed datasets built on WebDataset.

atdata provides a type-safe, composable framework for working with large-scale datasets. It combines the efficiency of WebDataset's tar-based storage with Python's type system and functional programming patterns.

Features

Typed Samples - Define dataset schemas using Python dataclasses with automatic msgpack serialization
Schema-free Exploration - Load datasets without defining a schema first using DictSample
Lens Transformations - Bidirectional, composable transformations between different dataset views
Automatic Batching - Smart batch aggregation with numpy array stacking
WebDataset Integration - Efficient storage and streaming for large-scale datasets
Flexible Data Sources - Stream from local files, HTTP URLs, or S3-compatible storage
HuggingFace-style API - load_dataset() with path resolution and split handling
Local & Atmosphere Storage - Index datasets locally with Redis or publish to ATProto network

Installation

pip install atdata

Requires Python 3.12 or later.

Quick Start

Loading Datasets

The primary way to load datasets is with load_dataset():

from atdata import load_dataset

# Load without specifying a type - returns Dataset[DictSample]
ds = load_dataset("path/to/data.tar", split="train")

# Explore the data
for sample in ds.ordered():
    print(sample.keys())      # See available fields
    print(sample["text"])     # Dict-style access
    print(sample.label)       # Attribute access
    break

Defining Typed Schemas

Once you understand your data, define a typed schema with @packable:

import atdata
from numpy.typing import NDArray

@atdata.packable
class ImageSample:
    image: NDArray
    label: str
    metadata: dict

Loading with Types

# Load with explicit type
ds = load_dataset("path/to/data-{000000..000009}.tar", ImageSample, split="train")

# Or convert from DictSample
ds = load_dataset("path/to/data.tar", split="train").as_type(ImageSample)

# Iterate over samples
for sample in ds.ordered():
    print(f"Label: {sample.label}, Image shape: {sample.image.shape}")

# Iterate with shuffling and batching
for batch in ds.shuffled(batch_size=32):
    # batch.image is automatically stacked into shape (32, ...)
    # batch.label is a list of 32 labels
    process_batch(batch.image, batch.label)

Lens Transformations

Define reusable transformations between sample types:

@atdata.packable
class ProcessedSample:
    features: NDArray
    label: str

@atdata.lens
def preprocess(sample: ImageSample) -> ProcessedSample:
    features = extract_features(sample.image)
    return ProcessedSample(features=features, label=sample.label)

# Apply lens to view dataset as ProcessedSample
processed_ds = dataset.as_type(ProcessedSample)

for sample in processed_ds.ordered(batch_size=None):
    # sample is now a ProcessedSample
    print(sample.features.shape)

Core Concepts

DictSample

The default sample type for schema-free exploration. Provides both attribute and dict-style access:

ds = load_dataset("data.tar", split="train")

for sample in ds.ordered():
    # Dict-style access
    print(sample["field_name"])

    # Attribute access
    print(sample.field_name)

    # Introspection
    print(sample.keys())
    print(sample.to_dict())

PackableSample

Base class for typed, serializable samples. Fields annotated as NDArray are automatically handled:

@atdata.packable
class MySample:
    array_field: NDArray      # Automatically serialized
    optional_array: NDArray | None
    regular_field: str

Every @packable class automatically registers a lens from DictSample, enabling seamless conversion via .as_type().

Lens

Bidirectional transformations with getter/putter semantics:

@atdata.lens
def my_lens(source: SourceType) -> ViewType:
    # Transform source -> view
    return ViewType(...)

@my_lens.putter
def my_lens_put(view: ViewType, source: SourceType) -> SourceType:
    # Transform view -> source
    return SourceType(...)

Data Sources

Datasets support multiple backends via the DataSource protocol:

# String URLs (most common) - automatically wrapped in URLSource
dataset = atdata.Dataset[ImageSample]("data-{000000..000009}.tar")

# S3 with authentication (private buckets, Cloudflare R2, MinIO)
source = atdata.S3Source(
    bucket="my-bucket",
    keys=["data-000000.tar", "data-000001.tar"],
    endpoint="https://my-account.r2.cloudflarestorage.com",
    access_key="...",
    secret_key="...",
)
dataset = atdata.Dataset[ImageSample](source)

Dataset URLs

Uses WebDataset brace expansion for sharded datasets:

Single file: "data/dataset-000000.tar"
Multiple shards: "data/dataset-{000000..000099}.tar"
Multiple patterns: "data/{train,val}/dataset-{000000..000009}.tar"

HuggingFace-style API

Load datasets with a familiar interface:

from atdata import load_dataset

# Load without type for exploration (returns Dataset[DictSample])
ds = load_dataset("./data/train-*.tar", split="train")

# Load with explicit type
ds = load_dataset("./data/train-*.tar", ImageSample, split="train")

# Load from S3 with brace notation
ds = load_dataset("s3://bucket/data-{000000..000099}.tar", ImageSample, split="train")

# Load all splits (returns DatasetDict)
ds_dict = load_dataset("./data", ImageSample)
train_ds = ds_dict["train"]
test_ds = ds_dict["test"]

# Convert DictSample to typed schema
ds = load_dataset("./data/train.tar", split="train").as_type(ImageSample)

Development

Setup

# Install uv if not already available
python -m pip install uv

# Install dependencies
uv sync

Testing

# Run all tests with coverage
uv run pytest

# Run specific test file
uv run pytest tests/test_dataset.py

# Run single test
uv run pytest tests/test_lens.py::test_lens

Building

uv build

Contributing

Contributions are welcome! This project is in beta, so the API may still evolve.

License

This project is licensed under the Mozilla Public License 2.0. See LICENSE for details.

Project details

Release history Release notifications | RSS feed

0.7.0b1 pre-release

Feb 26, 2026

0.6.0b1 pre-release

Feb 22, 2026

0.5.1b1 pre-release

Feb 17, 2026

0.5.0b1 pre-release

Feb 8, 2026

0.4.1b2 pre-release

Feb 5, 2026

0.4.1b1 pre-release

Feb 5, 2026

0.4.0b2 pre-release

Feb 5, 2026

0.3.4b1 pre-release

Feb 4, 2026

0.3.3b2 pre-release

Feb 4, 2026

0.3.3b1 pre-release

Feb 4, 2026

0.3.2b3 pre-release

Feb 4, 2026

0.3.2b2 pre-release

Feb 4, 2026

0.3.2b1 pre-release

Feb 4, 2026

0.3.1b1 pre-release

Feb 3, 2026

0.3.0b1 pre-release

Jan 31, 2026

0.2.3b1 pre-release

Jan 28, 2026

This version

0.2.2b1 pre-release

Jan 28, 2026

0.2.0a1 pre-release

Jan 9, 2026

0.1.3b4 pre-release

Nov 23, 2025

0.1.3b3 pre-release

Nov 20, 2025

0.1.3b2 pre-release

Nov 8, 2025

0.1.3b1 pre-release

Nov 8, 2025

0.1.3a3 pre-release

Nov 6, 2025

0.1.3a2 pre-release

Nov 6, 2025

0.1.2b1 pre-release

Oct 28, 2025

0.1.2a4 pre-release

Oct 27, 2025

0.1.2a3 pre-release

Oct 27, 2025

0.1.2a1 pre-release

Oct 26, 2025

0.1.1a3 pre-release

Oct 23, 2025

0.1.1a2 pre-release

Oct 23, 2025

0.1.1a1 pre-release

Oct 21, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

atdata-0.2.2b1.tar.gz (1.6 MB view details)

Uploaded Jan 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

atdata-0.2.2b1-py3-none-any.whl (95.3 kB view details)

Uploaded Jan 28, 2026 Python 3

File details

Details for the file atdata-0.2.2b1.tar.gz.

File metadata

Download URL: atdata-0.2.2b1.tar.gz
Upload date: Jan 28, 2026
Size: 1.6 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.27 {"installer":{"name":"uv","version":"0.9.27","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for atdata-0.2.2b1.tar.gz
Algorithm	Hash digest
SHA256	`65ca2057949970266e8e5de1d33637802a54055728a0be5d616fe883124f38b5`
MD5	`71b68e5a759ba469875b115ec79520a6`
BLAKE2b-256	`e36f079c140fdf0ebfd4a8fd3332a635af66c153294ea1521cd3b3d9d142a3a4`

See more details on using hashes here.

File details

Details for the file atdata-0.2.2b1-py3-none-any.whl.

File metadata

Download URL: atdata-0.2.2b1-py3-none-any.whl
Upload date: Jan 28, 2026
Size: 95.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.27 {"installer":{"name":"uv","version":"0.9.27","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for atdata-0.2.2b1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`98f3d9a5ea3d59d8bcd77c81d9cbbbb29a523e78f184966436b6d873260c9b0b`
MD5	`ad95c74df290adcdfd676cf64edc6832`
BLAKE2b-256	`fa3faeeb850b989606a5091b29e1234baf07b821e2cc17b5494a81f02173ed50`

See more details on using hashes here.

atdata 0.2.2b1

Navigation

Verified details

Maintainers

Meta

Unverified details

Meta

Project description

atdata

Features

Installation

Quick Start

Loading Datasets

Defining Typed Schemas

Loading with Types

Lens Transformations

Core Concepts

DictSample

PackableSample

Lens

Data Sources

Dataset URLs

HuggingFace-style API

Development

Setup

Testing

Building

Contributing

License

Project details

Verified details

Maintainers

Meta

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes