Skip to main content

A loose federation of distributed, typed datasets

Project description

atdata

codecov

A loose federation of distributed, typed datasets built on WebDataset.

atdata provides a type-safe, composable framework for working with large-scale datasets. It combines the efficiency of WebDataset's tar-based storage with Python's type system and functional programming patterns.

Features

  • Typed Samples - Define dataset schemas using Python dataclasses with automatic msgpack serialization
  • Lens Transformations - Bidirectional, composable transformations between different dataset views
  • Automatic Batching - Smart batch aggregation with numpy array stacking
  • WebDataset Integration - Efficient storage and streaming for large-scale datasets

Installation

pip install atdata

Requires Python 3.12 or later.

Quick Start

Defining Sample Types

Use the @packable decorator to create typed dataset samples:

import atdata
from numpy.typing import NDArray

@atdata.packable
class ImageSample:
    image: NDArray
    label: str
    metadata: dict

Creating Datasets

# Create a dataset
dataset = atdata.Dataset[ImageSample]("path/to/data-{000000..000009}.tar")

# Iterate over samples in order
for sample in dataset.ordered(batch_size=None):
    print(f"Label: {sample.label}, Image shape: {sample.image.shape}")

# Iterate with shuffling and batching
for batch in dataset.shuffled(batch_size=32):
    # batch.image is automatically stacked into shape (32, ...)
    # batch.label is a list of 32 labels
    process_batch(batch.image, batch.label)

Lens Transformations

Define reusable transformations between sample types:

@atdata.packable
class ProcessedSample:
    features: NDArray
    label: str

@atdata.lens
def preprocess(sample: ImageSample) -> ProcessedSample:
    features = extract_features(sample.image)
    return ProcessedSample(features=features, label=sample.label)

# Apply lens to view dataset as ProcessedSample
processed_ds = dataset.as_type(ProcessedSample)

for sample in processed_ds.ordered(batch_size=None):
    # sample is now a ProcessedSample
    print(sample.features.shape)

Core Concepts

PackableSample

Base class for serializable samples. Fields annotated as NDArray are automatically handled:

@atdata.packable
class MySample:
    array_field: NDArray      # Automatically serialized
    optional_array: NDArray | None
    regular_field: str

Lens

Bidirectional transformations with getter/putter semantics:

@atdata.lens
def my_lens(source: SourceType) -> ViewType:
    # Transform source -> view
    return ViewType(...)

@my_lens.putter
def my_lens_put(view: ViewType, source: SourceType) -> SourceType:
    # Transform view -> source
    return SourceType(...)

Dataset URLs

Uses WebDataset brace expansion for sharded datasets:

  • Single file: "data/dataset-000000.tar"
  • Multiple shards: "data/dataset-{000000..000099}.tar"
  • Multiple patterns: "data/{train,val}/dataset-{000000..000009}.tar"

Development

Setup

# Install uv if not already available
python -m pip install uv

# Install dependencies
uv sync

Testing

# Run all tests with coverage
pytest

# Run specific test file
pytest tests/test_dataset.py

# Run single test
pytest tests/test_lens.py::test_lens

Building

uv build

Contributing

Contributions are welcome! This project is in beta, so the API may still evolve.

License

This project is licensed under the Mozilla Public License 2.0. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

atdata-0.2.0a1.tar.gz (137.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

atdata-0.2.0a1-py3-none-any.whl (39.1 kB view details)

Uploaded Python 3

File details

Details for the file atdata-0.2.0a1.tar.gz.

File metadata

  • Download URL: atdata-0.2.0a1.tar.gz
  • Upload date:
  • Size: 137.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.23 {"installer":{"name":"uv","version":"0.9.23","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for atdata-0.2.0a1.tar.gz
Algorithm Hash digest
SHA256 4ce2ced69a10c0a7290199aec78687d58f6e8ab9f6e159e04fcf5cf70f697bdb
MD5 63b73cc19602de9319cdbcaf4c0497a9
BLAKE2b-256 7257a903e291ba00c703d47fdb755248a95a54810d371bdfb495e1a7d23588a2

See more details on using hashes here.

File details

Details for the file atdata-0.2.0a1-py3-none-any.whl.

File metadata

  • Download URL: atdata-0.2.0a1-py3-none-any.whl
  • Upload date:
  • Size: 39.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.23 {"installer":{"name":"uv","version":"0.9.23","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for atdata-0.2.0a1-py3-none-any.whl
Algorithm Hash digest
SHA256 25f776029a2358a4ec9b98f721a0d5d59efd222e5a83a53b06cc06e99fab1817
MD5 83155d84d704c62938cd38840d09353e
BLAKE2b-256 bd474c1668e9723f46db1945f4429b91cb183381011758b26961ebdefed2fec9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page