A loose federation of distributed, typed datasets
Project description
atdata
A loose federation of distributed, typed datasets built on WebDataset.
atdata provides a type-safe, composable framework for working with large-scale datasets. It combines the efficiency of WebDataset's tar-based storage with Python's type system and functional programming patterns.
Features
- Typed Samples - Define dataset schemas using Python dataclasses with automatic msgpack serialization
- Lens Transformations - Bidirectional, composable transformations between different dataset views
- Automatic Batching - Smart batch aggregation with numpy array stacking
- WebDataset Integration - Efficient storage and streaming for large-scale datasets
Installation
pip install atdata
Requires Python 3.12 or later.
Quick Start
Defining Sample Types
Use the @packable decorator to create typed dataset samples:
import atdata
from numpy.typing import NDArray
@atdata.packable
class ImageSample:
image: NDArray
label: str
metadata: dict
Creating Datasets
# Create a dataset
dataset = atdata.Dataset[ImageSample]("path/to/data-{000000..000009}.tar")
# Iterate over samples in order
for sample in dataset.ordered(batch_size=None):
print(f"Label: {sample.label}, Image shape: {sample.image.shape}")
# Iterate with shuffling and batching
for batch in dataset.shuffled(batch_size=32):
# batch.image is automatically stacked into shape (32, ...)
# batch.label is a list of 32 labels
process_batch(batch.image, batch.label)
Lens Transformations
Define reusable transformations between sample types:
@atdata.packable
class ProcessedSample:
features: NDArray
label: str
@atdata.lens
def preprocess(sample: ImageSample) -> ProcessedSample:
features = extract_features(sample.image)
return ProcessedSample(features=features, label=sample.label)
# Apply lens to view dataset as ProcessedSample
processed_ds = dataset.as_type(ProcessedSample)
for sample in processed_ds.ordered(batch_size=None):
# sample is now a ProcessedSample
print(sample.features.shape)
Core Concepts
PackableSample
Base class for serializable samples. Fields annotated as NDArray are automatically handled:
@atdata.packable
class MySample:
array_field: NDArray # Automatically serialized
optional_array: NDArray | None
regular_field: str
Lens
Bidirectional transformations with getter/putter semantics:
@atdata.lens
def my_lens(source: SourceType) -> ViewType:
# Transform source -> view
return ViewType(...)
@my_lens.putter
def my_lens_put(view: ViewType, source: SourceType) -> SourceType:
# Transform view -> source
return SourceType(...)
Dataset URLs
Uses WebDataset brace expansion for sharded datasets:
- Single file:
"data/dataset-000000.tar" - Multiple shards:
"data/dataset-{000000..000099}.tar" - Multiple patterns:
"data/{train,val}/dataset-{000000..000009}.tar"
Development
Setup
# Install uv if not already available
python -m pip install uv
# Install dependencies
uv sync
Testing
# Run all tests with coverage
pytest
# Run specific test file
pytest tests/test_dataset.py
# Run single test
pytest tests/test_lens.py::test_lens
Building
uv build
Contributing
Contributions are welcome! This project is in beta, so the API may still evolve.
License
This project is licensed under the Mozilla Public License 2.0. See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file atdata-0.1.3b4.tar.gz.
File metadata
- Download URL: atdata-0.1.3b4.tar.gz
- Upload date:
- Size: 25.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.11 {"installer":{"name":"uv","version":"0.9.11"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6ea54325927c4bac6b57378d715b497a99060ef8b32c71a211f6521f00d55a7c
|
|
| MD5 |
1148cd5e6685ec439a834c5a4d537441
|
|
| BLAKE2b-256 |
d6f1aaf7271d1f10b556ebc4833b896dcdf4af19309cd578050dace45af6164a
|
File details
Details for the file atdata-0.1.3b4-py3-none-any.whl.
File metadata
- Download URL: atdata-0.1.3b4-py3-none-any.whl
- Upload date:
- Size: 21.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.11 {"installer":{"name":"uv","version":"0.9.11"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b3cb571af932a8b5f4e9199387fb551a3a07e69e678f8bf929e2d225bd5b4010
|
|
| MD5 |
ce49a78d33856125cf350fd8171f3b09
|
|
| BLAKE2b-256 |
33b77e4e54bae21a54cb02f9818bf13155c2f5d35e368da8c8a321c1fb4092f9
|