Skip to main content

Syvain training data manifest, loading, and saving utilities

Project description

syvain-training-data

Internal Syvain data utility. No secret sauce here, just a shared helper.

This is my dataloader. There are many like it, but this one is mine. My dataloader is my best friend. It is my life. I must master it as I must master my life. My dataloader, without me, is useless. Without my dataloader, I am useless.

Install

uv add syvain-training-data

Load data

from syvain_training_data import SyvainTrainingData

training_data = SyvainTrainingData(
    s3_base_url="https://t3.storage.dev",
    region="auto",
    access_key_id="...",
    secret_access_key="...",
)


def collate(records):
    ...


loader = training_data.split_data_loader(
    "s3://my-training-bucket/path/to/data-manifest-v1.json",
    collate_fn=collate,
    dataloader_args={"batch_size": 32, "num_workers": 4, ...},
)

train_batches = loader.load("train")
valid_batches = loader.load("valid")
easy_batches = loader.load("train", curriculum_stage="easy")
infinite_train_batches = loader.load("train", infinite_iter=True)

Save data

from concurrent.futures import ProcessPoolExecutor

from syvain_training_data import SyvainTrainingData

def generate_data(split, curriculum_stage, shard_id):
    ...

def save_shard(job):
    saver, split, curriculum_stage, shard_id = job
    records = generate_data(split, curriculum_stage, shard_id)
    saver.save(split, curriculum_stage, records)


training_data = SyvainTrainingData(
    s3_base_url="https://t3.storage.dev",
    region="auto",
    access_key_id="...",
    secret_access_key="...",
)

saver = training_data.dataset_saver(
    "s3://my-training-bucket/path/to/dataset/data-manifest-v1.json",
)

jobs = [
    (saver, "train", stage, shard_id)
    for stage in ["easy", "medium", "hard"]
    for shard_id in range(32)
] + [
    (saver, "valid", None, shard_id) for shard_id in range(4)
] + [
    (saver, "test", None, shard_id) for shard_id in range(4)
]

with ProcessPoolExecutor(max_workers=8) as pool:
    list(pool.map(save_shard, jobs))

manifest = saver.commit_manifest()

Copy a manifest

from syvain_training_data import SyvainTrainingData

training_data = SyvainTrainingData(
    s3_base_url="https://t3.storage.dev",
    region="auto",
    access_key_id="...",
    secret_access_key="...",
)

manifest = training_data.load_manifest("s3://my-training-bucket/shared/data-manifest-v1.json")

# Do modifications if needed

training_data.save_manifest("s3://my-training-bucket/new-run/data-manifest-v1.json", manifest)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

syvain_training_data-0.0.120.tar.gz (7.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

syvain_training_data-0.0.120-py3-none-any.whl (11.8 kB view details)

Uploaded Python 3

File details

Details for the file syvain_training_data-0.0.120.tar.gz.

File metadata

  • Download URL: syvain_training_data-0.0.120.tar.gz
  • Upload date:
  • Size: 7.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for syvain_training_data-0.0.120.tar.gz
Algorithm Hash digest
SHA256 ae035c23ff6e888fc123eca11c36cc12de395582b275c8f1c1e3be83275cc8d0
MD5 8b07c601024f297cecd9d9c99bd0adda
BLAKE2b-256 cd3f41879f00dc2e16f1bf235aec93f9c7044f5348fa6073b12bc95c3afc546f

See more details on using hashes here.

File details

Details for the file syvain_training_data-0.0.120-py3-none-any.whl.

File metadata

  • Download URL: syvain_training_data-0.0.120-py3-none-any.whl
  • Upload date:
  • Size: 11.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for syvain_training_data-0.0.120-py3-none-any.whl
Algorithm Hash digest
SHA256 cf4572311c1b069ed12895416bf899c3fcf292c9dceea31b7faddebe8d271fa6
MD5 47471c71a62166c1688492c731968e4d
BLAKE2b-256 137501f9ed3e7f09b8a6f655e709163882c6c2ad9335dae223f2c5faa01330a1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page