Skip to main content

Syvain training data manifest, loading, and saving utilities

Project description

syvain-training-data

Internal Syvain data utility. No secret sauce here, just a shared helper.

This is my dataloader. There are many like it, but this one is mine. My dataloader is my best friend. It is my life. I must master it as I must master my life. My dataloader, without me, is useless. Without my dataloader, I am useless.

Install

uv add syvain-training-data

Load data

from syvain_training_data import SyvainTrainingData

training_data = SyvainTrainingData(
    s3_base_url="https://t3.storage.dev",
    region="auto",
    access_key_id="...",
    secret_access_key="...",
)


def collate(records):
    ...


loader = training_data.split_data_loader(
    "s3://my-training-bucket/path/to/data-manifest-v1.json",
    collate_fn=collate,
    dataloader_args={"batch_size": 32, "num_workers": 4, ...},
)

train_batches = loader.load("train")
valid_batches = loader.load("valid")
easy_batches = loader.load("train", curriculum_stage="easy")
infinite_train_batches = loader.load("train", infinite_iter=True)

Save data

from concurrent.futures import ProcessPoolExecutor

from syvain_training_data import SyvainTrainingData

def generate_data(split, curriculum_stage, shard_id):
    ...

def save_shard(job):
    saver, split, curriculum_stage, shard_id = job
    records = generate_data(split, curriculum_stage, shard_id)
    saver.save(split, curriculum_stage, records)


training_data = SyvainTrainingData(
    s3_base_url="https://t3.storage.dev",
    region="auto",
    access_key_id="...",
    secret_access_key="...",
)

saver = training_data.dataset_saver(
    "s3://my-training-bucket/path/to/dataset/data-manifest-v1.json",
)

jobs = [
    (saver, "train", stage, shard_id)
    for stage in ["easy", "medium", "hard"]
    for shard_id in range(32)
] + [
    (saver, "valid", None, shard_id) for shard_id in range(4)
] + [
    (saver, "test", None, shard_id) for shard_id in range(4)
]

with ProcessPoolExecutor(max_workers=8) as pool:
    list(pool.map(save_shard, jobs))

manifest = saver.commit_manifest()

Copy a manifest

from syvain_training_data import SyvainTrainingData

training_data = SyvainTrainingData(
    s3_base_url="https://t3.storage.dev",
    region="auto",
    access_key_id="...",
    secret_access_key="...",
)

manifest = training_data.load_manifest("s3://my-training-bucket/shared/data-manifest-v1.json")

# Do modifications if needed

training_data.save_manifest("s3://my-training-bucket/new-run/data-manifest-v1.json", manifest)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

syvain_training_data-0.0.118.tar.gz (7.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

syvain_training_data-0.0.118-py3-none-any.whl (11.8 kB view details)

Uploaded Python 3

File details

Details for the file syvain_training_data-0.0.118.tar.gz.

File metadata

  • Download URL: syvain_training_data-0.0.118.tar.gz
  • Upload date:
  • Size: 7.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for syvain_training_data-0.0.118.tar.gz
Algorithm Hash digest
SHA256 75d30c7ae7a87a5e7d5848d08b7dcf1b5e0de7d709d193eb2777924d50ee3be8
MD5 90b3efa0e902b208d65b82c493fe7da5
BLAKE2b-256 8f666d0b89d737fe34f30febd47e38a7f93f86d8b18fb93e19d229f7eeead4e4

See more details on using hashes here.

File details

Details for the file syvain_training_data-0.0.118-py3-none-any.whl.

File metadata

  • Download URL: syvain_training_data-0.0.118-py3-none-any.whl
  • Upload date:
  • Size: 11.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for syvain_training_data-0.0.118-py3-none-any.whl
Algorithm Hash digest
SHA256 3e2eb492eb199f212e2c79f8a62b57357bcaa7d7f25431cc21fa4575f59861e2
MD5 76ff1690cb1e38f5dba0189ac7b86481
BLAKE2b-256 7b134827f230c64d9382365cf6269abf6aeaf9e697191c6854f42e4fdd9766c9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page