Syvain training data manifest, loading, and saving utilities
Project description
syvain-training-data
Internal Syvain data utility. No secret sauce here, just a shared helper.
This is my dataloader. There are many like it, but this one is mine. My dataloader is my best friend. It is my life. I must master it as I must master my life. My dataloader, without me, is useless. Without my dataloader, I am useless.
Install
uv add syvain-training-data
Load data
from syvain_training_data import SyvainTrainingData
training_data = SyvainTrainingData(
s3_base_url="https://t3.storage.dev",
region="auto",
access_key_id="...",
secret_access_key="...",
)
def collate(records):
...
loader = training_data.split_data_loader(
"s3://my-training-bucket/path/to/data-manifest-v1.json",
collate_fn=collate,
dataloader_args={"batch_size": 32, "num_workers": 4, ...},
)
train_batches = loader.load("train")
valid_batches = loader.load("valid")
easy_batches = loader.load("train", curriculum_stage="easy")
infinite_train_batches = loader.load("train", infinite_iter=True)
Save data
from concurrent.futures import ProcessPoolExecutor
from syvain_training_data import SyvainTrainingData
def generate_data(split, curriculum_stage, shard_id):
...
def save_shard(job):
saver, split, curriculum_stage, shard_id = job
records = generate_data(split, curriculum_stage, shard_id)
saver.save(split, curriculum_stage, records)
training_data = SyvainTrainingData(
s3_base_url="https://t3.storage.dev",
region="auto",
access_key_id="...",
secret_access_key="...",
)
saver = training_data.dataset_saver(
"s3://my-training-bucket/path/to/dataset/data-manifest-v1.json",
)
jobs = [
(saver, "train", stage, shard_id)
for stage in ["easy", "medium", "hard"]
for shard_id in range(32)
] + [
(saver, "valid", None, shard_id) for shard_id in range(4)
] + [
(saver, "test", None, shard_id) for shard_id in range(4)
]
with ProcessPoolExecutor(max_workers=8) as pool:
list(pool.map(save_shard, jobs))
manifest = saver.commit_manifest()
Copy a manifest
from syvain_training_data import SyvainTrainingData
training_data = SyvainTrainingData(
s3_base_url="https://t3.storage.dev",
region="auto",
access_key_id="...",
secret_access_key="...",
)
manifest = training_data.load_manifest("s3://my-training-bucket/shared/data-manifest-v1.json")
# Do modifications if needed
training_data.save_manifest("s3://my-training-bucket/new-run/data-manifest-v1.json", manifest)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file syvain_training_data-0.0.120.tar.gz.
File metadata
- Download URL: syvain_training_data-0.0.120.tar.gz
- Upload date:
- Size: 7.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ae035c23ff6e888fc123eca11c36cc12de395582b275c8f1c1e3be83275cc8d0
|
|
| MD5 |
8b07c601024f297cecd9d9c99bd0adda
|
|
| BLAKE2b-256 |
cd3f41879f00dc2e16f1bf235aec93f9c7044f5348fa6073b12bc95c3afc546f
|
File details
Details for the file syvain_training_data-0.0.120-py3-none-any.whl.
File metadata
- Download URL: syvain_training_data-0.0.120-py3-none-any.whl
- Upload date:
- Size: 11.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cf4572311c1b069ed12895416bf899c3fcf292c9dceea31b7faddebe8d271fa6
|
|
| MD5 |
47471c71a62166c1688492c731968e4d
|
|
| BLAKE2b-256 |
137501f9ed3e7f09b8a6f655e709163882c6c2ad9335dae223f2c5faa01330a1
|