Skip to main content

No project description provided

Project description

Random Access Archive

.raa files are essentially a dict header + consecutive bytes of the samples. It was made to faccilitate and accelerate deep learning training on large datasets. It's written in Rust and fast, but easily accesible programmatically in Python. Most importantly, it allows you to shuffle the data, without sacrificing too much on sequential reads, by shuffling blocks of contiguous data. It also allows for lazy sharding.

Comparison

The main advantage of this library, is how extensible it is. Other libraries like Webdataset, FFCV, Streaming Dataset, TF Record, are very batteries included, which is great for experimentation, but sacrifices on extensibility heavily since they also include data processing. Our philosiphy quite simple, you write string byte pairs, and you read string byte pairs. We only implement functionality that NEEDS to be implemented at the reader level for optimization, like shuffling and sharding.

Benchmarks:

!todo

Usage

pip install rand-archive

Writing:

from rand_archive import Writer

with Writer("test.raa") as w:
  w.write("test", bytes("test"))

Reading

from rand_archive import Reader

for _ in Reader().open_file("dummy.raa").with_shuffling():
  pass

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rand_archive-0.2.3.tar.gz (24.3 kB view hashes)

Uploaded source

Built Distributions

rand_archive-0.2.3-cp311-none-win_amd64.whl (3.0 MB view hashes)

Uploaded cp311

rand_archive-0.2.3-cp310-none-win_amd64.whl (3.0 MB view hashes)

Uploaded cp310

rand_archive-0.2.3-cp39-none-win_amd64.whl (3.0 MB view hashes)

Uploaded cp39

rand_archive-0.2.3-cp38-none-win_amd64.whl (3.0 MB view hashes)

Uploaded cp38

rand_archive-0.2.3-cp37-none-win_amd64.whl (3.0 MB view hashes)

Uploaded cp37

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page