A simple library to load training data from hdd or ssd

These details have not been verified by PyPI

Project links

Homepage

Project description

split-data-loader

This package contains simple utility functions related to writing and reading data (typically related to machine learning) using multiple files.

There are mainly two extremes when dealing with data.

All the data in a single file - This is good for sequential access, but can be cumbersome to shuffle data when reading.
Each frame in its own file - This creates too many tiny files and can be difficult to scale.

This library uses an intermediate approach. The entire dataset is split and stored in multiple files (eg: N = 128) called bins. It allows easy shuffling of data and parallel processing when required.

This library also uses an index file to keep track of the order and location of each packet. It allows index based random lookup of all the input packets, distributed among all the bin files.

Writing Data

Use write_split_data to write data to a target directory.

from splitdataloader import write_split_data

def example_writer(...):
    # Get the data source
    data_source: Iterable[bytes] = some_source()
    target_dir = "tmp/training_data"
    write_split_data(target_dir, data_source, splits=128)

Reading Data

This is the main objective of this library. The class SplitDataLoader handles the loading of data. It supports the following:

Getting length using len()
Random indexing using []
Data iteration (binwise), with support for shuffling

from splitdataloader import SplitDataLoader

def example_loader(...):
    # Get the data source
    data_dir = "tmp/training_data"
    loader = SplitDataLoader(data_dir)
    # Supports len()
    print(len(loader))
    # Supports indexing
    data  = loader[2]
    # Supports iteration
    for data in loader.iterate_binwise(shuffle=True):
        do_something(data)

Multiprocessing Queue Based Iterator

If the loading takes too much time, it is probably a good idea to run the loading part in a separate process. If it is possible to refactor the entity that produces the batches as a generator, splitdataloader.MpQItr can be used to handle loading. Data will be loaded to an internal queue while it is being processed in the main process.

from splitdataloader import MpQItr

# a tuple, class, or whatever that handles the batch
class BatchType:
    ...

# a generator function that produces the batches
def batch_generator(...) -> Iterator[BatchType]:
    ...

def batch_wise_processing(...):
    # Multi-processing queue based iterator
    queued_batch_iterator = MpQItr[BatchType](
        batch_generator,  # the generator function
        args... # args or kwargs to the generator function
    )
    for batch in queued_batch_iterator:
        do_something_with(batch)

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.0.1

Dec 9, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

splitdataloader-0.0.1.tar.gz (6.7 kB view details)

Uploaded Dec 9, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

splitdataloader-0.0.1-py3-none-any.whl (7.4 kB view details)

Uploaded Dec 9, 2023 Python 3

File details

Details for the file splitdataloader-0.0.1.tar.gz.

File metadata

Download URL: splitdataloader-0.0.1.tar.gz
Upload date: Dec 9, 2023
Size: 6.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.2

File hashes

Hashes for splitdataloader-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`ba8980a6c165080d0d55ea5aba2eb888197d95465583f3947c9725c207502546`
MD5	`46e3ca9540ebe43dd3c95f38152dfe9f`
BLAKE2b-256	`e256ab33539cbbb9cb1d230c386bc53b3640f7f8589567cdf7a7e956e2f9edee`

See more details on using hashes here.

File details

Details for the file splitdataloader-0.0.1-py3-none-any.whl.

File metadata

Download URL: splitdataloader-0.0.1-py3-none-any.whl
Upload date: Dec 9, 2023
Size: 7.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.2

File hashes

Hashes for splitdataloader-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f8a15eff56dafbff567c44c6946de897f7270eb7186fa6d1368b10926f5b03e9`
MD5	`74cb654fb2f7ef886b36bc1de62b9a82`
BLAKE2b-256	`2ee638f0b463aa13bc51715f633835b77a93dbafafe0c3691cfd8986b388e4a3`

See more details on using hashes here.

splitdataloader 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

split-data-loader

Writing Data

Reading Data

Multiprocessing Queue Based Iterator

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes