Skip to main content

A simple library to load training data from hdd or ssd

Project description

split-data-loader

This package contains simple utility functions related to writing and reading data (typically related to machine learning) using multiple files.

There are mainly two extremes when dealing with data.

  1. All the data in a single file - This is good for sequential access, but can be cumbersome to shuffle data when reading.
  2. Each frame in its own file - This creates too many tiny files and can be difficult to scale.

This library uses an intermediate approach. The entire dataset is split and stored in multiple files (eg: N = 128) called bins. It allows easy shuffling of data and parallel processing when required.

This library also uses an index file to keep track of the order and location of each packet. It allows index based random lookup of all the input packets, distributed among all the bin files.

Writing Data

Use write_split_data to write data to a target directory.

from splitdataloader import write_split_data

def example_writer(...):
    # Get the data source
    data_source: Iterable[bytes] = some_source()
    target_dir = "tmp/training_data"
    write_split_data(target_dir, data_source, splits=128)

Reading Data

This is the main objective of this library. The class SplitDataLoader handles the loading of data. It supports the following:

  1. Getting length using len()
  2. Random indexing using []
  3. Data iteration (binwise), with support for shuffling
from splitdataloader import SplitDataLoader

def example_loader(...):
    # Get the data source
    data_dir = "tmp/training_data"
    loader = SplitDataLoader(data_dir)
    # Supports len()
    print(len(loader))
    # Supports indexing
    data  = loader[2]
    # Supports iteration
    for data in loader.iterate_binwise(shuffle=True):
        do_something(data)

Multiprocessing Queue Based Iterator

If the loading takes too much time, it is probably a good idea to run the loading part in a separate process. If it is possible to refactor the entity that produces the batches as a generator, splitdataloader.MpQItr can be used to handle loading. Data will be loaded to an internal queue while it is being processed in the main process.

from splitdataloader import MpQItr

# a tuple, class, or whatever that handles the batch
class BatchType:
    ...

# a generator function that produces the batches
def batch_generator(...) -> Iterator[BatchType]:
    ...

def batch_wise_processing(...):
    # Multi-processing queue based iterator
    queued_batch_iterator = MpQItr[BatchType](
        batch_generator,  # the generator function
        args... # args or kwargs to the generator function
    )
    for batch in queued_batch_iterator:
        do_something_with(batch)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

splitdataloader-0.0.1.tar.gz (6.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

splitdataloader-0.0.1-py3-none-any.whl (7.4 kB view details)

Uploaded Python 3

File details

Details for the file splitdataloader-0.0.1.tar.gz.

File metadata

  • Download URL: splitdataloader-0.0.1.tar.gz
  • Upload date:
  • Size: 6.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.2

File hashes

Hashes for splitdataloader-0.0.1.tar.gz
Algorithm Hash digest
SHA256 ba8980a6c165080d0d55ea5aba2eb888197d95465583f3947c9725c207502546
MD5 46e3ca9540ebe43dd3c95f38152dfe9f
BLAKE2b-256 e256ab33539cbbb9cb1d230c386bc53b3640f7f8589567cdf7a7e956e2f9edee

See more details on using hashes here.

File details

Details for the file splitdataloader-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for splitdataloader-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f8a15eff56dafbff567c44c6946de897f7270eb7186fa6d1368b10926f5b03e9
MD5 74cb654fb2f7ef886b36bc1de62b9a82
BLAKE2b-256 2ee638f0b463aa13bc51715f633835b77a93dbafafe0c3691cfd8986b388e4a3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page