Skip to main content

A minimalist multiprocessing data loader for tinygrad

Project description

tinyloader

A minimalist multiprocessing data loader for tinygrad

Why?

With PyTorch, you have DataLoader to help load data in the background. But what about tinygrad? We want to load data efficiently using multiprocessing, but it turns out to be more challenging than expected. This is mainly because pickling large amounts of data is extremely slow, often making it slower than a single-process approach. To solve this problem, we built a simple, minimalist library to efficiently load data in background processes into shared memory, avoiding slow pickling.

How?

To install tinyloader, simply run:

pip install tinyloader

Then, for example, you can define your own loader for loading video file like this:

import pathlib

import numpy as np
from tinyloader.loader import Loader

class VideoLoader(Loader):
    def make_request(self, item: pathlib.Path) -> typing.Any:
        # This function will be called in the main process (where load or load_with_worker is invoked)
        # It's for creating the request for loading data in a worker process.
        # Therefore, the returned value ideally should be easily pickable so that it can
        # be transferred to another process easily
        return item

    def load(self, request: pathlib.Path) -> tuple[np.typing.NDArray, ...]:
        # This function will be called from a background worker process if multiprocessing is used,
        x, y = load_video(request)
        # both x and y need to be numpy.ndarray
        return x, y

    def post_process(
        self, response: tuple[np.typing.NDArray, ...]
    ) -> tuple[tinygrad.Tensor, ...]:
        x, y = response
        # This function will be called from the main process (where load or load_with_worker is invoked)
        # before yielding back to the for loop. We need to transform the response returned from the `load`
        # method into a tinygrad Tensor. Be careful that the underlying memory buffer for the passed in
        # response could be shared memory, and it will be reused by other worker after this function
        # returns, so you need to copy the data before return
        return tinygrad.Tensor(x).contiguous().realize(), tinygrad.Tensor(
            y
        ).contiguous().realize()

Next, you can use the load function to load the data like this:

from tinyloader.loader import load

video_loader = VideoLoader()
for x, y in load(loader=loader, items=["0.mp4", "1.mp4", ...]):
    # ... use x and y for training or testing
    pass

The load function runs everything in the same process without multiprocessing. To speed up with multiprocessing, you can use load_with_workers instead like this:

from tinyloader.loader import load_with_workers

video_loader = VideoLoader()
for x, y in load_with_workers(loader=loader, items=["0.mp4", "1.mp4", ...], num_worker=8):
    # ... use x and y for training or testing
    pass

When this works fine, but if you are loading huge amount of data such as video files, it could be even slower than a single process approach due to picking large ndarray is very slow. To solve the problem, you can use the SharedMemoryShim.

from multiprocessing.managers import SharedMemoryManager
from tinyloader.loader import load_with_workers
from tinyloader.loader import SharedMemoryShim

num_workers = 8

with SharedMemoryManager() as smm:
    loader = SharedMemoryShim(
        VideoLoader(),
        smm=smm,
        memory_pool_block_count=num_workers,
    )
    with load_with_workers(loader, ["0.mp4", "1.mp4", ...], num_workers) as generator:
        for x, y in generator:
            # ... use x and y for training or testing
            pass

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tinyloader-0.1.1.tar.gz (5.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tinyloader-0.1.1-py3-none-any.whl (5.7 kB view details)

Uploaded Python 3

File details

Details for the file tinyloader-0.1.1.tar.gz.

File metadata

  • Download URL: tinyloader-0.1.1.tar.gz
  • Upload date:
  • Size: 5.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.13

File hashes

Hashes for tinyloader-0.1.1.tar.gz
Algorithm Hash digest
SHA256 89875e444660ff305601c694928b1d641fc0e92deecec27557fd65e17c2d08a4
MD5 b172073412a3554c3c80ad7c2dbe1eb8
BLAKE2b-256 6cf36bc9e77aaeb88ee45d12f3cf9ef3a225ff231502a636c2c0257cfa065a1a

See more details on using hashes here.

File details

Details for the file tinyloader-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: tinyloader-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 5.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.13

File hashes

Hashes for tinyloader-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1595650dad83ff5f6f6994969b6640176d3d52c1edefddf44adb4078f4cd18e7
MD5 cb96cf7131edfbcf9bcf9155f4afd704
BLAKE2b-256 1207d8c2370d15919d8925c4b5dd57e84430cd111bc03401499d7e35a4178517

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page