Skip to main content

Minimal data loader for Flax

Project description

loaderx

A Minimal Data Loader for Flax

Why Create loaderx?

While Flax supports multiple data-loading backends—including PyTorch, TensorFlow, Grain, and jax_dataloader—each comes with notable drawbacks:

  1. Installing large frameworks like PyTorch or TensorFlow just for data loading is often undesirable.
  2. Grain provides a clean API, but its real-world performance can be suboptimal.
  3. jax_dataloader defaults to using GPU memory, which may lead to inefficient memory utilization in some workflows.

Design Philosophy

loaderx is built around several core principles:

  1. A pragmatic approach that prioritizes minimal memory overhead and minimal dependencies.
  2. A strong focus on single-machine training workflows.
  3. We implement based on NumPy semantics, supporting both NumPy (for small to medium datasets) and ArrayRecord (for large-scale datasets) backends. Please note that when using ArrayRecord for writing, the group_size must be set to 1.
  4. An immortal (endless) step-based data loader, rather than the traditional epoch-based design—better aligned with modern ML training practices.

Current Limitations

Currently, loaderx only supports single-host environments and does not yet support multi-host training.

Array_record write & read

A quick start guide for those who are not yet familiar with ArrayRecord.

import numpy as np
from array_record.python.array_record_module import ArrayRecordWriter
from array_record.python.array_record_data_source import ArrayRecordDataSource

train_data = np.load('train_data.npy',mmap_mode='r')
dtype = train_data.dtype
shape =  train_data[0].shape

writer = ArrayRecordWriter("train_data.ar", options="group_size:1,zstd")

for i in range(train_data.shape[0]):
    writer.write(train_data[i].tobytes())

ds = ArrayRecordDataSource("train_data.ar")

np.frombuffer(ds[0], dtype=dtype).reshape(shape)

writer.close()
ds.close()

Quick Start

import numpy as np
from loaderx import NPDataset, ARDataset, DataLoader

dataset = ARDataset('xsub/train_data.ar', dtype=np.float32, shape=(3, 300, 25, 2))
labelset = NPDataset('xsub/train_label.npy')

writer = ArrayRecordWriter("train_data.ar", options="group_size:1,zstd")

loader = DataLoader(dataset, labelset)

for i, batch in enumerate(loader):
    if i >= 256:
        break

Integrating with Flax

For practical integration examples, please refer to the Data2Latent repository: https://github.com/eoeair/Data2Latent

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

loaderx-0.1.4.tar.gz (5.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

loaderx-0.1.4-py3-none-any.whl (6.6 kB view details)

Uploaded Python 3

File details

Details for the file loaderx-0.1.4.tar.gz.

File metadata

  • Download URL: loaderx-0.1.4.tar.gz
  • Upload date:
  • Size: 5.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for loaderx-0.1.4.tar.gz
Algorithm Hash digest
SHA256 738e2728cc0cd0457fbb7609722d7df5ab4b5c7ebaf595df5598c51e95aa5f51
MD5 14643de49b4452343b74d628190b8ffc
BLAKE2b-256 df1b797eed5a7f0fe4e53294957265b6880e51a1611144e602e9c7417500900f

See more details on using hashes here.

File details

Details for the file loaderx-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: loaderx-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 6.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for loaderx-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 22535680dfce95bc5652423ebd978d585315932573b43e600f6be7ccadc330ac
MD5 27d0f362c62a9d8a2e504e20cd9baa9a
BLAKE2b-256 0a012e8ac97b834d339ba86b2fb66c76e12fddab35be0f2c41cf46b63be6c0ce

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page