Skip to main content

Minimal data loader for Flax

Project description

loaderx

A Minimal Data Loader for Flax

Why Create loaderx?

While Flax supports multiple data-loading backends—including PyTorch, TensorFlow, Grain, and jax_dataloader—each comes with notable drawbacks:

  1. Installing large frameworks like PyTorch or TensorFlow just for data loading is often undesirable.
  2. Grain provides a clean API, but its real-world performance can be suboptimal.
  3. jax_dataloader defaults to using GPU memory, which may lead to inefficient memory utilization in some workflows.

Design Philosophy

loaderx is built around several core principles:

  1. A pragmatic approach that prioritizes minimal memory overhead and minimal dependencies.
  2. A strong focus on single-machine training workflows.
  3. We implement based on NumPy semantics, supporting both NumPy (for small to medium datasets) and ArrayRecord (for large-scale datasets) backends. Please note that when using ArrayRecord for writing, the group_size must be set to 1.
  4. An immortal (endless) step-based data loader, rather than the traditional epoch-based design—better aligned with modern ML training practices.

Current Limitations

Currently, loaderx only supports single-host environments and does not yet support multi-host training.

Array_record write & read

A quick start guide for those who are not yet familiar with ArrayRecord.

import numpy as np
from array_record.python.array_record_module import ArrayRecordWriter
from array_record.python.array_record_data_source import ArrayRecordDataSource

train_data = np.load('train_data.npy',mmap_mode='r')
dtype = train_data.dtype
shape =  train_data[0].shape

writer = ArrayRecordWriter("train_data.ar", options="group_size:1,zstd")

for i in range(train_data.shape[0]):
    writer.write(train_data[i].tobytes())

ds = ArrayRecordDataSource("train_data.ar")

np.frombuffer(ds[0], dtype=dtype).reshape(shape)

writer.close()
ds.close()

Quick Start

import numpy as np
from loaderx import NPDataset, ARDataset, DataLoader

dataset = ARDataset('xsub/train_data.ar', dtype=np.float32, shape=(3, 300, 25, 2))
labelset = NPDataset('xsub/train_label.npy')

writer = ArrayRecordWriter("train_data.ar", options="group_size:1,zstd")

loader = DataLoader(dataset, labelset)

for i, batch in enumerate(loader):
    if i >= 256:
        break

Integrating with Flax

For practical integration examples, please refer to the Data2Latent repository: https://github.com/eoeair/Data2Latent

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

loaderx-0.1.3.tar.gz (5.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

loaderx-0.1.3-py3-none-any.whl (6.6 kB view details)

Uploaded Python 3

File details

Details for the file loaderx-0.1.3.tar.gz.

File metadata

  • Download URL: loaderx-0.1.3.tar.gz
  • Upload date:
  • Size: 5.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for loaderx-0.1.3.tar.gz
Algorithm Hash digest
SHA256 f1b5fd285b68a3bde9aa6c51dfcde02da35b6baae7f116ce38f784c4555657d0
MD5 0a5930b9eb2673893ce63c9b52e9a94b
BLAKE2b-256 a65d0faffbdfdcc8f7ac58003bac7a4cc57cf2f391fc4ddbeac9330825d7787b

See more details on using hashes here.

File details

Details for the file loaderx-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: loaderx-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 6.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for loaderx-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 fb194644048fa3aded5e2afaa520a113d846b247c83b75f63146be4f8be71068
MD5 7d161b724563c2f612b9afbca000f09e
BLAKE2b-256 f9360b1acd5349974c025e1004ed8afc39459fee4b12e41d1f262ca1fdb828a6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page