Skip to main content

A dataloader, but for JAX

Project description

Jaxonloader

A blazingly fast dataloader for JAX that no one asked for, but here it is anyway.

Installation

Install this package using pip like so:

pip install jaxonloader

Quickstart

This package differs significantly from the PyTorch DataLoader class! In JAX, there is no internal state, which means we have to keep track of it ourselves. Here's a minimum example to setup MNIST:

import jax

from jaxonloader import get_mnist
from jaxonloader import make
key = jax.random.PRNGKey(0)

train, test = get_mnist()
# these are JaxonDatasets

train_loader, index = make(
    train,
    batch_size=4,
    shuffle=False,
    drop_last=True,
    key=key,
    jit=True
)
train_loader = jax.jit(train_loader)
while x:= train_loader(index):
    data, index, done = x
    processed_data = process_data(data)
    if done:
        break

Philosophy

The jaxonloader package is designed to be as lightweight as possible. In fact, it's only a very thin wrapper around JAX arrays! Under the hood, it's using the Equinox library to handle the stateful nature of the dataloader. Since the dataloader object is just a eqx.Module, it can be JITted and can be used in other JAX transformations as well (although, I haven't tested this).

Label & Target Handling

Due to it's lightweight nature, this package - as of now - doesn't perform any kinds of transformations. This means that you will have to transform your data first and then pass them to the dataloader. This also goes for post-processing the data.

While in PyTorch, you would do something like this:

for x, y in train_dataloader:
    # do something with x and y

In Jaxonloader, we don't split the row of the dataset into x and y and instead simply return the whole row. This means that you will have to do the splitting (i.e. data post-processing) yourself.

# MNIST example
while x:= train_loader(index):
    data, index, done = x
    print(data.shape) # (4, 785)
    x, y = data[:, :-1], data[:, -1] # split the data into x and y

    # do something with x and y

Roadmap

The goal is to keep this package as lightweight as possible, while also providing as many datasets as possible. The next steps are to gather as many datasets as possible and to provide a simple API to load them.


Other backends

Other backends are not supported and are not planned to be supported. There is already a very good dataloader for PyTorch, and with all the support PyTorch has, it's not needed to litter the world with yet another PyTorch dataloader. The same goes for TensorFlow as well.

If you really need one, which supports all backends, check out

jax-dataloader

Then why does this package exist?

For one, I just like building things and don't really care if it's needed or not. Secondly, I don't care about other backends (as they are already very well supported) and only want to focus on JAX and I needed a lightweight, easy-to-handle package, which loads data in JAX.

Also, the PyTorch dataloader is slow! To iterate over the MNIST training set, it takes on a MacBook M1 Pro around 2.83 seconds. Unjitted, the JAX dataloader takes 1.5 seconds and when jitted, it's around 0.09 seconds! This makes it around 31 times faster than the PyTorch dataloader.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jaxonloader-0.2.0.tar.gz (9.5 kB view hashes)

Uploaded Source

Built Distribution

jaxonloader-0.2.0-py3-none-any.whl (9.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page