Production-ready ML data loading library with distributed training support, SIMD-accelerated transforms, pipe operator composition, HDF5/TFRecord/Zarr support, and GPU transforms. Built with C++17 for maximum performance.

These details have not been verified by PyPI

Project links

Project description

TurboLoader

Production-Ready ML Data Loading Library

Overview

TurboLoader is a high-performance data loading library for machine learning workflows. Built with C++20 and featuring Python bindings, it provides efficient data loading with SIMD-accelerated transforms, custom binary formats, and distributed training support.

Core Features

Decoded Tensor Caching (v2.7.0) - cache_decoded=True for 100K+ img/s on subsequent epochs
Multiple Loader Types - FastDataLoader (8-12% faster), MemoryEfficientDataLoader, standard DataLoader
Distributed Training Support - Multi-node data loading with deterministic sharding
SIMD-Accelerated Transforms - 19 vectorized transforms using AVX2/AVX-512/NEON
TBL v2 Binary Format - Custom format with LZ4 compression for reduced storage
Framework Integration - Seamless support for PyTorch, TensorFlow, and JAX
Memory-Mapped I/O - Zero-copy file access for improved throughput
Lock-Free Queues - Concurrent data structures for efficient multi-threading
GPU JPEG Decoding - Optional NVIDIA nvJPEG support for accelerated decoding

Installation

From PyPI (Recommended)

pip install turboloader

From Source

git clone https://github.com/ALJainProjects/TurboLoader.git
cd TurboLoader
pip install -e .

System Requirements

Python: 3.10 or higher
Compiler: C++20 capable (GCC 10+, Clang 12+, MSVC 19.29+)
OS: macOS, Linux, Windows

Optional Dependencies

Install for enhanced performance:

# macOS
brew install jpeg-turbo libpng libwebp lz4

# Ubuntu/Debian
sudo apt-get install libjpeg-turbo8-dev libpng-dev libwebp-dev liblz4-dev

Quick Start

Basic Usage

import turboloader

# Create DataLoader
loader = turboloader.DataLoader(
    'imagenet.tar',
    batch_size=128,
    num_workers=8
)

# Iterate over batches
for batch in loader:
    for sample in batch:
        image = sample['image']  # NumPy array (H, W, C)
        label = sample['label']
        # Train your model...

With Transforms

import turboloader

# Create transforms
resize = turboloader.Resize(224, 224)
normalize = turboloader.ImageNetNormalize()
flip = turboloader.RandomHorizontalFlip(p=0.5)

# Apply transforms
loader = turboloader.DataLoader('data.tar', batch_size=64, num_workers=8)

for batch in loader:
    for sample in batch:
        img = sample['image']
        img = resize.apply(img)
        img = flip.apply(img)
        img = normalize.apply(img)
        # Ready for training

PyTorch Integration

import turboloader
import torch

loader = turboloader.DataLoader('imagenet.tar', batch_size=64, num_workers=8)

# Convert to PyTorch tensors
to_tensor = turboloader.ToTensor(
    format=turboloader.TensorFormat.PYTORCH_CHW
)

for batch in loader:
    images = []
    for sample in batch:
        img = to_tensor.apply(sample['image'])
        images.append(torch.from_numpy(img))

    batch_tensor = torch.stack(images)
    # Train model...

Distributed Training

import turboloader
import torch.distributed as dist

# Initialize distributed training
dist.init_process_group(backend='nccl')

# Create loader with distributed support
loader = turboloader.DataLoader(
    data_path="/data/imagenet.tar",
    batch_size=64,
    num_workers=4,
    shuffle=True,
    enable_distributed=True,
    world_rank=dist.get_rank(),
    world_size=dist.get_world_size(),
    drop_last=True
)

# Each rank automatically gets its shard
for batch in loader:
    # Your training code
    pass

Transform Library

TurboLoader includes 19 SIMD-accelerated transforms:

Core Transforms

Resize - Bilinear/Bicubic/Lanczos interpolation
Normalize - Mean/std normalization with SIMD
CenterCrop - Center region extraction
RandomCrop - Random crop with padding

Augmentation Transforms

RandomHorizontalFlip - SIMD horizontal flip
RandomVerticalFlip - SIMD vertical flip
ColorJitter - Brightness/contrast/saturation/hue
RandomRotation - Arbitrary angle rotation
GaussianBlur - Separable convolution
RandomErasing - Cutout augmentation
Pad - Border padding (CONSTANT/EDGE/REFLECT)

Advanced Transforms

RandomPosterize - Bit-depth reduction
RandomSolarize - Threshold inversion
RandomPerspective - Perspective warp
AutoAugment - Learned policies (ImageNet/CIFAR10/SVHN)

Tensor Conversion

ToTensor - PyTorch CHW or TensorFlow HWC format

TBL v2 Binary Format

TurboLoader includes a custom binary format optimized for ML workloads:

Features

LZ4 compression for reduced storage
Memory-mapped access for fast loading
O(1) random access via indexed structure
Data integrity validation with CRC checksums
Cached image dimensions for filtered loading

Convert TAR to TBL

import turboloader

writer = turboloader.TblWriterV2(
    output_path="/data/imagenet.tbl",
    compression=True
)

reader = turboloader.TarReader("/data/imagenet.tar")
for sample in reader:
    writer.add_sample(
        data=sample.data,
        format=sample.format,
        metadata={"label": sample.label}
    )

writer.finalize()

Documentation

Getting Started

Quick Start Notebook - Interactive tutorial for beginners
Installation Guide - Detailed setup instructions
Quick Start - Getting started examples
Troubleshooting Guide - Common issues and solutions

API Documentation

API Reference - Complete API documentation
Transforms API - All 19 transforms with examples

Framework Integration

PyTorch Integration Guide - Complete PyTorch guide
TensorFlow Integration Guide - Complete TensorFlow/Keras guide
PyTorch Lightning Example - Production-ready Lightning integration
Distributed Training (DDP) - Multi-GPU PyTorch DDP example

Examples

ImageNet ResNet50 Training - Complete training pipeline with AMP, checkpointing, TensorBoard
Distributed Training - Multi-node setup guide

Benchmarks

Head-to-head comparison with optimized PyTorch DataLoader (persistent_workers=True, prefetch_factor=4). Both loaders tested under identical conditions.

vs PyTorch DataLoader (BS=32, NW=4)

Configuration	TurboLoader	PyTorch	Speedup
uint8 CHW (resize only)	8,027 img/s	2,457 img/s	3.3x
float32 CHW (0-1 normalize)	8,456 img/s	2,040 img/s	4.1x
float32 CHW + ImageNet mean/std	8,029 img/s	2,039 img/s	3.9x

Decoded Tensor Caching (`cache_decoded=True`)

Configuration	Epoch 2 Throughput
uint8 HWC (from cache)	57,692,695 img/s
float32 CHW (from cache)	42,933,573 img/s
float32 CHW + ImageNet (from cache)	39,853,643 img/s

Worker Scaling (BS=32, float32 CHW + ImageNet)

Workers	TurboLoader	PyTorch	Speedup
1 worker	1,585 img/s	625 img/s	2.5x
2 workers	3,383 img/s	1,184 img/s	2.9x
4 workers	7,744 img/s	2,016 img/s	3.8x
8 workers	13,327 img/s	3,047 img/s	4.4x

Batch Size Scaling (NW=4, float32 CHW + ImageNet)

Batch Size	TurboLoader	PyTorch	Speedup
8	7,997 img/s	2,342 img/s	3.4x
16	8,280 img/s	2,261 img/s	3.7x
32	7,418 img/s	1,946 img/s	3.8x
64	7,896 img/s	1,765 img/s	4.5x
128	7,841 img/s	1,521 img/s	5.2x

Test conditions: Apple M4 Pro, 5000 JPEG images (640x480), best of 3 trials, 100 batches per trial. PyTorch uses persistent_workers=True, prefetch_factor=4.

Key Optimizations

OpenMP parallelism for batch assembly (decode, resize, transpose, convert)
Fused SIMD deinterleave: NEON vld3q_u8 for HWC→CHW + u8→f32 + normalize in a single pass
Thread-local buffers to eliminate per-sample heap allocation under OpenMP
Pipeline reset reuses buffer pools, decoders, and memory maps across epochs
LTO (thin) for cross-TU inlining of SIMD functions
GIL released during all C++ processing

Note: Actual throughput depends on your hardware, image sizes, and pipeline configuration. Run the benchmark on your setup for precise figures.

Architecture

TurboLoader uses a multi-threaded pipeline architecture:

┌─────────────────────────────────────────────┐
│           Memory-Mapped Reader              │
│     (TAR/TBL v2 with zero-copy access)      │
└──────────────┬──────────────────────────────┘
               │
        ┌──────▼──────┐
        │Worker Pool  │
        │  (N threads)│
        ├─────────────┤
        │ Decode      │
        │ Transform   │
        │ Convert     │
        └──────┬──────┘
               │
        ┌──────▼──────────────┐
        │ Lock-Free Queue     │
        └──────┬──────────────┘
               │
        ┌──────▼──────┐
        │Python API   │
        └─────────────┘

Key Components

Memory-Mapped I/O - Zero-copy file access
Worker Thread Pool - Parallel processing with per-thread decoders
SIMD Transforms - Vectorized operations (AVX2/AVX-512/NEON)
Lock-Free Queues - High-performance concurrent data structures

License

TurboLoader is released under the MIT License.

Citation

If you use TurboLoader in your research:

@software{turboloader2025,
  author = {Jain, Arnav},
  title = {TurboLoader: Production-Ready ML Data Loading},
  year = {2025},
  version = {2.7.0},
  url = {https://github.com/ALJainProjects/TurboLoader}
}

Support

Documentation: https://github.com/ALJainProjects/TurboLoader/tree/main/docs
Troubleshooting: https://github.com/ALJainProjects/TurboLoader/blob/main/docs/TROUBLESHOOTING.md
Verification Script: Run python scripts/verify_installation.py to check your setup
Issues: GitHub Issues
Discussions: GitHub Discussions
PyPI: https://pypi.org/project/turboloader/

TurboLoader - Production-ready ML data loading. 2.5-5.2x faster than optimized PyTorch DataLoader.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.25.0

Feb 11, 2026

2.23.0

Dec 19, 2025

2.22.0

Dec 19, 2025

2.21.0

Dec 19, 2025

2.20.0

Dec 19, 2025

2.19.0

Dec 18, 2025

2.18.0

Dec 18, 2025

2.17.0

Dec 18, 2025

2.16.0

Dec 18, 2025

2.15.0

Dec 18, 2025

2.14.0

Dec 18, 2025

2.13.0

Dec 18, 2025

2.12.0

Dec 17, 2025

2.11.0

Dec 17, 2025

2.10.0

Dec 17, 2025

2.9.0

Dec 16, 2025

2.8.0

Dec 3, 2025

2.7.0

Dec 2, 2025

2.6.0

Dec 2, 2025

2.5.0

Dec 2, 2025

2.4.0

Dec 1, 2025

2.3.23

Dec 1, 2025

2.3.22

Dec 1, 2025

2.3.21

Dec 1, 2025

2.3.20

Dec 1, 2025

2.3.19

Dec 1, 2025

2.3.18

Dec 1, 2025

2.3.17

Dec 1, 2025

2.3.16

Dec 1, 2025

2.3.15

Dec 1, 2025

2.3.14

Dec 1, 2025

2.3.13

Dec 1, 2025

2.3.12

Dec 1, 2025

2.3.10

Dec 1, 2025

2.3.6

Dec 1, 2025

2.3.5

Dec 1, 2025

2.3.4

Dec 1, 2025

2.3.3

Dec 1, 2025

2.3.2

Dec 1, 2025

2.3.0

Dec 1, 2025

2.2.0

Dec 1, 2025

2.1.0

Dec 1, 2025

2.0.0

Dec 1, 2025

1.9.0

Dec 1, 2025

1.8.1

Nov 30, 2025

1.8.0

Nov 30, 2025

1.7.9

Nov 23, 2025

1.7.8

Nov 23, 2025

1.7.7

Nov 21, 2025

1.7.6

Nov 20, 2025

1.7.5

Nov 20, 2025

1.7.4

Nov 19, 2025

1.7.3

Nov 19, 2025

1.7.2

Nov 19, 2025

1.7.1

Nov 19, 2025

1.7.0

Nov 19, 2025

1.6.1

Nov 19, 2025

1.6.0

Nov 19, 2025

1.5.1

Nov 18, 2025

1.5.0

Nov 18, 2025

1.4.0

Nov 18, 2025

1.3.0

Nov 18, 2025

1.2.1

Nov 17, 2025

1.2.0

Nov 17, 2025

1.1.0

Nov 17, 2025

0.8.1

Nov 17, 2025

0.8.0

Nov 16, 2025

0.7.0

Nov 16, 2025

0.6.0

Nov 16, 2025

0.5.3

Nov 16, 2025

0.5.2

Nov 16, 2025

0.5.1

Nov 16, 2025

0.5.0

Nov 16, 2025

0.4.0

Nov 16, 2025

0.3.7 yanked

Nov 16, 2025

0.3.3 yanked

Nov 16, 2025

0.3.2 yanked

Nov 16, 2025

0.2.1 yanked

Nov 16, 2025

0.2.0 yanked

Nov 16, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turboloader-2.25.0.tar.gz (430.9 kB view details)

Uploaded Feb 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

turboloader-2.25.0-cp314-cp314-macosx_15_0_arm64.whl (396.6 kB view details)

Uploaded Feb 11, 2026 CPython 3.14macOS 15.0+ ARM64

File details

Details for the file turboloader-2.25.0.tar.gz.

File metadata

Download URL: turboloader-2.25.0.tar.gz
Upload date: Feb 11, 2026
Size: 430.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for turboloader-2.25.0.tar.gz
Algorithm	Hash digest
SHA256	`65aabd6c84a1af1dc772b64a326c2f7da8602023b54e9f0d310887e3022315ee`
MD5	`2c9db6bfe05ecc398aacda78738dba1d`
BLAKE2b-256	`f94e76e3b9a253e8dd1b34d0cf3e313de7b537a07a3f9c43eed7c56a5488ca4d`

See more details on using hashes here.

File details

Details for the file turboloader-2.25.0-cp314-cp314-macosx_15_0_arm64.whl.

File metadata

Download URL: turboloader-2.25.0-cp314-cp314-macosx_15_0_arm64.whl
Upload date: Feb 11, 2026
Size: 396.6 kB
Tags: CPython 3.14, macOS 15.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for turboloader-2.25.0-cp314-cp314-macosx_15_0_arm64.whl
Algorithm	Hash digest
SHA256	`94b8dbc4a5cc1bdb98a761821e0c575b99638c146ff00a19d74f761e8d09c555`
MD5	`36a62dc6aa69d3d0904b6a331eb5835a`
BLAKE2b-256	`851d9a2a051855f666390a65c0d8c25ea889cef70a8bb0bacac5206590d60c5f`

See more details on using hashes here.

turboloader 2.25.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TurboLoader

Overview

Core Features

Installation

From PyPI (Recommended)

From Source

System Requirements

Optional Dependencies

Quick Start

Basic Usage

With Transforms

PyTorch Integration

Distributed Training

Transform Library

Core Transforms

Augmentation Transforms

Advanced Transforms

Tensor Conversion

TBL v2 Binary Format

Features

Convert TAR to TBL

Documentation

Getting Started

API Documentation

Framework Integration

Examples

Benchmarks

vs PyTorch DataLoader (BS=32, NW=4)

Decoded Tensor Caching (cache_decoded=True)

Worker Scaling (BS=32, float32 CHW + ImageNet)

Batch Size Scaling (NW=4, float32 CHW + ImageNet)

Key Optimizations

Architecture

Key Components

License

Citation

Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Decoded Tensor Caching (`cache_decoded=True`)