Skip to main content

Production-ready ML data loading library with distributed training support, SIMD-accelerated transforms, pipe operator composition, HDF5/TFRecord/Zarr support, and GPU transforms. Built with C++17 for maximum performance.

Project description

TurboLoader

Production-Ready ML Data Loading Library

PyPI version Tests Python 3.10+ C++20 License: MIT


Overview

TurboLoader is a high-performance data loading library for machine learning workflows. Built with C++20 and featuring Python bindings, it provides efficient data loading with SIMD-accelerated transforms, custom binary formats, and distributed training support.

Core Features

  • Decoded Tensor Caching (v2.7.0) - cache_decoded=True for 100K+ img/s on subsequent epochs
  • Multiple Loader Types - FastDataLoader (8-12% faster), MemoryEfficientDataLoader, standard DataLoader
  • Distributed Training Support - Multi-node data loading with deterministic sharding
  • SIMD-Accelerated Transforms - 19 vectorized transforms using AVX2/AVX-512/NEON
  • TBL v2 Binary Format - Custom format with LZ4 compression for reduced storage
  • Framework Integration - Seamless support for PyTorch, TensorFlow, and JAX
  • Memory-Mapped I/O - Zero-copy file access for improved throughput
  • Lock-Free Queues - Concurrent data structures for efficient multi-threading
  • GPU JPEG Decoding - Optional NVIDIA nvJPEG support for accelerated decoding

Installation

From PyPI (Recommended)

pip install turboloader

From Source

git clone https://github.com/ALJainProjects/TurboLoader.git
cd TurboLoader
pip install -e .

System Requirements

  • Python: 3.10 or higher
  • Compiler: C++20 capable (GCC 10+, Clang 12+, MSVC 19.29+)
  • OS: macOS, Linux, Windows

Optional Dependencies

Install for enhanced performance:

# macOS
brew install jpeg-turbo libpng libwebp lz4

# Ubuntu/Debian
sudo apt-get install libjpeg-turbo8-dev libpng-dev libwebp-dev liblz4-dev

Quick Start

Basic Usage

import turboloader

# Create DataLoader
loader = turboloader.DataLoader(
    'imagenet.tar',
    batch_size=128,
    num_workers=8
)

# Iterate over batches
for batch in loader:
    for sample in batch:
        image = sample['image']  # NumPy array (H, W, C)
        label = sample['label']
        # Train your model...

With Transforms

import turboloader

# Create transforms
resize = turboloader.Resize(224, 224)
normalize = turboloader.ImageNetNormalize()
flip = turboloader.RandomHorizontalFlip(p=0.5)

# Apply transforms
loader = turboloader.DataLoader('data.tar', batch_size=64, num_workers=8)

for batch in loader:
    for sample in batch:
        img = sample['image']
        img = resize.apply(img)
        img = flip.apply(img)
        img = normalize.apply(img)
        # Ready for training

PyTorch Integration

import turboloader
import torch

loader = turboloader.DataLoader('imagenet.tar', batch_size=64, num_workers=8)

# Convert to PyTorch tensors
to_tensor = turboloader.ToTensor(
    format=turboloader.TensorFormat.PYTORCH_CHW
)

for batch in loader:
    images = []
    for sample in batch:
        img = to_tensor.apply(sample['image'])
        images.append(torch.from_numpy(img))

    batch_tensor = torch.stack(images)
    # Train model...

Distributed Training

import turboloader
import torch.distributed as dist

# Initialize distributed training
dist.init_process_group(backend='nccl')

# Create loader with distributed support
loader = turboloader.DataLoader(
    data_path="/data/imagenet.tar",
    batch_size=64,
    num_workers=4,
    shuffle=True,
    enable_distributed=True,
    world_rank=dist.get_rank(),
    world_size=dist.get_world_size(),
    drop_last=True
)

# Each rank automatically gets its shard
for batch in loader:
    # Your training code
    pass

Transform Library

TurboLoader includes 19 SIMD-accelerated transforms:

Core Transforms

  • Resize - Bilinear/Bicubic/Lanczos interpolation
  • Normalize - Mean/std normalization with SIMD
  • CenterCrop - Center region extraction
  • RandomCrop - Random crop with padding

Augmentation Transforms

  • RandomHorizontalFlip - SIMD horizontal flip
  • RandomVerticalFlip - SIMD vertical flip
  • ColorJitter - Brightness/contrast/saturation/hue
  • RandomRotation - Arbitrary angle rotation
  • GaussianBlur - Separable convolution
  • RandomErasing - Cutout augmentation
  • Pad - Border padding (CONSTANT/EDGE/REFLECT)

Advanced Transforms

  • RandomPosterize - Bit-depth reduction
  • RandomSolarize - Threshold inversion
  • RandomPerspective - Perspective warp
  • AutoAugment - Learned policies (ImageNet/CIFAR10/SVHN)

Tensor Conversion

  • ToTensor - PyTorch CHW or TensorFlow HWC format

TBL v2 Binary Format

TurboLoader includes a custom binary format optimized for ML workloads:

Features

  • LZ4 compression for reduced storage
  • Memory-mapped access for fast loading
  • O(1) random access via indexed structure
  • Data integrity validation with CRC checksums
  • Cached image dimensions for filtered loading

Convert TAR to TBL

import turboloader

writer = turboloader.TblWriterV2(
    output_path="/data/imagenet.tbl",
    compression=True
)

reader = turboloader.TarReader("/data/imagenet.tar")
for sample in reader:
    writer.add_sample(
        data=sample.data,
        format=sample.format,
        metadata={"label": sample.label}
    )

writer.finalize()

Documentation

Getting Started

API Documentation

Framework Integration

Examples


Benchmarks

Head-to-head comparison with optimized PyTorch DataLoader (persistent_workers=True, prefetch_factor=4). Both loaders tested under identical conditions.

vs PyTorch DataLoader (BS=32, NW=4)

Configuration TurboLoader PyTorch Speedup
uint8 CHW (resize only) 8,027 img/s 2,457 img/s 3.3x
float32 CHW (0-1 normalize) 8,456 img/s 2,040 img/s 4.1x
float32 CHW + ImageNet mean/std 8,029 img/s 2,039 img/s 3.9x

Decoded Tensor Caching (cache_decoded=True)

Configuration Epoch 2 Throughput
uint8 HWC (from cache) 57,692,695 img/s
float32 CHW (from cache) 42,933,573 img/s
float32 CHW + ImageNet (from cache) 39,853,643 img/s

Worker Scaling (BS=32, float32 CHW + ImageNet)

Workers TurboLoader PyTorch Speedup
1 worker 1,585 img/s 625 img/s 2.5x
2 workers 3,383 img/s 1,184 img/s 2.9x
4 workers 7,744 img/s 2,016 img/s 3.8x
8 workers 13,327 img/s 3,047 img/s 4.4x

Batch Size Scaling (NW=4, float32 CHW + ImageNet)

Batch Size TurboLoader PyTorch Speedup
8 7,997 img/s 2,342 img/s 3.4x
16 8,280 img/s 2,261 img/s 3.7x
32 7,418 img/s 1,946 img/s 3.8x
64 7,896 img/s 1,765 img/s 4.5x
128 7,841 img/s 1,521 img/s 5.2x

Test conditions: Apple M4 Pro, 5000 JPEG images (640x480), best of 3 trials, 100 batches per trial. PyTorch uses persistent_workers=True, prefetch_factor=4.

Key Optimizations

  • OpenMP parallelism for batch assembly (decode, resize, transpose, convert)
  • Fused SIMD deinterleave: NEON vld3q_u8 for HWC→CHW + u8→f32 + normalize in a single pass
  • Thread-local buffers to eliminate per-sample heap allocation under OpenMP
  • Pipeline reset reuses buffer pools, decoders, and memory maps across epochs
  • LTO (thin) for cross-TU inlining of SIMD functions
  • GIL released during all C++ processing

Note: Actual throughput depends on your hardware, image sizes, and pipeline configuration. Run the benchmark on your setup for precise figures.


Architecture

TurboLoader uses a multi-threaded pipeline architecture:

┌─────────────────────────────────────────────┐
│           Memory-Mapped Reader              │
│     (TAR/TBL v2 with zero-copy access)      │
└──────────────┬──────────────────────────────┘
               │
        ┌──────▼──────┐
        │Worker Pool  │
        │  (N threads)│
        ├─────────────┤
        │ Decode      │
        │ Transform   │
        │ Convert     │
        └──────┬──────┘
               │
        ┌──────▼──────────────┐
        │ Lock-Free Queue     │
        └──────┬──────────────┘
               │
        ┌──────▼──────┐
        │Python API   │
        └─────────────┘

Key Components

  • Memory-Mapped I/O - Zero-copy file access
  • Worker Thread Pool - Parallel processing with per-thread decoders
  • SIMD Transforms - Vectorized operations (AVX2/AVX-512/NEON)
  • Lock-Free Queues - High-performance concurrent data structures

License

TurboLoader is released under the MIT License.


Citation

If you use TurboLoader in your research:

@software{turboloader2025,
  author = {Jain, Arnav},
  title = {TurboLoader: Production-Ready ML Data Loading},
  year = {2025},
  version = {2.7.0},
  url = {https://github.com/ALJainProjects/TurboLoader}
}

Support


TurboLoader - Production-ready ML data loading. 2.5-5.2x faster than optimized PyTorch DataLoader.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turboloader-2.25.0.tar.gz (430.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

turboloader-2.25.0-cp314-cp314-macosx_15_0_arm64.whl (396.6 kB view details)

Uploaded CPython 3.14macOS 15.0+ ARM64

File details

Details for the file turboloader-2.25.0.tar.gz.

File metadata

  • Download URL: turboloader-2.25.0.tar.gz
  • Upload date:
  • Size: 430.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for turboloader-2.25.0.tar.gz
Algorithm Hash digest
SHA256 65aabd6c84a1af1dc772b64a326c2f7da8602023b54e9f0d310887e3022315ee
MD5 2c9db6bfe05ecc398aacda78738dba1d
BLAKE2b-256 f94e76e3b9a253e8dd1b34d0cf3e313de7b537a07a3f9c43eed7c56a5488ca4d

See more details on using hashes here.

File details

Details for the file turboloader-2.25.0-cp314-cp314-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for turboloader-2.25.0-cp314-cp314-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 94b8dbc4a5cc1bdb98a761821e0c575b99638c146ff00a19d74f761e8d09c555
MD5 36a62dc6aa69d3d0904b6a331eb5835a
BLAKE2b-256 851d9a2a051855f666390a65c0d8c25ea889cef70a8bb0bacac5206590d60c5f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page