Skip to main content

High-performance multi-framework data loading library (10,146 img/s, 12x faster than PyTorch). Features: TensorFlow/Keras, JAX/Flax, PyTorch support, WebDataset format, cloud storage (S3/GCS/HTTP), SIMD-optimized JPEG decoder, 19 advanced transforms (including AutoAugment), and comprehensive benchmarking. Developed and tested on Apple M4 Max (48GB RAM) with C++20 and Python 3.8+

Project description

TurboLoader

High-Performance ML Data Loading Library

PyPI version C++20 License: MIT


Overview

TurboLoader is a high-performance data loading library designed to accelerate ML training by replacing Python's multiprocessing-based data loaders with efficient C++ native threads and thread-safe concurrent data structures.

Key Features:

  • 🚀 Native C++ Implementation with Python bindings via pybind11
  • SIMD-Optimized Transforms using AVX2/AVX-512/NEON
  • 🔒 Thread-Safe Concurrent Queues for reliable multi-threaded data passing
  • 🧵 C++ Native Threads (no Python GIL, no multiprocessing overhead)
  • 💾 Zero-Copy Memory-Mapped I/O for efficient file reading
  • 📦 WebDataset TAR Format support for sharded datasets
  • 🎯 SIMD-Accelerated Image Decoders (JPEG, PNG, WebP)
  • 🎨 19 Data Augmentation Transforms with SIMD optimization (5 new in v0.7.0)
  • 🤖 AutoAugment Policies for state-of-the-art augmentation
  • 🐍 PyTorch-Compatible API drop-in replacement

Performance

v0.7.0 Advanced Transforms (New!)

5 Additional SIMD-Accelerated Transforms:

  • RandomPosterize: Bit-depth reduction (ultra-fast bitwise ops, 336,000+ img/s)
  • RandomSolarize: Threshold-based pixel inversion (21,000+ img/s)
  • RandomPerspective: Perspective warping with SIMD interpolation (9,900+ img/s)
  • AutoAugment: Learned augmentation policies (ImageNet, CIFAR10, SVHN) (19,800+ img/s)
  • Lanczos Interpolation: High-quality downsampling for Resize (2,900+ img/s)

See BENCHMARK_RESULTS_V0.7.md for detailed performance analysis.


Overall Performance (v0.6.0)

Comprehensive Benchmark Results (2000 images, 8 workers, batch_size=32, 3 epochs):

Rank Framework Throughput vs TurboLoader Avg Epoch Time
1 TurboLoader 10,146 img/s 1.00x 0.18s
2 TensorFlow tf.data 7,569 img/s 0.75x 0.26s
3 PyTorch Cached 3,123 img/s 0.31x 0.64s
4 PyTorch Optimized 835 img/s 0.08x 2.40s
5 PIL Baseline 277 img/s 0.03x 7.22s
6 PyTorch Naive 85 img/s 0.01x 23.67s

Key Highlights:

  • 12x faster than PyTorch Optimized DataLoader
  • 3.2x faster than PyTorch with local file caching
  • 1.3x faster than TensorFlow tf.data
  • Extremely stable: ±0.005s standard deviation across epochs
  • Memory efficient: 848 MB peak memory usage

Test Configuration:

  • Hardware: Apple M4 Max (16 cores, 48 GB RAM)
  • Dataset: 2000 synthetic 256x256 JPEG images (117 MB TAR archive)
  • Configuration: 8 workers, batch size 32, 3 epochs
  • Backend: C++ multi-threaded pipeline with SIMD optimizations

See BENCHMARK_RESULTS.md for detailed analysis and interactive benchmark report.


Installation

pip install turboloader

Requirements:

  • Python 3.8+
  • C++20 compiler (GCC 10+, Clang 12+, MSVC 19.29+)
  • CMake 3.15+

Optional Dependencies:

  • libjpeg-turbo (JPEG decoding)
  • libpng (PNG decoding)
  • libwebp (WebP decoding)

Quick Start

Basic Usage

import turboloader

# Create pipeline
pipeline = turboloader.Pipeline(
    tar_paths=['imagenet.tar'],
    num_workers=8,
    batch_size=32,
    decode_jpeg=True
)

pipeline.start()

# Get batches
for _ in range(100):
    batch = pipeline.next_batch(32)
    for sample in batch:
        img = sample.get_image()  # NumPy array (H, W, C)
        # Your training code here...

pipeline.stop()

With SIMD Transforms

import turboloader

# Configure SIMD-accelerated transforms
config = turboloader.TransformConfig()
config.enable_resize = True
config.resize_width = 224
config.resize_height = 224
config.enable_normalize = True
config.mean = [0.485, 0.456, 0.406]
config.std = [0.229, 0.224, 0.225]

pipeline = turboloader.Pipeline(
    tar_paths=['imagenet.tar'],
    num_workers=8,
    decode_jpeg=True,
    enable_simd_transforms=True,
    transform_config=config
)

pipeline.start()
batch = pipeline.next_batch(256)
pipeline.stop()

With Data Augmentation

import turboloader

# Create augmentation pipeline
aug_pipeline = turboloader.AugmentationPipeline()
aug_pipeline.add_transform(turboloader.RandomHorizontalFlip(0.5))
aug_pipeline.add_transform(turboloader.ColorJitter(brightness=0.2, contrast=0.2))
aug_pipeline.add_transform(turboloader.RandomCrop(224, 224))

# Use with data loader (planned feature)
# pipeline = turboloader.Pipeline(tar_paths=['data.tar'], augmentations=aug_pipeline)

Architecture

TurboLoader is built on several high-performance components:

Core Components

  1. Thread-Safe Concurrent Queues

    • Mutex-based synchronization for reliable multi-threaded operation
    • Thread-safe data passing between reader and worker threads
    • Stable performance with high worker counts (8+ workers)
  2. Memory-Mapped I/O

    • mmap() for zero-copy file reading
    • Efficient TAR archive parsing
    • Minimizes memory allocations
  3. SIMD Transforms

    • AVX2/AVX-512 on x86_64
    • NEON on ARM (Apple Silicon, ARM servers)
    • Vectorized resize, normalize, color conversion
  4. Thread-Local Decoders

    • Per-thread JPEG/PNG/WebP decoders
    • Eliminates decoder allocation overhead
    • Maximizes cache locality

Supported Transforms

TurboLoader v0.3.x includes 7 SIMD-accelerated augmentation transforms:

  • RandomHorizontalFlip: SIMD-optimized horizontal flip
  • RandomVerticalFlip: SIMD-optimized vertical flip
  • ColorJitter: Brightness, contrast, saturation adjustments
  • RandomRotation: Bilinear interpolation rotation
  • RandomCrop: Random crop with padding
  • RandomErasing: Cutout augmentation
  • GaussianBlur: Separable Gaussian filter (SIMD)

API Reference

Pipeline

class Pipeline:
    def __init__(
        self,
        tar_paths: List[str],
        num_workers: int = 4,
        queue_size: int = 256,
        shuffle: bool = False,
        decode_jpeg: bool = False,
        enable_simd_transforms: bool = False,
        transform_config: Optional[TransformConfig] = None
    )

    def start() -> None
    def stop() -> None
    def reset() -> None
    def next_batch(batch_size: int) -> List[Sample]
    def total_samples() -> int

TransformConfig

class TransformConfig:
    enable_resize: bool = False
    resize_width: int = 224
    resize_height: int = 224
    resize_method: ResizeMethod = ResizeMethod.BILINEAR

    enable_normalize: bool = False
    mean: List[float] = [0.0, 0.0, 0.0]
    std: List[float] = [1.0, 1.0, 1.0]

    enable_color_convert: bool = False
    src_color: ColorSpace = ColorSpace.RGB
    dst_color: ColorSpace = ColorSpace.RGB
    output_float: bool = False

Augmentation Transforms

class AugmentationPipeline:
    def __init__(seed: Optional[int] = None)
    def add_transform(transform: AugmentationTransform) -> None
    def clear() -> None
    def num_transforms() -> int

class RandomHorizontalFlip(AugmentationTransform):
    def __init__(probability: float = 0.5)

class ColorJitter(AugmentationTransform):
    def __init__(
        brightness: float = 0.0,
        contrast: float = 0.0,
        saturation: float = 0.0,
        hue: float = 0.0
    )

Roadmap

TurboLoader.0 (Q1 2025) - HIGH PRIORITY

Complete pipeline rewrite to fix critical performance issues

See ARCHITECTURE_V2.md for full design.

Core Infrastructure

  • Lock-free SPSC ring buffers (~50x faster than mutex queues)
  • Object pool for buffer reuse (eliminate malloc/free overhead)
  • Zero-copy sample struct using std::span views

I/O Layer

  • Per-worker TAR file handles (eliminate mutex bottleneck)
  • Memory-mapped I/O for true zero-copy reads
  • Worker-based sample partitioning

Decoding & Performance

  • TurboJPEG SIMD decoder integration (2-3x faster)
  • Object pool for decoded buffers
  • Fallback to libjpeg for compatibility

Testing & Validation

  • Comprehensive unit tests (all components)
  • Performance benchmarks vs PyTorch (target: >100 img/s)
  • Memory leak checks (valgrind)
  • Thread safety verification (ThreadSanitizer)

Expected Performance: 150-200 img/s (3-4x faster than PyTorch baseline)

Estimated Timeline: 11-17 hours of development

Branch: TurboLoader-rewrite


v1.0.0 (Q4 2025)

  • Production-ready API stability
  • Comprehensive documentation
  • Full test coverage
  • Performance optimization

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Development Setup

# Clone repository
git clone https://github.com/ALJainProjects/TurboLoader.git
cd TurboLoader

# Install dependencies
brew install cmake libjpeg-turbo libpng libwebp  # macOS
# or
apt-get install cmake libjpeg-turbo8-dev libpng-dev libwebp-dev  # Ubuntu

# Build from source
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j8

# Run tests
./tests/turboloader_tests
./tests/test_simd_transforms

License

MIT License - see LICENSE for details.


Citation

If you use TurboLoader in your research, please cite:

@software{turboloader2025,
  author = {Jain, Arnav},
  title = {TurboLoader: High-Performance ML Data Loading},
  year = {2025},
  url = {https://github.com/ALJainProjects/TurboLoader}
}

Acknowledgments


Support

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turboloader-0.7.0.tar.gz (147.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

turboloader-0.7.0-cp313-cp313-macosx_15_0_arm64.whl (254.1 kB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

File details

Details for the file turboloader-0.7.0.tar.gz.

File metadata

  • Download URL: turboloader-0.7.0.tar.gz
  • Upload date:
  • Size: 147.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for turboloader-0.7.0.tar.gz
Algorithm Hash digest
SHA256 df6bfd6a8cd456363ca6beb36d3321bbd5a117d0037f55b8ed2a799a056fe995
MD5 c96a227b1dc38225ac0a0c84aeb5dcd2
BLAKE2b-256 11751bb3cb7d92375d13377a0ebd4b4c9060ce0562ced399acd17fe4ea1fff30

See more details on using hashes here.

File details

Details for the file turboloader-0.7.0-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for turboloader-0.7.0-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 e55e37453728ad049041c0c712259e9d2e1d72d8d992b0707c94374f3890498e
MD5 33351488385642bed66f391e05a1632c
BLAKE2b-256 36214059d3ce19adf36e701c602a48fce52bd4df73fd76168e9fc379339fad2b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page