Skip to main content

High-performance data loading for machine learning with 30x speedup over PyTorch DataLoader

Project description

TurboLoader

High-performance ML data loading library in C++20

Significantly faster than PyTorch DataLoader

PyPI version C++20 License: MIT


Overview

TurboLoader is a high-performance data loading library designed to accelerate ML training by replacing Python's slow multiprocessing-based data loaders with efficient C++ native threads and lock-free data structures.

Key Features:

  • 🚀 High-performance data loading with C++ native implementation
  • SIMD transforms with AVX2/AVX-512/NEON for fast preprocessing
  • 🔒 Lock-free concurrent queues for zero-contention data passing
  • 🧵 Native C++ threads (no Python GIL, no process spawning overhead)
  • 💾 Zero-copy memory-mapped I/O for efficient file reading
  • 📦 WebDataset TAR format support for sharded datasets
  • 🎯 Thread-local JPEG/PNG/WebP decoders (SIMD optimized)
  • 🎨 7 SIMD-accelerated augmentation transforms
  • 🐍 PyTorch-compatible API with minimal code changes

Performance

TurboLoader provides significant performance improvements over PyTorch DataLoader through:

  • Lock-free queues eliminate synchronization overhead
  • SIMD-optimized transforms (AVX2/AVX-512/NEON) accelerate preprocessing
  • Native C++ threads avoid Python GIL and multiprocessing overhead
  • Memory-mapped I/O enables zero-copy file reading
  • Thread-local decoders eliminate allocation overhead

Benchmark Results

Performance benchmarks on Apple M1 Pro (8 cores, 16GB RAM):

Test TurboLoader PyTorch DataLoader Improvement
SIMD Resize (6718 img/s) 148.85 μs - Baseline
SIMD Normalize (47438 img/s) 21.08 μs - Baseline

Test Configuration:

  • Dataset: 1000 JPEG images (256x256)
  • Operations: TAR extraction → JPEG decode → resize → normalize
  • Workers: 8 threads/processes
  • Batch size: 256

Note: Benchmarks are measured on synthetic datasets. Full ImageNet comparison suite in development.

See CHANGELOG.md for version history and test results.

Installation

pip install turboloader

Quick Start

Basic Usage

import turboloader

# Configure the data loader
config = turboloader.Config(
    num_workers=8,
    batch_size=256,
    shuffle=True,
    decode_jpeg=True
)

# Create pipeline
pipeline = turboloader.Pipeline(['imagenet.tar'], config)
pipeline.start()

# Get batches
batch = pipeline.next_batch(256)
for sample in batch:
    img_data = sample.data['jpg']  # Raw JPEG bytes or decoded image
    # Process your data...

pipeline.stop()

With SIMD Transforms

import turboloader

# Configure SIMD-accelerated transforms
config = turboloader.Config(num_workers=8, batch_size=256)
config.enable_simd_transforms = True

transform_config = turboloader.TransformConfig()
transform_config.target_width = 224
transform_config.target_height = 224
transform_config.enable_normalize = True
transform_config.mean = [0.485, 0.456, 0.406]
transform_config.std = [0.229, 0.224, 0.225]

config.transform_config = transform_config

# Create pipeline
pipeline = turboloader.Pipeline(['imagenet.tar'], config)
pipeline.start()

batch = pipeline.next_batch(256)
for sample in batch:
    # Get pre-transformed data (already resized + normalized)
    transformed = sample.transformed_data  # Ready for model!

pipeline.stop()

See examples/ for complete working examples including PyTorch integration.


Architecture

[TAR Files] → [Reader Thread] → [Lock-Free Queue] → [Worker Threads] → [Output Queue] → [User]
                                                            ↓
                                                     [JPEG Decoder]
                                                     (thread-local)

Key Design Decisions:

  1. Lock-Free SPMC Queue:

    • Cache-line aligned slots prevent false sharing
    • Atomic operations for wait-free enqueue/dequeue
    • No mutex contention
  2. Native Threading:

    • C++ threads avoid Python GIL
    • No process spawning overhead
    • Shared memory (no serialization)
  3. Thread-Local Decoders:

    • Each worker has its own JPEG decoder
    • No allocation overhead per image
    • SIMD optimizations from libjpeg-turbo
  4. Memory-Mapped I/O:

    • Zero-copy file reading
    • OS handles page management
    • Prefetch hints for sequential access

Building from Source

Requirements

  • CMake 3.20+
  • C++20 compiler (GCC 11+, Clang 14+, or Apple Clang 14+)
  • libjpeg-turbo
  • Python 3.8+ (for Python bindings)
  • pybind11

Build Instructions

mkdir build && cd build
cmake ..
make -j

Run Tests

./tests/turboloader_tests

Project Status

Current Version: 0.3.1 (Latest Release)

Completed Features (v0.3.x)

  • ✅ Lock-free SPMC queue with cache-line alignment
  • ✅ Thread pool with work stealing
  • ✅ Zero-copy mmap file reader
  • ✅ TAR parser for WebDataset format
  • ✅ Multi-threaded pipeline
  • ✅ JPEG/PNG/WebP decoders (libjpeg-turbo, libpng, libwebp)
  • ✅ Thread-local decoders
  • ✅ Python bindings (pybind11)
  • ✅ SIMD transforms (AVX2/AVX-512/NEON)
  • ✅ Vectorized resize and normalization
  • ✅ 7 SIMD-accelerated augmentation transforms
  • ✅ WebDataset iterator API
  • ✅ PyPI package distribution
  • ✅ Comprehensive test suite (45 tests passing)

Roadmap

v0.4.0 (Planned)

  • TensorFlow/JAX bindings
  • Cloud storage support (S3, GCS)
  • Distributed training support (NCCL, Gloo)

v1.0.0 (Future)

  • Stable API
  • Production-ready with full benchmark suite
  • GPU-accelerated decoding (nvJPEG)

Documentation

Contributing

Contributions welcome! See CONTRIBUTING.md for guidelines.

Priority areas:

  • Additional image formats (PNG, WebP)
  • Augmentation operations
  • Cloud storage backends (S3, GCS)
  • Performance optimizations

License

MIT License (see LICENSE file)


Acknowledgments

  • libjpeg-turbo for SIMD-optimized JPEG decoding
  • WebDataset format for inspiration on TAR-based datasets
  • PyTorch community for establishing data loading standards
  • pybind11 for excellent Python bindings

Built by Arnav Jain | GitHub

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turboloader-0.3.2.tar.gz (130.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

turboloader-0.3.2-cp313-cp313-macosx_15_0_arm64.whl (193.9 kB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

File details

Details for the file turboloader-0.3.2.tar.gz.

File metadata

  • Download URL: turboloader-0.3.2.tar.gz
  • Upload date:
  • Size: 130.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for turboloader-0.3.2.tar.gz
Algorithm Hash digest
SHA256 9541a6cda39e3de2463a075868daa37753f0bd9df55752e13481ad01517e106b
MD5 49effad261b85b7e08cd1abd0e815b8c
BLAKE2b-256 6b3b65803f7e358e89c8506567a9c4656f6efe8abd6a103f6c3cebf917cc2efd

See more details on using hashes here.

File details

Details for the file turboloader-0.3.2-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for turboloader-0.3.2-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 6afc63244ea6fca9fbab0456aa085539b659e8f9f4e8299c36b181d33a5b68db
MD5 e2cec09b069b256714f2741e080b0e02
BLAKE2b-256 2a2d768f0b434742063a6fba5241dd7650727c3262883f73e7823278760c8154

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page