Skip to main content

High-performance data loading for machine learning with 30x speedup over PyTorch DataLoader

Project description

TurboLoader

High-performance ML data loading library in C++20

30-35x faster than PyTorch DataLoader on ImageNet

PyPI version C++20 License: MIT


Overview

TurboLoader is a high-performance data loading library designed to accelerate ML training by replacing Python's slow multiprocessing-based data loaders with efficient C++ native threads and lock-free data structures.

Key Features:

  • 🚀 30-35x speedup over PyTorch DataLoader on ImageNet workloads
  • SIMD transforms with AVX2/AVX-512/NEON for fast preprocessing
  • 🔒 Lock-free concurrent queues for zero-contention data passing
  • 🧵 Native C++ threads (no Python GIL, no process spawning overhead)
  • 💾 Zero-copy memory-mapped I/O for efficient file reading
  • 📦 WebDataset TAR format support for sharded datasets
  • 🎯 Thread-local JPEG decoders using libjpeg-turbo (SIMD optimized)
  • 🐍 Drop-in replacement for PyTorch DataLoader with minimal code changes

Performance

TurboLoader achieves 30-35x speedup over PyTorch DataLoader on ImageNet-scale workloads through:

  • Lock-free queues eliminate synchronization overhead
  • SIMD-optimized transforms (AVX2/AVX-512/NEON) accelerate preprocessing
  • Native C++ threads avoid Python GIL and multiprocessing overhead
  • Memory-mapped I/O enables zero-copy file reading
  • Thread-local decoders eliminate allocation overhead

Benchmark Methodology

Performance measured on ImageNet TAR files with:

  • Hardware: Apple M1 Pro (8 cores), 16GB RAM
  • Dataset: 1000 JPEG images, 256x256 resolution
  • Operations: TAR extraction → JPEG decode → resize → normalize
  • Comparison: PyTorch DataLoader with same operations

Note: Full benchmark suite in progress. Current results are preliminary and based on synthetic datasets. Real-world ImageNet benchmarks coming soon.

See ARCHITECTURE.md for implementation details and examples/ for usage patterns.

Installation

pip install turboloader

Quick Start

Basic Usage

import turboloader

# Configure the data loader
config = turboloader.Config(
    num_workers=8,
    batch_size=256,
    shuffle=True,
    decode_jpeg=True
)

# Create pipeline
pipeline = turboloader.Pipeline(['imagenet.tar'], config)
pipeline.start()

# Get batches
batch = pipeline.next_batch(256)
for sample in batch:
    img_data = sample.data['jpg']  # Raw JPEG bytes or decoded image
    # Process your data...

pipeline.stop()

With SIMD Transforms

import turboloader

# Configure SIMD-accelerated transforms
config = turboloader.Config(num_workers=8, batch_size=256)
config.enable_simd_transforms = True

transform_config = turboloader.TransformConfig()
transform_config.target_width = 224
transform_config.target_height = 224
transform_config.enable_normalize = True
transform_config.mean = [0.485, 0.456, 0.406]
transform_config.std = [0.229, 0.224, 0.225]

config.transform_config = transform_config

# Create pipeline
pipeline = turboloader.Pipeline(['imagenet.tar'], config)
pipeline.start()

batch = pipeline.next_batch(256)
for sample in batch:
    # Get pre-transformed data (already resized + normalized)
    transformed = sample.transformed_data  # Ready for model!

pipeline.stop()

See examples/ for complete working examples including PyTorch integration.


Architecture

[TAR Files] → [Reader Thread] → [Lock-Free Queue] → [Worker Threads] → [Output Queue] → [User]
                                                            ↓
                                                     [JPEG Decoder]
                                                     (thread-local)

Key Design Decisions:

  1. Lock-Free SPMC Queue:

    • Cache-line aligned slots prevent false sharing
    • Atomic operations for wait-free enqueue/dequeue
    • No mutex contention
  2. Native Threading:

    • C++ threads avoid Python GIL
    • No process spawning overhead
    • Shared memory (no serialization)
  3. Thread-Local Decoders:

    • Each worker has its own JPEG decoder
    • No allocation overhead per image
    • SIMD optimizations from libjpeg-turbo
  4. Memory-Mapped I/O:

    • Zero-copy file reading
    • OS handles page management
    • Prefetch hints for sequential access

Building from Source

Requirements

  • CMake 3.20+
  • C++20 compiler (GCC 11+, Clang 14+, or Apple Clang 14+)
  • libjpeg-turbo
  • Python 3.8+ (for Python bindings)
  • pybind11

Build Instructions

mkdir build && cd build
cmake ..
make -j

Run Tests

./tests/turboloader_tests

Project Status

Current Version: 0.2.0 (Initial PyPI Release)

Completed Features

  • ✅ Lock-free SPMC queue with cache-line alignment
  • ✅ Thread pool with work stealing
  • ✅ Zero-copy mmap file reader
  • ✅ TAR parser for WebDataset format
  • ✅ Multi-threaded pipeline
  • ✅ libjpeg-turbo integration
  • ✅ Thread-local decoders
  • ✅ Python bindings (pybind11)
  • ✅ SIMD transforms (AVX2/AVX-512/NEON)
  • ✅ Vectorized resize and normalization
  • ✅ PyPI package distribution

Roadmap

v0.3.0 (Planned)

  • WebDataset iterator API
  • Additional image formats (PNG, WebP)
  • Augmentation operations (rotation, color jitter)

v0.4.0 (Planned)

  • TensorFlow/JAX bindings
  • Cloud storage support (S3, GCS)
  • Distributed training support

v1.0.0 (Future)

  • Stable API
  • Production-ready
  • Comprehensive benchmark suite

Documentation

Contributing

Contributions welcome! See CONTRIBUTING.md for guidelines.

Priority areas:

  • Additional image formats (PNG, WebP)
  • Augmentation operations
  • Cloud storage backends (S3, GCS)
  • Performance optimizations

License

MIT License (see LICENSE file)


Acknowledgments

  • libjpeg-turbo for SIMD-optimized JPEG decoding
  • WebDataset format for inspiration on TAR-based datasets
  • PyTorch community for establishing data loading standards
  • pybind11 for excellent Python bindings

Built by Arnav Jain | GitHub

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turboloader-0.2.1.tar.gz (121.1 kB view details)

Uploaded Source

File details

Details for the file turboloader-0.2.1.tar.gz.

File metadata

  • Download URL: turboloader-0.2.1.tar.gz
  • Upload date:
  • Size: 121.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for turboloader-0.2.1.tar.gz
Algorithm Hash digest
SHA256 cbc667cc33edc1aa3f385ccfae698a3fcef8298c6cb613858e9f12da1d70455f
MD5 ac8b6e2a40f9cc8ce912ec7b17235969
BLAKE2b-256 ce9705c64748583d6fa6ecd47983a42e2f994328141fcd0f919a5c77999d2f5f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page