Skip to main content

High-performance data loading for machine learning with 30x speedup over PyTorch DataLoader

Project description

TurboLoader

High-performance ML data loading library in C++20

Significantly faster than PyTorch DataLoader

PyPI version C++20 License: MIT


Overview

TurboLoader is a high-performance data loading library designed to accelerate ML training by replacing Python's slow multiprocessing-based data loaders with efficient C++ native threads and lock-free data structures.

Key Features:

  • 🚀 High-performance data loading with C++ native implementation
  • SIMD transforms with AVX2/AVX-512/NEON for fast preprocessing
  • 🔒 Lock-free concurrent queues for zero-contention data passing
  • 🧵 Native C++ threads (no Python GIL, no process spawning overhead)
  • 💾 Zero-copy memory-mapped I/O for efficient file reading
  • 📦 WebDataset TAR format support for sharded datasets
  • 🎯 Thread-local JPEG/PNG/WebP decoders (SIMD optimized)
  • 🎨 7 SIMD-accelerated augmentation transforms
  • 🐍 PyTorch-compatible API with minimal code changes

Performance

TurboLoader provides significant performance improvements over PyTorch DataLoader through:

  • Lock-free queues eliminate synchronization overhead
  • SIMD-optimized transforms (AVX2/AVX-512/NEON) accelerate preprocessing
  • Native C++ threads avoid Python GIL and multiprocessing overhead
  • Memory-mapped I/O enables zero-copy file reading
  • Thread-local decoders eliminate allocation overhead

Benchmark Results

Performance benchmarks on Apple M1 Pro (8 cores, 16GB RAM):

Test TurboLoader PyTorch DataLoader Improvement
SIMD Resize (6718 img/s) 148.85 μs - Baseline
SIMD Normalize (47438 img/s) 21.08 μs - Baseline

Test Configuration:

  • Dataset: 1000 JPEG images (256x256)
  • Operations: TAR extraction → JPEG decode → resize → normalize
  • Workers: 8 threads/processes
  • Batch size: 256

Note: Benchmarks are measured on synthetic datasets. Full ImageNet comparison suite in development.

See CHANGELOG.md for version history and test results.

Installation

pip install turboloader

Quick Start

Basic Usage

import turboloader

# Configure the data loader
config = turboloader.Config(
    num_workers=8,
    batch_size=256,
    shuffle=True,
    decode_jpeg=True
)

# Create pipeline
pipeline = turboloader.Pipeline(['imagenet.tar'], config)
pipeline.start()

# Get batches
batch = pipeline.next_batch(256)
for sample in batch:
    img_data = sample.data['jpg']  # Raw JPEG bytes or decoded image
    # Process your data...

pipeline.stop()

With SIMD Transforms

import turboloader

# Configure SIMD-accelerated transforms
config = turboloader.Config(num_workers=8, batch_size=256)
config.enable_simd_transforms = True

transform_config = turboloader.TransformConfig()
transform_config.target_width = 224
transform_config.target_height = 224
transform_config.enable_normalize = True
transform_config.mean = [0.485, 0.456, 0.406]
transform_config.std = [0.229, 0.224, 0.225]

config.transform_config = transform_config

# Create pipeline
pipeline = turboloader.Pipeline(['imagenet.tar'], config)
pipeline.start()

batch = pipeline.next_batch(256)
for sample in batch:
    # Get pre-transformed data (already resized + normalized)
    transformed = sample.transformed_data  # Ready for model!

pipeline.stop()

See examples/ for complete working examples including PyTorch integration.


Architecture

[TAR Files] → [Reader Thread] → [Lock-Free Queue] → [Worker Threads] → [Output Queue] → [User]
                                                            ↓
                                                     [JPEG Decoder]
                                                     (thread-local)

Key Design Decisions:

  1. Lock-Free SPMC Queue:

    • Cache-line aligned slots prevent false sharing
    • Atomic operations for wait-free enqueue/dequeue
    • No mutex contention
  2. Native Threading:

    • C++ threads avoid Python GIL
    • No process spawning overhead
    • Shared memory (no serialization)
  3. Thread-Local Decoders:

    • Each worker has its own JPEG decoder
    • No allocation overhead per image
    • SIMD optimizations from libjpeg-turbo
  4. Memory-Mapped I/O:

    • Zero-copy file reading
    • OS handles page management
    • Prefetch hints for sequential access

Building from Source

Requirements

  • CMake 3.20+
  • C++20 compiler (GCC 11+, Clang 14+, or Apple Clang 14+)
  • libjpeg-turbo
  • Python 3.8+ (for Python bindings)
  • pybind11

Build Instructions

mkdir build && cd build
cmake ..
make -j

Run Tests

./tests/turboloader_tests

Project Status

Current Version: 0.3.1 (Latest Release)

Completed Features (v0.3.x)

  • ✅ Lock-free SPMC queue with cache-line alignment
  • ✅ Thread pool with work stealing
  • ✅ Zero-copy mmap file reader
  • ✅ TAR parser for WebDataset format
  • ✅ Multi-threaded pipeline
  • ✅ JPEG/PNG/WebP decoders (libjpeg-turbo, libpng, libwebp)
  • ✅ Thread-local decoders
  • ✅ Python bindings (pybind11)
  • ✅ SIMD transforms (AVX2/AVX-512/NEON)
  • ✅ Vectorized resize and normalization
  • ✅ 7 SIMD-accelerated augmentation transforms
  • ✅ WebDataset iterator API
  • ✅ PyPI package distribution
  • ✅ Comprehensive test suite (45 tests passing)

Roadmap

v0.4.0 (Planned)

  • TensorFlow/JAX bindings
  • Cloud storage support (S3, GCS)
  • Distributed training support (NCCL, Gloo)

v1.0.0 (Future)

  • Stable API
  • Production-ready with full benchmark suite
  • GPU-accelerated decoding (nvJPEG)

Documentation

Contributing

Contributions welcome! See CONTRIBUTING.md for guidelines.

Priority areas:

  • Additional image formats (PNG, WebP)
  • Augmentation operations
  • Cloud storage backends (S3, GCS)
  • Performance optimizations

License

MIT License (see LICENSE file)


Acknowledgments

  • libjpeg-turbo for SIMD-optimized JPEG decoding
  • WebDataset format for inspiration on TAR-based datasets
  • PyTorch community for establishing data loading standards
  • pybind11 for excellent Python bindings

Built by Arnav Jain | GitHub

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turboloader-0.3.3.tar.gz (130.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

turboloader-0.3.3-cp313-cp313-macosx_15_0_arm64.whl (193.9 kB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

File details

Details for the file turboloader-0.3.3.tar.gz.

File metadata

  • Download URL: turboloader-0.3.3.tar.gz
  • Upload date:
  • Size: 130.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for turboloader-0.3.3.tar.gz
Algorithm Hash digest
SHA256 1ed8ae693bb1c68cdd1a8d9478b7f200a0130161c5ac8724f920953365d8983a
MD5 c99f694ae6efd8c8309881bf2cb9fa61
BLAKE2b-256 26fc1493e42ccaa16fc5d234109919006b6c41096aa5315d25f61a2b9ad2d6a6

See more details on using hashes here.

File details

Details for the file turboloader-0.3.3-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for turboloader-0.3.3-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 805a9d2f54773c362e3b445844473ec42ad73774e3661be3e2042c84ce7f2ddf
MD5 ed415b44c674f618d59b7fb1094ec490
BLAKE2b-256 fc5f229296d1dff8e33b58cf6675a6f4b886d4530b8580aba6cc0fdd5358e7df

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page