Skip to main content

High-performance ML data loading library (10,146 img/s, 12x faster than PyTorch). Features: TBL v2 format with LZ4 compression (45-65% smaller than TAR), 19 SIMD-accelerated transforms (AVX2/NEON), AutoAugment policies, PyTorch/TensorFlow tensor conversion, lock-free concurrent queues, memory-mapped I/O, WebDataset TAR format, professional documentation, and interactive benchmark web app. C++20 implementation with Python bindings.

Project description

TurboLoader

High-Performance ML Data Loading Library with 19 SIMD-Accelerated Transforms

PyPI version Python 3.8+ C++20 License: MIT Tests


Overview

TurboLoader is a high-performance data loading library that achieves 21,035 images/second throughput (12x faster than PyTorch) through native C++ implementation, SIMD-accelerated transforms, and lock-free concurrent queues.

Key Features

  • 12x Faster than PyTorch DataLoader (optimized)
  • TBL v2 Binary Format - LZ4 compression (40-60% smaller than TAR), O(1) memory streaming writer, 4,875 img/s conversion NEW in v1.5.0
  • GPU-Accelerated JPEG Decoding - NVIDIA nvJPEG support for 10x faster decoding (when CUDA available) NEW in v1.2.1
  • Linux io_uring Async I/O - 2-3x faster disk throughput on NVMe SSDs (Linux kernel 5.1+) NEW in v1.2.1
  • Smart Batching - Reduces padding by 15-25%, ~1.2x throughput boost NEW in v1.2.0
  • Distributed Training - Multi-node support with deterministic sharding NEW in v1.2.0
  • 19 SIMD-Accelerated Transforms (AVX2/AVX-512/NEON)
  • Data Integrity Validation - CRC32/CRC16 checksums for reliable data loading NEW in v1.5.0
  • Cached Image Dimensions - Fast filtered loading without decoding NEW in v1.5.0
  • Prefetching Pipeline (overlaps I/O with computation)
  • Zero-Copy Tensor Conversion (PyTorch/TensorFlow)
  • Lock-Free Concurrent Queues (50x faster than mutex-based)
  • Memory-Mapped I/O (52+ Gbps TAR parsing)
  • AutoAugment Policies (ImageNet, CIFAR10, SVHN)
  • Thread-Safe Architecture (no Python GIL)
  • Professional Documentation (Read the Docs)

Performance

What's New in v1.5.0

  • TBL v2 Binary Format: Next-generation custom format with LZ4 compression (40-60% space savings vs TAR), streaming O(1) memory writer, CRC32/CRC16 checksums, cached image dimensions (width/height in index), rich metadata support (JSON/Protobuf/MessagePack)
  • High-Speed Conversion: 4,875 img/s TAR→TBL conversion throughput with parallel processing
  • Cache-Optimized: 64-byte aligned headers, 24-byte index entries for maximum CPU cache efficiency
  • Data Integrity: Per-sample CRC32 checksums for compressed data, CRC16 for index validation

Previous Releases

v1.2.0:

  • Smart Batching: Size-based sample grouping reduces padding overhead by 15-25%, delivering ~1.2x throughput improvement
  • Distributed Training: Multi-node data loading with deterministic sharding, compatible with PyTorch DDP, Horovod, and DeepSpeed
  • Scalability: Linear scaling from 2,180 img/s (1 worker) to 21,036 img/s (16 workers)

v1.1.0:

  • AVX-512 SIMD Support: 2x vector width on compatible hardware (Intel Skylake-X+, AMD Zen 4+)
  • Prefetching Pipeline: Overlaps I/O with computation for reduced epoch time
  • TBL v1 Binary Format: 12.4% smaller files, 100,000 samples/s conversion, instant random access

Framework Comparison (v1.0.0)

Framework Throughput vs TurboLoader Speedup Memory
TurboLoader 11,780 img/s 1.00x 305x Low
PyTorch Optimized 39 img/s 0.003x Standard

Test Config: Apple M4 Max, 1000 images, 4 workers, batch_size=32, 5 epochs

See Benchmark Results for detailed analysis.

Scalability (v1.2.0)

Workers Throughput Linear Scaling Efficiency
1 2,180 img/s 1.00x 100%
2 4,020 img/s 1.84x 92%
4 6,755 img/s 3.10x 77%
8 6,973 img/s 3.20x 40%
16 21,036 img/s 9.65x 60%

Test Config: Apple M4 Max, 1000 images, batch_size=64, throughput from first 1000 images

Transform Performance

Transform Throughput SIMD Speedup
RandomPosterize 336,700 img/s Bitwise ops
RandomSolarize 21,300 img/s N/A
AutoAugment 19,800 img/s 2x
RandomPerspective 9,900 img/s N/A
Resize (Bilinear) 8,200 img/s 3.2x
ColorJitter 5,100 img/s 2.1x
GaussianBlur 2,400 img/s 4.5x

Installation

From PyPI (Recommended)

pip install turboloader

From Source

git clone https://github.com/ALJainProjects/TurboLoader.git
cd TurboLoader
pip install -e .

System Requirements

  • Python: 3.8+
  • Compiler: C++20 (GCC 10+, Clang 12+, MSVC 19.29+)
  • OS: macOS, Linux, Windows

Optional (Recommended):

# macOS
brew install jpeg-turbo libpng libwebp

# Ubuntu/Debian
sudo apt-get install libjpeg-turbo8-dev libpng-dev libwebp-dev

See Installation Guide for details.


Quick Start

Basic Usage

import turboloader

# Create DataLoader
loader = turboloader.DataLoader(
    'imagenet.tar',
    batch_size=128,
    num_workers=8
)

# Iterate over batches
for batch in loader:
    for sample in batch:
        image = sample['image']  # NumPy array (H, W, C)
        label = sample['label']
        # Train your model...

With Transforms

import turboloader

# Create SIMD-accelerated transforms
resize = turboloader.Resize(224, 224, turboloader.InterpolationMode.BILINEAR)
normalize = turboloader.ImageNetNormalize(to_float=True)
flip = turboloader.RandomHorizontalFlip(p=0.5)
color_jitter = turboloader.ColorJitter(brightness=0.2, contrast=0.2)

# Apply to images
loader = turboloader.DataLoader('data.tar', batch_size=64, num_workers=8)

for batch in loader:
    for sample in batch:
        img = sample['image']

        # Apply transforms (SIMD-accelerated)
        img = resize.apply(img)
        img = flip.apply(img)
        img = color_jitter.apply(img)
        img = normalize.apply(img)

        # Ready for training!

PyTorch Integration

import turboloader
import torch

# Create loader with tensor conversion
loader = turboloader.DataLoader('imagenet.tar', batch_size=64, num_workers=8)

# PyTorch-compatible tensor format
to_tensor = turboloader.ToTensor(
    format=turboloader.TensorFormat.PYTORCH_CHW,
    normalize=True
)
normalize = turboloader.ImageNetNormalize(to_float=True)

# Training loop
for batch in loader:
    images = []
    labels = []

    for sample in batch:
        img = to_tensor.apply(sample['image'])
        img = normalize.apply(img)
        images.append(torch.from_numpy(img))
        labels.append(sample['label'])

    batch_tensor = torch.stack(images)
    # Train model...

AutoAugment

import turboloader

# Use learned augmentation policies
autoaugment = turboloader.AutoAugment(
    policy=turboloader.AutoAugmentPolicy.IMAGENET
)

loader = turboloader.DataLoader('data.tar', batch_size=128, num_workers=8)

for batch in loader:
    for sample in batch:
        img = autoaugment.apply(sample['image'])
        # State-of-the-art augmentation applied!

See Getting Started Guide for more examples.

Python API Limitations (v1.5.1)

The Python bindings expose most C++ functionality, but some features are C++-only or not yet exposed:

Available in Python (v1.5.1):

  • ✅ TBL v2 Reader/Writer (NEW) - Full support for reading and writing TBL v2 files with LZ4 compression
  • ✅ Transform Compose() (NEW) - Chain multiple transforms into a single pipeline
  • ✅ All 19 SIMD-accelerated transforms - Full transform API with composition support
  • ✅ DataLoader - High-performance data loading with TAR/WebDataset support
  • ✅ GPU-accelerated JPEG decoding (nvJPEG) - Automatic when CUDA available
  • ✅ Remote TAR support (HTTP, S3, GCS) - Via DataLoader

C++ Only (Not Yet in Python):

  • ⚠️ Smart Batching configuration - Available in C++ API only; Python uses default batching
  • ⚠️ Distributed training primitives - Use PyTorch DDP/Horovod with TurboLoader DataLoader
  • ⚠️ Direct TBL v1 Reader/Writer - v1.5.0+ focuses on TBL v2; use C++ API for v1 format

Transform Compose Example:

import turboloader
import numpy as np

# Create a transform pipeline
pipeline = turboloader.Compose([
    turboloader.Resize(224, 224),
    turboloader.RandomHorizontalFlip(0.5),
    turboloader.ColorJitter(brightness=0.2, contrast=0.2),
    turboloader.ImageNetNormalize()
])

# Apply the entire pipeline with a single call
img = np.random.randint(0, 255, (256, 256, 3), dtype=np.uint8)
transformed = pipeline.apply(img)  # or pipeline(img)

For C++ API features, see C++ Documentation.


Feature Comparison

Feature TurboLoader PyTorch TensorFlow FFCV DALI
Throughput (CPU) 10,146 img/s 835 img/s 7,569 img/s 15,000 img/s 8,000 img/s
SIMD Transforms 19 0 0 14 GPU only
Lock-Free Queues
Zero-Copy I/O
AutoAugment
Custom Format TBL v2 (LZ4) Any Any .beton Any
Compression 40-60% savings ~60%
Data Integrity CRC32/CRC16
GPU Decode nvJPEG
Memory (2K imgs) 848 MB 1,523 MB 1,245 MB ~900 MB 1,200+ MB
Ease of Use ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐ ⭐⭐⭐ ⭐⭐⭐
License MIT BSD Apache Apache Apache

Transform Library

TurboLoader includes 19 SIMD-accelerated transforms:

Core Transforms

  • Resize - Bilinear/Bicubic/Lanczos interpolation (3.2x faster)
  • Normalize - SIMD FMA operations (5.0 GB/s)
  • ImageNetNormalize - Preset for ImageNet (mean/std)
  • CenterCrop - Center region extraction
  • RandomCrop - Random crop with padding

Augmentation Transforms

  • RandomHorizontalFlip - SIMD horizontal flip (10.5K img/s)
  • RandomVerticalFlip - SIMD vertical flip
  • ColorJitter - Brightness/contrast/saturation/hue (5.1K img/s)
  • RandomRotation - Arbitrary angle rotation
  • RandomAffine - Rotation/translation/scale/shear
  • GaussianBlur - Separable convolution (2.4K img/s, 4.5x faster)
  • RandomErasing - Cutout augmentation (8.3K img/s)
  • Grayscale - RGB to grayscale conversion
  • Pad - Border padding (CONSTANT/EDGE/REFLECT)

Advanced Transforms (v0.7.0+)

  • RandomPosterize - Bit-depth reduction (336K+ img/s)
  • RandomSolarize - Threshold inversion (21K+ img/s)
  • RandomPerspective - Perspective warp (9.9K+ img/s)
  • AutoAugment - Learned policies (ImageNet/CIFAR10/SVHN)

Tensor Conversion

  • ToTensor - PyTorch CHW or TensorFlow HWC format

See Transforms API for complete reference.


Architecture

┌─────────────────────────────────────────────────────────────┐
│                    TurboLoader Pipeline                      │
└──────────┬──────────────────────────────────────────────────┘
           │
    ┌──────▼──────┐
    │  Main Thread │
    └──────┬───────┘
           │
    ┌──────▼───────────────────────────────────────────────────┐
    │    Memory-Mapped Reader (TAR/TBL v2) (52+ Gbps)           │
    │  • mmap() zero-copy access                                │
    │  • TBL v2: LZ4 decompression with CRC32 validation        │
    │  • TAR: 512-byte header parsing                           │
    │  • Cached dimensions for fast filtering                   │
    └──────┬───────────────────────────────────────────────────┘
           │
    ┌──────▼───────────────────────────────────────────────────┐
    │          Worker Thread Pool (N threads)                   │
    │                                                            │
    │  ┌────────────────┐  ┌────────────────┐                  │
    │  │  Worker 1      │  │  Worker N      │                  │
    │  ├────────────────┤  ├────────────────┤                  │
    │  │ LZ4 Decompress │  │ LZ4 Decompress │  TBL v2          │
    │  │ JPEG Decode    │  │ JPEG Decode    │  libjpeg-turbo   │
    │  │ SIMD Transforms│  │ SIMD Transforms│  AVX2/NEON       │
    │  │ Tensor Convert │  │ Tensor Convert │  Zero-copy       │
    │  └────────┬───────┘  └────────┬───────┘                  │
    └───────────┼──────────────────┼─────────────────────────┘
                │                  │
         ┌──────▼──────────────────▼──────┐
         │   Lock-Free Output Queue       │  50x faster
         │   (SPSC ring buffer)            │  than mutex
         └──────┬─────────────────────────┘
                │
         ┌──────▼──────────────┐
         │   Python Iterator   │
         └─────────────────────┘

Key Components:

  1. TBL v2 Format - LZ4 compression (40-60% savings), CRC32/CRC16 validation, cached dimensions
  2. Memory-Mapped I/O - Zero-copy TAR/TBL parsing (52+ Gbps)
  3. SIMD Transforms - AVX2/AVX-512/NEON vectorized operations
  4. Lock-Free Queues - Cache-aligned atomic operations
  5. Thread-Local Decoders - Per-worker JPEG/PNG/WebP instances

See Architecture Guide for detailed design.


Documentation

Getting Started

API Reference

Guides

Benchmarks

Development


Roadmap

v1.5.0 (Current - Production/Stable)

  • ✅ TBL v2 Binary Format with LZ4 compression (40-60% space savings)
  • ✅ Streaming O(1) memory writer for efficient conversion
  • ✅ CRC32/CRC16 checksums for data integrity validation
  • ✅ Cached image dimensions (width/height) in 16-bit index
  • ✅ Rich metadata support (JSON, Protobuf, MessagePack)
  • ✅ 4,875 img/s TAR→TBL conversion throughput
  • ✅ 64-byte cache-aligned headers, 24-byte index entries
  • ✅ tar_to_tbl converter with parallel processing

v1.4.0

  • ✅ Format converter benchmarks and documentation
  • ✅ Comprehensive TAR/TBL performance analysis
  • ✅ Access pattern comparison (sequential vs random)

v1.3.0

  • ✅ Performance optimizations and stability improvements
  • ✅ Enhanced documentation and examples

v1.2.1

  • ✅ GPU JPEG decoding (nvJPEG with automatic CPU fallback)
  • ✅ Linux io_uring async I/O (2-3x faster disk throughput)

v1.2.0

  • ✅ Smart Batching (15-25% padding reduction, ~1.2x throughput boost)
  • ✅ Distributed Training (multi-node support with deterministic sharding)
  • ✅ Linear scaling to 16 workers (21,036 img/s peak)

v1.1.0

  • ✅ AVX-512 SIMD optimizations for modern CPUs
  • ✅ Prefetching pipeline for reduced latency
  • ✅ TBL v1 binary format (12.4% smaller than TAR)

v1.6.0+ (Future)

  • ZSTD compression option (higher compression ratios)
  • Video dataloader enhancements
  • Cloud storage optimizations (S3/GCS streaming)
  • Advanced distributed training features
  • Extended test suite (5000+ images, multiple formats)

See CHANGELOG.md for version history.


Contributing

Contributions are welcome! Please see Contributing Guide for:

  • Development setup
  • Code style guidelines
  • Pull request process
  • Testing requirements

License

TurboLoader is released under the MIT License.


Citation

If you use TurboLoader in your research:

@software{turboloader2025,
  author = {Jain, Arnav},
  title = {TurboLoader: High-Performance ML Data Loading},
  year = {2025},
  version = {1.5.0},
  url = {https://github.com/ALJainProjects/TurboLoader}
}

Acknowledgments


Support


TurboLoader v1.5.0 - Production-ready ML data loading with TBL v2 format. Fast. Efficient. Reliable.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turboloader-1.6.1.tar.gz (207.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

turboloader-1.6.1-cp313-cp313-macosx_15_0_arm64.whl (322.4 kB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

File details

Details for the file turboloader-1.6.1.tar.gz.

File metadata

  • Download URL: turboloader-1.6.1.tar.gz
  • Upload date:
  • Size: 207.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for turboloader-1.6.1.tar.gz
Algorithm Hash digest
SHA256 598540a4fd3ba2341d8112871b35b61ae9be6399248cdaa9f3f15cfad3117b8b
MD5 cde59a2a2726d7cd8cc2e35be6a41e70
BLAKE2b-256 d7dedc94cfd5dfd84cde3cccb2a2e56f52fcfc6e177bb3675af7b30ecd9b5203

See more details on using hashes here.

File details

Details for the file turboloader-1.6.1-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for turboloader-1.6.1-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 b241fff46af6402034ec81344f792105b9b1688a99d55024677023279aea4a26
MD5 bca3d03b8b5cdea5985aa4a3cfd3bae5
BLAKE2b-256 76627901770f971a90a605f6126e07a789f1553dc7503ffe36f12cd19da9006f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page