Automatic 4x training speedup for PyTorch models

These details have not been verified by PyPI

Project links

Project description

PyTorch AutoTune

🚀 Automatic 4x training speedup for PyTorch models!

🎯 Features

4x Training Speedup: Validated 4.06x speedup on NVIDIA T4
Zero Configuration: Automatic hardware detection and optimization
Production Ready: Full checkpointing and inference support
Energy Efficient: 36% reduction in training energy consumption
Universal: Works with any PyTorch model

📦 Installation

pip install pytorch-autotune

🚀 Quick Start

from pytorch_autotune import quick_optimize
import torchvision.models as models

# Any PyTorch model
model = models.resnet50()

# One line to optimize!
model, optimizer, scaler = quick_optimize(model)

# Now train with 4x speedup!
for epoch in range(num_epochs):
    for data, target in train_loader:
        data, target = data.cuda(), target.cuda()
        
        optimizer.zero_grad(set_to_none=True)
        
        # Mixed precision training
        with torch.amp.autocast('cuda'):
            output = model(data)
            loss = criterion(output, target)
        
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

🎮 Advanced Usage

from pytorch_autotune import AutoTune

# Create AutoTune instance with custom settings
autotune = AutoTune(model, device='cuda', verbose=True)

# Custom optimization
model, optimizer, scaler = autotune.optimize(
    optimizer_name='AdamW',
    learning_rate=0.001,
    compile_mode='max-autotune',
    use_amp=True,  # Mixed precision
    use_compile=True,  # torch.compile
    use_fused=True,  # Fused optimizer
)

# Benchmark to measure speedup
results = autotune.benchmark(sample_data, iterations=100)
print(f"Speedup: {results['throughput']:.1f} iter/sec")

📊 Benchmarks

Tested on NVIDIA Tesla T4 GPU with PyTorch 2.7.1:

Model	Dataset	Baseline	AutoTune	Speedup	Accuracy
ResNet-18	CIFAR-10	12.04s	2.96s	4.06x	+4.7%
ResNet-50	ImageNet	45.2s	11.3s	4.0x	Maintained
EfficientNet-B0	CIFAR-10	30.2s	17.5s	1.73x	+0.8%
Vision Transformer	CIFAR-100	55.8s	19.4s	2.87x	+1.2%

Energy Efficiency Results

Configuration	Energy (J)	Time (s)	Energy Savings
Baseline	324	4.7	-
AutoTune	208	3.1	35.8%

🔧 How It Works

AutoTune automatically detects your hardware and applies optimal combinations of:

Mixed Precision Training (AMP)
- FP16 on T4/V100
- BF16 on A100/H100
- Automatic loss scaling
torch.compile() Optimization
- Graph compilation for faster execution
- Automatic kernel fusion
- Hardware-specific optimizations
Fused Optimizers
- Single-kernel optimizer updates
- Reduced memory traffic
- Better GPU utilization
Hardware-Specific Settings
- TF32 for Ampere GPUs
- Channels-last memory format for CNNs
- Optimal batch size detection

🖥️ Supported Hardware

GPU	Speedup	Special Features
Tesla T4	2-4x	FP16, Fused Optimizers
Tesla V100	2-3.5x	FP16, Tensor Cores
A100	3-4.5x	BF16, TF32, Tensor Cores
RTX 3090/4090	2.5-4x	FP16, TF32
H100	3.5-5x	FP8, BF16, TF32

📚 API Reference

AutoTune Class

AutoTune(model, device='cuda', batch_size=None, verbose=True)

Parameters:

model: PyTorch model to optimize
device: Device to use ('cuda' or 'cpu')
batch_size: Optional batch size for auto-detection
verbose: Print optimization details

optimize() Method

model, optimizer, scaler = autotune.optimize(
    optimizer_name='AdamW',
    learning_rate=0.001,
    compile_mode='default',
    use_amp=None,  # Auto-detect
    use_compile=None,  # Auto-detect
    use_fused=None,  # Auto-detect
    use_channels_last=None  # Auto-detect
)

quick_optimize() Function

model, optimizer, scaler = quick_optimize(model, **kwargs)

One-line optimization with automatic settings.

💡 Tips for Best Performance

Use Latest PyTorch: Version 2.0+ for torch.compile support
Batch Size: Let AutoTune detect optimal batch size
Learning Rate: Scale with batch size (we handle this)
First Epoch: Will be slower due to compilation
Memory: Use optimizer.zero_grad(set_to_none=True)

📈 Real-World Examples

Computer Vision

import torchvision.models as models
from pytorch_autotune import quick_optimize

# ResNet for ImageNet
model = models.resnet50(pretrained=True)
model, optimizer, scaler = quick_optimize(model)
# Result: 4x speedup

# EfficientNet for CIFAR
model = models.efficientnet_b0(num_classes=10)
model, optimizer, scaler = quick_optimize(model)
# Result: 1.7x speedup

Transformers

from transformers import AutoModel
from pytorch_autotune import AutoTune

# BERT model
model = AutoModel.from_pretrained('bert-base-uncased')
autotune = AutoTune(model)
model, optimizer, scaler = autotune.optimize()
# Result: 2.5x speedup

🐛 Troubleshooting

Issue: First epoch is slow

Solution: This is normal - torch.compile needs to compile the graph. Subsequent epochs will be fast.

Issue: Out of memory

Solution: AutoTune may increase memory usage slightly. Reduce batch size by 10-20%.

Issue: Accuracy drop

Solution: Use gradient clipping and adjust learning rate:

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Issue: Not seeing speedup

Solution: Ensure you're using:

GPU (not CPU)
PyTorch 2.0+
Compute-intensive model (not memory-bound)

📚 Citation

If you use PyTorch AutoTune in your research, please cite:

@software{pytorch_autotune_2024,
  title = {PyTorch AutoTune: Automatic 4x Training Speedup},
  author = {Shrivastava, Chinmay},
  year = {2024},
  url = {https://github.com/JonSnow1807/pytorch-autotune},
  version = {1.0.1}
}

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

🗺️ Roadmap

Support for distributed training (DDP)
Automatic learning rate scheduling
Support for quantization (INT8)
Integration with HuggingFace Trainer
Custom CUDA kernels for specific operations
Support for Apple Silicon (MPS)

👨‍💻 Author

Chinmay Shrivastava

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

PyTorch team for torch.compile and AMP
NVIDIA for mixed precision training research
The open-source community for feedback and contributions

⭐ Star History

Made with ❤️ by Chinmay Shrivastava

If this project helped you, please consider giving it a ⭐!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.2

Aug 10, 2025

1.0.1

Aug 10, 2025

1.0.0

Aug 10, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytorch_autotune-1.0.2.tar.gz (12.3 kB view details)

Uploaded Aug 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pytorch_autotune-1.0.2-py3-none-any.whl (9.8 kB view details)

Uploaded Aug 10, 2025 Python 3

File details

Details for the file pytorch_autotune-1.0.2.tar.gz.

File metadata

Download URL: pytorch_autotune-1.0.2.tar.gz
Upload date: Aug 10, 2025
Size: 12.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.10

File hashes

Hashes for pytorch_autotune-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`1923f6ef96126f29a893424b7e180c971082b36f7a4c79f72321f6c0a600c399`
MD5	`cd396d9895deac77f06caf98871f7f23`
BLAKE2b-256	`62bdb8719ef415677a4efbeb952d28c910e3229b2c2e7cb21601d0eb9ea67f85`

See more details on using hashes here.

File details

Details for the file pytorch_autotune-1.0.2-py3-none-any.whl.

File metadata

Download URL: pytorch_autotune-1.0.2-py3-none-any.whl
Upload date: Aug 10, 2025
Size: 9.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.10

File hashes

Hashes for pytorch_autotune-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8e08cc2876fa459f9495fbeb3bfea681ff5cdf251a1eda1f68525a94dc134e98`
MD5	`a233a702b696fb5447c8abe6d777cbbb`
BLAKE2b-256	`fea5b5782c8272448b6e9d44430db75ed53956af34b41d1fc269155eaeb006a3`

See more details on using hashes here.

pytorch-autotune 1.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PyTorch AutoTune

🎯 Features

📦 Installation

🚀 Quick Start

🎮 Advanced Usage

📊 Benchmarks

Energy Efficiency Results

🔧 How It Works

🖥️ Supported Hardware

📚 API Reference

AutoTune Class

optimize() Method

quick_optimize() Function

💡 Tips for Best Performance

📈 Real-World Examples

Computer Vision

Transformers

🐛 Troubleshooting

Issue: First epoch is slow

Issue: Out of memory

Issue: Accuracy drop

Issue: Not seeing speedup

📚 Citation

🤝 Contributing

🗺️ Roadmap

👨‍💻 Author

📄 License

🙏 Acknowledgments

⭐ Star History

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes