Automatic Recovery Controller - Auto-detect and recover from neural network training failures
Project description
ARC - Automatic Recovery Controller
Auto-detect and recover from neural network training failures.
ARC automatically detects NaN/Inf losses, gradient explosions, OOM errors, and silent failures—then recovers your training without losing progress.
Key Results
| Metric | ARC v4.0 |
|---|---|
| Recovery Rate | 100% (on test suite) |
| Overhead | 27% (ARC Lite) |
| vs torchft | 3/3 vs 1/3 recoveries |
| Max Model Size | 1.5B params |
🚀 Quick Start
Installation
pip install arc-training
Basic Usage (3 lines!)
from arc import WeightRollback
# Initialize
arc = WeightRollback(model, optimizer)
# Training loop
for batch in dataloader:
loss = model(batch)
# ARC handles everything
action = arc.step(loss)
if not action.rolled_back:
loss.backward()
optimizer.step()
Lightning Integration
from arc.integrations import ARCCallback
trainer = pl.Trainer(
callbacks=[ARCCallback()]
)
📊 Configurations
| Config | Overhead | Use Case |
|---|---|---|
| ARC Lite | 27% | Production training |
| ARC Full | 44% | Debugging unstable runs |
🛡️ What ARC Handles
| Failure Type | Detection | Recovery | Status |
|---|---|---|---|
| NaN/Inf Loss | ✅ | ✅ | Validated |
| Loss Explosion | ✅ | ✅ | Validated |
| Gradient Explosion | ✅ | ✅ | Validated |
| OOM (all stages) | ✅ | ✅ | Validated |
| Accuracy Collapse | ✅ | ⚠️ | Detection only |
| Mode Collapse | ✅ | ⚠️ | Detection only |
📈 Benchmarks
Recovery Rate: 100% (160/160 induced failures)
Statistical Significance: p < 0.001
Models Tested: CNN, ViT, Transformer, Diffusion (up to 1.5B params)
🔗 Links
📜 Citation
@software{arc2026,
title={ARC: Automatic Recovery Controller for Neural Network Training},
author={Kaushik, Aryan},
year={2026},
url={https://github.com/aryankaushik/arc-training}
}
📄 License
AGPL-3.0 License - see LICENSE for details.
Copyright (c) 2026 Aryan Kaushik. All rights reserved.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file arc_training-4.0.0.tar.gz.
File metadata
- Download URL: arc_training-4.0.0.tar.gz
- Upload date:
- Size: 156.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aa7da5c397947d344486e5da8c6458f23cd48e514f5c8506f55f0e73ea8cb11b
|
|
| MD5 |
d2157a6e33c9817ece70bfe84241fb27
|
|
| BLAKE2b-256 |
211339562854658ba2020ae18920869330c33a2e141891bac6c52def0fe16165
|
File details
Details for the file arc_training-4.0.0-py3-none-any.whl.
File metadata
- Download URL: arc_training-4.0.0-py3-none-any.whl
- Upload date:
- Size: 207.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b51b185106d9a7179ed2eeaa2a8c7a5c86816d79c42e25e5e881f3757dec7b4f
|
|
| MD5 |
fe88dcaa0145c71804bd97e607a7ada1
|
|
| BLAKE2b-256 |
33ee3c9fa49657d0198ca4ff14176e3f238711c90dbb7a18edd4b640c59f58cb
|