Automatic Recovery Controller - Auto-detect and recover from neural network training failures
Project description
ARC - Autonomous Recovery Controller
A real-time fault-tolerance framework for neural network training.
ARC monitors training signals — gradients, loss curvature, Fisher Information — to predict and recover from failures before they crash your run. It uses a Mamba-based state-space model with Evidential Deep Learning for uncertainty-aware failure prediction.
Key Results
| Metric | ARC v4.0 |
|---|---|
| Recovery Rate | 100% on core failure types (NaN, Inf, explosion) |
| Overhead | ~35% (small models), higher on larger models |
| vs torchft | 3/3 vs 1/3 recoveries |
| Models Tested | YOLOv11, DINOv2, Llama-Style, SD-UNet (up to 33M params) |
Quick Start
Installation
pip install arc-training
Basic Usage
from arc import Arc
# Wrap your model and optimizer
controller = Arc(model, optimizer)
# Training loop
for batch in dataloader:
loss = model(batch)
# ARC monitors, predicts, and recovers automatically
action = controller.step(loss)
if not action.rolled_back:
loss.backward()
optimizer.step()
Lightning Integration
from arc import ArcCallback
trainer = pl.Trainer(
callbacks=[ArcCallback()]
)
Configurations
| Config | Overhead | Use Case |
|---|---|---|
| ARC Lite | Lower | Production training |
| ARC Full | Higher | Debugging unstable runs |
Failure Coverage
| Failure Type | Detection | Recovery | Status |
|---|---|---|---|
| NaN/Inf Loss | Yes | Yes | Validated |
| Loss Explosion | Yes | Yes | Validated |
| Gradient Explosion | Yes | Yes | Validated |
| OOM (all stages) | Yes | Yes | Validated |
| Accuracy Collapse | Yes | Partial | Detection only |
| Mode Collapse | Yes | Partial | Detection only |
Benchmarks
Core Failure Recovery: 100% (NaN, Inf, explosion across all tests)
Modern Model Recovery: 5/8 induced failures recovered
torchft Comparison: ARC 3/3 vs torchft 1/3
Models Tested: YOLOv11, DINOv2-Small, Llama-Style, SD-UNet
Links
Citation
@software{arc2026,
title={ARC: Autonomous Recovery Controller for Neural Network Training},
author={Kaushik, Aryan},
year={2026},
url={https://github.com/a-kaushik2209/ARC}
}
License
AGPL-3.0 License - see LICENSE for details.
Copyright (c) 2026 Aryan Kaushik. All rights reserved.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file arc_training-4.0.1.tar.gz.
File metadata
- Download URL: arc_training-4.0.1.tar.gz
- Upload date:
- Size: 157.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed246eb565dcda19a5915ada6834dab98d2fefdee0c27355c25fd0edbd2ef2cf
|
|
| MD5 |
b473c43c0d8c2f9d05afafeb1a34c0b8
|
|
| BLAKE2b-256 |
d601a120e29d768887f655bc84c6eaac8f2ad34d3f2feb0223089f7c57b3dd7f
|
File details
Details for the file arc_training-4.0.1-py3-none-any.whl.
File metadata
- Download URL: arc_training-4.0.1-py3-none-any.whl
- Upload date:
- Size: 208.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fd1488663c58d52ff76648c826d66b4c54241860e88fe8879d28079506813ca7
|
|
| MD5 |
d6a3661aa62a6a6b0b2547159ea618cb
|
|
| BLAKE2b-256 |
f41bba3ec9583657bc8c1a9e0912a1758dd6d447c4702d4b193fcd30ac755ef5
|