Automatic Recovery Controller - Auto-detect and recover from neural network training failures

These details have not been verified by PyPI

Project links

Project description

ARC

Autonomous Recovery Controller for Neural Network Training

Real-time fault tolerance that monitors, predicts, and recovers from training failures — automatically.

3 lines of code · ~18% GPU overhead · 100% recovery on induced failures · 20K–355M parameters validated

Quick Start · Architecture · Benchmarks

The Problem

Training neural networks is fragile. A single NaN gradient, an OOM spike, or an exploding loss at hour 47 of a 48-hour run can destroy days of compute. Engineers waste enormous time adding manual checkpointing, writing recovery scripts, and babysitting long runs.

ARC eliminates this entirely. It wraps your training loop with an autonomous controller that:

Monitors — Tracks multi-signal telemetry (loss trajectory, gradient norms, weight health, optimizer state integrity)
Predicts — Tracks gradient norm trends and detects exponential growth before failures become irreversible
Recovers — Automatically rolls back to the last healthy checkpoint and applies corrective measures (LR reduction, weight perturbation)

You keep training. ARC keeps it alive.

Quick Start

Installation

pip install arc-training

Or install from source:

git clone https://github.com/a-kaushik2209/ARC.git
cd ARC
pip install -e .

3-Line Integration

from arc import Arc

controller = Arc(model, optimizer)

for batch in dataloader:
    loss = model(batch)
    action = controller.step(loss)       # monitor + protect

    if not action.rolled_back:           # normal path
        loss.backward()
        optimizer.step()

That's it. ARC handles NaN detection, gradient explosion recovery, checkpoint management, and learning rate adjustment — all behind controller.step().

Architecture

ARC is a modular multi-signal monitoring system:

arc/
├── core/            Self-healing engine with rollback + LR reduction
├── signals/         Multi-signal collectors (gradient, loss, weight, optimizer state)
├── features/        Feature extraction, normalization, and buffering
├── prediction/      Signal-based failure prediction (logistic regression + MLP)
├── intervention/    Recovery strategies (LR reduction, gradient clipping, weight perturbation)
├── checkpointing/   Checkpoint management with circular buffer
├── introspection/   Fisher Information, Hessian approximation, loss landscape analysis
├── physics/         Lyapunov stability analysis, FFT oscillation detection
├── uncertainty/     Conformal prediction for calibrated stability assessment
└── evaluation/      Benchmarking and validation harness

Signal Pipeline

Training Step
     │
     ▼
┌─────────────────────────────────────────────────────────────┐
│  Signal Collectors                                          │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌───────────────┐  │
│  │ Gradient  │ │ Loss     │ │ Weight   │ │ Optimizer     │  │
│  │ Norm/Ent. │ │ Trend/Var│ │ Norm/NaN │ │ State Norm    │  │
│  └─────┬────┘ └─────┬────┘ └─────┬────┘ └──────┬────────┘  │
│        └──────┬──────┴──────┬─────┘             │           │
│               ▼             ▼                   ▼           │
│         Feature Extractor (12 features)                     │
│               │                                             │
│               ▼                                             │
│    ┌─────────────────────┐    ┌──────────────────────────┐  │
│    │  Heuristic Detector │    │  MLP Predictor           │  │
│    │  (instant response) │    │  (97.5% acc, 0 FP)       │  │
│    └─────────┬───────────┘    └────────────┬─────────────┘  │
│              └──────────┬─────────────────┘                 │
│                         ▼                                   │
│              Risk Assessment + Recovery Decision            │
└─────────────────────────┬───────────────────────────────────┘
                          │
              ┌───────────┴───────────┐
              │   HEALTHY             │──── Continue training
              │   WARNING             │──── Increase monitoring, prepare checkpoint
              │   FAILURE             │──── Rollback to checkpoint + corrective action
              └───────────────────────┘

Failure Coverage

Category	Failure Type	Detection	Recovery
Numeric	NaN / Inf Loss	Instant	Rollback + LR reduction
Numeric	Loss Explosion	Instant	Rollback + LR reduction
Numeric	Gradient Explosion	Instant	Rollback + gradient clipping
Numeric	Weight Corruption	Instant	Rollback from checkpoint
Silent	Optimizer State Reset	Detected	Rollback + state restoration
Silent	Silent Weight Drift	Detected	Alert + optional rollback
Silent	LR Spike	Instant	Rollback + LR correction

Benchmarks

All numbers below are from reproducible experiment scripts with fixed seeds.

Baseline Comparison (25 scenarios)

4 methods × 5 failure types × 5 seeds. Script: experiments/baseline_comparison.py

Method	Detection	Recovery
No Protection	52.0%	0.0%
Gradient Clipping	20.0%	0.0%
Loss-Only Monitor	80.0%	80.0%
Full ARC	100%	100%

Failure Prediction (200 scenarios)

4 architectures × 5 failure types × 5 seeds × 2 labels, 5-fold CV. Script: experiments/prediction_200_v2.py

Classifier	Accuracy	Precision	Recall	F1
Logistic Reg (12f)	95.5% ± 1.9%	100%	91.0%	0.953 ± 2.6%
MLP (12f)	97.5% ± 2.2%	100%	95.0%	0.974 ± 2.8%

Ablation Study (35 scenarios)

7 failure types × 5 seeds. Script: experiments/ablation_experiment.py

Configuration	Detection	Δ from Full
Full ARC (all components)	85.7%	---
− Weight Health	85.7%	0.0%
− Gradient Monitoring	85.7%	0.0%
− Loss Monitoring	85.7%	0.0%
− Optimizer State	71.4%	−14.3%
Loss Only (baseline)	71.4%	−14.3%

Defense in depth: Weight/gradient/loss provide redundant coverage (any one catches most failures). Optimizer state monitoring is uniquely valuable for silent failures.

Overhead (measured, CPU)

Script: experiments/overhead_measurement.py

Component	Time (ms)	% of ARC Total
Gradient Norm	0.12	9.0%
Weight Statistics	1.06	76.9%
Loss Analysis	0.01	0.6%
Checkpoint (amort.)	0.13	9.6%
Forecasting	0.06	4.1%
Total ARC	1.38	100%

Model Scale	Parameters	ARC Overhead	Relative
Small MLP	50K	0.86 ms	~60%
Medium CNN	288K	1.38 ms	~10%
Large CNN	2.5M	7.04 ms	~9.5%

Large Model Stress Test

Script: experiments/validate_claims_phase2.py

Model	Params	Failure Type	ARC Recovery	Rollbacks
NanoGPT	10M	LR Spike (50×)	✓	2
ResNet-50	25.6M	Loss Singularity	✓	1
GPT-2 Small	50M	NaN Bomb	✓	4
SD-UNet	60M	Gradient Attack	✓	4
ViT-Base	86M	Inf Nuke	✓	1
GPT-2 Medium	117M	NaN Bomb	✓	3

Theoretical Foundation

ARC integrates six mathematical frameworks, each experimentally validated:

Framework	Purpose	Validation
Fisher Information	Parameter importance weighting for recovery	11.5× separation ratio (important vs unimportant params)
Lyapunov Stability	Online stability estimation from parameter velocity	10× higher exponent under instability
FFT Oscillation Detection	Periodic behaviour detection in training dynamics	6.9× power ratio at oscillation frequency
Conformal Prediction	Distribution-free coverage guarantees for stability	≥99% empirical coverage at all target levels
Elastic Weight Consolidation	Knowledge preservation during recovery	0.4% lower post-recovery loss
Loss Landscape Analysis	Sharpness-based instability prediction	12.2× higher sharpness before failure

Known Limitations

ARC is honest about what it cannot do:

fp16 models: Models loaded in fp16 require proper mixed precision setup (autocast + GradScaler). ARC monitors training — it does not manage dtype conversion
Scale ceiling: Validated up to 355M parameters (GPT-2 Medium). Behaviour above this is not yet confirmed
First checkpoint: No checkpoint exists before the first save — very early failures are unrecoverable
Data problems: ARC cannot detect data corruption, label noise, or adversarial poisoning
Distributed training: Multi-GPU (DDP/FSDP) is not yet supported
Non-PyTorch: Only PyTorch is supported

Citation

@article{kaushik2026arc,
  title   = {ARC: Autonomous Recovery Controller for Fault-Tolerant Neural Network Training},
  author  = {Kaushik, Aryan},
  year    = {2026},
  note    = {Maharaja Agrasen Institute of Technology, New Delhi}
}

Built to make neural network training unkillable.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

4.2.2

Mar 16, 2026

4.2.1 yanked

Mar 15, 2026

4.2.0 yanked

Mar 15, 2026

4.1.0 yanked

Mar 8, 2026

4.0.1 yanked

Mar 8, 2026

4.0.0 yanked

Mar 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arc_training-4.2.2.tar.gz (164.2 kB view details)

Uploaded Mar 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

arc_training-4.2.2-py3-none-any.whl (211.0 kB view details)

Uploaded Mar 16, 2026 Python 3

File details

Details for the file arc_training-4.2.2.tar.gz.

File metadata

Download URL: arc_training-4.2.2.tar.gz
Upload date: Mar 16, 2026
Size: 164.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for arc_training-4.2.2.tar.gz
Algorithm	Hash digest
SHA256	`d566b90518255b0d3f224718bb5c7e26fd458a7e41dd2cc42d4dffe1da7bc34d`
MD5	`b978ed2a7bba74a608961f51d26039ea`
BLAKE2b-256	`14819da2a7225469c36a9d56e60407dc1b521175f4bc68dc0c1310aac8cb37c2`

See more details on using hashes here.

File details

Details for the file arc_training-4.2.2-py3-none-any.whl.

File metadata

Download URL: arc_training-4.2.2-py3-none-any.whl
Upload date: Mar 16, 2026
Size: 211.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for arc_training-4.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`39f1543c9fa54a58bb1f7fa068e592cc1d63d382942a78d6725022d8eecbe7af`
MD5	`a963afa042d1b09ed01405ef425f85e2`
BLAKE2b-256	`d2f4cd8903d90d6ef59c05f7368901b87d50e08a0fb83ed151c0dc3603bdedec`

See more details on using hashes here.

arc-training 4.2.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ARC

Autonomous Recovery Controller for Neural Network Training

The Problem

Quick Start

Installation

3-Line Integration

Architecture

Signal Pipeline

Failure Coverage

Benchmarks

Baseline Comparison (25 scenarios)

Failure Prediction (200 scenarios)

Ablation Study (35 scenarios)

Overhead (measured, CPU)

Large Model Stress Test

Theoretical Foundation

Known Limitations

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes