Skip to main content

Automatic Recovery Controller - Auto-detect and recover from neural network training failures

Project description

ARC - Autonomous Recovery Controller

PyPI version Python 3.8+ License: AGPL v3

A real-time fault-tolerance framework for neural network training.

ARC monitors training signals — gradients, loss curvature, Fisher Information — to predict and recover from failures before they crash your run. It uses a Mamba-based state-space model with Evidential Deep Learning for uncertainty-aware failure prediction.

Key Results

Metric ARC v4.0
Recovery Rate 100% on core failure types (NaN, Inf, explosion)
Overhead ~35% (small models), higher on larger models
vs torchft 3/3 vs 1/3 recoveries
Models Tested YOLOv11, DINOv2, Llama-Style, SD-UNet (up to 33M params)

Quick Start

Installation

pip install arc-training

Basic Usage

from arc import Arc

# Wrap your model and optimizer
controller = Arc(model, optimizer)

# Training loop
for batch in dataloader:
    loss = model(batch)

    # ARC monitors, predicts, and recovers automatically
    action = controller.step(loss)

    if not action.rolled_back:
        loss.backward()
        optimizer.step()

Lightning Integration

from arc import ArcCallback

trainer = pl.Trainer(
    callbacks=[ArcCallback()]
)

Configurations

Config Overhead Use Case
ARC Lite Lower Production training
ARC Full Higher Debugging unstable runs

Failure Coverage

Failure Type Detection Recovery Status
NaN/Inf Loss Yes Yes Validated
Loss Explosion Yes Yes Validated
Gradient Explosion Yes Yes Validated
OOM (all stages) Yes Yes Validated
Accuracy Collapse Yes Partial Detection only
Mode Collapse Yes Partial Detection only

Benchmarks

Core Failure Recovery: 100% (NaN, Inf, explosion across all tests)
Modern Model Recovery: 5/8 induced failures recovered
torchft Comparison:    ARC 3/3 vs torchft 1/3
Models Tested:         YOLOv11, DINOv2-Small, Llama-Style, SD-UNet

Links

Citation

@software{arc2026,
  title={ARC: Autonomous Recovery Controller for Neural Network Training},
  author={Kaushik, Aryan},
  year={2026},
  url={https://github.com/a-kaushik2209/ARC}
}

License

AGPL-3.0 License - see LICENSE for details.

Copyright (c) 2026 Aryan Kaushik. All rights reserved.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arc_training-4.0.1.tar.gz (157.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arc_training-4.0.1-py3-none-any.whl (208.0 kB view details)

Uploaded Python 3

File details

Details for the file arc_training-4.0.1.tar.gz.

File metadata

  • Download URL: arc_training-4.0.1.tar.gz
  • Upload date:
  • Size: 157.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for arc_training-4.0.1.tar.gz
Algorithm Hash digest
SHA256 ed246eb565dcda19a5915ada6834dab98d2fefdee0c27355c25fd0edbd2ef2cf
MD5 b473c43c0d8c2f9d05afafeb1a34c0b8
BLAKE2b-256 d601a120e29d768887f655bc84c6eaac8f2ad34d3f2feb0223089f7c57b3dd7f

See more details on using hashes here.

File details

Details for the file arc_training-4.0.1-py3-none-any.whl.

File metadata

  • Download URL: arc_training-4.0.1-py3-none-any.whl
  • Upload date:
  • Size: 208.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for arc_training-4.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 fd1488663c58d52ff76648c826d66b4c54241860e88fe8879d28079506813ca7
MD5 d6a3661aa62a6a6b0b2547159ea618cb
BLAKE2b-256 f41bba3ec9583657bc8c1a9e0912a1758dd6d447c4702d4b193fcd30ac755ef5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page