Skip to main content

Automatic Recovery Controller - Auto-detect and recover from neural network training failures

Project description

ARC - Automatic Recovery Controller

PyPI version Python 3.8+ License: AGPL v3 Tests

Auto-detect and recover from neural network training failures.

ARC automatically detects NaN/Inf losses, gradient explosions, OOM errors, and silent failures—then recovers your training without losing progress.

Key Results

Metric ARC v4.0
Recovery Rate 100% (on test suite)
Overhead 27% (ARC Lite)
vs torchft 3/3 vs 1/3 recoveries
Max Model Size 1.5B params

🚀 Quick Start

Installation

pip install arc-training

Basic Usage (3 lines!)

from arc import WeightRollback

# Initialize
arc = WeightRollback(model, optimizer)

# Training loop
for batch in dataloader:
    loss = model(batch)

    # ARC handles everything
    action = arc.step(loss)

    if not action.rolled_back:
        loss.backward()
        optimizer.step()

Lightning Integration

from arc.integrations import ARCCallback

trainer = pl.Trainer(
    callbacks=[ARCCallback()]
)

📊 Configurations

Config Overhead Use Case
ARC Lite 27% Production training
ARC Full 44% Debugging unstable runs

🛡️ What ARC Handles

Failure Type Detection Recovery Status
NaN/Inf Loss Validated
Loss Explosion Validated
Gradient Explosion Validated
OOM (all stages) Validated
Accuracy Collapse ⚠️ Detection only
Mode Collapse ⚠️ Detection only

📈 Benchmarks

Recovery Rate: 100% (160/160 induced failures)
Statistical Significance: p < 0.001
Models Tested: CNN, ViT, Transformer, Diffusion (up to 1.5B params)

🔗 Links

📜 Citation

@software{arc2026,
  title={ARC: Automatic Recovery Controller for Neural Network Training},
  author={Kaushik, Aryan},
  year={2026},
  url={https://github.com/aryankaushik/arc-training}
}

📄 License

AGPL-3.0 License - see LICENSE for details.

Copyright (c) 2026 Aryan Kaushik. All rights reserved.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arc_training-4.0.0.tar.gz (156.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arc_training-4.0.0-py3-none-any.whl (207.9 kB view details)

Uploaded Python 3

File details

Details for the file arc_training-4.0.0.tar.gz.

File metadata

  • Download URL: arc_training-4.0.0.tar.gz
  • Upload date:
  • Size: 156.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for arc_training-4.0.0.tar.gz
Algorithm Hash digest
SHA256 aa7da5c397947d344486e5da8c6458f23cd48e514f5c8506f55f0e73ea8cb11b
MD5 d2157a6e33c9817ece70bfe84241fb27
BLAKE2b-256 211339562854658ba2020ae18920869330c33a2e141891bac6c52def0fe16165

See more details on using hashes here.

File details

Details for the file arc_training-4.0.0-py3-none-any.whl.

File metadata

  • Download URL: arc_training-4.0.0-py3-none-any.whl
  • Upload date:
  • Size: 207.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for arc_training-4.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b51b185106d9a7179ed2eeaa2a8c7a5c86816d79c42e25e5e881f3757dec7b4f
MD5 fe88dcaa0145c71804bd97e607a7ada1
BLAKE2b-256 33ee3c9fa49657d0198ca4ff14176e3f238711c90dbb7a18edd4b640c59f58cb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page