Skip to main content

Training stability tools for synthetic demo

Project description

🌊 CandorFlow

Early Warning System for Training Instabilities

PyPI Colab License: MIT Python 3.8+


⚠️ Important Notice

This repository contains a SIMPLIFIED, PUBLIC DEMONSTRATION of CandorFlow concepts.

This is NOT the full proprietary system. Many advanced features, algorithms, and optimizations are intentionally excluded. See What Is NOT Included for details.


📖 Overview

CandorFlow is a training stability monitoring and intervention system designed to detect and prevent neural network training instabilities before they cause divergence.

This public repository demonstrates:

  • A simplified stability metric λ(t) based on gradient variance
  • Basic threshold-based monitoring
  • Automatic checkpoint rollback on instability detection
  • Learning rate reduction for recovery
  • Minimal working examples with toy models

What is λ(t)?

The lambda metric λ(t) is a stability indicator that tracks training health over time. In this simplified demo, it measures gradient norm variance as a proxy for instability.

High λ(t) → Training is becoming unstable
Low λ(t) → Training is stable


🎯 Features in This Demo

✅ What This Repo Contains (Safe/Public Demo)

  • Simplified λ(t) metric: Gradient norm variance-based instability detection
  • Basic stability controller: Threshold monitoring with rollback capabilities
  • Checkpoint management: Automatic saving and restoration
  • Learning rate adaptation: Halving on instability detection
  • Minimal training loop: Toy example with intentional instability
  • Visualization tools: Plot λ(t) curves and stability phases
  • Jupyter notebook: Interactive demo with explanations
  • Reproducible examples: Fully runnable on CPU or GPU

🚫 What Is NOT Included (Proprietary)

The full CandorFlow system includes many advanced features that are NOT in this public demo:

Core Algorithms

  • Universal scaling law for λ(t)
  • Reflexive ridge equation and closed-form solutions
  • Cross-domain invariants (works across NLP, vision, RL, etc.)
  • Jacobian spectral analysis for stability prediction
  • Multi-signal fusion (loss, gradients, activations, etc.)

Advanced Control

  • Real-time stability engine with predictive modeling
  • Reflexive decay algorithms for adaptive intervention
  • Temporal smoothing with active inference
  • Dynamic threshold adaptation based on training phase
  • HPC-optimized control loops for large-scale training

Domain Extensions

  • ECG anomaly detection applications
  • Earthquake early warning systems
  • Financial market stability monitoring
  • General-purpose time series instability detection

Performance

  • Production-grade optimizations for minimal overhead
  • Distributed training integration (DeepSpeed, FSDP, etc.)
  • Hardware acceleration (CUDA kernels, etc.)

For access to the full proprietary system, please contact us.


🚀 Installation

Prerequisites

  • Python 3.8 or higher
  • pip package manager

Option 1: Install from GitHub (Recommended)

Install CandorFlow directly from the repository:

pip install git+https://github.com/CandorSystem/CandorFlow.git

After installation, you can import CandorFlow from anywhere:

from candorflow import compute_lambda, StabilityController
from candorflow.demo import run_demo, plot_results

Option 2: Development Installation

For contributing, modifying the code, or running examples from the repository:

git clone https://github.com/CandorSystem/CandorFlow.git
cd CandorFlow
pip install -e .

This installs the package in editable mode, so changes to the source code are immediately reflected.

Option 3: Manual Setup (Not Recommended)

If you prefer not to install the package:

git clone https://github.com/CandorSystem/CandorFlow.git
cd CandorFlow
pip install -r requirements.txt
python examples/run_demo.py  # Must run from repo directory

Note: With this approach, you'll need to add the repository to your Python path or run scripts from the repository root directory.


💻 Usage

Quick Start: Run the Training Demo

After installation, run the demo:

from candorflow.demo import run_demo, plot_results

# Run the demo
results = run_demo()

# Generate plots
plot_results(results)

Or use the command-line wrapper:

python examples/run_demo.py

This will:

  1. Create a small MLP neural network
  2. Train it on synthetic data
  3. Compute λ(t) at each step
  4. Inject synthetic instability spike at step 30
  5. Demonstrate automatic detection and rollback
  6. Generate two plots:
    • plots/lambda_curve.png - λ(t) over time with intervention markers
    • plots/stability_phases.png - Color-coded stability zones

Expected output:

✓ Saved plots to plots/
  - lambda_curve.png
  - stability_phases.png

Colab Integration

The demo is designed for easy Colab integration:

!pip install candorflow

from candorflow.demo import run_demo, plot_results
results = run_demo(steps=50, spike_step=30, threshold=2.0)
plot_results(results)

All training logic is contained in candorflow.demo - no need to write training loops in Colab!


📁 Repository Structure

CandorFlow/
│
├── README.md                   # This file
├── pyproject.toml              # Package configuration (pip install)
├── requirements.txt            # Python dependencies
├── LICENSE                     # MIT License
│
├── candorflow/                 # Main package
│   ├── __init__.py            # Public API (compute_lambda, StabilityController)
│   ├── demo.py                # Complete training demo (all logic here)
│   ├── lambda_metric.py       # Simplified λ(t) computation
│   ├── stability_controller.py # Basic monitoring & intervention
│   ├── utils.py               # Checkpoint and logging utilities
│   └── version.py             # Version information
│
├── examples/                   # Runnable demos
│   └── run_demo.py            # Thin wrapper to run demo
│
├── notebooks/                  # Jupyter notebooks
│   └── CandorFlow_Demo.ipynb  # Interactive tutorial
│
└── plots/                      # Output directory for plots
    └── (generated files)

🔬 How It Works (Simplified Version)

1. Monitor Training with λ(t)

from candorflow import compute_lambda, StabilityController

# During training loop
lambda_value = compute_lambda(
    model=model,
    loss=loss,
    gradient_history=gradient_history
)

2. Automatic Intervention

controller = StabilityController(threshold=2.0)

action = controller.update(
    lambda_value=lambda_value,
    model=model,
    optimizer=optimizer,
    step=step
)

if action["action"] == "rollback":
    print("Instability detected - rolling back to stable checkpoint")

3. Training Continues Safely

The controller automatically:

  • Saves checkpoints when training is stable
  • Detects when λ(t) exceeds threshold
  • Rolls back to last stable state
  • Reduces learning rate
  • Resumes training

📊 Example Results

After running the demo, you'll see plots like this:

Lambda Curve with Interventions:

  • Blue line: λ(t) stability metric over time
  • Purple dashed line: Instability threshold
  • Orange markers: Rollback + LR reduction events
  • Red markers: Warnings

Stability Phases:

  • Green zone: Stable training
  • Orange zone: Warning (approaching threshold)
  • Red zone: Unstable (intervention triggered)

🧪 Running Tests

The demo includes built-in validation:

# Run training demo (includes self-checks)
python examples/demo_training_loop.py

# Generate plots (validates results)
python examples/demo_plots.py

📚 Documentation

API Reference

compute_lambda_metric(model, loss, history_window=10, gradient_history=None)

Compute simplified λ(t) stability metric.

Parameters:

  • model (torch.nn.Module): Neural network model
  • loss (torch.Tensor): Current loss value (with grad_fn)
  • history_window (int): Number of past gradient norms to track
  • gradient_history (list): List to store gradient history (modified in-place)

Returns:

  • lambda_value (float): Stability metric (higher = more unstable)

StabilityController(threshold, checkpoint_dir, lr_reduction_factor)

Training stability monitor and intervention system.

Parameters:

  • threshold (float): λ(t) value above which to trigger intervention
  • checkpoint_dir (str): Directory for saving checkpoints
  • lr_reduction_factor (float): Factor to reduce LR by (default: 0.5)

Methods:

  • update(lambda_value, model, optimizer, step): Update controller and take action if needed
  • get_summary(): Get training statistics

🤝 Contributing

This is a demonstration repository. Contributions are welcome for:

  • Bug fixes in demo code
  • Documentation improvements
  • Additional visualization examples
  • Educational content

Note: This repo intentionally excludes proprietary algorithms. Please do not submit PRs attempting to implement advanced features from the full system.


📧 Contact

For questions about this demo:

  • Open an issue on GitHub

For inquiries about the full proprietary CandorFlow system:

  • Email: [your-email@example.com]
  • Website: [https://candorflow.example.com]
  • Patents: [Patent application numbers]

📄 License

This simplified demonstration code is released under the MIT License. See LICENSE for details.

Important: The full CandorFlow system, including its proprietary algorithms and commercial applications, is NOT covered by this license. Please contact us for commercial licensing.


📖 Citation

If you use this demo code in your research or project, please cite:

@software{candorflow2025,
  title={CandorFlow: Training Stability Monitoring System},
  author={[Your Name]},
  year={2025},
  url={https://github.com/yourusername/CandorFlow},
  note={Simplified public demonstration version}
}

🙏 Acknowledgments

This simplified demo is provided for educational purposes to demonstrate basic concepts in training stability monitoring.

The full CandorFlow system represents significant research and development investment and is protected by pending patents.


⭐ Star History

If you find this demo helpful, please consider starring the repository!


Built for responsible, and safe AI.


Project Support & Affiliations

CandorFlow and the Candor Systems project are supported by several leading industry startup programs:

NVIDIA Inception Google Cloud Startup AWS Activate

These affiliations provide cloud credits, compute resources, and technical support for ongoing research and development.

Note: These affiliations indicate participation in early-stage startup support programs and do not imply endorsement of CandorFlow's algorithms or proprietary systems.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

candorflow-0.1.1.tar.gz (16.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

candorflow-0.1.1-py3-none-any.whl (15.0 kB view details)

Uploaded Python 3

File details

Details for the file candorflow-0.1.1.tar.gz.

File metadata

  • Download URL: candorflow-0.1.1.tar.gz
  • Upload date:
  • Size: 16.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for candorflow-0.1.1.tar.gz
Algorithm Hash digest
SHA256 d73bfd982a900b4f4a9a31da37bbf8eb379ad5a80bd78ad76bbe6f7da016a408
MD5 5f4df2c67f408b83b8573a8a73e62f3b
BLAKE2b-256 0c57f065b5bf525e7af45fc080c1a99c0de8b494dfa00e084ffdeaaab98a54ab

See more details on using hashes here.

File details

Details for the file candorflow-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: candorflow-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 15.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for candorflow-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0fe54cc4d4bae459b6fc2cfb7f31b998acb84f892960542741f04f7478515e8f
MD5 1cf270c27e1926eac8c4a60ac8dee58b
BLAKE2b-256 70e2f6cb908b3249309890022378317edeb352ae4641901ef1ccb9389736762c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page