Skip to main content

Causal inference engine for deep learning training dynamics

Project description

NeuralDBG

A causal inference engine for deep learning training that provides structured explanations of neural network training failures. Understand why your model failed during training through semantic analysis and abductive reasoning, not raw tensor inspection.

License: MIT Python 3.8+ Build Status CodeQL Security: Bandit Pre-commit

Overview

NeuralDBG treats training as a semantic trace of learning dynamics rather than a black box. It extracts meaningful events and provides causal hypotheses about training failures, enabling researchers to:

  • Identify gradient health transitions (stable -> vanishing/saturated)
  • Detect activation regime shifts (normal -> saturated/dead)
  • Detect optimizer instability (loss plateaus, spikes, divergence)
  • Catch data anomalies (NaN, Inf, distribution shifts)
  • Track propagation of instabilities through network layers
  • Generate ranked causal explanations for training failures

Unlike traditional monitoring tools (TensorBoard, Weights & Biases), NeuralDBG focuses on causal inference rather than metric tracking.

Key Features

  • Semantic Event Extraction: Detects meaningful transitions in training dynamics
  • Causal Compression: Identifies first occurrences and propagation patterns
  • Post-Mortem Reasoning: Provides ranked hypotheses about failure causes
  • Optimizer Instability Detection: Tracks loss plateaus, spikes, and divergence
  • Data Anomaly Detection: Catches NaN, Inf, and distribution shifts in inputs
  • Event Collapsing: Merges sequential events into summary traces
  • Compiler-Aware: Operates at module boundaries to survive torch.compile
  • Non-Invasive: Wraps existing PyTorch training loops without code changes
  • Minimal API: Focused on explanations, not raw data dumps

Quick Start

Installation

pip install neuraldbg

Contributor Onboarding

For a new collaborator, run:

make bootstrap

This one-command setup:

  • verifies or recreates .venv
  • installs runtime, development, and MLflow/MLOps dependencies
  • activates the repository git hooks
  • installs the project in editable mode

Then activate the environment:

source .venv/bin/activate

Validation sync is intentionally opt-in because it depends on VALIDATION_BUNDLE_TOKEN and rewrites protected local files:

bash scripts/bootstrap.sh --with-validation-sync

Docker Development (Hermetic Workspace)

Use Docker to keep a reproducible local environment across machines and contributors.

# Build image
docker-compose build

# Start the dev container (one-command startup)
docker-compose up -d

# Open a shell in the running workspace
docker-compose exec neuraldbg-dev bash

Equivalent shortcuts via Makefile:

make build
make up
make shell

Run tests inside Docker:

docker-compose run --rm neuraldbg-dev bash -lc "pytest"

Or:

make test-docker

Persistent volumes are mounted to:

  • /data (host: ./data)
  • /models (host: ./models)
  • /outputs (host: ./outputs)

Stop containers:

docker-compose down

Basic Usage

import torch
import torch.nn as nn
from neuraldbg import NeuralDbg

# Your existing model and training setup
model = nn.Sequential(nn.Linear(10, 5), nn.ReLU(), nn.Linear(5, 1))
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
criterion = nn.MSELoss()

# Wrap your training loop
with NeuralDbg(model) as dbg:
    for step, (inputs, targets) in enumerate(dataloader):
        optimizer.zero_grad()

        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()

        # Events are extracted automatically

# After training failure, query for explanations
explanations = dbg.explain_failure()
print(explanations[0])  # "Gradient vanishing originated in layer 'linear1' at step 234, likely due to LR × activation mismatch (confidence: 0.87)"

Inference API

# Get ranked causal hypotheses for the failure
hypotheses = dbg.get_causal_hypotheses()

# Query specific causal chains
chain = dbg.trace_causal_chain('vanishing_gradients')

# Check for coupled failures
couplings = dbg.detect_coupled_failures()

Optimizer Instability Detection

with NeuralDbg(model) as dbg:
    for step in range(num_steps):
        dbg.step = step
        output = model(inputs)
        loss = criterion(output, targets)
        loss.backward()

        # Feed loss values for optimizer instability detection
        dbg.record_loss(loss.item())

        optimizer.step()

# Detect loss plateaus, spikes, or divergence
hypotheses = dbg.explain_failure("optimizer_instability")
for h in hypotheses:
    print(h.description)  # "Loss spike detected at step 50..."

Data Anomaly Detection

Data anomalies (NaN, Inf, distribution shifts) are detected automatically from layer inputs during the forward pass -- no extra API call needed:

with NeuralDbg(model) as dbg:
    # ... training loop ...
    pass

# Check for data issues
hypotheses = dbg.explain_failure("data_anomaly")
for h in hypotheses:
    print(h.description)  # "NaN values detected in input to layer 'linear1'..."

Event Collapsing

Compress sequential events in the same layer into summary traces:

# Get compressed event timeline
collapsed = dbg._collapse_events()
print(f"{len(dbg.events)} raw events -> {len(collapsed)} collapsed")

Architecture

Core Components

  • Semantic Event Extractor: Detects meaningful transitions in learning dynamics
  • Causal Compressor: Identifies patterns and propagation in training failures
  • Post-Mortem Reasoner: Generates ranked hypotheses about failure causes
  • Compiler-Aware Monitor: Operates at safe boundaries for optimization compatibility

Event Types

Event Type Source Detects
gradient_health_transition Backward hooks Vanishing, exploding, saturated gradients
activation_regime_shift Forward hooks Dead neurons, saturated activations
optimizer_instability record_loss() Loss plateaus, spikes, divergence
data_anomaly Forward hooks (inputs) NaN, Inf, distribution shifts

Event Structure

Each semantic event represents:

  • Transition type (gradient_health, activation_regime, optimizer_instability, data_anomaly)
  • Layer/parameter identifier
  • Step range of occurrence
  • Confidence score
  • Causal metadata (propagation patterns, coupled failures)

Target Users

  • ML Researchers seeking causal explanations for training failures
  • PhD Students analyzing learning dynamics in novel architectures
  • Research Engineers understanding optimization instabilities

Not intended for production monitoring, metric tracking, or no-code users.

Supported Failure Types

  • vanishing_gradients -- Root cause + saturation coupling
  • exploding_gradients -- First layer to explode
  • dead_neurons -- Neuron death in activation layers
  • saturated_activations -- Activation saturation patterns
  • optimizer_instability -- Loss plateaus, spikes, divergence (with gradient cross-reference)
  • data_anomaly -- NaN/Inf/distribution shift in inputs

Limitations (MVP Scope)

  • PyTorch only
  • Focus on semantic events, not tensor inspection
  • Command-line interface only
  • Compiler-aware (torch.compile compatible)

Contributing

This is an MVP focused on proving the concept of causal inference for training dynamics. Contributions should align with the core mission of providing structured explanations for training failures.

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass
  5. Submit a pull request

License

MIT License - see LICENSE.md for details.

Documentation

Citation

If you use NeuralDBG in your research, please cite:

@misc{neuraldbg2025,
  title={NeuralDBG: A Causal Inference Engine for Deep Learning Training Dynamics},
  author={SENOUVO Jacques-Charles Gad},
  year={2025},
  url={https://github.com/Lemniscate-world/Neural}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

neuraldbg-1.3.0.tar.gz (20.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

neuraldbg-1.3.0-py3-none-any.whl (16.8 kB view details)

Uploaded Python 3

File details

Details for the file neuraldbg-1.3.0.tar.gz.

File metadata

  • Download URL: neuraldbg-1.3.0.tar.gz
  • Upload date:
  • Size: 20.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for neuraldbg-1.3.0.tar.gz
Algorithm Hash digest
SHA256 a7b030572b146e5b455117c6bb5d2d7575b94a93c729bb0eeaecb13bb7d1e367
MD5 2940b5ec20989e75c42531b348711eca
BLAKE2b-256 b2f8acd15de1f2a5204b93045572ae44ca9fed680fdf0a36b0747dc8a065dc21

See more details on using hashes here.

Provenance

The following attestation bundles were made for neuraldbg-1.3.0.tar.gz:

Publisher: publish.yml on LambdaSection/NeuralDBG

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file neuraldbg-1.3.0-py3-none-any.whl.

File metadata

  • Download URL: neuraldbg-1.3.0-py3-none-any.whl
  • Upload date:
  • Size: 16.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for neuraldbg-1.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b9311594349e277c6546c8745efa1037224223d5699c73c2304770a725647af2
MD5 4e45cbd06b67bc33b9b3a88224840e7f
BLAKE2b-256 ab776e32fbbe316c28d25dc856205fbf3799882144c0f425cbfb326e9c5b9edf

See more details on using hashes here.

Provenance

The following attestation bundles were made for neuraldbg-1.3.0-py3-none-any.whl:

Publisher: publish.yml on LambdaSection/NeuralDBG

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page