Post-training diagnostics for LLMs — what did fine-tuning actually do to your model?
Project description
Afterburn
Post-training diagnostics for LLMs. Weight diffs, behavioral analysis, and reward hacking detection — before you deploy.
Most evaluation tools tell you benchmark scores went up or down. Afterburn tells you why — by comparing two model checkpoints (base + post-trained) at the weight level, the behavioural level, and checking for reward hacking patterns.
No other open-source tool combines all three.
Quick Start
pip install afterburn
afterburn diagnose \
--base Qwen/Qwen2.5-0.5B \
--trained Qwen/Qwen2.5-0.5B-Instruct \
--method sft \
-o report.html
from afterburn import Diagnoser
report = Diagnoser(
base_model="Qwen/Qwen2.5-0.5B",
trained_model="Qwen/Qwen2.5-0.5B-Instruct",
method="sft",
).run()
print(report.summary)
print(f"Reward hack risk: {report.hack_score:.0f}/100")
report.save("report.html")
What It Does
1. Weight Diff Analysis
Compares model weights layer-by-layer. Memory-efficient via safetensors memory mapping (~128MB peak per layer for 8B models).
| Metric | What It Measures |
|---|---|
| L2 / Cosine / Frobenius | Magnitude and direction of weight changes |
| SVD decomposition | Effective rank, concentration ratio, stable rank of the diff |
| Spectral alpha | Power-law exponent of eigenvalue spectrum (2-4 = healthy) |
| Marchenko-Pastur law | Compares eigenvalues to random matrix theory — spikes = learned structure |
| Behavioral vectors | Principal directions of change via SVD, cross-layer coherence |
| Attention head importance | Per-head importance delta before vs after training |
| LayerNorm shift | Gamma/beta parameter drift detection |
| Embedding drift | Token embedding movement, most-drifted tokens |
| LoRA analysis | Adapter weight decomposition and impact (auto-detected) |
2. Behavioral Shift Detection
Runs the same prompts through both models, compares outputs statistically.
- Length distribution — Mann-Whitney U test, Cohen's d, skewness, kurtosis, percentiles
- Reasoning strategy — Classification (direct, step-by-step, code-assisted, CoT, tool use) with NLI tiebreaker
- Strategy shift — Detects if training collapsed diverse strategies into one
- Format compliance — Code blocks, LaTeX, markdown, tables, thinking tags (Shannon entropy)
- Chain-of-thought — Step counting, depth, self-correction rate, verification patterns
- Diversity — EAD (Expectation-Adjusted Distinct n-grams), optional SBERT semantic diversity
- Token divergence — Jensen-Shannon Divergence on token probability distributions
- Calibration — Expected Calibration Error (ECE), reliability diagrams
3. Reward Hacking Detection
Detects failure modes from RLHF/DPO/GRPO training. Composite risk score 0-100.
| Detector | What It Catches |
|---|---|
| Length bias | Outputs got longer without quality gains (Cohen's d) |
| Format gaming | Model exploits format-based reward signals (ROUGE-L correctness correlation) |
| Strategy collapse | Model converges on one strategy, losing diversity (Shannon entropy) |
| Sycophancy | Model agrees more post-training, even with false claims |
Sycophancy detection uses three methods:
- Regex-based agreement/pushback rate comparison
- NLI-enhanced semantic agreement detection (
cross-encoder/nli-deberta-v3-small) - 40 adversarial consistency probes across math, science, history, and coding — neutral vs leading prompt pairs that test if the model changes factual answers under pressure
4. Reports
- Interactive HTML with Plotly visualizations
- JSON structured output for pipelines
- Markdown for documentation
- PDF (optional dependency)
- Executive summary + actionable recommendations
Installation
# From PyPI
pip install afterburn
# From source
git clone https://github.com/code-mohanprakash/afterburn.git
cd afterburn
pip install -e ".[dev]"
# Optional: NLI-enhanced analysis
pip install afterburn[nli]
# Optional: PDF export
pip install afterburn[pdf]
# Optional: Semantic diversity (SBERT)
pip install afterburn[semantic]
Requirements: Python 3.10+, PyTorch 2.0+. GPU recommended but not required (CUDA, MPS, CPU).
CLI
# Full diagnostic
afterburn diagnose --base <model> --trained <model> -o report.html
# Individual analyses
afterburn weight-diff --base <model> --trained <model> -o weights.json
afterburn behaviour --base <model> --trained <model> -o behaviour.json
afterburn hack-check --base <model> --trained <model> -o hacking.json
Python API
from afterburn import Diagnoser
diag = Diagnoser(
base_model="meta-llama/Llama-3.1-8B",
trained_model="my-org/Llama-3.1-8B-RLVR",
method="rlvr",
)
# Full analysis
report = diag.run()
# Or individual modules
weight_diff = diag.run_weight_diff()
behaviour = diag.run_behaviour()
hack_check = diag.run_hack_check()
# Inspect results
for layer in weight_diff.top_changed_layers:
print(f"{layer.layer_name}: relative_change={layer.relative_change:.4f}")
if layer.mp_num_spikes is not None:
print(f" MP spikes: {layer.mp_num_spikes} (bulk: {layer.mp_bulk_fraction:.1%})")
print(f"Direction coherence: {weight_diff.direction_coherence:.3f}")
Configuration
Optional .afterburn.yaml:
device: auto
behaviour:
suites: [math, code, reasoning, safety]
max_new_tokens: 512
batch_size: 4
reward_hack:
weights:
length_bias: 0.25
format_gaming: 0.30
strategy_collapse: 0.20
sycophancy: 0.25
How It Works
Base Model ──┐
├── Weight Diff (safetensors, one layer at a time)
Trained Model┘ │
├── Diagnostic Report
Base Model ──┐ │ (HTML / JSON / MD / PDF)
├── Prompt Runner (one model at a time)
Trained Model┘ │
├── Behaviour Analysis (statistical comparison)
│
└── Reward Hack Detection (40 adversarial probes)
- Weight diff loads both checkpoints via memory-mapped safetensors and computes per-layer metrics including SVD, spectral analysis, and Marchenko-Pastur law fitting
- Prompt runner generates outputs from both models on standardized prompt suites (loads one model at a time to halve memory)
- Behaviour analyser compares output distributions with statistical tests (Mann-Whitney U, Cohen's d, JSD, EAD)
- Reward hack detector runs 4 sub-detectors + 40 adversarial consistency probes with NLI-enhanced scoring
- Report generator compiles everything into a human-readable diagnostic with Plotly visualizations
Project Structure
src/afterburn/
├── cli/ # Click CLI commands
├── loading/ # Model loading, safetensors, LoRA adapter detection
├── weight_diff/ # L2, cosine, SVD, spectral alpha, MP law, behavioral vectors
├── behaviour/ # Length, format, strategy, CoT, calibration, diversity, JSD
├── reward_hack/ # Length bias, format gaming, strategy collapse, sycophancy, probes
├── prompts/ # Prompt suites + inference runner
├── report/ # HTML/JSON/MD/PDF generation + Plotly visualizations
├── nli.py # Shared NLI model (cross-encoder/nli-deberta-v3-small)
├── diagnoser.py # Top-level orchestrator
└── types.py # 30+ shared dataclasses and enums
Testing
pytest tests/ -v # 678 tests
pytest tests/ --cov=afterburn # with coverage
ruff check src/ tests/ # linting
mypy src/afterburn/ # type checking
Contributing
git clone https://github.com/code-mohanprakash/afterburn.git
cd afterburn
pip install -e ".[dev]"
pytest tests/
See docs/contributing.md for architecture details and contribution guidelines.
Why Afterburn?
Existing tools either analyze weights (WeightWatcher) or evaluate outputs (lm-eval-harness, Giskard, DeepEval) — but none connect weight changes to behavioral shifts to reward hacking patterns in a single workflow.
Reward hacking is a validated problem in frontier models (METR, Anthropic). Afterburn is the open-source tool for detecting it.
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file afterburn-0.6.0.tar.gz.
File metadata
- Download URL: afterburn-0.6.0.tar.gz
- Upload date:
- Size: 172.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3bce94e88c0c2b03daa452abac279df97dbee144bfee86de276f9b66928b6392
|
|
| MD5 |
0752a6d40bc6e55822160d468b55fd75
|
|
| BLAKE2b-256 |
74c32ca082afd18eed7e54539ad5f2665bd9b60a52e3efa6fbd494a2815dbf8d
|
File details
Details for the file afterburn-0.6.0-py3-none-any.whl.
File metadata
- Download URL: afterburn-0.6.0-py3-none-any.whl
- Upload date:
- Size: 125.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d8a2302882d25c583aba226892eb6ca1fc02343cb462d627e762c39628d4029c
|
|
| MD5 |
c5b66867ffeeff3c0e4518e78159e0e4
|
|
| BLAKE2b-256 |
96fcd3103a9925688746fc5a1dc97fadc7b1e10b673ac0ecf8a3b36625a7a11d
|