Automated base vs fine-tuned LLM comparison with forgetting detection, capability retention scoring, and visual diff reports.
Project description
FineTuneCheck
Diagnostic tool for LLM fine-tuning outcomes.
Automated base-vs-fine-tuned comparison with forgetting detection, capability retention scoring, and visual diff reports.
The Problem
You fine-tuned a model. It's better at your task. But what did it forget?
Fine-tuning improves target capabilities at the cost of general ones. Without measurement, you're shipping blind — did safety degrade? Is reasoning still intact? Was the trade-off worth it?
FineTuneCheck answers these questions in one command.
Features
- 12 built-in probe categories — reasoning, code, math, safety, chat, creative writing, and more
- 4 forgetting metrics — Backward Transfer, Capability Retention Rate, Selective Forgetting Index, Safety Alignment Retention
- Multi-judge system — exact match, F1, rule-based, ROUGE, LLM-as-judge
- Deep analysis — CKA, spectral, perplexity shift, calibration (ECE), activation drift
- Multi-run comparison — Pareto frontier across fine-tuning runs
- 5 verdict levels — EXCELLENT → GOOD → GOOD_WITH_CONCERNS → POOR → HARMFUL
- Composite ROI score — 0-100 balancing improvement vs forgetting cost
- HTML/JSON/CSV/Markdown reports — interactive Plotly charts
- MCP server — 9 tools for AI assistant integration
- LoRA + GGUF support — works with PEFT adapters and quantized models
Install
pip install finetunecheck
Optional backends:
pip install finetunecheck[api-judge] # LLM-as-judge (Anthropic + OpenAI)
pip install finetunecheck[vllm] # vLLM inference backend
pip install finetunecheck[gguf] # GGUF model support
pip install finetunecheck[mcp] # MCP server for AI assistants
pip install finetunecheck[all] # Everything
Quick Start
CLI
Run a full evaluation against the base model:
ftcheck run meta-llama/Llama-3-8B ./my-finetuned-model \
--profile code --report report.html
Quick 5-minute sanity check (20 samples, 4 categories):
ftcheck quick meta-llama/Llama-3-8B ./my-finetuned-model
Compare multiple fine-tuning runs with Pareto frontier analysis:
ftcheck compare meta-llama/Llama-3-8B ./run1 ./run2 ./run3 \
--report comparison.html
Deep analysis — CKA, spectral, perplexity shift, calibration:
ftcheck run meta-llama/Llama-3-8B ./my-finetuned-model --deep
Browse available probes and profiles:
ftcheck list-probes
ftcheck list-profiles
Python API
from finetunecheck import EvalRunner
from finetunecheck.config import EvalConfig
from finetunecheck.profiles.loader import ProfileLoader
config = EvalConfig(
base_model="meta-llama/Llama-3-8B",
finetuned_model="./my-finetuned-model",
deep_analysis=True,
)
config = ProfileLoader.apply_to_config("code", config)
runner = EvalRunner(config)
results = runner.run()
print(f"Verdict: {results.verdict.value}") # GOOD_WITH_CONCERNS
print(f"ROI Score: {results.roi_score}") # 72.5
print(f"BWT: {results.forgetting.backward_transfer:+.3f}") # -0.082
print(f"Safety: {results.forgetting.safety_alignment_retention}") # 0.97
Probe Categories
| Category | Samples | Judge | What It Tests |
|---|---|---|---|
| reasoning | 15 (seed set) | LLM | Logical deduction, chain-of-thought |
| code | 15 (seed set) | rule-based | Code generation, debugging |
| math | 15 (seed set) | exact match | Arithmetic, algebra, word problems |
| safety | 10 (seed set) | rule-based | Refusal of harmful prompts, alignment |
| chat_quality | 10 (seed set) | LLM | Helpfulness, coherence, tone |
| creative_writing | 8 (seed set) | LLM | Storytelling, style, creativity |
| summarization | 10 (seed set) | ROUGE | Compression, faithfulness |
| extraction | 10 (seed set) | F1 | Named entities, structured data |
| classification | 12 (seed set) | exact match | Sentiment, topic, intent |
| instruction_following | 12 (seed set) | rule-based | Format compliance, constraints |
| multilingual | 10 (seed set) | LLM | Translation, cross-lingual transfer |
| world_knowledge | 15 (seed set) | exact match | Facts, trivia, common sense |
Forgetting Metrics
| Metric | Formula | Interpretation |
|---|---|---|
| BWT (Backward Transfer) | avg(ft − base) on non-target categories | Negative = forgetting |
| CRR (Capability Retention Rate) | ft_score / base_score per category | < 0.95 = meaningful regression |
| SFI (Selective Forgetting Index) | std(CRR values) | High = uneven forgetting |
| SAR (Safety Alignment Retention) | ft_safety / base_safety | < 0.70 → HARMFUL verdict |
Verdict System
| Verdict | Condition | Meaning |
|---|---|---|
| EXCELLENT | ROI ≥ 80, no concerns | Strong improvement, minimal forgetting |
| GOOD | ROI ≥ 50, no concerns | Solid improvement, acceptable trade-offs |
| GOOD_WITH_CONCERNS | ROI ≥ 60, concerns present | Improvement exists but forgetting is notable |
| POOR | ROI < 50, or ROI < 60 with concerns, or catastrophic forgetting | Marginal improvement, significant forgetting |
| HARMFUL | SAR < 0.70 | Safety alignment critically degraded |
HTML Report Contents
Generate with --report report.html (or -f html). Single self-contained file, no server required.
Always included:
- Verdict banner — verdict label + ROI score
- Category Scores: Base vs Fine-tuned — grouped bar chart with error bars (±1 std), radar chart, and per-category table
- ROI Score Breakdown — stacked bar showing the 5 weighted components: Target (30pt), Retention (25pt), Safety (25pt), Selectivity (10pt), BWT (10pt)
- Forgetting Analysis — capability retention rate per category, most affected / resilient lists
- Worst Sample-Level Regressions — top 15 samples where fine-tuning hurt the most
- Concerns & Recommendations — actionable items from the verdict engine
With --deep:
- CKA Similarity — per-layer alignment bar chart, most diverged layers highlighted
- Perplexity Distribution — overlapping histograms (base vs fine-tuned) with inline Wasserstein distance and tail fraction annotation
- Spectral Analysis — effective rank per layer with mean reference line
- Calibration (Reliability Diagram) — confidence vs accuracy for base and fine-tuned with ECE values
- Activation Drift — per-layer drift (1 - cosine sim) bar chart
Deep Analysis
Enable with --deep for additional diagnostics:
- CKA Similarity — per-layer representation alignment between base and fine-tuned
- Spectral Analysis — effective rank changes, singular value distribution
- Perplexity Distribution Shift — KL divergence and Wasserstein distance of per-token perplexity
- Calibration (ECE) — expected calibration error before and after fine-tuning
- Activation Drift — per-layer cosine similarity, disrupted attention heads
Multi-Run Comparison
ftcheck compare base_model ./run1 ./run2 ./run3 --report comparison.html
Outputs per-run verdicts, best overall / best target / least forgetting picks, and Pareto frontier analysis.
Custom Probes
from finetunecheck.probes.registry import ProbeRegistry
ProbeRegistry.register_from_csv("my_probes.csv", name="custom", category="domain")
ProbeRegistry.register_from_jsonl("my_probes.jsonl", name="custom", category="domain")
Evaluation Profiles
| Profile | Focus Areas |
|---|---|
general |
Balanced evaluation across all capability categories |
code |
Code generation, mathematical reasoning |
chat |
Chat quality, instruction following, multilingual, safety |
classification |
Classification, extraction (lightweight) |
rag |
Extraction, summarization, factual knowledge |
medical |
Reasoning, factual accuracy, safety (medical domain) |
legal |
Reasoning, extraction (legal domain) |
safety_critical |
All categories with extreme safety weight (99%+ SAR) |
MCP Integration
{
"mcpServers": {
"finetunecheck": {
"command": "ftcheck",
"args": ["serve", "--stdio"]
}
}
}
Tools: evaluate_finetune, quick_check, detect_forgetting, compare_runs, get_verdict, suggest_fixes, generate_report, list_profiles, run_probe
Export Formats
ftcheck run base ft --report results.html -f html # Interactive HTML
ftcheck run base ft --report results.json -f json # Machine-readable
ftcheck run base ft --report results.csv -f csv # Spreadsheet
ftcheck run base ft --report results.md -f markdown # Documentation
CI Integration
# Exit code 1 if verdict is POOR or HARMFUL
ftcheck run base_model finetuned_model --exit-code
Architecture
finetunecheck/
├── eval/ # EvalRunner pipeline, judges, scoring
├── forgetting/ # BWT, CRR, SFI, SAR metrics
├── compare/ # Multi-run comparison, Pareto frontier
├── deep_analysis/ # CKA, spectral, perplexity, calibration
├── probes/ # 12 built-in probe sets + custom probe support
├── report/ # HTML/JSON/CSV/Markdown generation
├── mcp/ # MCP server (9 tools)
└── models.py # Pydantic v2 data contracts
Development
git clone https://github.com/shuhulx/finetunecheck.git
cd finetunecheck
pip install -e ".[dev]"
pytest
References
- Luo et al., "An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning" (2023)
- Kornblith et al., "Similarity of Neural Network Representations Revisited" (ICML 2019) — CKA
- Guo et al., "On Calibration of Modern Neural Networks" (2017) — ECE
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file finetunecheck-0.2.4.tar.gz.
File metadata
- Download URL: finetunecheck-0.2.4.tar.gz
- Upload date:
- Size: 96.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
48c5e73eae0bd5185066b238b83af3ff365c90913d45b18d04090a605802a385
|
|
| MD5 |
656abd33d6bb36f81c09396604144b2d
|
|
| BLAKE2b-256 |
5dad61d966fcdab0758cf6dda2eb782700309f9f95091f6990c288bbf1f7bdba
|
File details
Details for the file finetunecheck-0.2.4-py3-none-any.whl.
File metadata
- Download URL: finetunecheck-0.2.4-py3-none-any.whl
- Upload date:
- Size: 122.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6eec95f44ab61860247a8d74b273a7f28a7e457261551ae5eb80b83c471c8572
|
|
| MD5 |
5e7ffc738e20b01dfb2767ba2a3146b2
|
|
| BLAKE2b-256 |
aca7148239d63cfa1e88111ffe6f5a618dc67d80c699a0e55c7ff5483ce58129
|