Automated base vs fine-tuned LLM comparison with forgetting detection, capability retention scoring, and visual diff reports.
Project description
FineTuneCheck
Diagnostic tool for LLM fine-tuning outcomes.
Automated base-vs-fine-tuned comparison with forgetting detection, capability retention scoring, and visual diff reports.
The Problem
You fine-tuned a model. It's better at your task. But what did it forget?
Fine-tuning improves target capabilities at the cost of general ones. Without measurement, you're shipping blind:
- Did safety alignment degrade?
- Is reasoning still intact?
- Are code capabilities broken?
- Was the trade-off worth it?
FineTuneCheck answers these questions in one command.
Features
- 12 built-in probe categories — reasoning, code, math, safety, chat quality, creative writing, summarization, extraction, classification, instruction following, multilingual, world knowledge
- 4 forgetting metrics — Backward Transfer (BWT), Capability Retention Rate (CRR), Selective Forgetting Index (SFI), Safety Alignment Retention (SAR)
- Multi-judge system — exact match, F1, rule-based, ROUGE, LLM-as-judge
- Deep analysis — CKA similarity, spectral analysis, perplexity distribution shift, calibration (ECE), activation drift
- Multi-run comparison — Pareto frontier analysis across fine-tuning runs
- 5 verdict levels — EXCELLENT → GOOD → GOOD_WITH_CONCERNS → POOR → HARMFUL
- Composite ROI score — 0-100 score balancing improvement vs forgetting cost
- HTML/JSON/CSV/Markdown reports — interactive Plotly charts, exportable results
- MCP server — 9 tools for AI assistant integration
- LoRA + GGUF support — works with PEFT adapters and quantized models
Install
pip install finetunecheck
With optional backends:
pip install finetunecheck[api-judge] # LLM-as-judge (Anthropic + OpenAI)
pip install finetunecheck[vllm] # vLLM inference backend
pip install finetunecheck[gguf] # GGUF model support
pip install finetunecheck[mcp] # MCP server for AI assistants
pip install finetunecheck[all] # Everything
Quick Start
CLI
# Full evaluation
ftcheck run meta-llama/Llama-3-8B ./my-finetuned-model \
--profile code --report report.html
# Quick 5-minute check (20 samples, 4 categories)
ftcheck quick meta-llama/Llama-3-8B ./my-finetuned-model
# Compare multiple fine-tuning runs
ftcheck compare meta-llama/Llama-3-8B ./run1 ./run2 ./run3 \
--report comparison.html
# Deep analysis (CKA, spectral, perplexity, calibration)
ftcheck run meta-llama/Llama-3-8B ./my-finetuned-model --deep
# List available probes and profiles
ftcheck list-probes
ftcheck list-profiles
Python API
from finetunecheck import EvalRunner
from finetunecheck.config import EvalConfig
config = EvalConfig(
base_model="meta-llama/Llama-3-8B",
finetuned_model="./my-finetuned-model",
profile="code",
deep_analysis=True,
)
runner = EvalRunner(config)
results = runner.run()
print(f"Verdict: {results.verdict.value}") # GOOD_WITH_CONCERNS
print(f"ROI Score: {results.roi_score}") # 72.5
print(f"BWT: {results.forgetting.backward_transfer:+.3f}") # -0.082
print(f"Safety: {results.forgetting.safety_alignment_retention}") # 0.97
Probe Categories
| Category | Samples | Judge | What It Tests |
|---|---|---|---|
| reasoning | 100+ | LLM | Logical deduction, chain-of-thought |
| code | 100+ | rule-based | Code generation, debugging |
| math | 100+ | exact match | Arithmetic, algebra, word problems |
| safety | 100+ | rule-based | Refusal of harmful prompts, alignment |
| chat_quality | 100+ | LLM | Helpfulness, coherence, tone |
| creative_writing | 100+ | LLM | Storytelling, style, creativity |
| summarization | 100+ | ROUGE | Compression, faithfulness |
| extraction | 100+ | F1 | Named entities, structured data |
| classification | 100+ | exact match | Sentiment, topic, intent |
| instruction_following | 100+ | rule-based | Format compliance, constraints |
| multilingual | 100+ | LLM | Translation, cross-lingual transfer |
| world_knowledge | 100+ | exact match | Facts, trivia, common sense |
Forgetting Metrics
| Metric | Formula | Interpretation |
|---|---|---|
| BWT (Backward Transfer) | avg(ft − base) on non-target categories | Negative = forgetting |
| CRR (Capability Retention Rate) | ft_score / base_score per category | < 0.95 = meaningful regression |
| SFI (Selective Forgetting Index) | std(CRR values) | High = uneven forgetting |
| SAR (Safety Alignment Retention) | ft_safety / base_safety | < 0.90 → HARMFUL verdict |
Verdict System
| Verdict | ROI Score | Meaning |
|---|---|---|
| EXCELLENT | 85-100 | Strong improvement, minimal forgetting |
| GOOD | 70-84 | Solid improvement, acceptable trade-offs |
| GOOD_WITH_CONCERNS | 50-69 | Improvement exists but forgetting is notable |
| POOR | 25-49 | Marginal improvement, significant forgetting |
| HARMFUL | 0-24 | Safety degraded or catastrophic forgetting |
Deep Analysis
Enable with --deep for additional diagnostics:
- CKA Similarity — per-layer representation alignment between base and fine-tuned
- Spectral Analysis — effective rank changes, singular value distribution
- Perplexity Distribution Shift — KL divergence and Wasserstein distance of per-token perplexity
- Calibration (ECE) — expected calibration error before and after fine-tuning
- Activation Drift — per-layer cosine similarity, disrupted attention heads
Multi-Run Comparison
ftcheck compare base_model ./run1 ./run2 ./run3 --report comparison.html
Outputs:
- Per-run verdicts and ROI scores
- Best overall (highest ROI)
- Best target performance (highest target improvement)
- Least forgetting (highest mean CRR)
- Pareto frontier — runs that aren't dominated on any metric
Custom Probes
from finetunecheck.probes.registry import ProbeRegistry
# From CSV
ProbeRegistry.register_from_csv("my_probes.csv", name="custom", category="domain")
# From JSONL
ProbeRegistry.register_from_jsonl("my_probes.jsonl", name="custom", category="domain")
CSV format: input,reference,difficulty,tags
MCP Integration
Add to your AI assistant's MCP config:
{
"mcpServers": {
"finetunecheck": {
"command": "ftcheck",
"args": ["serve", "--stdio"]
}
}
}
9 MCP tools: run_evaluation, quick_check, compare_runs, get_forgetting_report, list_probes, list_profiles, get_probe_details, analyze_deep, generate_report
Evaluation Profiles
| Profile | Focus Areas |
|---|---|
default |
All 12 categories |
code |
Code generation, reasoning, instruction following |
chat |
Chat quality, safety, instruction following |
safety |
Thorough safety and alignment evaluation |
math |
Mathematical reasoning, problem solving |
multilingual |
Cross-lingual capabilities |
Export Formats
ftcheck run base ft --report results.html -f html # Interactive HTML
ftcheck run base ft --report results.json -f json # Machine-readable
ftcheck run base ft --report results.csv -f csv # Spreadsheet
ftcheck run base ft --report results.md -f markdown # Documentation
CI Integration
# Exit code 1 if verdict is POOR or HARMFUL
ftcheck run base_model finetuned_model --exit-code
Security
- Models loaded via HuggingFace Transformers (no pickle/torch.load)
- YAML parsed with
safe_load - Jinja2 templates with autoescape
- No secrets in reports or logs
- Disk cache for baseline results (safe serialization)
Architecture
finetunecheck/
├── eval/ # EvalRunner pipeline, judges, scoring
├── forgetting/ # BWT, CRR, SFI, SAR metrics
├── compare/ # Multi-run comparison, Pareto frontier
├── deep_analysis/ # CKA, spectral, perplexity, calibration
├── probes/ # 12 built-in probe sets + custom probe support
├── report/ # HTML/JSON/CSV/Markdown generation
├── mcp/ # MCP server (9 tools)
└── models.py # Pydantic v2 data contracts
References
- Luo et al., "An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning" (2023)
- Kornblith et al., "Similarity of Neural Network Representations Revisited" (ICML 2019) — CKA
- Guo et al., "On Calibration of Modern Neural Networks" (2017) — ECE
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file finetunecheck-0.1.1.tar.gz.
File metadata
- Download URL: finetunecheck-0.1.1.tar.gz
- Upload date:
- Size: 92.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
02fb52f511042bc4bcd6963a6d403f1ab780a938f669eff7b6b9877da38bee21
|
|
| MD5 |
fd515d06885772d8c192cbf66a38ce33
|
|
| BLAKE2b-256 |
b6008182f464c68e0468a2f2f70a0d3d159324665c33c602c707420d4801b8c2
|
File details
Details for the file finetunecheck-0.1.1-py3-none-any.whl.
File metadata
- Download URL: finetunecheck-0.1.1-py3-none-any.whl
- Upload date:
- Size: 119.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9aa5fab39cc5798150ae74b363b31690536f9e68b43a6a7d385c243c19f08ada
|
|
| MD5 |
544daebafdcc153420724436bdf9f388
|
|
| BLAKE2b-256 |
054b0932799063863b772343fa3a605fa7d6513740685dd677198609698f53c3
|