Automated base vs fine-tuned LLM comparison with forgetting detection, capability retention scoring, and visual diff reports.

These details have not been verified by PyPI

Project links

Project description

FineTuneCheck

Diagnostic tool for LLM fine-tuning outcomes.

Automated base-vs-fine-tuned comparison with forgetting detection, capability retention scoring, and visual diff reports.

The Problem

You fine-tuned a model. It's better at your task. But what did it forget?

Fine-tuning improves target capabilities at the cost of general ones. Without measurement, you're shipping blind — did safety degrade? Is reasoning still intact? Was the trade-off worth it?

FineTuneCheck answers these questions in one command.

Features

12 built-in probe categories — reasoning, code, math, safety, chat, creative writing, and more
4 forgetting metrics — Backward Transfer, Capability Retention Rate, Selective Forgetting Index, Safety Alignment Retention
Multi-judge system — exact match, F1, rule-based, ROUGE, LLM-as-judge
Deep analysis — CKA, spectral, perplexity shift, calibration (ECE), activation drift
Multi-run comparison — Pareto frontier across fine-tuning runs
5 verdict levels — EXCELLENT → GOOD → GOOD_WITH_CONCERNS → POOR → HARMFUL
Composite ROI score — 0-100 balancing improvement vs forgetting cost
HTML/JSON/CSV/Markdown reports — interactive Plotly charts
MCP server — 9 tools for AI assistant integration
LoRA + GGUF support — works with PEFT adapters and quantized models

Install

pip install finetunecheck

Optional backends:

pip install finetunecheck[api-judge]   # LLM-as-judge (Anthropic + OpenAI)
pip install finetunecheck[vllm]        # vLLM inference backend
pip install finetunecheck[gguf]        # GGUF model support
pip install finetunecheck[mcp]         # MCP server for AI assistants
pip install finetunecheck[all]         # Everything

Quick Start

CLI

Run a full evaluation against the base model:

ftcheck run meta-llama/Llama-3-8B ./my-finetuned-model \
  --profile code --report report.html

Quick 5-minute sanity check (20 samples, 4 categories):

ftcheck quick meta-llama/Llama-3-8B ./my-finetuned-model

Compare multiple fine-tuning runs with Pareto frontier analysis:

ftcheck compare meta-llama/Llama-3-8B ./run1 ./run2 ./run3 \
  --report comparison.html

Deep analysis — CKA, spectral, perplexity shift, calibration:

ftcheck run meta-llama/Llama-3-8B ./my-finetuned-model --deep

Browse available probes and profiles:

ftcheck list-probes
ftcheck list-profiles

Python API

from finetunecheck import EvalRunner
from finetunecheck.config import EvalConfig

config = EvalConfig(
    base_model="meta-llama/Llama-3-8B",
    finetuned_model="./my-finetuned-model",
    profile="code",
    deep_analysis=True,
)

runner = EvalRunner(config)
results = runner.run()

print(f"Verdict: {results.verdict.value}")        # GOOD_WITH_CONCERNS
print(f"ROI Score: {results.roi_score}")           # 72.5
print(f"BWT: {results.forgetting.backward_transfer:+.3f}")  # -0.082
print(f"Safety: {results.forgetting.safety_alignment_retention}")  # 0.97

Probe Categories

Category	Samples	Judge	What It Tests
reasoning	15 (seed set)	LLM	Logical deduction, chain-of-thought
code	15 (seed set)	rule-based	Code generation, debugging
math	15 (seed set)	exact match	Arithmetic, algebra, word problems
safety	10 (seed set)	rule-based	Refusal of harmful prompts, alignment
chat_quality	10 (seed set)	LLM	Helpfulness, coherence, tone
creative_writing	8 (seed set)	LLM	Storytelling, style, creativity
summarization	10 (seed set)	ROUGE	Compression, faithfulness
extraction	10 (seed set)	F1	Named entities, structured data
classification	12 (seed set)	exact match	Sentiment, topic, intent
instruction_following	12 (seed set)	rule-based	Format compliance, constraints
multilingual	10 (seed set)	LLM	Translation, cross-lingual transfer
world_knowledge	15 (seed set)	exact match	Facts, trivia, common sense

Forgetting Metrics

Metric	Formula	Interpretation
BWT (Backward Transfer)	avg(ft − base) on non-target categories	Negative = forgetting
CRR (Capability Retention Rate)	ft_score / base_score per category	< 0.95 = meaningful regression
SFI (Selective Forgetting Index)	std(CRR values)	High = uneven forgetting
SAR (Safety Alignment Retention)	ft_safety / base_safety	< 0.70 → HARMFUL verdict

Verdict System

Verdict	Condition	Meaning
EXCELLENT	ROI ≥ 80, no concerns	Strong improvement, minimal forgetting
GOOD	ROI ≥ 50, no concerns	Solid improvement, acceptable trade-offs
GOOD_WITH_CONCERNS	ROI ≥ 60, concerns present	Improvement exists but forgetting is notable
POOR	ROI < 50, or ROI < 60 with concerns, or catastrophic forgetting	Marginal improvement, significant forgetting
HARMFUL	SAR < 0.70	Safety alignment critically degraded

HTML Report Contents

Generate with --report report.html (or -f html). Single self-contained file, no server required.

Always included:

Verdict banner — verdict label + ROI score
Category Scores: Base vs Fine-tuned — grouped bar chart with error bars (±1 std), radar chart, and per-category table
ROI Score Breakdown — stacked bar showing the 5 weighted components: Target (30pt), Retention (25pt), Safety (25pt), Selectivity (10pt), BWT (10pt)
Forgetting Analysis — capability retention rate per category, most affected / resilient lists
Worst Sample-Level Regressions — top 15 samples where fine-tuning hurt the most
Concerns & Recommendations — actionable items from the verdict engine

With --deep:

CKA Similarity — per-layer alignment bar chart, most diverged layers highlighted
Perplexity Distribution — overlapping histograms (base vs fine-tuned) with inline Wasserstein distance and tail fraction annotation
Spectral Analysis — effective rank per layer with mean reference line
Calibration (Reliability Diagram) — confidence vs accuracy for base and fine-tuned with ECE values
Activation Drift — per-layer drift (1 - cosine sim) bar chart

Deep Analysis

Enable with --deep for additional diagnostics:

CKA Similarity — per-layer representation alignment between base and fine-tuned
Spectral Analysis — effective rank changes, singular value distribution
Perplexity Distribution Shift — KL divergence and Wasserstein distance of per-token perplexity
Calibration (ECE) — expected calibration error before and after fine-tuning
Activation Drift — per-layer cosine similarity, disrupted attention heads

Multi-Run Comparison

ftcheck compare base_model ./run1 ./run2 ./run3 --report comparison.html

Outputs per-run verdicts, best overall / best target / least forgetting picks, and Pareto frontier analysis.

Custom Probes

from finetunecheck.probes.registry import ProbeRegistry

ProbeRegistry.register_from_csv("my_probes.csv", name="custom", category="domain")
ProbeRegistry.register_from_jsonl("my_probes.jsonl", name="custom", category="domain")

Evaluation Profiles

Profile	Focus Areas
`general`	Balanced evaluation across all capability categories
`code`	Code generation, mathematical reasoning
`chat`	Chat quality, instruction following, multilingual, safety
`classification`	Classification, extraction (lightweight)
`rag`	Extraction, summarization, factual knowledge
`medical`	Reasoning, factual accuracy, safety (medical domain)
`legal`	Reasoning, extraction (legal domain)
`safety_critical`	All categories with extreme safety weight (99%+ SAR)

MCP Integration

{
  "mcpServers": {
    "finetunecheck": {
      "command": "ftcheck",
      "args": ["serve", "--stdio"]
    }
  }
}

Tools: run_evaluation, quick_check, compare_runs, get_forgetting_report, list_probes, list_profiles, get_probe_details, analyze_deep, generate_report

Export Formats

ftcheck run base ft --report results.html -f html       # Interactive HTML
ftcheck run base ft --report results.json -f json       # Machine-readable
ftcheck run base ft --report results.csv -f csv         # Spreadsheet
ftcheck run base ft --report results.md -f markdown     # Documentation

CI Integration

# Exit code 1 if verdict is POOR or HARMFUL
ftcheck run base_model finetuned_model --exit-code

Architecture

finetunecheck/
├── eval/           # EvalRunner pipeline, judges, scoring
├── forgetting/     # BWT, CRR, SFI, SAR metrics
├── compare/        # Multi-run comparison, Pareto frontier
├── deep_analysis/  # CKA, spectral, perplexity, calibration
├── probes/         # 12 built-in probe sets + custom probe support
├── report/         # HTML/JSON/CSV/Markdown generation
├── mcp/            # MCP server (9 tools)
└── models.py       # Pydantic v2 data contracts

Development

git clone https://github.com/shuhulx/finetunecheck.git
cd finetunecheck
pip install -e ".[dev]"
pytest

References

Luo et al., "An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning" (2023)
Kornblith et al., "Similarity of Neural Network Representations Revisited" (ICML 2019) — CKA
Guo et al., "On Calibration of Modern Neural Networks" (2017) — ECE

Changelog

0.2.1

Docs: Added HTML Report Contents section — full list of always-included and --deep charts
Docs: Filled changelog for 0.1.7 through 0.2.0

0.2.0

Added: ROI Score Breakdown chart — stacked horizontal bar with 5 weighted components (Target, Retention, Safety, Selectivity, BWT) showing how the composite score was built
Added: Error bars on Category Scores bar chart using std_score — shows score variance across probe samples for both base and fine-tuned
Added: Inline annotation on Perplexity Distribution chart — Wasserstein distance and tail fraction (% samples where FT perplexity > 2× base) shown directly in chart
Added: HTML Report Contents section in README

0.1.9

Fixed: SFI (Selective Forgetting Index) now uses sample variance with Bessel's correction (/ n-1) instead of population variance
Fixed: find_regressions default threshold corrected from 0.3 → 0.1 to match EvalRunner

0.1.8

Fixed: LLMJudge _parse_judgment regex now handles nested braces with re.DOTALL — previously failed on multi-line JSON responses

0.1.7

Fixed: BWT metric now uses normalized -1.0 for missing categories (consistent with CRR)
Fixed: SAR dict handling when ft_safety is not a dict
Fixed: ROI score clamps BWT to [-1.0, inf) before normalization

0.1.6

Fixed: BWT metric now uses normalized -1.0 for missing categories (consistent with CRR)
Fixed: SAR dict handling when ft_safety is not a dict
Fixed: ROI score clamps BWT to [-1.0, inf) before normalization
Fixed: ExactMatchJudge no longer allows substring matches
Fixed: All judge batch methods validate input lengths
Fixed: LLM judge parse failures now logged instead of silently returning 0.5
Fixed: ExecutionJudge temp file cleanup in proper try/finally
Fixed: Path traversal hardening in MCP report generation
Fixed: MCP server logs full tracebacks on errors
Fixed: CLI validates num_samples > 0
Fixed: adapter_config.json parsing validates JSON structure
Fixed: Backend cleanup on partial model load failure
Added: Warnings for SFI infinity filtering and placeholder probe fallback
Added: ROI weight validation (non-negative, at least one > 0)
Added: Dependency upper bounds (torch<3, transformers<5, pydantic<3)

License

Apache 2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.4

Apr 5, 2026

0.2.2

Mar 20, 2026

This version

0.2.1

Mar 20, 2026

0.2.0

Mar 20, 2026

0.1.9

Mar 19, 2026

0.1.8

Mar 19, 2026

0.1.7

Mar 19, 2026

0.1.6

Mar 10, 2026

0.1.5

Mar 8, 2026

0.1.4

Mar 8, 2026

0.1.3

Feb 28, 2026

0.1.2

Feb 28, 2026

0.1.1

Feb 28, 2026

0.1.0

Feb 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

finetunecheck-0.2.1.tar.gz (97.1 kB view details)

Uploaded Mar 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

finetunecheck-0.2.1-py3-none-any.whl (122.6 kB view details)

Uploaded Mar 20, 2026 Python 3

File details

Details for the file finetunecheck-0.2.1.tar.gz.

File metadata

Download URL: finetunecheck-0.2.1.tar.gz
Upload date: Mar 20, 2026
Size: 97.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for finetunecheck-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`b363ca491bf7468354afca9ebb0309db7ee6d0c342148102e8686ed527b77975`
MD5	`0d638dc616be7dcf061a39a402bd6b11`
BLAKE2b-256	`34f13e1eb06bfa053bab16b9a5eb5953cdaf43c3dc3f22fd2c5dfc78342c7242`

See more details on using hashes here.

File details

Details for the file finetunecheck-0.2.1-py3-none-any.whl.

File metadata

Download URL: finetunecheck-0.2.1-py3-none-any.whl
Upload date: Mar 20, 2026
Size: 122.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for finetunecheck-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`eef04bcf2dbd0a9e36cb14e84ccae05a143ec5fca5b450722abda0a4ba04826e`
MD5	`52a955bcfb2f80327562a926e3cb5260`
BLAKE2b-256	`1026bcedb46a33f379b122c9c348d8587b17b88ac2dc4e0199e7497f9c5a87df`

See more details on using hashes here.

finetunecheck 0.2.1

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

FineTuneCheck

The Problem

Features

Install

Quick Start

CLI

Python API

Probe Categories

HTML Report Contents

Deep Analysis

Multi-Run Comparison

Custom Probes

Evaluation Profiles

MCP Integration

Export Formats

CI Integration

Architecture

Development

References

Changelog

0.2.1

0.2.0

0.1.9

0.1.8

0.1.7

0.1.6

License

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes