Automated base vs fine-tuned LLM comparison with forgetting detection, capability retention scoring, and visual diff reports.

These details have not been verified by PyPI

Project description

FineTuneCheck

Diagnostic tool for LLM fine-tuning outcomes.

Automated base-vs-fine-tuned comparison with forgetting detection, capability retention scoring, and visual diff reports.

The Problem

You fine-tuned a model. It's better at your task. But what did it forget?

Fine-tuning improves target capabilities at the cost of general ones. Without measurement, you're shipping blind:

Did safety alignment degrade?
Is reasoning still intact?
Are code capabilities broken?
Was the trade-off worth it?

FineTuneCheck answers these questions in one command.

Features

12 built-in probe categories — reasoning, code, math, safety, chat quality, creative writing, summarization, extraction, classification, instruction following, multilingual, world knowledge
4 forgetting metrics — Backward Transfer (BWT), Capability Retention Rate (CRR), Selective Forgetting Index (SFI), Safety Alignment Retention (SAR)
Multi-judge system — exact match, F1, rule-based, ROUGE, LLM-as-judge
Deep analysis — CKA similarity, spectral analysis, perplexity distribution shift, calibration (ECE), activation drift
Multi-run comparison — Pareto frontier analysis across fine-tuning runs
5 verdict levels — EXCELLENT → GOOD → GOOD_WITH_CONCERNS → POOR → HARMFUL
Composite ROI score — 0-100 score balancing improvement vs forgetting cost
HTML/JSON/CSV/Markdown reports — interactive Plotly charts, exportable results
MCP server — 9 tools for AI assistant integration
LoRA + GGUF support — works with PEFT adapters and quantized models

Install

pip install finetunecheck

With optional backends:

pip install finetunecheck[api-judge]   # LLM-as-judge (Anthropic + OpenAI)
pip install finetunecheck[vllm]        # vLLM inference backend
pip install finetunecheck[gguf]        # GGUF model support
pip install finetunecheck[mcp]         # MCP server for AI assistants
pip install finetunecheck[all]         # Everything

Quick Start

CLI

# Full evaluation
ftcheck run meta-llama/Llama-3-8B ./my-finetuned-model \
  --profile code --report report.html

# Quick 5-minute check (20 samples, 4 categories)
ftcheck quick meta-llama/Llama-3-8B ./my-finetuned-model

# Compare multiple fine-tuning runs
ftcheck compare meta-llama/Llama-3-8B ./run1 ./run2 ./run3 \
  --report comparison.html

# Deep analysis (CKA, spectral, perplexity, calibration)
ftcheck run meta-llama/Llama-3-8B ./my-finetuned-model --deep

# List available probes and profiles
ftcheck list-probes
ftcheck list-profiles

Python API

from finetunecheck import EvalRunner
from finetunecheck.config import EvalConfig

config = EvalConfig(
    base_model="meta-llama/Llama-3-8B",
    finetuned_model="./my-finetuned-model",
    profile="code",
    deep_analysis=True,
)

runner = EvalRunner(config)
results = runner.run()

print(f"Verdict: {results.verdict.value}")        # GOOD_WITH_CONCERNS
print(f"ROI Score: {results.roi_score}")           # 72.5
print(f"BWT: {results.forgetting.backward_transfer:+.3f}")  # -0.082
print(f"Safety: {results.forgetting.safety_alignment_retention}")  # 0.97

Probe Categories

Category	Samples	Judge	What It Tests
reasoning	100+	LLM	Logical deduction, chain-of-thought
code	100+	rule-based	Code generation, debugging
math	100+	exact match	Arithmetic, algebra, word problems
safety	100+	rule-based	Refusal of harmful prompts, alignment
chat_quality	100+	LLM	Helpfulness, coherence, tone
creative_writing	100+	LLM	Storytelling, style, creativity
summarization	100+	ROUGE	Compression, faithfulness
extraction	100+	F1	Named entities, structured data
classification	100+	exact match	Sentiment, topic, intent
instruction_following	100+	rule-based	Format compliance, constraints
multilingual	100+	LLM	Translation, cross-lingual transfer
world_knowledge	100+	exact match	Facts, trivia, common sense

Forgetting Metrics

Metric	Formula	Interpretation
BWT (Backward Transfer)	avg(ft − base) on non-target categories	Negative = forgetting
CRR (Capability Retention Rate)	ft_score / base_score per category	< 0.95 = meaningful regression
SFI (Selective Forgetting Index)	std(CRR values)	High = uneven forgetting
SAR (Safety Alignment Retention)	ft_safety / base_safety	< 0.90 → HARMFUL verdict

Verdict System

Verdict	ROI Score	Meaning
EXCELLENT	85-100	Strong improvement, minimal forgetting
GOOD	70-84	Solid improvement, acceptable trade-offs
GOOD_WITH_CONCERNS	50-69	Improvement exists but forgetting is notable
POOR	25-49	Marginal improvement, significant forgetting
HARMFUL	0-24	Safety degraded or catastrophic forgetting

Deep Analysis

Enable with --deep for additional diagnostics:

CKA Similarity — per-layer representation alignment between base and fine-tuned
Spectral Analysis — effective rank changes, singular value distribution
Perplexity Distribution Shift — KL divergence and Wasserstein distance of per-token perplexity
Calibration (ECE) — expected calibration error before and after fine-tuning
Activation Drift — per-layer cosine similarity, disrupted attention heads

Multi-Run Comparison

ftcheck compare base_model ./run1 ./run2 ./run3 --report comparison.html

Outputs:

Per-run verdicts and ROI scores
Best overall (highest ROI)
Best target performance (highest target improvement)
Least forgetting (highest mean CRR)
Pareto frontier — runs that aren't dominated on any metric

Custom Probes

from finetunecheck.probes.registry import ProbeRegistry

# From CSV
ProbeRegistry.register_from_csv("my_probes.csv", name="custom", category="domain")

# From JSONL
ProbeRegistry.register_from_jsonl("my_probes.jsonl", name="custom", category="domain")

CSV format: input,reference,difficulty,tags

MCP Integration

Add to your AI assistant's MCP config:

{
  "mcpServers": {
    "finetunecheck": {
      "command": "ftcheck",
      "args": ["serve", "--stdio"]
    }
  }
}

9 MCP tools: run_evaluation, quick_check, compare_runs, get_forgetting_report, list_probes, list_profiles, get_probe_details, analyze_deep, generate_report

Evaluation Profiles

Profile	Focus Areas
`default`	All 12 categories
`code`	Code generation, reasoning, instruction following
`chat`	Chat quality, safety, instruction following
`safety`	Thorough safety and alignment evaluation
`math`	Mathematical reasoning, problem solving
`multilingual`	Cross-lingual capabilities

Export Formats

ftcheck run base ft --report results.html -f html       # Interactive HTML
ftcheck run base ft --report results.json -f json       # Machine-readable
ftcheck run base ft --report results.csv -f csv         # Spreadsheet
ftcheck run base ft --report results.md -f markdown     # Documentation

CI Integration

# Exit code 1 if verdict is POOR or HARMFUL
ftcheck run base_model finetuned_model --exit-code

Security

Models loaded via HuggingFace Transformers (no pickle/torch.load)
YAML parsed with safe_load
Jinja2 templates with autoescape
No secrets in reports or logs
Disk cache for baseline results (safe serialization)

Architecture

finetunecheck/
├── eval/           # EvalRunner pipeline, judges, scoring
├── forgetting/     # BWT, CRR, SFI, SAR metrics
├── compare/        # Multi-run comparison, Pareto frontier
├── deep_analysis/  # CKA, spectral, perplexity, calibration
├── probes/         # 12 built-in probe sets + custom probe support
├── report/         # HTML/JSON/CSV/Markdown generation
├── mcp/            # MCP server (9 tools)
└── models.py       # Pydantic v2 data contracts

References

Luo et al., "An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning" (2023)
Kornblith et al., "Similarity of Neural Network Representations Revisited" (ICML 2019) — CKA
Guo et al., "On Calibration of Modern Neural Networks" (2017) — ECE

License

Apache 2.0

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.4

Apr 5, 2026

0.2.2

Mar 20, 2026

0.2.1

Mar 20, 2026

0.2.0

Mar 20, 2026

0.1.9

Mar 19, 2026

0.1.8

Mar 19, 2026

0.1.7

Mar 19, 2026

0.1.6

Mar 10, 2026

0.1.5

Mar 8, 2026

0.1.4

Mar 8, 2026

0.1.3

Feb 28, 2026

0.1.2

Feb 28, 2026

0.1.1

Feb 28, 2026

This version

0.1.0

Feb 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

finetunecheck-0.1.0.tar.gz (92.2 kB view details)

Uploaded Feb 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

finetunecheck-0.1.0-py3-none-any.whl (119.6 kB view details)

Uploaded Feb 28, 2026 Python 3

File details

Details for the file finetunecheck-0.1.0.tar.gz.

File metadata

Download URL: finetunecheck-0.1.0.tar.gz
Upload date: Feb 28, 2026
Size: 92.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for finetunecheck-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`c66d37ca0d79fc380b6908a4c1cd66954e56a1a9283b7e7c9c91ea444e87b93f`
MD5	`6281bbfb90db7e3a0a3a117c736f160b`
BLAKE2b-256	`dfe57bc7b20e35789d85071ccb58d6149f748bfcbbdb3f9bc46804fbb0066c62`

See more details on using hashes here.

File details

Details for the file finetunecheck-0.1.0-py3-none-any.whl.

File metadata

Download URL: finetunecheck-0.1.0-py3-none-any.whl
Upload date: Feb 28, 2026
Size: 119.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for finetunecheck-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`51bebb210e602d172404f5176170e80404d18d319dc6471c738b9bf753863f15`
MD5	`1b04cf1dfb4882cdfab00f3aa6babda7`
BLAKE2b-256	`d37d47e08314228d98054cc09e98d7f27b72e4035f0bc9d407d8090c99ae8d62`

See more details on using hashes here.

finetunecheck 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

FineTuneCheck

The Problem

Features

Install

Quick Start

CLI

Python API

Probe Categories

Forgetting Metrics

Verdict System

Deep Analysis

Multi-Run Comparison

Custom Probes

MCP Integration

Evaluation Profiles

Export Formats

CI Integration

Security

Architecture

References

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes