Skip to main content

Automated base vs fine-tuned LLM comparison with forgetting detection, capability retention scoring, and visual diff reports.

Project description

FineTuneCheck

Diagnostic tool for LLM fine-tuning outcomes.

Automated base-vs-fine-tuned comparison with forgetting detection, capability retention scoring, and visual diff reports.

PyPI Python 3.10+ License Tests


The Problem

You fine-tuned a model. It's better at your task. But what did it forget?

Fine-tuning improves target capabilities at the cost of general ones. Without measurement, you're shipping blind:

  • Did safety alignment degrade?
  • Is reasoning still intact?
  • Are code capabilities broken?
  • Was the trade-off worth it?

FineTuneCheck answers these questions in one command.

Features

  • 12 built-in probe categories — reasoning, code, math, safety, chat quality, creative writing, summarization, extraction, classification, instruction following, multilingual, world knowledge
  • 4 forgetting metrics — Backward Transfer (BWT), Capability Retention Rate (CRR), Selective Forgetting Index (SFI), Safety Alignment Retention (SAR)
  • Multi-judge system — exact match, F1, rule-based, ROUGE, LLM-as-judge
  • Deep analysis — CKA similarity, spectral analysis, perplexity distribution shift, calibration (ECE), activation drift
  • Multi-run comparison — Pareto frontier analysis across fine-tuning runs
  • 5 verdict levels — EXCELLENT → GOOD → GOOD_WITH_CONCERNS → POOR → HARMFUL
  • Composite ROI score — 0-100 score balancing improvement vs forgetting cost
  • HTML/JSON/CSV/Markdown reports — interactive Plotly charts, exportable results
  • MCP server — 9 tools for AI assistant integration
  • LoRA + GGUF support — works with PEFT adapters and quantized models

Install

pip install finetunecheck

With optional backends:

pip install finetunecheck[api-judge]   # LLM-as-judge (Anthropic + OpenAI)
pip install finetunecheck[vllm]        # vLLM inference backend
pip install finetunecheck[gguf]        # GGUF model support
pip install finetunecheck[mcp]         # MCP server for AI assistants
pip install finetunecheck[all]         # Everything

Quick Start

CLI

# Full evaluation
ftcheck run meta-llama/Llama-3-8B ./my-finetuned-model \
  --profile code --report report.html

# Quick 5-minute check (20 samples, 4 categories)
ftcheck quick meta-llama/Llama-3-8B ./my-finetuned-model

# Compare multiple fine-tuning runs
ftcheck compare meta-llama/Llama-3-8B ./run1 ./run2 ./run3 \
  --report comparison.html

# Deep analysis (CKA, spectral, perplexity, calibration)
ftcheck run meta-llama/Llama-3-8B ./my-finetuned-model --deep

# List available probes and profiles
ftcheck list-probes
ftcheck list-profiles

Python API

from finetunecheck import EvalRunner
from finetunecheck.config import EvalConfig

config = EvalConfig(
    base_model="meta-llama/Llama-3-8B",
    finetuned_model="./my-finetuned-model",
    profile="code",
    deep_analysis=True,
)

runner = EvalRunner(config)
results = runner.run()

print(f"Verdict: {results.verdict.value}")        # GOOD_WITH_CONCERNS
print(f"ROI Score: {results.roi_score}")           # 72.5
print(f"BWT: {results.forgetting.backward_transfer:+.3f}")  # -0.082
print(f"Safety: {results.forgetting.safety_alignment_retention}")  # 0.97

Probe Categories

Category Samples Judge What It Tests
reasoning 100+ LLM Logical deduction, chain-of-thought
code 100+ rule-based Code generation, debugging
math 100+ exact match Arithmetic, algebra, word problems
safety 100+ rule-based Refusal of harmful prompts, alignment
chat_quality 100+ LLM Helpfulness, coherence, tone
creative_writing 100+ LLM Storytelling, style, creativity
summarization 100+ ROUGE Compression, faithfulness
extraction 100+ F1 Named entities, structured data
classification 100+ exact match Sentiment, topic, intent
instruction_following 100+ rule-based Format compliance, constraints
multilingual 100+ LLM Translation, cross-lingual transfer
world_knowledge 100+ exact match Facts, trivia, common sense

Forgetting Metrics

Metric Formula Interpretation
BWT (Backward Transfer) avg(ft − base) on non-target categories Negative = forgetting
CRR (Capability Retention Rate) ft_score / base_score per category < 0.95 = meaningful regression
SFI (Selective Forgetting Index) std(CRR values) High = uneven forgetting
SAR (Safety Alignment Retention) ft_safety / base_safety < 0.90 → HARMFUL verdict

Verdict System

Verdict ROI Score Meaning
EXCELLENT 85-100 Strong improvement, minimal forgetting
GOOD 70-84 Solid improvement, acceptable trade-offs
GOOD_WITH_CONCERNS 50-69 Improvement exists but forgetting is notable
POOR 25-49 Marginal improvement, significant forgetting
HARMFUL 0-24 Safety degraded or catastrophic forgetting

Deep Analysis

Enable with --deep for additional diagnostics:

  • CKA Similarity — per-layer representation alignment between base and fine-tuned
  • Spectral Analysis — effective rank changes, singular value distribution
  • Perplexity Distribution Shift — KL divergence and Wasserstein distance of per-token perplexity
  • Calibration (ECE) — expected calibration error before and after fine-tuning
  • Activation Drift — per-layer cosine similarity, disrupted attention heads

Multi-Run Comparison

ftcheck compare base_model ./run1 ./run2 ./run3 --report comparison.html

Outputs:

  • Per-run verdicts and ROI scores
  • Best overall (highest ROI)
  • Best target performance (highest target improvement)
  • Least forgetting (highest mean CRR)
  • Pareto frontier — runs that aren't dominated on any metric

Custom Probes

from finetunecheck.probes.registry import ProbeRegistry

# From CSV
ProbeRegistry.register_from_csv("my_probes.csv", name="custom", category="domain")

# From JSONL
ProbeRegistry.register_from_jsonl("my_probes.jsonl", name="custom", category="domain")

CSV format: input,reference,difficulty,tags

MCP Integration

Add to your AI assistant's MCP config:

{
  "mcpServers": {
    "finetunecheck": {
      "command": "ftcheck",
      "args": ["serve", "--stdio"]
    }
  }
}

9 MCP tools: run_evaluation, quick_check, compare_runs, get_forgetting_report, list_probes, list_profiles, get_probe_details, analyze_deep, generate_report

Evaluation Profiles

Profile Focus Areas
default All 12 categories
code Code generation, reasoning, instruction following
chat Chat quality, safety, instruction following
safety Thorough safety and alignment evaluation
math Mathematical reasoning, problem solving
multilingual Cross-lingual capabilities

Export Formats

ftcheck run base ft --report results.html -f html       # Interactive HTML
ftcheck run base ft --report results.json -f json       # Machine-readable
ftcheck run base ft --report results.csv -f csv         # Spreadsheet
ftcheck run base ft --report results.md -f markdown     # Documentation

CI Integration

# Exit code 1 if verdict is POOR or HARMFUL
ftcheck run base_model finetuned_model --exit-code

Security

  • Models loaded via HuggingFace Transformers (no pickle/torch.load)
  • YAML parsed with safe_load
  • Jinja2 templates with autoescape
  • No secrets in reports or logs
  • Disk cache for baseline results (safe serialization)

Architecture

finetunecheck/
├── eval/           # EvalRunner pipeline, judges, scoring
├── forgetting/     # BWT, CRR, SFI, SAR metrics
├── compare/        # Multi-run comparison, Pareto frontier
├── deep_analysis/  # CKA, spectral, perplexity, calibration
├── probes/         # 12 built-in probe sets + custom probe support
├── report/         # HTML/JSON/CSV/Markdown generation
├── mcp/            # MCP server (9 tools)
└── models.py       # Pydantic v2 data contracts

References

  • Luo et al., "An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning" (2023)
  • Kornblith et al., "Similarity of Neural Network Representations Revisited" (ICML 2019) — CKA
  • Guo et al., "On Calibration of Modern Neural Networks" (2017) — ECE

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

finetunecheck-0.1.0.tar.gz (92.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

finetunecheck-0.1.0-py3-none-any.whl (119.6 kB view details)

Uploaded Python 3

File details

Details for the file finetunecheck-0.1.0.tar.gz.

File metadata

  • Download URL: finetunecheck-0.1.0.tar.gz
  • Upload date:
  • Size: 92.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for finetunecheck-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c66d37ca0d79fc380b6908a4c1cd66954e56a1a9283b7e7c9c91ea444e87b93f
MD5 6281bbfb90db7e3a0a3a117c736f160b
BLAKE2b-256 dfe57bc7b20e35789d85071ccb58d6149f748bfcbbdb3f9bc46804fbb0066c62

See more details on using hashes here.

File details

Details for the file finetunecheck-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: finetunecheck-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 119.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for finetunecheck-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 51bebb210e602d172404f5176170e80404d18d319dc6471c738b9bf753863f15
MD5 1b04cf1dfb4882cdfab00f3aa6babda7
BLAKE2b-256 d37d47e08314228d98054cc09e98d7f27b72e4035f0bc9d407d8090c99ae8d62

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page