Local, embedding-based LLM model equivalence scoring for migration validation
Project description
MERIDIAN
Model Equivalence and Regression via Intent Drift In AI Networks
A lightweight Python library for validating LLM model equivalence when a vendor deprecates a model and you need to migrate to a replacement.
The Problem
When OpenAI deprecates gpt-4-0613 or Anthropic retires claude-2, enterprise teams have no established, reusable methodology to validate that the replacement produces semantically equivalent outputs for their specific workload. Traditional software testing checks exact outputs — useless for non-deterministic LLM responses. Existing benchmarks (MMLU, HELM) measure absolute capability, not relative equivalence between two specific models on your use case.
How MERIDIAN Is Different
Recent work (arXiv:2604.27082, arXiv:2507.05573, arXiv:2604.27789) describes migration validation processes using LLM-as-judge evaluation or human review. MERIDIAN takes a different approach:
| Existing approaches | MERIDIAN | |
|---|---|---|
| Scoring method | LLM-as-judge or human eval | Sentence-transformer cosine similarity |
| Cloud dependency | Requires API calls to score | Runs entirely locally |
| Cost | Per-token API cost to evaluate | Free after model download |
| Reproducibility | Non-deterministic (LLM judge) | Deterministic |
| Framing | Evaluation problem | Regression testing problem |
| Format | Research process descriptions | Reusable open-source library |
Core insight: embed old and new model outputs using a sentence-transformer, compute cosine similarity, and flag pairs below a drift threshold. Same technique as canvas-heal (UI locator healing), different problem surface.
Three-Tier Gate
Cosine Similarity
─────────────────────────────────────────────────────────
0.0 ──────────── 0.75 ──────────── 0.92 ──────────── 1.0
DRIFTED REVIEW EQUIVALENT
(flag) (human eye) (auto-pass)
Thresholds are configurable. Defaults (0.92 / 0.75) are starting points — calibrate them against a small human-labeled set for your domain. See the accompanying paper for a calibration procedure derived from the deepseek-chat (V3) → deepseek-reasoner (R1) empirical study.
Installation
pip install meridian-regression
Or from source:
git clone https://github.com/mandavillivijay/meridian
cd meridian
pip install -e ".[dev]"
Quickstart
1. Build your golden dataset
Create a JSON file with outputs from both models for each prompt:
[
{
"prompt": "What is the capital of France?",
"intent": "factual",
"old_output": "The capital of France is Paris.",
"new_output": "Paris is the capital city of France."
}
]
Intent categories: factual, generative, classification, structured_output.
Run your old model and new model on the same prompts, save the outputs. MERIDIAN doesn't call any APIs — you bring the outputs.
2. Run the pipeline
from meridian.runner import run
report = run("datasets/my_golden_set.json")
print(report.summary)
# "94.0% of outputs are semantically equivalent, 4.0% show minor drift
# requiring human review, 2.0% show significant drift (regression flagged)."
3. Use the report
print(f"Equivalent: {report.equivalent_pct}%")
print(f"Wilson 95% CI: [{report.wilson_lower:.3f}, {report.wilson_upper:.3f}]")
JSON and markdown reports are written to reports/ automatically.
Advanced Usage
from meridian.runner import run
report = run(
"datasets/my_golden_set.json",
sample_n=50, # stratified sample of 50 prompts
seed=42, # reproducible sampling
equivalent_threshold=0.90,
review_threshold=0.70,
report_stem="sonnet_migration_v2",
)
Using modules directly
from meridian.sampler import load, stratified_sample
from meridian.scorer import DriftScorer
from meridian.reporter import Reporter
records = load("datasets/my_golden_set.json")
records = stratified_sample(records, n=50, seed=42)
scorer = DriftScorer()
results = scorer.score_all(records)
reporter = Reporter()
report = reporter.build(results)
reporter.write(report, stem="my_run")
Bringing your own adapter
If you want to populate outputs programmatically rather than from a JSON file, implement the ModelAdapter protocol:
from meridian.adapters.base import ModelAdapter
class MyAdapter:
def complete(self, prompt: str) -> str:
# call your model here
...
def name(self) -> str:
return "my-model-v2"
Project Structure
meridian/
├── meridian/
│ ├── models.py # Pydantic data models
│ ├── embedder.py # Sentence-transformer wrapper (singleton)
│ ├── scorer.py # Three-tier drift gate
│ ├── reporter.py # Aggregate verdict + JSON/markdown output
│ ├── sampler.py # Dataset loading + stratified sampling
│ ├── runner.py # End-to-end pipeline entry point
│ └── adapters/
│ └── base.py # ModelAdapter Protocol (extension point)
├── datasets/ # Example golden datasets
├── reports/ # Generated reports
└── tests/ # pytest suite (106 tests)
Running Tests
pytest
Author
Vijay Mandavilli — Quality Engineering Lead, Cognida AI, Hyderabad, India
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file meridian_regression-0.1.0.tar.gz.
File metadata
- Download URL: meridian_regression-0.1.0.tar.gz
- Upload date:
- Size: 30.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
87e0a1f1c225ad39090e82e2629f475e13893fff8a5820d4722f3efbb5a21df1
|
|
| MD5 |
e25d135cbf8e739ebf1a744d3ea1c7fd
|
|
| BLAKE2b-256 |
a78c891b37aa3c5cb3e69ac5584b35c52c95213b8da94c233cccde501a8c6db3
|
File details
Details for the file meridian_regression-0.1.0-py3-none-any.whl.
File metadata
- Download URL: meridian_regression-0.1.0-py3-none-any.whl
- Upload date:
- Size: 12.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2f85bfdf3bd9eabb1ae87b7aed64346404d0f6348fe42e5d091c6e26c02ce5d5
|
|
| MD5 |
52637b46cb90438479665376d2f3f0e4
|
|
| BLAKE2b-256 |
bbea6d18e421a5af087dd722a82385ca6156f162b8be7c2e85f4f2fe877bdacd
|