Local, embedding-based LLM model equivalence scoring for migration validation

These details have not been verified by PyPI

Project links

Repository

Project description

MERIDIAN

Model Equivalence and Regression via Intent Drift In AI Networks

A lightweight Python library for validating LLM model equivalence when a vendor deprecates a model and you need to migrate to a replacement.

The Problem

When OpenAI deprecates gpt-4-0613 or Anthropic retires claude-2, enterprise teams have no established, reusable methodology to validate that the replacement produces semantically equivalent outputs for their specific workload. Traditional software testing checks exact outputs — useless for non-deterministic LLM responses. Existing benchmarks (MMLU, HELM) measure absolute capability, not relative equivalence between two specific models on your use case.

How MERIDIAN Is Different

Recent work (arXiv:2604.27082, arXiv:2507.05573, arXiv:2604.27789) describes migration validation processes using LLM-as-judge evaluation or human review. MERIDIAN takes a different approach:

	Existing approaches	MERIDIAN
Scoring method	LLM-as-judge or human eval	Sentence-transformer cosine similarity
Cloud dependency	Requires API calls to score	Runs entirely locally
Cost	Per-token API cost to evaluate	Free after model download
Reproducibility	Non-deterministic (LLM judge)	Deterministic
Framing	Evaluation problem	Regression testing problem
Format	Research process descriptions	Reusable open-source library

Core insight: embed old and new model outputs using a sentence-transformer, compute cosine similarity, and flag pairs below a drift threshold. Same technique as canvas-heal (UI locator healing), different problem surface.

Three-Tier Gate

Cosine Similarity
─────────────────────────────────────────────────────────
0.0 ──────────── 0.75 ──────────── 0.92 ──────────── 1.0
     DRIFTED          REVIEW              EQUIVALENT
     (flag)         (human eye)          (auto-pass)

Thresholds are configurable. Defaults (0.92 / 0.75) are starting points — calibrate them against a small human-labeled set for your domain. See the accompanying paper for a calibration procedure derived from the deepseek-chat (V3) → deepseek-reasoner (R1) empirical study.

Installation

pip install meridian-regression

Or from source:

git clone https://github.com/mandavillivijay/meridian
cd meridian
pip install -e ".[dev]"

Quickstart

1. Build your golden dataset

Create a JSON file with outputs from both models for each prompt:

[
  {
    "prompt": "What is the capital of France?",
    "intent": "factual",
    "old_output": "The capital of France is Paris.",
    "new_output": "Paris is the capital city of France."
  }
]

Intent categories: factual, generative, classification, structured_output.

Run your old model and new model on the same prompts, save the outputs. MERIDIAN doesn't call any APIs — you bring the outputs.

2. Run the pipeline

from meridian.runner import run

report = run("datasets/my_golden_set.json")
print(report.summary)
# "94.0% of outputs are semantically equivalent, 4.0% show minor drift
#  requiring human review, 2.0% show significant drift (regression flagged)."

3. Use the report

print(f"Equivalent: {report.equivalent_pct}%")
print(f"Wilson 95% CI: [{report.wilson_lower:.3f}, {report.wilson_upper:.3f}]")

JSON and markdown reports are written to reports/ automatically.

Advanced Usage

from meridian.runner import run

report = run(
    "datasets/my_golden_set.json",
    sample_n=50,              # stratified sample of 50 prompts
    seed=42,                  # reproducible sampling
    equivalent_threshold=0.90,
    review_threshold=0.70,
    report_stem="sonnet_migration_v2",
)

Using modules directly

from meridian.sampler import load, stratified_sample
from meridian.scorer import DriftScorer
from meridian.reporter import Reporter

records = load("datasets/my_golden_set.json")
records = stratified_sample(records, n=50, seed=42)

scorer = DriftScorer()
results = scorer.score_all(records)

reporter = Reporter()
report = reporter.build(results)
reporter.write(report, stem="my_run")

Bringing your own adapter

If you want to populate outputs programmatically rather than from a JSON file, implement the ModelAdapter protocol:

from meridian.adapters.base import ModelAdapter

class MyAdapter:
    def complete(self, prompt: str) -> str:
        # call your model here
        ...
    def name(self) -> str:
        return "my-model-v2"

Project Structure

meridian/
├── meridian/
│   ├── models.py       # Pydantic data models
│   ├── embedder.py     # Sentence-transformer wrapper (singleton)
│   ├── scorer.py       # Three-tier drift gate
│   ├── reporter.py     # Aggregate verdict + JSON/markdown output
│   ├── sampler.py      # Dataset loading + stratified sampling
│   ├── runner.py       # End-to-end pipeline entry point
│   └── adapters/
│       └── base.py     # ModelAdapter Protocol (extension point)
├── datasets/           # Example golden datasets
├── reports/            # Generated reports
└── tests/              # pytest suite (106 tests)

Running Tests

pytest

Author

Vijay Mandavilli — Quality Engineering Lead, Cognida AI, Hyderabad, India

License

MIT

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

This version

0.1.0

Jul 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

meridian_regression-0.1.0.tar.gz (30.0 kB view details)

Uploaded Jul 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

meridian_regression-0.1.0-py3-none-any.whl (12.6 kB view details)

Uploaded Jul 2, 2026 Python 3

File details

Details for the file meridian_regression-0.1.0.tar.gz.

File metadata

Download URL: meridian_regression-0.1.0.tar.gz
Upload date: Jul 2, 2026
Size: 30.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for meridian_regression-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`87e0a1f1c225ad39090e82e2629f475e13893fff8a5820d4722f3efbb5a21df1`
MD5	`e25d135cbf8e739ebf1a744d3ea1c7fd`
BLAKE2b-256	`a78c891b37aa3c5cb3e69ac5584b35c52c95213b8da94c233cccde501a8c6db3`

See more details on using hashes here.

File details

Details for the file meridian_regression-0.1.0-py3-none-any.whl.

File metadata

Download URL: meridian_regression-0.1.0-py3-none-any.whl
Upload date: Jul 2, 2026
Size: 12.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for meridian_regression-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2f85bfdf3bd9eabb1ae87b7aed64346404d0f6348fe42e5d091c6e26c02ce5d5`
MD5	`52637b46cb90438479665376d2f3f0e4`
BLAKE2b-256	`bbea6d18e421a5af087dd722a82385ca6156f162b8be7c2e85f4f2fe877bdacd`

See more details on using hashes here.

meridian-regression 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MERIDIAN

The Problem

How MERIDIAN Is Different

Three-Tier Gate

Installation

Quickstart

1. Build your golden dataset

2. Run the pipeline

3. Use the report

Advanced Usage

Using modules directly

Bringing your own adapter

Project Structure

Running Tests

Author

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes