Skip to main content

Dependabot for AI models — catch model-migration regressions before they hit production.

Project description

Modelpin

Dependabot for AI models. Know before the model breaks you.

A provider ships a new model (or retires the one you depend on). Modelpin replays your app's real behavior on the new model, decides whether anything actually regressed despite model randomness, and posts a PR-style report — so you find out in a pull request, not in production.

CLI: modelpin (alias mp). License: Apache-2.0.


Why this exists (and why you can trust the verdict)

Models are non-deterministic. Run the same prompt twice and the words change. So the naive way to "test a new model" — diff the text — cries wolf on every run. An alerter that cries wolf is worse than no alerter: you mute it, then it misses the real break.

Modelpin's entire design optimizes for one north-star metric: false-positive rate. The promise is narrow and falsifiable: if Modelpin says it broke, it broke. Everything below is in service of that promise — and where the evidence is thin, this README says so plainly.

This is meant to be the independent, no-BS tool: it measures behavior change relative to your app, it never declares one model globally "better," and the whole harness is open source so you can reproduce it and disagree.


Quickstart

Install from source (not on PyPI yet — see Install):

git clone https://github.com/samarthputhraya/modelpin
cd modelpin
pip install -e ".[dev,providers]"

Try it offline, no API key (30 seconds)

Modelpin ships a fake provider that replays canned traces, so you can see the whole pipeline — baseline, candidate replay, behavioral diff, report — with zero cost and no key:

mp baseline --provider fake --fixtures examples/traces/demo_traces.json \
  --model claude-opus-4-6 \
  --scenarios-dir examples/scenarios --config examples/modelpin.yaml

mp check --to claude-opus-4-7 --from claude-opus-4-6 \
  --provider fake --fixtures examples/traces/demo_traces.json \
  --scenarios-dir examples/scenarios --config examples/modelpin.yaml

You'll get a per-scenario verdict (unchanged / changed_minor / regression), a confidence score, a one-line plain-English explanation per scenario, and a Markdown report written to .modelpin/last-report.md. mp check exits non-zero only on a real regression — that's the CI gate.

The real flow, on your own app

# 1. Scaffold modelpin.yaml + scenarios/ (never overwrites existing files)
mp init

# 2. See which models your repo already depends on, and where
mp scan

# 3. Add a scenario or two (a JSON file per representative case — see below),
#    then record how your current model behaves, N times
export OPENAI_API_KEY=sk-...        # your key, read from the env — never stored
mp baseline                         # uses models[0] + providers[0] from modelpin.yaml

# 4. Replay your scenarios on a candidate model and diff the behavior
mp check --to gpt-5.5

A scenario is a small JSON file (one per case) under scenarios/. The one mp init writes:

{
  "id": "greeting",
  "name": "Simple greeting",
  "kind": "single",
  "input": {"messages": [{"role": "user", "content": "Say hello in one short sentence."}]},
  "assertions": {"must_contain": ["hello"]}
}

Scenarios can also be agent runs: set "kind": "agent", add "tools" (and canned "tool_results") to input, and Modelpin drives a multi-turn model↔tool loop so trajectories like lookup_order → issue_refund actually emerge during replay. Eight worked examples spanning tool trajectories, semantic equivalence, refusals, and output format live in examples/suite/.


See it in your PR (GitHub Action)

The point of Modelpin is that the answer shows up at review time. It ships a real composite GitHub Action: it installs Modelpin, optionally records a baseline, runs mp check, posts a sticky PR comment (found-and-updated in place via a hidden marker — no comment spam), and fails the job on a regression. Drop this at .github/workflows/modelpin.yml:

name: Modelpin

on:
  pull_request:
  workflow_dispatch:        # trigger by hand the day a provider ships a new model

permissions:
  contents: read
  pull-requests: write      # so the action can post/update the PR comment

jobs:
  model-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: samarthputhraya/modelpin/actions@v1
        with:
          from: gpt-4o-mini       # the model you depend on today (committed baseline)
          to: gpt-5.5             # the candidate to vet before adopting
          provider: openai
          runs: "5"
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}     # BYO-key from repo secrets — never inline a key
          # If your judge_model lives on another provider, add its key too:
          # GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
          # GROQ_API_KEY:   ${{ secrets.GROQ_API_KEY }}

Action inputs: to (required), from, provider, config, scenarios-dir, runs, match, baseline, comment, fail-on-regression, github-token, modelpin-spec, python-version, working-directory. Outputs: verdict-exit-code and report-path. The usual pattern is to commit your baseline so CI only replays the candidate; flip baseline: "true" to record fresh (needs the old model still reachable). Copy-paste workflow: examples/github-workflow.yml.


How the behavioral diff works (the moat)

Modelpin decides "did it really change?" from multiple signals over multiple runs, then gates every regression behind a distributional significance test plus an effect-size floor. A single odd run never trips it; a majority that merely flips between two equally-likely behaviors never trips it. Here is the whole decision rule, no hand-waving:

1. Multi-run, not single-shot. Each scenario runs N times (runs: in config; default 5, minimum 2 — a single run can't form a distribution, so --runs 1 is rejected outright). Baseline and candidate are both sampled, so the comparison is distribution-vs-distribution.

2. Structural signals (per run, no network, deterministic):

  • Tool-call trajectory match with four modes — strict | unordered | subset | superset (--match) — so you choose how strict "same plan" means for your agent.
  • Output format / assertion validity — your scenario's must_contain / must_not_contain text assertions, checked as a rate across runs.
  • Refusal detection — did the model start declining requests it used to answer?
  • Latency / token deltas — captured and reported, but informational only; they never gate the verdict (latency is jittery; a token bump isn't a behavior regression).

3. Semantic signal (optional LLM-as-judge): a low-temperature judge answers the only question that matters — do these answers mean / accomplish the same thing? This catches the structural blind spot: two answers that are textually different but identical in meaning ("The total is $5." vs "5 dollars."). The judge is injected and optional — with no judge_model set (and always on the offline fake path) the diff stays purely structural and makes zero network calls, so CI can run for $0. The judge is independent of the two models being compared, so it can arbitrate a cross-vendor check.

4. The statistics that kill false alarms. Every gating signal goes through an exact two-sample permutation test (modelpin/diff/stats.py — no SciPy, deterministic, so golden tests stay reproducible). A signal counts as a regression only when both:

  • the candidate distribution differs from baseline at p ≤ 0.05 (ALPHA), and
  • the effect clears a conservative size floor — tool-call shift ≥ 0.5 total-variation distance (MIN_TOOL_TVD), refusal-rate rise ≥ 0.34 (MIN_REFUSAL_DELTA), or semantic-divergence rate ≥ 0.5 over baseline (MIN_SEMANTIC_DELTA).

The size floor is what stops a statistically significant but practically trivial jitter from firing once N grows large. These floors are intentionally conservative — biased toward missing a borderline change rather than inventing one — because a miss is a false negative (the safe direction for a trust product), while a false alarm erodes trust permanently.

Output: each scenario gets a verdict, a confidence score, the underlying signals, and a one-line explanation. A structural tool-call / refusal break or a calibrated semantic divergence is a CI-failing regression; format/assertion drift alone is changed_minor (reported, doesn't fail the build).

The false-positive evidence — and its limits, stated plainly

Result: 0/8 false positives on a held-out 8-scenario suite (a model judged against itself, judge on, all unchanged at confidence 1.00). On that same run, the two perturbations that genuinely changed behavior were caught, and one prompt-injection the model resisted was correctly left unchanged (not a false negative). Corroborated by additional same-model and cross-model splits. Full writeup: docs/fp-measurement.md.

The semantic judge's escalation threshold is calibrated on a labeled set in examples/calibration/ that is deliberately distinct from the held-out suite (so it can't leak into the 0/8 number): equivalent-but-reworded pairs land at divergence 0.0, real meaning changes at ≥ 0.8, leaving an empty gap around the 0.5 floor — 0 false positives. FP-safety was re-checked with an independent judge (a different model arbitrating) and re-validated on the held-out suite after promoting semantic divergence from changed_minor to a CI-failing regression (still 0/8).

This is a first calibration. Do not over-trust it. The honest limitations, documented in docs/STATUS.md:

  • the calibration set is small (≈6+6 pairs) and the perturbations are synthetic, not harvested from real migrations;
  • recall on subtle changes was 4/6 — it can miss a subtle real change (again, the safe direction);
  • the judge is OpenAI-only so far;
  • the structural floors are FP-validated by the held-out suite but not yet swept on a labeled set.

Planned before any high-stakes reliance: ≥30 pairs including real migration traces, and a non-OpenAI judge. We'd rather you know this than discover it.


Cross-vendor (including a free third vendor)

A model migration isn't always within one lab. Modelpin diffs across vendors through one engine; a separate judge model arbitrates meaning-equivalence.

Provider Status
OpenAI Live (Chat Completions), multi-turn tool loops
Google / Gemini Live (google-genai), multi-turn tool loops, cross-vendor proven
OpenAI-compatible hostsgroq, openrouter, together, cerebras Live (the OpenAI adapter pointed at the host's base_url)
Anthropic Stub — raises NotImplementedError (deferred until a paid key is in play)

What we observed (open suite, our settings):

  • gpt-4o-mini vs gemini-3.1-flash-lite, 5 runs × 8 scenarios, OpenAI judge on → 8/8 unchanged: the cross-vendor judge genuinely fired and found the two vendors behaviorally equivalent on this suite.
  • gpt-4o-mini vs llama-3.3-70b-versatile on Groq, same suite → 8/8 unchanged.

Free third vendor, zero cost: Groq serves Llama models over the OpenAI-compatible API and has a free tier, so a free key makes a zero-cost cross-vendor check:

export GROQ_API_KEY=...     # free at console.groq.com
mp check --provider groq --from gpt-4o-mini --to llama-3.3-70b-versatile

A caveat worth stating: open-model hosts rotate ids but don't retire on a lab's fixed schedule the way the big providers do, so Groq/OpenRouter/etc. are a genuine cross-vendor bonus and an architecture proof — not the core migration wedge.


Bring your own key

Modelpin replays with the end user's own API key, always read from the environment, never hardcoded, shipped, or stored (cost stays yours; provider ToS stays clean):

  • OPENAI_API_KEY
  • GEMINI_API_KEY (or GOOGLE_API_KEY)
  • GROQ_API_KEY (and the equivalents for other OpenAI-compatible hosts)

In CI, supply these as repo secrets (see the workflow above). Error text is scrubbed of sk- / Bearer tokens, so a failed call never leaks your key into a log, traceback, or PR comment.


CLI reference

Command What it does
mp init [dir] Scaffold modelpin.yaml + scenarios/ (never overwrites).
mp scan [path] Detect which AI models the repo depends on, and where.
mp baseline Record current model behavior for your scenarios (N runs).
mp check --to <model> Replay scenarios on a new model, diff vs baseline, write the PR-style report, fail CI on a regression.
mp version Print the Modelpin version.
mp report Not built yet — prints a "coming soon" notice. The public Modelpin Report suite is a stub.

Shared flags on baseline / check: --from / --model, --provider, --runs, --match (strict\|unordered\|subset\|superset), --config, --scenarios-dir, --store-dir, and --fixtures (with --provider fake).


Install

Not on PyPI yet (publishing is pending). For now, from source:

git clone https://github.com/samarthputhraya/modelpin
cd modelpin
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev,providers]"     # Python 3.12+; 'providers' pulls the live SDKs
mp version

Once published, the intended install is:

pip install "modelpin[providers]"     # or: pipx install "modelpin[providers]"

The providers extra pulls in the openai, google-genai, and anthropic SDKs. The base install runs the offline fake path with no provider SDKs at all.


What this is not (non-goals, on purpose)

Modelpin is a migration tool, and stays one. It is not:

  • a general eval / observability platform,
  • prompt management,
  • a model gateway or host,
  • an absolute "which model is best" leaderboard.

It measures behavior change relative to your app — not abstract quality. Saying no to that scope is what keeps the false-positive promise honest and the tool small enough to trust.

Honest-framing rules (this is a trust product)

  • Any public / measurement claim is phrased as "on our open suite, under these settings, we observed…"never "Model X is worse." The harness and scenarios are open source so anyone can rerun and disagree. That's the whole point of being the independent voice.
  • We don't overclaim and we don't falsely undersell. The engine is real and cross-vendor proven; and Anthropic is a stub, mp report isn't built, the judge calibration is a documented first pass, and it's pre-PyPI. All four are true at once.

Status

Phase 0 (core engine MVP) — essentially complete. Live-validated cross-vendor (OpenAI ↔ Google ↔ Groq/Llama); held-out false-positive rate 0/8; multi-turn replay; a real GitHub Action; PyPI-ready packaging (0.1.0, builds a clean wheel, mp + modelpin entry points); 132 tests passing, ruff + black clean. Stubs/TODOs (Anthropic adapter, mp report) are called out above, and it is not yet published to PyPI or the GitHub Marketplace.

The full engineering record and roadmap live in docs/STATUS.md. Next up: publish, the first public Modelpin Report (gated on a real model launch), and the Anthropic adapter.

License

Apache-2.0. See LICENSE. The open-source core (CLI, engine, Action) is and stays open; any future hosted tier lives in a separate, proprietary package.

Repo: https://github.com/samarthputhraya/modelpin

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

modelpin-0.1.0.tar.gz (67.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

modelpin-0.1.0-py3-none-any.whl (50.6 kB view details)

Uploaded Python 3

File details

Details for the file modelpin-0.1.0.tar.gz.

File metadata

  • Download URL: modelpin-0.1.0.tar.gz
  • Upload date:
  • Size: 67.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for modelpin-0.1.0.tar.gz
Algorithm Hash digest
SHA256 56df75e79953f48e691c32703b034f31628e1d70139e5aade364a45ed8de3a5c
MD5 2dcfd02237b75fc16953a6cfe41fa7df
BLAKE2b-256 e55dd95bab7e107970ce1984363cc3f15dbcadb14e6450accaa22f6ff7577cd6

See more details on using hashes here.

File details

Details for the file modelpin-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: modelpin-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 50.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for modelpin-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c9b2f396996661a994bb8aaefe1be2c6ee45920fd10e8d702b38cae95f57b406
MD5 d365ae0477d61340678d685ee24c798a
BLAKE2b-256 79c14aaad89c281d1943fa225db7e96569a021b20c8ba1cecbfbf680c850ea9e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page