Skip to main content

Dependabot for AI models — catch model-migration regressions before they hit production.

Project description

Modelpin

Dependabot for AI models. Know before the model breaks you.

A provider ships a new model (or retires the one you depend on). Modelpin replays your app's real behavior on the new model, decides whether anything actually regressed despite model randomness, and posts a PR-style report — so you find out in a pull request, not in production.

CLI: modelpin (alias mp). License: Apache-2.0.


Why this exists (and why you can trust the verdict)

Models are non-deterministic. Run the same prompt twice and the words change. So the naive way to "test a new model" — diff the text — cries wolf on every run. An alerter that cries wolf is worse than no alerter: you mute it, then it misses the real break.

Modelpin's entire design optimizes for one north-star metric: false-positive rate. The promise is narrow and falsifiable: if Modelpin says it broke, it broke. Everything below is in service of that promise — and where the evidence is thin, this README says so plainly.

This is meant to be the independent, no-BS tool: it measures behavior change relative to your app, it never declares one model globally "better," and the whole harness is open source so you can reproduce it and disagree.


Quickstart

Install (Python 3.12+):

pip install "modelpin[providers]"      # or: pipx install "modelpin[providers]"
modelpin version                        # -> modelpin 0.1.0

Windows PowerShell: run modelpin …, not mp …. PowerShell ships a built-in mp alias (Move-ItemProperty) that shadows the CLI. The mp alias works everywhere else (cmd, bash, zsh) and via mp.exe; on PowerShell either use modelpin, call mp.exe, or run Remove-Item Alias:mp once per session (add it to your $PROFILE to make it permanent).

Try it offline, no API key (30 seconds)

Modelpin ships a fake provider that replays canned traces, so you can see the whole pipeline — baseline, candidate replay, behavioral diff, report — with zero cost and no key:

mp baseline --provider fake --fixtures examples/traces/demo_traces.json \
  --model claude-opus-4-6 \
  --scenarios-dir examples/scenarios --config examples/modelpin.yaml

mp check --to claude-opus-4-7 --from claude-opus-4-6 \
  --provider fake --fixtures examples/traces/demo_traces.json \
  --scenarios-dir examples/scenarios --config examples/modelpin.yaml

You'll get a per-scenario verdict (unchanged / changed_minor / regression), a confidence score, a one-line plain-English explanation per scenario, and a Markdown report written to .modelpin/last-report.md. mp check exits non-zero only on a real regression — that's the CI gate.

The real flow, on your own app

# 1. Scaffold modelpin.yaml + scenarios/ (never overwrites existing files)
mp init

# 2. See which models your repo already depends on, and where
mp scan

# 3. Add a scenario or two (a JSON file per representative case — see below),
#    then record how your current model behaves, N times
export OPENAI_API_KEY=sk-...        # your key, read from the env — never stored
mp baseline                         # uses models[0] + providers[0] from modelpin.yaml

# 4. Replay your scenarios on a candidate model and diff the behavior
mp check --to gpt-5.5

A scenario is a small JSON file (one per case) under scenarios/. The one mp init writes:

{
  "id": "greeting",
  "name": "Simple greeting",
  "kind": "single",
  "input": {"messages": [{"role": "user", "content": "Say hello in one short sentence."}]},
  "assertions": {"must_contain": ["hello"]}
}

Scenarios can also be agent runs: set "kind": "agent", add "tools" (and canned "tool_results") to input, and Modelpin drives a multi-turn model↔tool loop so trajectories like lookup_order → issue_refund actually emerge during replay. Eight worked examples spanning tool trajectories, semantic equivalence, refusals, and output format live in examples/suite/.


See it in your PR (GitHub Action)

The point of Modelpin is that the answer shows up at review time. It ships a real composite GitHub Action: it installs Modelpin, optionally records a baseline, runs mp check, posts a sticky PR comment (found-and-updated in place via a hidden marker — no comment spam), and fails the job on a regression. Drop this at .github/workflows/modelpin.yml:

name: Modelpin

on:
  pull_request:
  workflow_dispatch:        # trigger by hand the day a provider ships a new model

permissions:
  contents: read
  pull-requests: write      # so the action can post/update the PR comment

jobs:
  model-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: samarthputhraya/modelpin@v1
        with:
          from: gpt-4o-mini       # the model you depend on today (committed baseline)
          to: gpt-5.5             # the candidate to vet before adopting
          provider: openai
          runs: "5"
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}     # BYO-key from repo secrets — never inline a key
          # If your judge_model lives on another provider, add its key too:
          # GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
          # GROQ_API_KEY:   ${{ secrets.GROQ_API_KEY }}

Action inputs: to (required), from, provider, config, scenarios-dir, runs, match, baseline, comment, fail-on-regression, github-token, modelpin-spec, python-version, working-directory. Outputs: verdict-exit-code and report-path. The usual pattern is to commit your baseline so CI only replays the candidate; flip baseline: "true" to record fresh (needs the old model still reachable). Copy-paste workflow: examples/github-workflow.yml.


How the behavioral diff works (the moat)

Modelpin decides "did it really change?" from multiple signals over multiple runs, then gates every regression behind a distributional significance test plus an effect-size floor. A single odd run never trips it; a majority that merely flips between two equally-likely behaviors never trips it. Here is the whole decision rule, no hand-waving:

1. Multi-run, not single-shot. Each scenario runs N times (runs: in config; default 5, minimum 2 — a single run can't form a distribution, so --runs 1 is rejected outright). Baseline and candidate are both sampled, so the comparison is distribution-vs-distribution.

2. Structural signals (per run, no network, deterministic):

  • Tool-call trajectory match with four modes — strict | unordered | subset | superset (--match) — so you choose how strict "same plan" means for your agent.
  • Output format / assertion validity — your scenario's must_contain / must_not_contain text assertions, checked as a rate across runs.
  • Refusal detection — did the model start declining requests it used to answer?
  • Latency / token deltas — captured and reported, but informational only; they never gate the verdict (latency is jittery; a token bump isn't a behavior regression).

3. Semantic signal (optional LLM-as-judge): a low-temperature judge answers the only question that matters — do these answers mean / accomplish the same thing? This catches the structural blind spot: two answers that are textually different but identical in meaning ("The total is $5." vs "5 dollars."). The judge is injected and optional — with no judge_model set (and always on the offline fake path) the diff stays purely structural and makes zero network calls, so CI can run for $0. The judge is independent of the two models being compared, so it can arbitrate a cross-vendor check.

4. The statistics that kill false alarms. Every gating signal goes through an exact two-sample permutation test (modelpin/diff/stats.py — no SciPy, deterministic, so golden tests stay reproducible). A signal counts as a regression only when both:

  • the candidate distribution differs from baseline at p ≤ 0.05 (ALPHA), and
  • the effect clears a conservative size floor — tool-call shift ≥ 0.5 total-variation distance (MIN_TOOL_TVD), refusal-rate rise ≥ 0.34 (MIN_REFUSAL_DELTA), or semantic-divergence rate ≥ 0.5 over baseline (MIN_SEMANTIC_DELTA).

The size floor is what stops a statistically significant but practically trivial jitter from firing once N grows large. These floors are intentionally conservative — biased toward missing a borderline change rather than inventing one — because a miss is a false negative (the safe direction for a trust product), while a false alarm erodes trust permanently.

Output: each scenario gets a verdict, a confidence score, the underlying signals, and a one-line explanation. A structural tool-call / refusal break or a calibrated semantic divergence is a CI-failing regression; format/assertion drift alone is changed_minor (reported, doesn't fail the build).

The false-positive evidence — and its limits, stated plainly

Result: 0/8 false positives on a held-out 8-scenario suite (a model judged against itself, judge on, all unchanged at confidence 1.00). On that same run, the two perturbations that genuinely changed behavior were caught, and one prompt-injection the model resisted was correctly left unchanged (not a false negative). Corroborated by additional same-model and cross-model splits. Full writeup: docs/fp-measurement.md.

The semantic judge's escalation threshold is calibrated on a labeled set in examples/calibration/ that is deliberately distinct from the held-out suite (so it can't leak into the 0/8 number): equivalent-but-reworded pairs land at divergence 0.0, real meaning changes at ≥ 0.8, leaving an empty gap around the 0.5 floor — 0 false positives. FP-safety was re-checked with an independent judge (a different model arbitrating) and re-validated on the held-out suite after promoting semantic divergence from changed_minor to a CI-failing regression (still 0/8).

This is a first calibration. Do not over-trust it. The honest limitations, documented in docs/STATUS.md:

  • the calibration set is small (≈6+6 pairs) and the perturbations are synthetic, not harvested from real migrations;
  • recall on subtle changes was 4/6 — it can miss a subtle real change (again, the safe direction);
  • the judge is OpenAI-only so far;
  • the structural floors are FP-validated by the held-out suite but not yet swept on a labeled set.

Planned before any high-stakes reliance: ≥30 pairs including real migration traces, and a non-OpenAI judge. We'd rather you know this than discover it.


Cross-vendor (including a free third vendor)

A model migration isn't always within one lab. Modelpin diffs across vendors through one engine; a separate judge model arbitrates meaning-equivalence.

Provider Status
OpenAI Live (Chat Completions), multi-turn tool loops
Google / Gemini Live (google-genai), multi-turn tool loops, cross-vendor proven
OpenAI-compatible hostsgroq, openrouter, together, cerebras Live (the OpenAI adapter pointed at the host's base_url)
Anthropic Stub — raises NotImplementedError (deferred until a paid key is in play)

What we observed (open suite, our settings):

  • gpt-4o-mini vs gemini-3.1-flash-lite, 5 runs × 8 scenarios, OpenAI judge on → 8/8 unchanged: the cross-vendor judge genuinely fired and found the two vendors behaviorally equivalent on this suite.
  • gpt-4o-mini vs llama-3.3-70b-versatile on Groq, same suite → 8/8 unchanged.

Free third vendor, zero cost: Groq serves Llama models over the OpenAI-compatible API and has a free tier, so a free key makes a zero-cost cross-vendor check:

export GROQ_API_KEY=...     # free at console.groq.com
mp check --provider groq --from gpt-4o-mini --to llama-3.3-70b-versatile

A caveat worth stating: open-model hosts rotate ids but don't retire on a lab's fixed schedule the way the big providers do, so Groq/OpenRouter/etc. are a genuine cross-vendor bonus and an architecture proof — not the core migration wedge.


Bring your own key

Modelpin replays with the end user's own API key, always read from the environment, never hardcoded, shipped, or stored (cost stays yours; provider ToS stays clean):

  • OPENAI_API_KEY
  • GEMINI_API_KEY (or GOOGLE_API_KEY)
  • GROQ_API_KEY (and the equivalents for other OpenAI-compatible hosts)

In CI, supply these as repo secrets (see the workflow above). Error text is scrubbed of sk- / Bearer tokens, so a failed call never leaks your key into a log, traceback, or PR comment.


CLI reference

Command What it does
mp init [dir] Scaffold modelpin.yaml + scenarios/ (never overwrites).
mp scan [path] Detect which AI models the repo depends on, and where.
mp baseline Record current model behavior for your scenarios (N runs).
mp check --to <model> Replay scenarios on a new model, diff vs baseline, write the PR-style report, fail CI on a regression.
mp version Print the Modelpin version.
mp report Not built yet — prints a "coming soon" notice. The public Modelpin Report suite is a stub.

Shared flags on baseline / check: --from / --model, --provider, --runs, --match (strict\|unordered\|subset\|superset), --config, --scenarios-dir, --store-dir, and --fixtures (with --provider fake).


Install

pip install "modelpin[providers]"     # or: pipx install "modelpin[providers]"   (Python 3.12+)
modelpin version

The providers extra pulls in the openai, google-genai, and anthropic SDKs. The bare pip install modelpin (no extra) runs the offline fake path with no provider SDKs at all.

From source (for development):

git clone https://github.com/samarthputhraya/modelpin
cd modelpin
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev,providers]"

What this is not (non-goals, on purpose)

Modelpin is a migration tool, and stays one. It is not:

  • a general eval / observability platform,
  • prompt management,
  • a model gateway or host,
  • an absolute "which model is best" leaderboard.

It measures behavior change relative to your app — not abstract quality. Saying no to that scope is what keeps the false-positive promise honest and the tool small enough to trust.

Honest-framing rules (this is a trust product)

  • Any public / measurement claim is phrased as "on our open suite, under these settings, we observed…"never "Model X is worse." The harness and scenarios are open source so anyone can rerun and disagree. That's the whole point of being the independent voice.
  • We don't overclaim and we don't falsely undersell. The engine is real and cross-vendor proven; and Anthropic is a stub, mp report isn't built, and the judge calibration is a documented first pass. All true at once.

Status

Phase 0 (core engine MVP) — essentially complete; v0.1.0 published on PyPI. Live-validated cross-vendor (OpenAI ↔ Google ↔ Groq/Llama); held-out false-positive rate 0/8; multi-turn replay; a real GitHub Action; pip install "modelpin[providers]"; 132 tests passing, ruff + black clean. Stubs/TODOs (Anthropic adapter, mp report) are called out above. Not yet listed on the GitHub Marketplace.

The full engineering record and roadmap live in docs/STATUS.md. Next up: publish, the first public Modelpin Report (gated on a real model launch), and the Anthropic adapter.

License

Apache-2.0. See LICENSE. The open-source core (CLI, engine, Action) is and stays open; any future hosted tier lives in a separate, proprietary package.

Repo: https://github.com/samarthputhraya/modelpin

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

modelpin-0.1.1.tar.gz (67.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

modelpin-0.1.1-py3-none-any.whl (51.1 kB view details)

Uploaded Python 3

File details

Details for the file modelpin-0.1.1.tar.gz.

File metadata

  • Download URL: modelpin-0.1.1.tar.gz
  • Upload date:
  • Size: 67.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for modelpin-0.1.1.tar.gz
Algorithm Hash digest
SHA256 2a0fc08ae5948e401bc1e6307f236307959f3795710d499e2be0ffc6c32d5d22
MD5 dd638de352ce40016f2fe228692a1ebe
BLAKE2b-256 e0f77863610678ac6f52557f6b618129243c3c538136dd0733a5438da360e553

See more details on using hashes here.

File details

Details for the file modelpin-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: modelpin-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 51.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for modelpin-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 bf50571b6a82c55ba758d029714bc0f5337f7816bda4186d7e9a4b070b5463a1
MD5 a6ca02e4f9e72c2ffdeb51c7e3a26462
BLAKE2b-256 0a34784157947e59f3077eeee3bab57b54142fc7423543398fbf83ce7cae839b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page