Skip to main content

Regression protection for LLM pipelines

Project description

promptry

PyPI npm CI Python 3.10+ License: MIT

Sentry for prompts. Sentry catches when your code breaks. promptry catches when your prompts break — versions them, runs eval suites in CI, and flags regressions or drift against a baseline. Local-first. No SaaS.

from promptry import track, suite, assert_semantic

# track() content-hashes your prompt and stores a new version if it changed
prompt = track(system_prompt, "rag-qa")
response = llm.chat(system=prompt, ...)

# suites are regular Python functions. run them via CLI or in CI.
@suite("rag-regression")
def test_quality():
    response = my_pipeline("What is photosynthesis?")
    assert_semantic(response, "Converts light into chemical energy")

When a suite regresses against its baseline, promptry reports what changed:

Overall score: 0.910 -> 0.720  REGRESSION

Probable cause:
  -> Prompt changed (v3 -> v4)

Install

pip install promptry                       # core
pip install promptry[semantic]             # + semantic assertions (sentence-transformers)
pip install promptry[dashboard]            # + web dashboard
pip install promptry[semantic,dashboard]   # everything

Quick start

promptry init                              # scaffold project + starter eval
promptry run smoke-test --module evals     # run it
PASS test_basic_quality (142ms)
  semantic (0.891) ok

Overall: PASS  score: 0.891

Features

Feature What it does
Prompt versioning Content-hashed, automatic dedup
Eval suites Semantic, schema, LLM-as-judge, JSON, regex, grounding assertions
Regression detection Compare against baselines, get root cause hints
Drift detection Catch slow quality degradation over time
Model comparison Statistical comparison against historical baseline (not just snapshots)
Cost tracking Token usage and cost per prompt, aggregated reports
Safety templates 25 starter jailbreak / injection / PII tests — add your own
MCP server Expose everything as tools for Claude, Cursor, VS Code, etc.
Dashboard Web UI for eval history, prompt diffs, model comparison, cost
JS/TS client Ship prompt events from frontend/Node apps

Dashboard

pip install promptry[dashboard]
promptry dashboard

Overview Suite Detail Prompts Models Cost

How it differs

Promptfoo DeepEval RAGAS LangSmith promptry
Language TypeScript Python Python Python + JS Python + JS
Local-first Yes Cloud push Yes SaaS only SQLite
Prompt versioning Via git + YAML No No Prompt Hub Automatic
Drift over time No No No Dashboards Regression window
Root cause hints No No No No Yes
Safety / red-team Yes Yes No No 25 starters
MCP server Plugin Partial No No Native
Vendor OpenAI-owned Independent Independent LangChain Independent
Cost Free Freemium Free Freemium Free

Honest caveats: Promptfoo has more assertion types and a larger red-team corpus. RAGAS has the gold-standard RAG metrics (faithfulness, context precision, answer relevancy). LangSmith has better multi-user dashboards and deeper LangChain integration. promptry's niche is the combo of local SQLite + automatic versioning + CI-native + MCP server in one Python-first package.

GitHub Action

Run eval suites in CI with one line. On pull requests it posts (or updates) a single comment summarizing the eval: overall score, pass/fail counts, and any regressed tests vs. the previous run. View on Marketplace.

# .github/workflows/eval.yml
name: Eval
on: [push, pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write  # required for PR comments
    steps:
      - uses: actions/checkout@v4
      - uses: bihanikeshav/promptry@v0.6.0
        with:
          suite: rag-regression
          module: evals
          compare: prod  # optional — compare against baseline

Example PR comment on a regression:

## promptry eval: rag-regression

| | Current | Baseline | Delta |
|---|---|---|---|
| Overall score | 0.891 | 0.910 | -0.019 |
| Passed | 8/10 | 9/10 | -1 |
| Status | REGRESSED | PASS | |

**Regressions:**
- `test_photosynthesis_answer`: semantic 0.89 -> 0.72 (-0.17)
- `test_schema_validation`: passed -> **failed**

_Generated by [promptry](https://github.com/bihanikeshav/promptry)_

Subsequent pushes edit the same comment instead of spamming new ones.

Input Required Default Description
suite Yes Eval suite name
module Yes Python module containing the suite
compare No Baseline tag to compare against
python-version No 3.12 Python version
extras No semantic pip extras to install
pr-comment No true Post/update a PR comment with results
github-token No ${{ github.token }} Token used to post PR comments

MCP server

claude mcp add promptry -- promptry mcp    # Claude Code

Works with Claude Desktop, Cursor, Windsurf, VS Code. See full setup.

Documentation

The full guide covers all assertions, cost tracking, model comparison, safety templates, notifications, storage modes, JS client, CLI reference, MCP setup, and config options.

Honest caveats

  • Early-stage. v0.7, solo-maintained, small user base. API is stable but bus-factor is one. Issues welcome.
  • "No API keys" applies to the framework only. SQLite storage and the CLI need nothing. assert_llm, assert_grounded, and cost tracking all need your own LLM provider key.
  • Drift detection is a rolling-window regression on scores. Works for steady degradation over a configurable window (default 30 runs). It is not a formal hypothesis test — see drift detection docs for exactly what it does and does not do.
  • Safety templates are starters, not comprehensive coverage. 25 curated prompts across 6 categories. For serious red-teaming look at garak or PyRIT. Bring your own templates via templates.toml.
  • Cost tracking uses hardcoded rate tables. Fine for rough estimates; won't reflect batching discounts, prompt caching, or provider price changes. Reconcile against invoices for finance.
  • Auto-instrumentation is opt-in. promptry.integrations.openai and .litellm wrap clients automatically; otherwise you add track() manually. Explicit by default.
  • No hosted multi-user UI. For that, look at LangSmith or Arize.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

promptry-0.7.0.tar.gz (297.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

promptry-0.7.0-py3-none-any.whl (266.9 kB view details)

Uploaded Python 3

File details

Details for the file promptry-0.7.0.tar.gz.

File metadata

  • Download URL: promptry-0.7.0.tar.gz
  • Upload date:
  • Size: 297.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for promptry-0.7.0.tar.gz
Algorithm Hash digest
SHA256 dc2a37aaa75392b51612bb5e67a6ce79d881e639df23fd8f12a079a9c282e2f6
MD5 c7ccac7e6e93abe810a1016fb0772ee3
BLAKE2b-256 74fa93dab82613d0f4166d549a647404d80b9e3e523a5bb1afb242df9f89ab81

See more details on using hashes here.

Provenance

The following attestation bundles were made for promptry-0.7.0.tar.gz:

Publisher: publish-pypi.yml on bihanikeshav/promptry

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file promptry-0.7.0-py3-none-any.whl.

File metadata

  • Download URL: promptry-0.7.0-py3-none-any.whl
  • Upload date:
  • Size: 266.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for promptry-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e0bf986cbf67e28d18bfe16f6433f473474c8755b39e417cac341e6f76d5b871
MD5 dcba3a4a0c5e6086bdf0e8f540eb6a2f
BLAKE2b-256 bac1871eb6c1382b484d04febac16cedf499276f12b6a9d7fc128f1cb492e164

See more details on using hashes here.

Provenance

The following attestation bundles were made for promptry-0.7.0-py3-none-any.whl:

Publisher: publish-pypi.yml on bihanikeshav/promptry

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page