Skip to main content

Regression protection for LLM pipelines

Project description

promptry

PyPI npm CI Python 3.10+ License: MIT

Local-first prompt observability that lives in your repo. Version your prompts, write eval suites in Python, track the cost of every call, edit prompts live, and catch regressions in CI. One pip install, one SQLite file, zero services — your prompts never leave your laptop.

Try the live demo → · Integration guide · Docs

from promptry import track, suite, assert_semantic

# track() content-hashes your prompt and stores a new version if it changed
prompt = track(system_prompt, "rag-qa")
response = llm.chat(system=prompt, ...)

# suites are regular Python functions. run them via CLI or in CI.
@suite("rag-regression")
def test_quality():
    response = my_pipeline("What is photosynthesis?")
    assert_semantic(response, "Converts light into chemical energy")

When a suite regresses against its baseline, promptry reports what changed:

Overall score: 0.910 -> 0.720  REGRESSION

Probable cause:
  -> Prompt changed (v3 -> v4)

Install

pip install promptry                       # core
pip install promptry[semantic]             # + semantic assertions (sentence-transformers)
pip install promptry[dashboard]            # + web dashboard
pip install promptry[semantic,dashboard]   # everything

Quick start

promptry init                              # scaffold project + starter eval
promptry run smoke-test --module evals     # run it
PASS test_basic_quality (142ms)
  semantic (0.891) ok

Overall: PASS  score: 0.891

Features

Feature What it does
Prompt versioning Content-hashed, automatic dedup, grouped by module. No manual bumps, no YAML, no git dance.
Live prompt CMS render_prompt() serves dashboard-edited {{name}} templates with no redeploy. Edit a prompt in the browser, your app picks it up on the next call. Substitution is value-driven, so JSON braces and literal $ are never mistaken for variables.
Semantic prompt search Search the registry by meaning and flag near-duplicate prompts (likely forks to consolidate). Embeddings with a lexical fallback.
Environment promotion dev → staging → prod tags gate every edit before it reaches users. Promote a version, roll one back.
Python-native suites @suite decorators, not YAML. Loops, fixtures, and your IDE's debugger all work.
Deterministic assertions Semantic, schema, JSON, regex, grounding, tool-use. Zero API calls at CI time.
LLM-as-judge Opt-in, not default. You decide when to spend tokens on evaluation.
Drift detection Mann-Whitney U on a rolling window with real p-values — on eval scores and on live production telemetry (cost, latency, output length, rating).
Regression diff Tells you what changed — prompt version, model, or data — not just that it broke.
Regression bisect Walks the run history to pinpoint the first run that broke a test.
SLO gates [slo] latency budgets fail CI on performance regressions, independent of the eval score.
Judge-cost attribution LLM-judge spend estimated and summed per eval run, so you see what evaluation itself costs.
Eval-from-trace Promote a real captured invocation into a per-prompt golden set, then re-run it against any model to check accuracy.
Model comparison Statistical comparison against the historical baseline, not snapshot-to-snapshot.
Invocations ledger Every call recorded: tokens, cost, latency, model. Opt-in sampled request/response trace capture; per-call ratings/feedback via POST /api/feedback.
Cost tracking Per-model pricing with module → prompt → call drill-down, per-call template-vs-payload split, and a coverage check that flags un-priced models. Cache-aware, across OpenAI, Anthropic, Gemini, Grok.
Budgets Daily and monthly spend caps with breach alerts.
PII / secret scanning Captured request/response text is scanned for API keys, private keys, JWTs, emails, SSNs, and card numbers; the dashboard warns with masked findings.
Safety suite 25 jailbreak / injection / PII / encoding templates across 6 categories. Extensible via templates.toml.
MCP server First-class: your LLM agent drives the whole test runner. Native, not a plugin.
Dashboard Local web UI for eval history, prompt registry + live editing, cost drill-down, model comparison, invocation traces, and a multi-model playground. No account, no cloud.
Project config Committable .promptry/config.toml (models, judge, dashboard prefs, pricing overrides). API keys via env.
JS/TS client Ship prompt events from frontend/Node apps to the same SQLite store.

Dashboard

pip install promptry[dashboard]
promptry dashboard

Eval health and spend at a glance — drill into evals or cost for detail. Overview

The prompt registry, grouped by module. Click any prompt to inspect versions, diffs, and stats. Prompts

A prompt detail view: edit the live $-placeholder template, with variable pills and promotion tags. Prompt detail

Cost, drilled module → prompt → the priciest individual calls. Cost

A single call, broken into fixed template overhead vs the variable payload you fed in. Invocation

The playground: render a prompt and compare it across models before promoting to a suite. Playground

Why promptry

Three things you won't get elsewhere — together, in one tool:

  1. Code, not YAML. Suites are pytest-style decorators. Loops, fixtures, debugger breakpoints, IDE autocomplete. Promptfoo makes you generate YAML from Python scripts once your suite grows past a few dozen tests. Just skip the round trip.
  2. Local by design. One SQLite file. No account, no API key for the framework, no cloud to trust. LangSmith and DeepEval's flagship features push your prompts and outputs to their servers — disqualifying for regulated industries, IP-sensitive work, or anyone who reads their procurement policy.
  3. No per-run judge tax. Most assertions are deterministic: semantic similarity, schema, JSON, regex, grounding, tool-use. CI runs cost $0. RAGAS's headline metrics (faithfulness, answer relevancy, context precision) all need judge-model calls — every run costs tokens, adds latency, and drifts when the judge model updates. We treat LLM-as-judge as an opt-in, not a default.
Promptfoo RAGAS LangSmith DeepEval promptry
Config YAML Python metrics SaaS UI Python Python decorators
Data location Local Local Their cloud Local + push Local SQLite
Account required No No Yes No (for OSS) No, ever
CI cost per run Mixed Per-judge-call Trace volume Per-judge-call $0 (deterministic)
Prompt versioning Manual + git None Prompt Hub None Automatic content-hash
Live prompt editing None None Prompt Hub (cloud) None Dashboard, no redeploy
Drift detection None None Dashboards only None Mann-Whitney U + p-values
Cost budgets + alerts None None Usage charts only None Daily/monthly caps
MCP server Plugin None None Partial Native
Commercial tier Promptfoo Enterprise None LangSmith (SaaS) Confident AI None planned

GitHub Action

Run eval suites in CI with one line. On pull requests it posts (or updates) a single comment summarizing the eval: overall score, pass/fail counts, and any regressed tests vs. the previous run. View on Marketplace.

# .github/workflows/eval.yml
name: Eval
on: [push, pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write  # required for PR comments
    steps:
      - uses: actions/checkout@v4
      - uses: bihanikeshav/promptry@v0.6.0
        with:
          suite: rag-regression
          module: evals
          compare: prod  # optional — compare against baseline

Example PR comment on a regression:

## promptry eval: rag-regression

| | Current | Baseline | Delta |
|---|---|---|---|
| Overall score | 0.891 | 0.910 | -0.019 |
| Passed | 8/10 | 9/10 | -1 |
| Status | REGRESSED | PASS | |

**Regressions:**
- `test_photosynthesis_answer`: semantic 0.89 -> 0.72 (-0.17)
- `test_schema_validation`: passed -> **failed**

_Generated by [promptry](https://github.com/bihanikeshav/promptry)_

Subsequent pushes edit the same comment instead of spamming new ones.

Input Required Default Description
suite Yes Eval suite name
module Yes Python module containing the suite
compare No Baseline tag to compare against
python-version No 3.12 Python version
extras No semantic pip extras to install
pr-comment No true Post/update a PR comment with results
github-token No ${{ github.token }} Token used to post PR comments

MCP server

claude mcp add promptry -- promptry mcp    # Claude Code

Works with Claude Desktop, Cursor, Windsurf, VS Code. See full setup.

Documentation

The full guide covers all assertions, cost tracking, model comparison, safety templates, notifications, storage modes, JS client, CLI reference, MCP setup, and config options.

Scope

Promptry is local-first by design. If you need a hosted, always-on observability product for production traffic with team seats and SSO, use LangSmith or Arize — different product category. Promptry runs against one SQLite file on your machine: wire it into CI so a bad prompt change never reaches production, manage your live prompts from the dashboard, and keep a per-call ledger of cost and traces without sending anything to a vendor.

Shipped: everything in the feature table above, across Python + JS + CLI + dashboard + MCP + GitHub Action — including the live prompt CMS with environment promotion, the per-call invocations ledger with opt-in request/response capture and feedback ingest, cost-by-module drill-down with budgets, and regression bisect.

On the roadmap: agent trajectory analysis and LLM-powered root cause.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

promptry-1.0.0.tar.gz (291.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

promptry-1.0.0-py3-none-any.whl (254.9 kB view details)

Uploaded Python 3

File details

Details for the file promptry-1.0.0.tar.gz.

File metadata

  • Download URL: promptry-1.0.0.tar.gz
  • Upload date:
  • Size: 291.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for promptry-1.0.0.tar.gz
Algorithm Hash digest
SHA256 6a533da97d447e7725c0103f40d532945f195e930acf54525a4cf4a4eb392d34
MD5 c32d35ac2104f22117c39c4d8ccb2050
BLAKE2b-256 d62d0c709a5e1591d30fe47249455af8009d4de6fe3ffa038134e9d5c7f41379

See more details on using hashes here.

Provenance

The following attestation bundles were made for promptry-1.0.0.tar.gz:

Publisher: publish-pypi.yml on bihanikeshav/promptry

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file promptry-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: promptry-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 254.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for promptry-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 edaa31020daddc474a054c96bce47c23a3dfefd476b88de5ece37a04365dce16
MD5 e2fa4f8bee8f21659a85ea3e303941d0
BLAKE2b-256 c386db1d87c4ea0d1b5d90a34dfce57d0b2474a370e42ec74fa6d527dbe79b4d

See more details on using hashes here.

Provenance

The following attestation bundles were made for promptry-1.0.0-py3-none-any.whl:

Publisher: publish-pypi.yml on bihanikeshav/promptry

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page