Regression protection for LLM pipelines

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

bihanikeshav

These details have not been verified by PyPI

Project links

Homepage

Project description

promptry

Local-first prompt observability that lives in your repo. Version your prompts, write eval suites in Python, track the cost of every call, edit prompts live, and catch regressions in CI. One pip install, one SQLite file, zero services — your prompts never leave your laptop.

Try the live demo → · Integration guide · Docs

from promptry import track, suite, assert_semantic

# track() content-hashes your prompt and stores a new version if it changed
prompt = track(system_prompt, "rag-qa")
response = llm.chat(system=prompt, ...)

# suites are regular Python functions. run them via CLI or in CI.
@suite("rag-regression")
def test_quality():
    response = my_pipeline("What is photosynthesis?")
    assert_semantic(response, "Converts light into chemical energy")

When a suite regresses against its baseline, promptry reports what changed:

Overall score: 0.910 -> 0.720  REGRESSION

Probable cause:
  -> Prompt changed (v3 -> v4)

Install

pip install promptry                       # core
pip install promptry[semantic]             # + semantic assertions (sentence-transformers)
pip install promptry[dashboard]            # + web dashboard
pip install promptry[semantic,dashboard]   # everything

Quick start

promptry init                              # scaffold project + starter eval
promptry run smoke-test --module evals     # run it

PASS test_basic_quality (142ms)
  semantic (0.891) ok

Overall: PASS  score: 0.891

Features

Feature	What it does
Prompt versioning	Content-hashed, automatic dedup, grouped by module. No manual bumps, no YAML, no git dance.
Live prompt CMS	`render_prompt()` serves dashboard-edited `{{name}}` templates with no redeploy. Edit a prompt in the browser, your app picks it up on the next call. Substitution is value-driven, so JSON braces and literal `$` are never mistaken for variables.
Semantic prompt search	Search the registry by meaning and flag near-duplicate prompts (likely forks to consolidate). Embeddings with a lexical fallback.
Environment promotion	dev → staging → prod tags gate every edit before it reaches users. Promote a version, roll one back.
Python-native suites	`@suite` decorators, not YAML. Loops, fixtures, and your IDE's debugger all work.
Deterministic assertions	Semantic, schema, JSON, regex, grounding, tool-use. Zero API calls at CI time.
LLM-as-judge	Opt-in, not default. You decide when to spend tokens on evaluation.
Drift detection	Mann-Whitney U on a rolling window with real p-values — on eval scores and on live production telemetry (cost, latency, output length, rating).
Regression diff	Tells you what changed — prompt version, model, or data — not just that it broke.
Regression bisect	Walks the run history to pinpoint the first run that broke a test.
SLO gates	`[slo]` latency budgets fail CI on performance regressions, independent of the eval score.
Judge-cost attribution	LLM-judge spend estimated and summed per eval run, so you see what evaluation itself costs.
Eval-from-trace	Promote a real captured invocation into a per-prompt golden set, then re-run it against any model to check accuracy.
Model comparison	Statistical comparison against the historical baseline, not snapshot-to-snapshot.
Invocations ledger	Every call recorded: tokens, cost, latency, model. Opt-in sampled request/response trace capture; per-call ratings/feedback via `POST /api/feedback`.
Cost tracking	Per-model pricing with module → prompt → call drill-down, per-call template-vs-payload split, and a coverage check that flags un-priced models. Cache-aware, across OpenAI, Anthropic, Gemini, Grok.
Budgets	Daily and monthly spend caps with breach alerts.
PII / secret scanning	Captured request/response text is scanned for API keys, private keys, JWTs, emails, SSNs, and card numbers; the dashboard warns with masked findings.
Safety suite	25 jailbreak / injection / PII / encoding templates across 6 categories. Extensible via `templates.toml`.
MCP server	First-class: your LLM agent drives the whole test runner. Native, not a plugin.
Dashboard	Local web UI for eval history, prompt registry + live editing, cost drill-down, model comparison, invocation traces, and a multi-model playground. No account, no cloud.
Project config	Committable `.promptry/config.toml` (models, judge, dashboard prefs, pricing overrides). API keys via env.
JS/TS client	Ship prompt events from frontend/Node apps to the same SQLite store.

Dashboard

pip install promptry[dashboard]
promptry dashboard

Eval health and spend at a glance — drill into evals or cost for detail. Overview

The prompt registry, grouped by module. Click any prompt to inspect versions, diffs, and stats. Prompts

A prompt detail view: edit the live $-placeholder template, with variable pills and promotion tags.

Cost, drilled module → prompt → the priciest individual calls. Cost

A single call, broken into fixed template overhead vs the variable payload you fed in. Invocation

The playground: render a prompt and compare it across models before promoting to a suite.

Why promptry

Three things you won't get elsewhere — together, in one tool:

Code, not YAML. Suites are pytest-style decorators. Loops, fixtures, debugger breakpoints, IDE autocomplete. Promptfoo makes you generate YAML from Python scripts once your suite grows past a few dozen tests. Just skip the round trip.
Local by design. One SQLite file. No account, no API key for the framework, no cloud to trust. LangSmith and DeepEval's flagship features push your prompts and outputs to their servers — disqualifying for regulated industries, IP-sensitive work, or anyone who reads their procurement policy.
No per-run judge tax. Most assertions are deterministic: semantic similarity, schema, JSON, regex, grounding, tool-use. CI runs cost $0. RAGAS's headline metrics (faithfulness, answer relevancy, context precision) all need judge-model calls — every run costs tokens, adds latency, and drifts when the judge model updates. We treat LLM-as-judge as an opt-in, not a default.

	Promptfoo	RAGAS	LangSmith	DeepEval	promptry
Config	YAML	Python metrics	SaaS UI	Python	Python decorators
Data location	Local	Local	Their cloud	Local + push	Local SQLite
Account required	No	No	Yes	No (for OSS)	No, ever
CI cost per run	Mixed	Per-judge-call	Trace volume	Per-judge-call	$0 (deterministic)
Prompt versioning	Manual + git	None	Prompt Hub	None	Automatic content-hash
Live prompt editing	None	None	Prompt Hub (cloud)	None	Dashboard, no redeploy
Drift detection	None	None	Dashboards only	None	Mann-Whitney U + p-values
Cost budgets + alerts	None	None	Usage charts only	None	Daily/monthly caps
MCP server	Plugin	None	None	Partial	Native
Commercial tier	Promptfoo Enterprise	None	LangSmith (SaaS)	Confident AI	None planned

GitHub Action

Run eval suites in CI with one line. On pull requests it posts (or updates) a single comment summarizing the eval: overall score, pass/fail counts, and any regressed tests vs. the previous run. View on Marketplace.

# .github/workflows/eval.yml
name: Eval
on: [push, pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write  # required for PR comments
    steps:
      - uses: actions/checkout@v4
      - uses: bihanikeshav/promptry@v0.6.0
        with:
          suite: rag-regression
          module: evals
          compare: prod  # optional — compare against baseline

Example PR comment on a regression:

## promptry eval: rag-regression

| | Current | Baseline | Delta |
|---|---|---|---|
| Overall score | 0.891 | 0.910 | -0.019 |
| Passed | 8/10 | 9/10 | -1 |
| Status | REGRESSED | PASS | |

**Regressions:**
- `test_photosynthesis_answer`: semantic 0.89 -> 0.72 (-0.17)
- `test_schema_validation`: passed -> **failed**

_Generated by [promptry](https://github.com/bihanikeshav/promptry)_

Subsequent pushes edit the same comment instead of spamming new ones.

Input	Required	Default	Description
`suite`	Yes		Eval suite name
`module`	Yes		Python module containing the suite
`compare`	No		Baseline tag to compare against
`python-version`	No	`3.12`	Python version
`extras`	No	`semantic`	pip extras to install
`pr-comment`	No	`true`	Post/update a PR comment with results
`github-token`	No	`${{ github.token }}`	Token used to post PR comments

MCP server

claude mcp add promptry -- promptry mcp    # Claude Code

Works with Claude Desktop, Cursor, Windsurf, VS Code. See full setup.

Documentation

The full guide covers all assertions, cost tracking, model comparison, safety templates, notifications, storage modes, JS client, CLI reference, MCP setup, and config options.

Scope

Promptry is local-first by design. If you need a hosted, always-on observability product for production traffic with team seats and SSO, use LangSmith or Arize — different product category. Promptry runs against one SQLite file on your machine: wire it into CI so a bad prompt change never reaches production, manage your live prompts from the dashboard, and keep a per-call ledger of cost and traces without sending anything to a vendor.

Shipped: everything in the feature table above, across Python + JS + CLI + dashboard + MCP + GitHub Action — including the live prompt CMS with environment promotion, the per-call invocations ledger with opt-in request/response capture and feedback ingest, cost-by-module drill-down with budgets, and regression bisect.

On the roadmap: agent trajectory analysis and LLM-powered root cause.

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

bihanikeshav

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.0.0

May 26, 2026

0.7.0

Apr 17, 2026

0.6.1

Apr 17, 2026

0.6.0

Apr 12, 2026

0.5.0

Apr 12, 2026

0.4.0

Apr 10, 2026

0.3.0

Mar 18, 2026

0.1.0

Mar 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

promptry-1.0.0.tar.gz (291.8 kB view details)

Uploaded May 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

promptry-1.0.0-py3-none-any.whl (254.9 kB view details)

Uploaded May 26, 2026 Python 3

File details

Details for the file promptry-1.0.0.tar.gz.

File metadata

Download URL: promptry-1.0.0.tar.gz
Upload date: May 26, 2026
Size: 291.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for promptry-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`6a533da97d447e7725c0103f40d532945f195e930acf54525a4cf4a4eb392d34`
MD5	`c32d35ac2104f22117c39c4d8ccb2050`
BLAKE2b-256	`d62d0c709a5e1591d30fe47249455af8009d4de6fe3ffa038134e9d5c7f41379`

See more details on using hashes here.

Provenance

The following attestation bundles were made for promptry-1.0.0.tar.gz:

Publisher: publish-pypi.yml on bihanikeshav/promptry

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: promptry-1.0.0.tar.gz
- Subject digest: 6a533da97d447e7725c0103f40d532945f195e930acf54525a4cf4a4eb392d34
- Sigstore transparency entry: 1631754712
- Sigstore integration time: May 26, 2026
Source repository:
- Permalink: bihanikeshav/promptry@1880f45106a74760ac510f5c81cf9892cee3f7d3
- Branch / Tag: refs/tags/v1.0.0rc1
- Owner: https://github.com/bihanikeshav
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@1880f45106a74760ac510f5c81cf9892cee3f7d3
- Trigger Event: release

File details

Details for the file promptry-1.0.0-py3-none-any.whl.

File metadata

Download URL: promptry-1.0.0-py3-none-any.whl
Upload date: May 26, 2026
Size: 254.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for promptry-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`edaa31020daddc474a054c96bce47c23a3dfefd476b88de5ece37a04365dce16`
MD5	`e2fa4f8bee8f21659a85ea3e303941d0`
BLAKE2b-256	`c386db1d87c4ea0d1b5d90a34dfce57d0b2474a370e42ec74fa6d527dbe79b4d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for promptry-1.0.0-py3-none-any.whl:

Publisher: publish-pypi.yml on bihanikeshav/promptry

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: promptry-1.0.0-py3-none-any.whl
- Subject digest: edaa31020daddc474a054c96bce47c23a3dfefd476b88de5ece37a04365dce16
- Sigstore transparency entry: 1631754747
- Sigstore integration time: May 26, 2026
Source repository:
- Permalink: bihanikeshav/promptry@1880f45106a74760ac510f5c81cf9892cee3f7d3
- Branch / Tag: refs/tags/v1.0.0rc1
- Owner: https://github.com/bihanikeshav
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@1880f45106a74760ac510f5c81cf9892cee3f7d3
- Trigger Event: release

promptry 1.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

promptry

Install

Quick start

Features

Dashboard

Why promptry

GitHub Action

MCP server

Documentation

Scope

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance