Skip to main content

A minimalist LLMOps framework for prompt versioning, evaluation and regression testing.

Project description

PromptForge ๐Ÿ”จ

I changed a prompt in production. The urgency classifier dropped from 100% to 75%. Nobody noticed for two weeks. That's the problem PromptForge solves.

PromptForge is a minimalist, open-source LLMOps framework for prompt versioning, evaluation, and regression testing. Built by someone who wrote a book on prompt engineering โ€” and got tired of "vibes-based" quality control.


The Problem

You change a prompt. You run it manually on 3 examples. It "feels better". You ship it.

Two days later, a category of inputs silently degrades. You have no baseline, no metrics, no diff. You have a hunch.

PromptForge treats prompts like code: versioned, tested, diffed, and auditable.


Real Example โ€” Support Ticket Triage

Here's a real scenario: an AI system that classifies customer support tickets by category, urgency, and responsible team.

The prompt was "working". But was it really?

We ran PromptForge against 8 real support cases and discovered:

Evaluator             | Mean Score | Failure Rate | Cases
json_validity         |      1.000 |         0.0% |     8   โœ…
schema_match          |      1.000 |         0.0% |     8   โœ…
field_match_category  |      1.000 |         0.0% |     8   โœ…
field_match_urgency   |      0.750 |        25.0% |     8   โš ๏ธ  โ† problem found
field_match_team      |      1.000 |         0.0% |     8   โœ…

PromptForge pinpointed the exact failures:

Case Customer Message Expected Got Status
t004 "Can't login since yesterday, password is correct." critical high โŒ
t005 "My subscription was cancelled without warning." critical high โŒ

Root cause: The prompt had no definition of what critical means for this company. The model couldn't distinguish high from critical.

The fix: explicit urgency definitions (v1.1.0)

We added a definitions block to the prompt:

- "critical": user completely blocked OR data loss OR account access lost OR active incorrect charge
- "high": important feature broken but workaround exists OR charge resolved but no refund yet
- "medium": performance degradation or delays affecting work
- "low": feature requests, questions, suggestions

The result โ€” proved with data, not gut feeling:

promptforge diff --baseline <v1.0.0-run> --candidate <v1.1.0-run>

Evaluator             | Baseline | Candidate | Delta  | Status
field_match_category  |    1.000 |     1.000 | +0.000 | โ€” unchanged
field_match_team      |    1.000 |     1.000 | +0.000 | โ€” unchanged
field_match_urgency   |    0.750 |     1.000 | +0.250 | โœ… IMPROVED
json_validity         |    1.000 |     1.000 | +0.000 | โ€” unchanged
schema_match          |    1.000 |     1.000 | +0.000 | โ€” unchanged

โœ“ No regressions detected.

+25% improvement on urgency. Zero regressions. Proven.

This is what you normally don't have. Without PromptForge, you change a prompt, test on 2 examples, and ship hoping for the best. With PromptForge, you have written, reproducible proof.


Core Concepts

Concept What it is
PromptSpec A YAML file defining your prompt template, system prompt, inputs, output contract, and model params
Dataset A golden set of {input, expected} cases โ€” real examples with known correct answers
Run One execution of a PromptSpec against a Dataset โ€” produces scores per case
Evaluator A function that scores each output (heuristic or LLM-as-judge)
Diff A comparison between two Runs showing regressions and improvements
Report A Markdown report with ASCII charts, failure analysis, and automated insights

Quickstart

# Install
pip install promptforge-llmops

# Set your API key (OpenAI, Anthropic, or any OpenAI-compatible provider like Groq)
# .env file:
# OPENAI_API_KEY=your-key-here
# OPENAI_BASE_URL=https://api.groq.com/openai/v1  โ† optional, for Groq (free tier available)

# Scaffold a new prompt interactively
promptforge new

# Or initialise a project manually
promptforge init

# Validate your files
promptforge validate \
  --prompt prompts/my_prompt.yaml \
  --dataset datasets/my_golden.yaml

# Run evaluation
promptforge eval \
  --prompt prompts/my_prompt.yaml \
  --dataset datasets/my_golden.yaml \
  --config configs/my_config.yaml

# Compare two runs (detect regressions)
promptforge diff --baseline <run_id_A> --candidate <run_id_B>

# View score evolution across versions
promptforge history --prompt my_prompt

# Generate Markdown report
promptforge report --run <run_id> --out report.md

# View recent runs
promptforge runs

promptforge new โ€” Interactive Wizard

The fastest way to get started. One command creates all three files you need:

$ promptforge new

๐Ÿ”จ PromptForge โ€” New Prompt Wizard

  Prompt name: support_triage
  Description: Classifies customer support tickets
  Provider [openai]: openai
  Model [llama-3.3-70b-versatile]:
  Output format (text/json) [json]: json
  Version [0.1.0]:

  โœ“ Created prompts/support_triage.yaml
  โœ“ Created datasets/support_triage_golden.yaml
  โœ“ Created configs/support_triage.yaml

Next step:
  promptforge eval \
    --prompt prompts/support_triage.yaml \
    --dataset datasets/support_triage_golden.yaml \
    --config configs/support_triage.yaml

System Prompt Support

Define a system_prompt separately from your user template โ€” the way modern models work best:

id: support_triage
version: 1.2.0
system_prompt: "You are a precise support triage agent. Always respond with valid JSON only."
template: |
  Classify the following message: {{ message }}

PromptForge sends them as separate messages to the API. Changes to either the system prompt or the template are tracked in the content hash โ€” so a diff will catch regressions even if only the system prompt changed.


LLM-as-Judge Evaluators

Beyond heuristics, PromptForge supports LLM-as-judge evaluation using rubrics. Define a rubric YAML:

# rubrics/support_quality.yaml
rubric_id: support_quality
judge_model: llama-3.3-70b-versatile
dimensions:
  - name: clarity
    scale: [1, 2, 3, 4, 5]
    instruction: "Is the reason field clear and easy to understand for a support agent?"
  - name: accuracy
    scale: [1, 2, 3, 4, 5]
    instruction: "Does the classification correctly reflect the customer's problem?"
  - name: completeness
    scale: [1, 2, 3, 4, 5]
    instruction: "Does the response include all required fields with meaningful values?"

Add it to your config:

evaluators:
  - type: heuristic
    name: json_validity
  - type: judge
    name: quality
    config:
      rubric: rubrics/support_quality.yaml

Each dimension generates a separate normalised score (0.0โ€“1.0) in the run summary:

Evaluator             | Mean Score | Failure Rate | Cases
json_validity         |      1.000 |         0.0% |     8   โœ…
field_match_urgency   |      1.000 |         0.0% |     8   โœ…
quality_clarity       |      1.000 |         0.0% |     8   โœ…
quality_accuracy      |      1.000 |         0.0% |     8   โœ…
quality_completeness  |      1.000 |         0.0% |     8   โœ…

Score History

Track how your prompt evolves over time:

$ promptforge history --prompt support_triage

๐Ÿ“ˆ Evolution โ€” support_triage
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Version โ”ƒ Date       โ”ƒ fm_urgency โ”ƒ fm_cat...  โ”ƒ Trend         โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ v1.0.0  โ”‚ 2026-03-07 โ”‚ 0.75 โ–ˆโ–ˆโ–ˆโ–ˆโ–‘ โ”‚ 1.00 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚ โ€”             โ”‚
โ”‚ v1.1.0  โ”‚ 2026-03-07 โ”‚ 1.00 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚ 1.00 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚ โ†‘ 1 improved  โ”‚
โ”‚ v1.2.0  โ”‚ 2026-03-08 โ”‚ 1.00 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚ 1.00 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚ โ†‘ 3 improved  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Use as a Library

PromptForge can also be used directly in Python โ€” no CLI required:

from dotenv import load_dotenv
load_dotenv()

from promptforge import PromptSpec, Dataset, RunConfig, EvalPipeline
from promptforge.store.db import init_db
from promptforge.store.repositories import ScoreRepository
from promptforge.eval.aggregations import aggregate_run_scores

init_db()

ps = PromptSpec.from_yaml("prompts/support_triage.yaml")
ds = Dataset.from_file("datasets/support_golden.yaml")
rc = RunConfig.from_yaml("configs/support_triage.yaml")

pipeline = EvalPipeline(ps, ds, rc)
run_id = pipeline.run()

scores = ScoreRepository().get_by_run(run_id)
agg = aggregate_run_scores(scores)

all_pass = all(s["mean"] >= 0.9 for s in agg.values())
if all_pass:
    print("โœ… Prompt approved โ€” safe to promote to production.")
else:
    print("โŒ Prompt failed โ€” review failures before promoting.")

This makes it easy to integrate PromptForge into CI/CD pipelines, APIs, or monitoring systems.


The Workflow That Changes Everything

1. You have a prompt that works
   โ†’ promptforge new (2 min to scaffold everything)

2. Define 10โ€“20 real input/expected cases
   โ†’ golden dataset YAML (done once, reused forever)

3. Run: promptforge eval
   โ†’ get scores per case, mean score, failure rate

4. Change the prompt โ†’ run eval again
   โ†’ promptforge diff shows exactly what improved and what regressed

5. promptforge history --prompt <name>
   โ†’ see the full evolution of your prompt over time

6. promptforge report
   โ†’ Markdown report with ASCII charts to share with your team

Supported Evaluators

Evaluator Type What it checks
json_validity heuristic Output is valid JSON
schema_match heuristic All required fields are present
field_match heuristic A specific field matches the expected value
keyword_match heuristic Required keywords appear in output
length_ok heuristic Output is within character limit
exact_match heuristic Output matches expected text exactly
judge LLM-as-judge Semantic quality scored by a rubric

Supported Providers

Provider Config
OpenAI (GPT-4o, GPT-4o-mini) provider: openai
Anthropic (Claude 3, Claude 3.5) provider: anthropic
Groq (Llama, Mixtral) โ€” free tier provider: openai + OPENAI_BASE_URL=https://api.groq.com/openai/v1
Any OpenAI-compatible API provider: openai + custom OPENAI_BASE_URL

Project Structure

src/promptforge/
  core/       # PromptSpec, Dataset, RunConfig, Templating
  llm/        # Provider adapters (OpenAI, Anthropic)
  eval/       # Heuristics, LLM-as-judge, Regression
  store/      # SQLite persistence
  reporting/  # Markdown reports, CLI tables
  utils/      # Hashing, redaction, JSONL helpers

prompts/      # Your PromptSpec YAML files
datasets/     # Your golden datasets
configs/      # Your RunConfig YAML files
rubrics/      # Your LLM-as-judge rubric YAML files
.promptforge/ # SQLite database (auto-created)

Design Philosophy

  • Prompts are artefacts, not strings. Version them. Hash them. Diff them.
  • Quality is measured, not felt. Every run produces scores. Every change produces a delta.
  • LLM-as-judge is a measuring instrument, not truth. Use it with rubrics, not blind trust.
  • Minimal dependencies. Maximum auditability.
  • Works with free-tier providers. No excuses not to test.

Changelog

v0.2.0

  • LLM-as-judge evaluators with rubric YAML support
  • promptforge new โ€” interactive wizard to scaffold prompts, datasets and configs
  • promptforge history โ€” visual score evolution across prompt versions
  • System prompt support (system_prompt field in PromptSpec)
  • Automatic markdown code block stripping in JSON outputs

v0.1.0

  • Core eval pipeline with heuristic evaluators
  • promptforge eval, diff, report, runs, dashboard, validate
  • SQLite persistence for runs and scores
  • OpenAI and Anthropic provider adapters

CI/CD Integration

Add prompt regression testing to any GitHub Actions workflow:

# .github/workflows/prompt-eval.yml
name: Prompt Eval

on:
  push:
    paths:
      - "prompts/**"
      - "datasets/**"
      - "configs/**"

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run PromptForge eval
        id: pf
        uses: MPrazeres-1983/promptforge@v1
        with:
          prompt: prompts/support_triage.yaml
          dataset: datasets/support_golden.yaml
          config: configs/support_triage.yaml
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}
          openai-base-url: ${{ secrets.OPENAI_BASE_URL }}
          fail-on-regression: "true"

The action automatically installs promptforge-llmops, runs the eval, and fails the workflow if regressions are detected. Available on the GitHub Marketplace.

Inputs:

Input Required Description
prompt โœ… Path to PromptSpec YAML
dataset โœ… Path to Dataset YAML or JSONL
config โœ… Path to RunConfig YAML
openai-api-key โœ… API key for OpenAI or compatible provider
openai-base-url โŒ Base URL for Groq or other compatible providers
baseline-run-id โŒ Run ID to diff against (enables regression detection)
fail-on-regression โŒ Fail workflow on regressions (default: true)

Docs


License

MIT ยฉ Mรกrio Prazeres

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

promptforge_llmops-0.2.1.tar.gz (33.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

promptforge_llmops-0.2.1-py3-none-any.whl (36.1 kB view details)

Uploaded Python 3

File details

Details for the file promptforge_llmops-0.2.1.tar.gz.

File metadata

  • Download URL: promptforge_llmops-0.2.1.tar.gz
  • Upload date:
  • Size: 33.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for promptforge_llmops-0.2.1.tar.gz
Algorithm Hash digest
SHA256 cd0ec62be83b51e7d6110540490881b0f760bd0bc708873a396d9b6d0255523a
MD5 f4a0adb61a1454252197072128d6e867
BLAKE2b-256 452c485612938b3f4a4b5d75e013cfd6a3a4da545894dd7697fbeeca9b37cd9e

See more details on using hashes here.

File details

Details for the file promptforge_llmops-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for promptforge_llmops-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 30eb54deb9535bc4676429a009bc6b19c83be3d41ea66fc0e77007fe6a42fb19
MD5 ebd1640fc43424c439d65dd3aacfa7a9
BLAKE2b-256 54c09d381bfad5c49528d616b68c53b60da2e662c2cf9f028fcef91009b29b8e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page