Automated testing for LLM prompts. Like pytest, but for prompts.
Project description
promptlab ⚡
Automated testing for LLM prompts. Write test cases in YAML, run them against Claude or OpenAI, get pass/fail results in your terminal.
Like pytest, but for prompts.
pip install promptlab-cli
promptlab run tests/
✅ summarize_article :: returns_short_summary PASS (1.2s)
✅ summarize_article :: mentions_key_points PASS (1.1s)
❌ translate_text :: preserves_tone FAIL (0.9s)
Expected: contains "formal"
Got: "Here is the translated text in a casual style..."
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Results: 2 passed, 1 failed, 3 total (3.2s)
Why?
You're building an app with Claude or GPT. Your prompt works today. Tomorrow you tweak it and something breaks. You don't notice until a user complains.
promptlab catches prompt regressions before they ship. Define what good output looks like, run tests on every change, and know immediately if something broke.
Quickstart
Install
pip install promptlab-cli
Set your API key
export ANTHROPIC_API_KEY=sk-ant-...
# or
export OPENAI_API_KEY=sk-...
Write a test file
Create tests/summarize.yaml:
prompt: |
Summarize this article in 2-3 sentences:
{{ article }}
model: claude-sonnet-4-20250514
tests:
- name: short_summary
vars:
article: |
The Federal Reserve held interest rates steady on Wednesday,
keeping the benchmark rate in the 5.25%-5.50% range. Chair
Jerome Powell said the committee needs more confidence that
inflation is moving toward the 2% target before cutting rates.
assert:
- type: max_tokens
value: 100
- type: contains
value: "Federal Reserve"
- type: contains
value: "interest rate"
- name: handles_empty_input
vars:
article: ""
assert:
- type: not_contains
value: "error"
- type: min_length
value: 10
Run it
promptlab run tests/
Test File Format
Each .yaml file defines a prompt and its test cases:
# The prompt template. Use {{ variable }} for inputs.
prompt: |
You are a helpful assistant. {{ instruction }}
# Which model to use
model: claude-sonnet-4-20250514 # or gpt-4o, claude-haiku-4-5-20251001, etc.
# Optional system prompt
system: "You are a concise technical writer."
# Optional model parameters
temperature: 0
max_tokens: 500
# Test cases
tests:
- name: test_name
vars:
instruction: "Explain what a CPU does in one sentence."
assert:
- type: contains
value: "processor"
- type: max_tokens
value: 50
Assertion Types
| Type | Description | Example |
|---|---|---|
contains |
Output must contain this string (case-insensitive) | value: "machine learning" |
not_contains |
Output must NOT contain this string | value: "I'm sorry" |
starts_with |
Output must start with this string | value: "Sure" |
regex |
Output must match this regex pattern | value: "\\d{4}" |
max_tokens |
Output must be at most N tokens | value: 100 |
min_length |
Output must be at least N characters | value: 50 |
max_length |
Output must be at most N characters | value: 500 |
equals |
Output must exactly equal this string | value: "42" |
llm_judge |
Ask another LLM to evaluate the output | value: "Is this response helpful and accurate?" |
LLM-as-Judge
The most powerful assertion type. Uses a second LLM call to evaluate output quality:
tests:
- name: helpful_response
vars:
question: "How do I fix a memory leak in Python?"
assert:
- type: llm_judge
value: "Does this response provide specific, actionable debugging steps? Answer YES or NO."
Supported Models
Anthropic (Claude):
claude-sonnet-4-20250514claude-haiku-4-5-20251001- Set
ANTHROPIC_API_KEYenvironment variable
OpenAI:
gpt-4ogpt-4o-mini- Set
OPENAI_API_KEYenvironment variable
CLI Commands
# Run all test files in a directory
promptlab run tests/
# Run a single test file
promptlab run tests/summarize.yaml
# Verbose output (show full LLM responses)
promptlab run tests/ --verbose
# Output results as JSON
promptlab run tests/ --json
# Dry run (show what would be tested without calling APIs)
promptlab run tests/ --dry-run
Use Cases
- Prompt regression testing — Run tests in CI/CD to catch regressions
- Prompt comparison — Test the same cases across different models
- Guard rails validation — Verify your prompt rejects harmful inputs
- Output format checking — Ensure structured output matches expectations
Development
git clone https://github.com/vigp17/promptlab.git
cd promptlab
pip install -e ".[dev]"
pytest
Roadmap
- YAML test definitions
- Claude and OpenAI support
- 9 assertion types including LLM-as-judge
- CLI with colored output
- Cost tracking per test run
- HTML report generation
- Parallel test execution
- GitHub Actions integration
- Prompt versioning and diff
- Custom scoring functions
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file promptlab_cli-0.1.0.tar.gz.
File metadata
- Download URL: promptlab_cli-0.1.0.tar.gz
- Upload date:
- Size: 9.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: Hatch/1.16.5 cpython/3.14.3 HTTPX/0.28.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
36ca9c0da58ee0d92f9340a03f83756fe3112d5d6cb120aa79e2b34cf1670b2d
|
|
| MD5 |
ef37560efe494bb874c798af882f321f
|
|
| BLAKE2b-256 |
b037b526cad841b766764e9abde96815d91b2f0c67f89315820a5fadbd2441c5
|
File details
Details for the file promptlab_cli-0.1.0-py3-none-any.whl.
File metadata
- Download URL: promptlab_cli-0.1.0-py3-none-any.whl
- Upload date:
- Size: 10.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: Hatch/1.16.5 cpython/3.14.3 HTTPX/0.28.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7c23e75295f03d92a97f377c4340cf929fdddd64f2af30a618649be8aee02dd5
|
|
| MD5 |
1c90513a79c3b420d9d58b0cdfaf9ae0
|
|
| BLAKE2b-256 |
ebf164b9240d18b94f34bc60664173621d9ae58af783740a2dec2c01be3148e3
|