CI-first prompt testing framework for LLMs
Project description
PromptCheck
PromptCheck is a CI-first test harness for LLM prompts. It lets you write tests for GPT/Claude outputs, and automatically checks them in CI so you can catch prompt regressions early — before they reach users.
PromptCheck is a CI-first test harness for LLM prompts.
Write tests in YAML, gate pull-requests, and see pass/fail summaries posted as comments—so your prompts don't quietly regress.
Install & Run
pip install promptcheck
promptcheck init # creates a config and test scaffold
promptcheck run # runs all prompt tests
Need full example? See
example/or the Quick-Start Guide.
Get Started
Ready to dive in?
- See a basic test example in the
example/directory. - Follow our Quick-Start Guide for step-by-step setup instructions.
Why Prompt Testing Matters
LLMs can break without warning — even small prompt changes or model updates can cause major regressions. PromptCheck automates prompt evaluation like unit tests automate code quality.
What does PromptCheck do? Key Concepts in Automated LLM Evaluation
When you tweak a prompt, swap models, or refactor your agent code, PromptCheck runs a battery of tests in CI (Rouge, regex, token-cost, latency, etc.) and fails the pull-request if quality regresses or cost spikes.
Think pytest + coverage, but for LLM output.
Key Features for Effective LLM Testing
- Easy setup — drop a YAML test file, add the GitHub Action, done.
- Multi‑provider — OpenAI, Groq, OpenRouter (more soon).
- Metrics out‑of‑the‑box — exact/regex match, ROUGE‑L, BLEU (optional), token‑count, latency, cost.
- Readable reports — Action log output and (coming soon) PR comment bot.
run.jsonartifact produced. - Fast to extend — write your own metric in <30 lines (standard Python).
| Feature | Free | Pro |
|---|---|---|
| CLI & GitHub Action | ✅ | ✅ |
| 30-day result history | ✅ | ✅ |
| Unlimited history & charts | — | ✅ |
| Slack alerts | — | ✅ |
What it looks like
How the YAML works (tests/*.yaml)
A test file contains a list of test cases. Here's an example structure:
- id: "openrouter_greet_test_001"
name: "OpenRouter Basic Greeting Test"
description: "Tests a basic greeting prompt."
type: "llm_generation"
input_data:
prompt: "Briefly introduce yourself and greet the user."
expected_output:
# For regex_match, a pattern to find in the LLM's output
regex_pattern: ".+" # Example: matches any non-empty string
metric_configs:
- metric: "regex_match"
- metric: "token_count"
parameters:
count_types: ["completion", "total"]
- metric: "latency"
threshold: # Optional: fail if conditions aren't met
value: 10000 # e.g., latency_ms <= 10000
- metric: "cost"
model_config:
provider: "openrouter"
model_name: "mistralai/mistral-7b-instruct"
parameters:
temperature: 0.7
max_tokens: 50
timeout_s: 25.0
retry_attempts: 2
tags: ["openrouter", "greeting"]
Add more cases in tests/. Thresholds (like value for latency, or f_score for rouge) are defined within the threshold object of a metric_config.
Installation Options & Development Setup
# From PyPI (once 0.1.0+ is live)
# pip install promptcheck
# With optional BLEU metric (requires NLTK)
# pip install promptcheck[bleu]
# For development:
poetry install # Installs base dependencies
poetry install --extras bleu # Installs with BLEU support
Releasing (maintainers)
# 1. Ensure tests pass and docs are updated
# 2. Bump version in pyproject.toml
poetry version <new_version_例えば_0.1.0> # e.g., 0.1.0, 0.2.0b1
# 3. Build the package
poetry build
# 4. Publish to TestPyPI first (configured in pyproject.toml or via poetry config)
# poetry config pypi-token.testpypi <YOUR_TESTPYPI_TOKEN>
poetry publish -r testpypi
# 5. Test TestPyPI package thoroughly (e.g., in a clean venv or CI)
# 6. Publish to PyPI (prod)
# poetry config pypi-token.pypi <YOUR_PYPI_TOKEN>
poetry publish
# 7. Tag the release in Git
git tag v<new_version_例えば_0.1.0> # e.g., v0.1.0
git push origin v<new_version_例えば_0.1.0>
Documentation
📖 Docs: Quick-Start Guide · YAML Reference (Coming Soon!)
Roadmap
- PR comment bot (✅/❌ matrix in‑line)
- Hosted dashboard (Supabase)
- Async runner for large test suites
- More metrics and LLM provider integrations
Contributing
- Fork & clone the repository.
- Set up your development environment:
poetry install --extras bleu(to include all deps). - Run tests locally:
poetry run promptcheck run tests/(or a specific file). Keep it green! - Make your changes, add tests for new features.
- Open a Pull Request.
Feedback & Questions
Found an issue or have a question? We'd love to hear from you! Please open an issue or start a discussion.
License
License: Business Source License 1.1
PromptCheck is free to use for evaluation and non-production use. For commercial licenses, contact us.
End of Document – keep this file as the project's living reference; version & timestamp changes at top on each major edit.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file promptcheck-0.1.0b2.tar.gz.
File metadata
- Download URL: promptcheck-0.1.0b2.tar.gz
- Upload date:
- Size: 26.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.13.3 Darwin/24.3.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
301ce719cd05fc1bd43e4db2a8d117e76ff3993e1326d7772fce251556f12a92
|
|
| MD5 |
15ccd13066cf9fd6953fe6d38dd65fa3
|
|
| BLAKE2b-256 |
32c74433341b4af4eb10539975894bac585740b604da775e6b57289625f1c705
|
File details
Details for the file promptcheck-0.1.0b2-py3-none-any.whl.
File metadata
- Download URL: promptcheck-0.1.0b2-py3-none-any.whl
- Upload date:
- Size: 28.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.13.3 Darwin/24.3.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e485683219b65cf246d4f6ea98d8700e8fc16d94807f3afa9d7b0782f8e9e7c6
|
|
| MD5 |
8074fd2f80c0a7189a288ee982292a45
|
|
| BLAKE2b-256 |
6e5f6130ac96bee29d56e108bd17054152013a6c9163b6a65fbb0a2d1db542c5
|