LLM testing for humans.
Project description
pytest-eval
LLM testing for humans.
Bring LLM evaluation into your existing pytest workflow.
No custom test runners. No new concepts. Just pytest.
Install
pip install pytest-eval
Quick Start
# No imports needed. The ai fixture IS the API.
def test_chatbot(ai):
response = my_chatbot("What is the capital of France?")
assert ai.similar(response, "Paris is the capital of France")
pytest -v
tests/test_chatbot.py::test_chatbot PASSED
✓ similar ███████████████ 0.94 ≥0.80
──────────────────────────────────────────────────────
pytest-eval v0.1.0
──────────────────────────────────────────────────────
Test Result Score Cost
──────────────────────────────────────────────────────
test_chatbot ✓ ██████████████░ 0.94 $0
──────────────────────────────────────────────────────
1 tests │ 1 passed │ $0.0000 total
──────────────────────────────────────────────────────
That's it. No LLMTestCase objects, no custom runner, no cloud dashboard.
Why pytest-eval?
| DeepEval | pytest-eval | |
|---|---|---|
| Basic test | ~15 lines, 4 imports | ~3 lines, 0 imports |
| Test runner | deepeval test run |
pytest |
| Metrics | 50+ to learn | ~10 methods on one fixture |
| Dependencies | 30+ (OpenTelemetry, gRPC, Sentry...) | 4 core |
| Telemetry | Cloud dashboard by default | None. Fully local. |
Methods
| Method | What it does | Cost |
|---|---|---|
ai.similar(a, b, threshold=0.8) |
Semantic similarity check | Free (local) |
ai.similarity_score(a, b) |
Returns similarity float 0–1 | Free (local) |
ai.judge(text, criteria) |
LLM evaluates against criteria | $ |
ai.grounded(response, context) |
RAG faithfulness check | $ |
ai.relevant(response, query) |
Answer relevancy | $ |
ai.hallucinated(response, context) |
Detect unsupported claims | $ |
ai.toxic(text) |
Toxicity detection | Free |
ai.biased(text) |
Bias detection | Free |
ai.valid_json(text, schema=None) |
JSON validation + Pydantic parsing | Free |
ai.assert_snapshot(value, name) |
Regression testing vs saved baseline | Free (local) |
ai.metric(name, text, **kw) |
Run a custom registered metric | Varies |
ai.cost |
Cumulative $ for this test | — |
ai.latency |
Cumulative seconds for this test | — |
Free methods use local models (sentence-transformers). No API key needed.
$ methods call an LLM API (OpenAI by default). Requires OPENAI_API_KEY.
Examples
Semantic Similarity (free, local)
def test_capital(ai):
response = my_chatbot("What is the capital of France?")
assert ai.similar(response, "Paris is the capital of France")
LLM-as-Judge
def test_tone(ai):
response = my_chatbot("I want to cancel my subscription")
assert ai.judge(response, "Response is polite and offers help")
Structured Output
from pydantic import BaseModel
class City(BaseModel):
name: str
country: str
def test_structured(ai):
response = my_llm("Give me Paris info as JSON")
city = ai.valid_json(response, City)
assert city.country == "France"
RAG Pipeline
def test_rag(ai):
query = "What is our refund policy?"
docs = retriever.get_relevant_docs(query)
response = generator.generate(query, docs)
assert ai.grounded(response, docs)
assert ai.relevant(response, query)
assert not ai.hallucinated(response, docs)
Snapshot Regression
def test_regression(ai):
response = my_chatbot("What are your business hours?")
ai.assert_snapshot(response, name="business_hours", threshold=0.85)
# First run saves baseline. Next runs compare.
# Update baselines when intentional changes are made:
pytest --snapshot-update
Multi-Model Comparison
import pytest
@pytest.mark.parametrize("model", ["gpt-4o", "claude-sonnet-4-20250514", "llama-3.1-8b"])
def test_accuracy(ai, model):
response = call_llm(model=model, prompt="What is 2+2?")
assert ai.similar(response, "4")
Custom Metrics
from pytest_eval import Metric, MetricResult
@Metric.register("brand_voice")
def brand_voice(text: str, **kwargs) -> MetricResult:
formal = sum(1 for w in ["please", "thank you"] if w in text.lower())
score = min(formal / 2, 1.0)
return MetricResult(score=score, passed=score >= kwargs.get("threshold", 0.5))
def test_brand(ai):
assert ai.metric("brand_voice", response, threshold=0.7)
Configuration
pyproject.toml
[tool.pytest.ini_options]
ai_provider = "openai"
ai_model = "gpt-4o-mini"
ai_embedding_model = "local"
ai_threshold = 0.8
ai_budget = 5.00
ai_snapshot_dir = ".pytest_eval_snapshots"
Environment Variables
OPENAI_API_KEY=sk-...
PYTEST_EVAL_PROVIDER=openai
PYTEST_EVAL_MODEL=gpt-4o-mini
PYTEST_EVAL_BUDGET=5.00
CLI Options
pytest --ai-provider=openai # Provider
pytest --ai-model=gpt-4o # Model
pytest --ai-threshold=0.9 # Similarity threshold
pytest --ai-budget=2.00 # Cap spending per run
pytest --ai-report=report.json # JSON report output
pytest --ai-verbose # Show scores for passing tests
pytest --snapshot-update # Update snapshot baselines
pytest -m ai # Run only @pytest.mark.ai tests
pytest -m "not cost_high" # Skip expensive tests
Precedence: CLI > env vars > pyproject.toml > defaults
Providers
pytest-eval supports multiple LLM providers:
pip install 'pytest-eval[openai]' # OpenAI (default)
pip install 'pytest-eval[anthropic]' # Anthropic
pip install 'pytest-eval[litellm]' # 100+ providers via LiteLLM
pip install 'pytest-eval[safety]' # Toxicity/bias detection (detoxify)
pip install 'pytest-eval[all]' # Everything
Local embeddings (sentence-transformers) are always included — no API key needed for similar(), similarity_score(), and assert_snapshot().
Rich Failure Messages
Every assertion failure explains what happened:
AssertionError: Semantic similarity below threshold
actual: "The capital of France is Lyon"
expected: "The capital of France is Paris"
similarity: 0.72
threshold: 0.85
reason: Texts differ on the key fact (Lyon vs Paris)
TUI Output
pytest-eval renders score bars and a summary table directly in your terminal:
- Per-test metric detail lines (with
-vor--ai-verbose) - Session summary table with visual score bars
- Cost tracking per test and per session
Contributing
See CONTRIBUTING.md.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pytest_eval-0.1.0.tar.gz.
File metadata
- Download URL: pytest_eval-0.1.0.tar.gz
- Upload date:
- Size: 28.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7948397d3e69566536f51e174d004136a52ccfce9f2e5182b984059384c7b742
|
|
| MD5 |
c3b08a2206081c5b85df4669f8c80f3b
|
|
| BLAKE2b-256 |
6a7e24e51a9be2ea497549bd6e4ad1c198581457d0d8d4d02f367c2bd13d9fe0
|
Provenance
The following attestation bundles were made for pytest_eval-0.1.0.tar.gz:
Publisher:
ci.yml on doganarif/pytest-eval
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pytest_eval-0.1.0.tar.gz -
Subject digest:
7948397d3e69566536f51e174d004136a52ccfce9f2e5182b984059384c7b742 - Sigstore transparency entry: 938749165
- Sigstore integration time:
-
Permalink:
doganarif/pytest-eval@cd20ea60af43ac4b7078ed679f421f287268cc8e -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/doganarif
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@cd20ea60af43ac4b7078ed679f421f287268cc8e -
Trigger Event:
push
-
Statement type:
File details
Details for the file pytest_eval-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pytest_eval-0.1.0-py3-none-any.whl
- Upload date:
- Size: 29.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6cda928c26243dd3e8d15f5208f070a768bd4fce9d62974fe5ffd6548f30c433
|
|
| MD5 |
cb480f5618bcbf7c7d310ae7ebe30b99
|
|
| BLAKE2b-256 |
bcce9e93f54d95349075e9fc25e86f7e8afe587e61d2ac59d38bbb7da516f01c
|
Provenance
The following attestation bundles were made for pytest_eval-0.1.0-py3-none-any.whl:
Publisher:
ci.yml on doganarif/pytest-eval
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pytest_eval-0.1.0-py3-none-any.whl -
Subject digest:
6cda928c26243dd3e8d15f5208f070a768bd4fce9d62974fe5ffd6548f30c433 - Sigstore transparency entry: 938749184
- Sigstore integration time:
-
Permalink:
doganarif/pytest-eval@cd20ea60af43ac4b7078ed679f421f287268cc8e -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/doganarif
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@cd20ea60af43ac4b7078ed679f421f287268cc8e -
Trigger Event:
push
-
Statement type: