Skip to main content

LLM testing for humans.

Project description

pytest-eval

LLM testing for humans.

PyPI version Python Tests License: MIT

Bring LLM evaluation into your existing pytest workflow.
No custom test runners. No new concepts. Just pytest.


Install

pip install pytest-eval

Quick Start

# No imports needed. The ai fixture IS the API.

def test_chatbot(ai):
    response = my_chatbot("What is the capital of France?")
    assert ai.similar(response, "Paris is the capital of France")
pytest -v
tests/test_chatbot.py::test_chatbot PASSED
    ✓ similar       ███████████████  0.94  ≥0.80

  ──────────────────────────────────────────────────────
  pytest-eval                                     v0.1.0
  ──────────────────────────────────────────────────────
    Test           Result Score                    Cost
  ──────────────────────────────────────────────────────
    test_chatbot     ✓   ██████████████░  0.94      $0
  ──────────────────────────────────────────────────────
    1 tests  │  1 passed  │  $0.0000 total
  ──────────────────────────────────────────────────────

That's it. No LLMTestCase objects, no custom runner, no cloud dashboard.

Why pytest-eval?

DeepEval pytest-eval
Basic test ~15 lines, 4 imports ~3 lines, 0 imports
Test runner deepeval test run pytest
Metrics 50+ to learn ~10 methods on one fixture
Dependencies 30+ (OpenTelemetry, gRPC, Sentry...) 4 core
Telemetry Cloud dashboard by default None. Fully local.

Methods

Method What it does Cost
ai.similar(a, b, threshold=0.8) Semantic similarity check Free (local)
ai.similarity_score(a, b) Returns similarity float 0–1 Free (local)
ai.judge(text, criteria) LLM evaluates against criteria $
ai.grounded(response, context) RAG faithfulness check $
ai.relevant(response, query) Answer relevancy $
ai.hallucinated(response, context) Detect unsupported claims $
ai.toxic(text) Toxicity detection Free
ai.biased(text) Bias detection Free
ai.valid_json(text, schema=None) JSON validation + Pydantic parsing Free
ai.assert_snapshot(value, name) Regression testing vs saved baseline Free (local)
ai.metric(name, text, **kw) Run a custom registered metric Varies
ai.cost Cumulative $ for this test
ai.latency Cumulative seconds for this test

Free methods use local models (sentence-transformers). No API key needed. $ methods call an LLM API (OpenAI by default). Requires OPENAI_API_KEY.

Examples

Semantic Similarity (free, local)

def test_capital(ai):
    response = my_chatbot("What is the capital of France?")
    assert ai.similar(response, "Paris is the capital of France")

LLM-as-Judge

def test_tone(ai):
    response = my_chatbot("I want to cancel my subscription")
    assert ai.judge(response, "Response is polite and offers help")

Structured Output

from pydantic import BaseModel

class City(BaseModel):
    name: str
    country: str

def test_structured(ai):
    response = my_llm("Give me Paris info as JSON")
    city = ai.valid_json(response, City)
    assert city.country == "France"

RAG Pipeline

def test_rag(ai):
    query = "What is our refund policy?"
    docs = retriever.get_relevant_docs(query)
    response = generator.generate(query, docs)

    assert ai.grounded(response, docs)
    assert ai.relevant(response, query)
    assert not ai.hallucinated(response, docs)

Snapshot Regression

def test_regression(ai):
    response = my_chatbot("What are your business hours?")
    ai.assert_snapshot(response, name="business_hours", threshold=0.85)
# First run saves baseline. Next runs compare.
# Update baselines when intentional changes are made:
pytest --snapshot-update

Multi-Model Comparison

import pytest

@pytest.mark.parametrize("model", ["gpt-4o", "claude-sonnet-4-20250514", "llama-3.1-8b"])
def test_accuracy(ai, model):
    response = call_llm(model=model, prompt="What is 2+2?")
    assert ai.similar(response, "4")

Custom Metrics

from pytest_eval import Metric, MetricResult

@Metric.register("brand_voice")
def brand_voice(text: str, **kwargs) -> MetricResult:
    formal = sum(1 for w in ["please", "thank you"] if w in text.lower())
    score = min(formal / 2, 1.0)
    return MetricResult(score=score, passed=score >= kwargs.get("threshold", 0.5))

def test_brand(ai):
    assert ai.metric("brand_voice", response, threshold=0.7)

Configuration

pyproject.toml

[tool.pytest.ini_options]
ai_provider = "openai"
ai_model = "gpt-4o-mini"
ai_embedding_model = "local"
ai_threshold = 0.8
ai_budget = 5.00
ai_snapshot_dir = ".pytest_eval_snapshots"

Environment Variables

OPENAI_API_KEY=sk-...
PYTEST_EVAL_PROVIDER=openai
PYTEST_EVAL_MODEL=gpt-4o-mini
PYTEST_EVAL_BUDGET=5.00

CLI Options

pytest --ai-provider=openai    # Provider
pytest --ai-model=gpt-4o       # Model
pytest --ai-threshold=0.9      # Similarity threshold
pytest --ai-budget=2.00        # Cap spending per run
pytest --ai-report=report.json # JSON report output
pytest --ai-verbose            # Show scores for passing tests
pytest --snapshot-update       # Update snapshot baselines
pytest -m ai                   # Run only @pytest.mark.ai tests
pytest -m "not cost_high"      # Skip expensive tests

Precedence: CLI > env vars > pyproject.toml > defaults

Providers

pytest-eval supports multiple LLM providers:

pip install 'pytest-eval[openai]'     # OpenAI (default)
pip install 'pytest-eval[anthropic]'  # Anthropic
pip install 'pytest-eval[litellm]'    # 100+ providers via LiteLLM
pip install 'pytest-eval[safety]'     # Toxicity/bias detection (detoxify)
pip install 'pytest-eval[all]'        # Everything

Local embeddings (sentence-transformers) are always included — no API key needed for similar(), similarity_score(), and assert_snapshot().

Rich Failure Messages

Every assertion failure explains what happened:

AssertionError: Semantic similarity below threshold
  actual:     "The capital of France is Lyon"
  expected:   "The capital of France is Paris"
  similarity: 0.72
  threshold:  0.85
  reason:     Texts differ on the key fact (Lyon vs Paris)

TUI Output

pytest-eval renders score bars and a summary table directly in your terminal:

  • Per-test metric detail lines (with -v or --ai-verbose)
  • Session summary table with visual score bars
  • Cost tracking per test and per session

Contributing

See CONTRIBUTING.md.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytest_eval-0.1.0.tar.gz (28.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pytest_eval-0.1.0-py3-none-any.whl (29.9 kB view details)

Uploaded Python 3

File details

Details for the file pytest_eval-0.1.0.tar.gz.

File metadata

  • Download URL: pytest_eval-0.1.0.tar.gz
  • Upload date:
  • Size: 28.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pytest_eval-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7948397d3e69566536f51e174d004136a52ccfce9f2e5182b984059384c7b742
MD5 c3b08a2206081c5b85df4669f8c80f3b
BLAKE2b-256 6a7e24e51a9be2ea497549bd6e4ad1c198581457d0d8d4d02f367c2bd13d9fe0

See more details on using hashes here.

Provenance

The following attestation bundles were made for pytest_eval-0.1.0.tar.gz:

Publisher: ci.yml on doganarif/pytest-eval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pytest_eval-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pytest_eval-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 29.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pytest_eval-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6cda928c26243dd3e8d15f5208f070a768bd4fce9d62974fe5ffd6548f30c433
MD5 cb480f5618bcbf7c7d310ae7ebe30b99
BLAKE2b-256 bcce9e93f54d95349075e9fc25e86f7e8afe587e61d2ac59d38bbb7da516f01c

See more details on using hashes here.

Provenance

The following attestation bundles were made for pytest_eval-0.1.0-py3-none-any.whl:

Publisher: ci.yml on doganarif/pytest-eval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page