LLM testing for humans.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

arifcodes

These details have not been verified by PyPI

Project description

pytest-eval

LLM testing for humans.

Bring LLM evaluation into your existing pytest workflow.
No custom test runners. No new concepts. Just pytest.

Install

pip install pytest-eval

Quick Start

# No imports needed. The ai fixture IS the API.

def test_chatbot(ai):
    response = my_chatbot("What is the capital of France?")
    assert ai.similar(response, "Paris is the capital of France")

pytest -v

tests/test_chatbot.py::test_chatbot PASSED
    ✓ similar       ███████████████  0.94  ≥0.80

  ──────────────────────────────────────────────────────
  pytest-eval                                     v0.1.0
  ──────────────────────────────────────────────────────
    Test           Result Score                    Cost
  ──────────────────────────────────────────────────────
    test_chatbot     ✓   ██████████████░  0.94      $0
  ──────────────────────────────────────────────────────
    1 tests  │  1 passed  │  $0.0000 total
  ──────────────────────────────────────────────────────

That's it. No LLMTestCase objects, no custom runner, no cloud dashboard.

Why pytest-eval?

	DeepEval	pytest-eval
Basic test	~15 lines, 4 imports	~3 lines, 0 imports
Test runner	`deepeval test run`	`pytest`
Metrics	50+ to learn	~10 methods on one fixture
Dependencies	30+ (OpenTelemetry, gRPC, Sentry...)	4 core
Telemetry	Cloud dashboard by default	None. Fully local.

Methods

Method	What it does	Cost
`ai.similar(a, b, threshold=0.8)`	Semantic similarity check	Free (local)
`ai.similarity_score(a, b)`	Returns similarity float 0–1	Free (local)
`ai.judge(text, criteria)`	LLM evaluates against criteria	$
`ai.grounded(response, context)`	RAG faithfulness check	$
`ai.relevant(response, query)`	Answer relevancy	$
`ai.hallucinated(response, context)`	Detect unsupported claims	$
`ai.toxic(text)`	Toxicity detection	Free
`ai.biased(text)`	Bias detection	Free
`ai.valid_json(text, schema=None)`	JSON validation + Pydantic parsing	Free
`ai.assert_snapshot(value, name)`	Regression testing vs saved baseline	Free (local)
`ai.metric(name, text, **kw)`	Run a custom registered metric	Varies
`ai.cost`	Cumulative $ for this test	—
`ai.latency`	Cumulative seconds for this test	—

Free methods use local models (sentence-transformers). No API key needed. $ methods call an LLM API (OpenAI by default). Requires OPENAI_API_KEY.

Examples

Semantic Similarity (free, local)

def test_capital(ai):
    response = my_chatbot("What is the capital of France?")
    assert ai.similar(response, "Paris is the capital of France")

LLM-as-Judge

def test_tone(ai):
    response = my_chatbot("I want to cancel my subscription")
    assert ai.judge(response, "Response is polite and offers help")

Structured Output

from pydantic import BaseModel

class City(BaseModel):
    name: str
    country: str

def test_structured(ai):
    response = my_llm("Give me Paris info as JSON")
    city = ai.valid_json(response, City)
    assert city.country == "France"

RAG Pipeline

def test_rag(ai):
    query = "What is our refund policy?"
    docs = retriever.get_relevant_docs(query)
    response = generator.generate(query, docs)

    assert ai.grounded(response, docs)
    assert ai.relevant(response, query)
    assert not ai.hallucinated(response, docs)

Snapshot Regression

def test_regression(ai):
    response = my_chatbot("What are your business hours?")
    ai.assert_snapshot(response, name="business_hours", threshold=0.85)

# First run saves baseline. Next runs compare.
# Update baselines when intentional changes are made:
pytest --snapshot-update

Multi-Model Comparison

import pytest

@pytest.mark.parametrize("model", ["gpt-4o", "claude-sonnet-4-20250514", "llama-3.1-8b"])
def test_accuracy(ai, model):
    response = call_llm(model=model, prompt="What is 2+2?")
    assert ai.similar(response, "4")

Custom Metrics

from pytest_eval import Metric, MetricResult

@Metric.register("brand_voice")
def brand_voice(text: str, **kwargs) -> MetricResult:
    formal = sum(1 for w in ["please", "thank you"] if w in text.lower())
    score = min(formal / 2, 1.0)
    return MetricResult(score=score, passed=score >= kwargs.get("threshold", 0.5))

def test_brand(ai):
    assert ai.metric("brand_voice", response, threshold=0.7)

Configuration

pyproject.toml

[tool.pytest.ini_options]
ai_provider = "openai"
ai_model = "gpt-4o-mini"
ai_embedding_model = "local"
ai_threshold = 0.8
ai_budget = 5.00
ai_snapshot_dir = ".pytest_eval_snapshots"

Environment Variables

OPENAI_API_KEY=sk-...
PYTEST_EVAL_PROVIDER=openai
PYTEST_EVAL_MODEL=gpt-4o-mini
PYTEST_EVAL_BUDGET=5.00

CLI Options

pytest --ai-provider=openai    # Provider
pytest --ai-model=gpt-4o       # Model
pytest --ai-threshold=0.9      # Similarity threshold
pytest --ai-budget=2.00        # Cap spending per run
pytest --ai-report=report.json # JSON report output
pytest --ai-verbose            # Show scores for passing tests
pytest --snapshot-update       # Update snapshot baselines
pytest -m ai                   # Run only @pytest.mark.ai tests
pytest -m "not cost_high"      # Skip expensive tests

Precedence: CLI > env vars > pyproject.toml > defaults

Providers

pytest-eval supports multiple LLM providers:

pip install 'pytest-eval[openai]'     # OpenAI (default)
pip install 'pytest-eval[anthropic]'  # Anthropic
pip install 'pytest-eval[litellm]'    # 100+ providers via LiteLLM
pip install 'pytest-eval[safety]'     # Toxicity/bias detection (detoxify)
pip install 'pytest-eval[all]'        # Everything

Local embeddings (sentence-transformers) are always included — no API key needed for similar(), similarity_score(), and assert_snapshot().

Rich Failure Messages

Every assertion failure explains what happened:

AssertionError: Semantic similarity below threshold
  actual:     "The capital of France is Lyon"
  expected:   "The capital of France is Paris"
  similarity: 0.72
  threshold:  0.85
  reason:     Texts differ on the key fact (Lyon vs Paris)

TUI Output

pytest-eval renders score bars and a summary table directly in your terminal:

Per-test metric detail lines (with -v or --ai-verbose)
Session summary table with visual score bars
Cost tracking per test and per session

Contributing

See CONTRIBUTING.md.

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

arifcodes

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Feb 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytest_eval-0.1.0.tar.gz (28.7 kB view details)

Uploaded Feb 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pytest_eval-0.1.0-py3-none-any.whl (29.9 kB view details)

Uploaded Feb 11, 2026 Python 3

File details

Details for the file pytest_eval-0.1.0.tar.gz.

File metadata

Download URL: pytest_eval-0.1.0.tar.gz
Upload date: Feb 11, 2026
Size: 28.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pytest_eval-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`7948397d3e69566536f51e174d004136a52ccfce9f2e5182b984059384c7b742`
MD5	`c3b08a2206081c5b85df4669f8c80f3b`
BLAKE2b-256	`6a7e24e51a9be2ea497549bd6e4ad1c198581457d0d8d4d02f367c2bd13d9fe0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pytest_eval-0.1.0.tar.gz:

Publisher: ci.yml on doganarif/pytest-eval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pytest_eval-0.1.0.tar.gz
- Subject digest: 7948397d3e69566536f51e174d004136a52ccfce9f2e5182b984059384c7b742
- Sigstore transparency entry: 938749165
- Sigstore integration time: Feb 11, 2026
Source repository:
- Permalink: doganarif/pytest-eval@cd20ea60af43ac4b7078ed679f421f287268cc8e
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/doganarif
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@cd20ea60af43ac4b7078ed679f421f287268cc8e
- Trigger Event: push

File details

Details for the file pytest_eval-0.1.0-py3-none-any.whl.

File metadata

Download URL: pytest_eval-0.1.0-py3-none-any.whl
Upload date: Feb 11, 2026
Size: 29.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pytest_eval-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6cda928c26243dd3e8d15f5208f070a768bd4fce9d62974fe5ffd6548f30c433`
MD5	`cb480f5618bcbf7c7d310ae7ebe30b99`
BLAKE2b-256	`bcce9e93f54d95349075e9fc25e86f7e8afe587e61d2ac59d38bbb7da516f01c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pytest_eval-0.1.0-py3-none-any.whl:

Publisher: ci.yml on doganarif/pytest-eval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pytest_eval-0.1.0-py3-none-any.whl
- Subject digest: 6cda928c26243dd3e8d15f5208f070a768bd4fce9d62974fe5ffd6548f30c433
- Sigstore transparency entry: 938749184
- Sigstore integration time: Feb 11, 2026
Source repository:
- Permalink: doganarif/pytest-eval@cd20ea60af43ac4b7078ed679f421f287268cc8e
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/doganarif
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@cd20ea60af43ac4b7078ed679f421f287268cc8e
- Trigger Event: push

pytest-eval 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

pytest-eval

Install

Quick Start

Why pytest-eval?

Methods

Examples

Semantic Similarity (free, local)

LLM-as-Judge

Structured Output

RAG Pipeline

Snapshot Regression

Multi-Model Comparison

Custom Metrics

Configuration

pyproject.toml

Environment Variables

CLI Options

Providers

Rich Failure Messages

TUI Output

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance