Skip to main content

A pytest plugin for testing LLM outputs using semantic similarity matching

Project description

pytest-semantic

A pytest plugin for testing LLM outputs using semantic similarity matching instead of exact string comparison.

Installation

pip install pytest-semantic

Or with uv:

uv pip install pytest-semantic

Quick Start

def test_llm_greeting(semantic_matcher):
    """Test that LLM generates appropriate greetings."""
    llm_response = my_llm("Say hello")

    matcher = semantic_matcher(
        valid=["Hello!", "Hi there!", "Greetings!"],
    )

    assert llm_response == matcher

Features

  • Semantic Matching: Compare text responses based on meaning, not exact strings
  • Flexible Configuration: Configure thresholds globally or per-test
  • Custom Embeddings: Use your own embedding models or functions
  • Offline-First: Works locally with included sentence-transformers model
  • Clear Error Messages: Detailed failure messages with similarity scores
  • Easy Integration: Simple pytest fixture-based API

Usage

Basic Usage

Use the semantic_matcher fixture to create matchers with valid examples:

def test_llm_response(semantic_matcher):
    response = generate_llm_response("What is the capital of France?")

    matcher = semantic_matcher(
        valid=["Paris", "The capital is Paris", "Paris is the capital"],
    )

    assert response == matcher

With Invalid Examples

Provide invalid examples to strengthen the matching:

def test_sentiment_classification(semantic_matcher):
    result = classify_sentiment("I love this product!")

    matcher = semantic_matcher(
        valid=["positive", "good", "happy"],
        invalid=["negative", "bad", "sad", "neutral"],
    )

    assert result == matcher

If you don't provide invalid examples, random word combinations are automatically generated as a baseline.

Custom Thresholds

Adjust matching sensitivity per test:

def test_with_custom_threshold(semantic_matcher):
    matcher = semantic_matcher(
        valid=["Python programming"],
        threshold=0.2,        # Difference between valid/invalid similarity
        min_similarity=0.6,   # Minimum absolute similarity to valid examples
    )

    assert "Python coding" == matcher

Reusable Matchers

Create reusable matchers with fixtures:

import pytest

@pytest.fixture
def greeting_matcher(semantic_matcher):
    return semantic_matcher(
        valid=["Hello!", "Hi there!", "Hey!"],
    )

@pytest.fixture
def farewell_matcher(semantic_matcher):
    return semantic_matcher(
        valid=["Goodbye!", "See you!", "Bye!"],
    )

def test_conversation(greeting_matcher, farewell_matcher):
    assert llm.greet() == greeting_matcher
    assert llm.say_goodbye() == farewell_matcher

Custom Embedding Functions

Use your own embedding function:

def test_with_custom_embeddings(semantic_matcher):
    def my_embed_function(text: str) -> list:
        # Your custom embedding logic
        return openai.embeddings.create(input=text, model="text-embedding-3-small")

    matcher = semantic_matcher(
        valid=["Hello"],
        custom_embed_fn=my_embed_function,
    )

    assert "Hi" == matcher

Configuration

Configure default values in pytest.ini:

[pytest]
semantic_threshold = 0.15
semantic_min_similarity = 0.5
semantic_model = all-MiniLM-L6-v2

Or in pyproject.toml:

[tool.pytest.ini_options]
semantic_threshold = 0.15
semantic_min_similarity = 0.5
semantic_model = "all-MiniLM-L6-v2"

Configuration Options

  • semantic_threshold (default: 0.15): Minimum difference between similarity to valid examples vs invalid examples
  • semantic_min_similarity (default: 0.5): Minimum absolute similarity score to valid examples (0-1 range)
  • semantic_model (default: "all-MiniLM-L6-v2"): Sentence-transformers model name

How It Works

  1. Embeddings: Text is converted to vector embeddings using sentence-transformers
  2. Similarity Calculation: Cosine similarity is computed between response and examples
  3. Dual Criteria:
    • Response must be at least min_similarity similar to valid examples
    • Response must be at least threshold more similar to valid vs invalid examples

This dual-criteria approach prevents false positives while ensuring meaningful matches.

Error Messages

When a test fails, you get detailed information:

AssertionError: Semantic similarity check failed
  Response: "Bonjour"
  Similarity to valid examples: 0.342
  Similarity to invalid examples: 0.156
  Difference: 0.186 (threshold: 0.150)
  Failure reason: Response similarity (0.342) is below minimum threshold (0.500)
  Closest valid example: 'Hello!' (similarity: 0.389)
  Valid examples:
    - 'Hello!'
    - 'Hi there!'
    - 'Greetings!'

API Reference

semantic_matcher(valid, invalid=None, threshold=None, min_similarity=None, model_name=None, custom_embed_fn=None)

Creates a semantic matcher for comparing text responses.

Parameters:

  • valid (List[str]): List of valid example responses (required)
  • invalid (List[str], optional): List of invalid examples (random words if not provided)
  • threshold (float, optional): Override default threshold (0-1 range)
  • min_similarity (float, optional): Override default minimum similarity (0-1 range)
  • model_name (str, optional): Override default sentence-transformers model
  • custom_embed_fn (Callable, optional): Custom embedding function (str) -> List[float]

Returns: SemanticMatcher instance that can be used with == operator

SemanticMatcher.check(response: str) -> bool

Explicitly check if a response matches. Raises SemanticAssertionError on failure.

Examples

Testing LLM Text Generation

def test_story_generation(semantic_matcher):
    """Test that LLM generates creative stories."""
    story = llm.generate_story(prompt="A robot learning to paint")

    matcher = semantic_matcher(
        valid=[
            "A robot discovers art and creativity",
            "An AI learns to express itself through painting",
            "A mechanical being explores artistic expression",
        ],
        threshold=0.1,  # Allow more variation for creative content
    )

    assert story == matcher

Testing Classification

def test_intent_classification(semantic_matcher):
    """Test intent classification accuracy."""
    intent = classify_intent("I want to cancel my subscription")

    matcher = semantic_matcher(
        valid=["cancel", "cancellation", "unsubscribe"],
        invalid=["help", "question", "purchase", "upgrade"],
    )

    assert intent == matcher

Testing Summarization

def test_summarization(semantic_matcher):
    """Test that summaries capture key points."""
    long_text = "..." # Long article
    summary = llm.summarize(long_text)

    matcher = semantic_matcher(
        valid=[
            "Article discusses climate change impacts",
            "The text is about environmental challenges",
        ],
        min_similarity=0.4,  # Lower threshold for summaries
    )

    assert summary == matcher

Development

Setup

git clone https://github.com/tombedor/pytest-semantic.git
cd pytest-semantic
uv sync

Running Tests

uv run pytest tests/

Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass
  5. Submit a pull request

License

MIT License - see LICENSE file for details.

Credits

Built with:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytest_semantic-0.1.0.tar.gz (12.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pytest_semantic-0.1.0-py3-none-any.whl (12.8 kB view details)

Uploaded Python 3

File details

Details for the file pytest_semantic-0.1.0.tar.gz.

File metadata

  • Download URL: pytest_semantic-0.1.0.tar.gz
  • Upload date:
  • Size: 12.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for pytest_semantic-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a69084e2168363fd4c297655f25f31a06e3abb027ba43db8ba64765536e06f61
MD5 65b45db626bb7ce9d26a1abfa72f172d
BLAKE2b-256 8504ce430bc0ce8aedb0f581540b6a61da7d62f6a7202674ee608d0c2f7db185

See more details on using hashes here.

File details

Details for the file pytest_semantic-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pytest_semantic-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d16758f39a99d4c99b44ee669d71096e824211fbc167e7913e1f7fa3461b74b2
MD5 c1ff3773d63f8a6c0af958366700b38d
BLAKE2b-256 ba964d74a5df5cfe7a83fb3409025f5cc63fd21ab7fd8ad57d6fc7b89ee59926

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page