A pytest plugin for testing LLM outputs using semantic similarity matching
Project description
pytest-semantic
A pytest plugin for testing LLM outputs using semantic similarity matching instead of exact string comparison.
Installation
pip install pytest-semantic
Or with uv:
uv pip install pytest-semantic
Quick Start
def test_llm_greeting(semantic_matcher):
"""Test that LLM generates appropriate greetings."""
llm_response = my_llm("Say hello")
matcher = semantic_matcher(
valid=["Hello!", "Hi there!", "Greetings!"],
)
assert llm_response == matcher
Features
- Semantic Matching: Compare text responses based on meaning, not exact strings
- Flexible Configuration: Configure thresholds globally or per-test
- Custom Embeddings: Use your own embedding models or functions
- Offline-First: Works locally with included sentence-transformers model
- Clear Error Messages: Detailed failure messages with similarity scores
- Easy Integration: Simple pytest fixture-based API
Usage
Basic Usage
Use the semantic_matcher fixture to create matchers with valid examples:
def test_llm_response(semantic_matcher):
response = generate_llm_response("What is the capital of France?")
matcher = semantic_matcher(
valid=["Paris", "The capital is Paris", "Paris is the capital"],
)
assert response == matcher
With Invalid Examples
Provide invalid examples to strengthen the matching:
def test_sentiment_classification(semantic_matcher):
result = classify_sentiment("I love this product!")
matcher = semantic_matcher(
valid=["positive", "good", "happy"],
invalid=["negative", "bad", "sad", "neutral"],
)
assert result == matcher
If you don't provide invalid examples, random word combinations are automatically generated as a baseline.
Custom Thresholds
Adjust matching sensitivity per test:
def test_with_custom_threshold(semantic_matcher):
matcher = semantic_matcher(
valid=["Python programming"],
threshold=0.2, # Difference between valid/invalid similarity
min_similarity=0.6, # Minimum absolute similarity to valid examples
)
assert "Python coding" == matcher
Reusable Matchers
Create reusable matchers with fixtures:
import pytest
@pytest.fixture
def greeting_matcher(semantic_matcher):
return semantic_matcher(
valid=["Hello!", "Hi there!", "Hey!"],
)
@pytest.fixture
def farewell_matcher(semantic_matcher):
return semantic_matcher(
valid=["Goodbye!", "See you!", "Bye!"],
)
def test_conversation(greeting_matcher, farewell_matcher):
assert llm.greet() == greeting_matcher
assert llm.say_goodbye() == farewell_matcher
Custom Embedding Functions
Use your own embedding function:
def test_with_custom_embeddings(semantic_matcher):
def my_embed_function(text: str) -> list:
# Your custom embedding logic
return openai.embeddings.create(input=text, model="text-embedding-3-small")
matcher = semantic_matcher(
valid=["Hello"],
custom_embed_fn=my_embed_function,
)
assert "Hi" == matcher
Configuration
Configure default values in pytest.ini:
[pytest]
semantic_threshold = 0.15
semantic_min_similarity = 0.5
semantic_model = all-MiniLM-L6-v2
Or in pyproject.toml:
[tool.pytest.ini_options]
semantic_threshold = 0.15
semantic_min_similarity = 0.5
semantic_model = "all-MiniLM-L6-v2"
Configuration Options
semantic_threshold(default:0.15): Minimum difference between similarity to valid examples vs invalid examplessemantic_min_similarity(default:0.5): Minimum absolute similarity score to valid examples (0-1 range)semantic_model(default:"all-MiniLM-L6-v2"): Sentence-transformers model name
How It Works
- Embeddings: Text is converted to vector embeddings using sentence-transformers
- Similarity Calculation: Cosine similarity is computed between response and examples
- Dual Criteria:
- Response must be at least
min_similaritysimilar to valid examples - Response must be at least
thresholdmore similar to valid vs invalid examples
- Response must be at least
This dual-criteria approach prevents false positives while ensuring meaningful matches.
Error Messages
When a test fails, you get detailed information:
AssertionError: Semantic similarity check failed
Response: "Bonjour"
Similarity to valid examples: 0.342
Similarity to invalid examples: 0.156
Difference: 0.186 (threshold: 0.150)
Failure reason: Response similarity (0.342) is below minimum threshold (0.500)
Closest valid example: 'Hello!' (similarity: 0.389)
Valid examples:
- 'Hello!'
- 'Hi there!'
- 'Greetings!'
API Reference
semantic_matcher(valid, invalid=None, threshold=None, min_similarity=None, model_name=None, custom_embed_fn=None)
Creates a semantic matcher for comparing text responses.
Parameters:
valid(List[str]): List of valid example responses (required)invalid(List[str], optional): List of invalid examples (random words if not provided)threshold(float, optional): Override default threshold (0-1 range)min_similarity(float, optional): Override default minimum similarity (0-1 range)model_name(str, optional): Override default sentence-transformers modelcustom_embed_fn(Callable, optional): Custom embedding function(str) -> List[float]
Returns: SemanticMatcher instance that can be used with == operator
SemanticMatcher.check(response: str) -> bool
Explicitly check if a response matches. Raises SemanticAssertionError on failure.
Examples
Testing LLM Text Generation
def test_story_generation(semantic_matcher):
"""Test that LLM generates creative stories."""
story = llm.generate_story(prompt="A robot learning to paint")
matcher = semantic_matcher(
valid=[
"A robot discovers art and creativity",
"An AI learns to express itself through painting",
"A mechanical being explores artistic expression",
],
threshold=0.1, # Allow more variation for creative content
)
assert story == matcher
Testing Classification
def test_intent_classification(semantic_matcher):
"""Test intent classification accuracy."""
intent = classify_intent("I want to cancel my subscription")
matcher = semantic_matcher(
valid=["cancel", "cancellation", "unsubscribe"],
invalid=["help", "question", "purchase", "upgrade"],
)
assert intent == matcher
Testing Summarization
def test_summarization(semantic_matcher):
"""Test that summaries capture key points."""
long_text = "..." # Long article
summary = llm.summarize(long_text)
matcher = semantic_matcher(
valid=[
"Article discusses climate change impacts",
"The text is about environmental challenges",
],
min_similarity=0.4, # Lower threshold for summaries
)
assert summary == matcher
Development
Setup
git clone https://github.com/tombedor/pytest-semantic.git
cd pytest-semantic
uv sync
Running Tests
uv run pytest tests/
Contributing
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
License
MIT License - see LICENSE file for details.
Credits
Built with:
- sentence-transformers for embeddings
- pytest testing framework
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pytest_semantic-0.1.0.tar.gz.
File metadata
- Download URL: pytest_semantic-0.1.0.tar.gz
- Upload date:
- Size: 12.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a69084e2168363fd4c297655f25f31a06e3abb027ba43db8ba64765536e06f61
|
|
| MD5 |
65b45db626bb7ce9d26a1abfa72f172d
|
|
| BLAKE2b-256 |
8504ce430bc0ce8aedb0f581540b6a61da7d62f6a7202674ee608d0c2f7db185
|
File details
Details for the file pytest_semantic-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pytest_semantic-0.1.0-py3-none-any.whl
- Upload date:
- Size: 12.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d16758f39a99d4c99b44ee669d71096e824211fbc167e7913e1f7fa3461b74b2
|
|
| MD5 |
c1ff3773d63f8a6c0af958366700b38d
|
|
| BLAKE2b-256 |
ba964d74a5df5cfe7a83fb3409025f5cc63fd21ab7fd8ad57d6fc7b89ee59926
|