The most comprehensive LLM testing framework for Python
Project description
PyLLMTest 🚀
The Most Comprehensive LLM Testing Framework for Python
PyLLMTest is a revolutionary testing framework designed specifically for LLM applications. It provides everything you need to build, test, and optimize AI-powered applications with confidence.
🌟 Why PyLLMTest?
Testing LLM applications is fundamentally different from traditional software testing. PyLLMTest solves the unique challenges of LLM testing:
- ✅ Semantic Assertions - Test meaning, not exact strings
- ✅ Snapshot Testing - Detect regressions with semantic awareness
- ✅ Multi-Provider Support - OpenAI, Anthropic, and more
- ✅ RAG Testing - Comprehensive retrieval and generation testing
- ✅ Cost Tracking - Monitor token usage and costs
- ✅ Prompt Optimization - A/B test and optimize prompts
- ✅ Performance Benchmarking - Track latency and quality
- ✅ Async Support - Full async/await compatibility
- ✅ Beautiful Reporting - Rich test reports and metrics
📦 Installation
# Basic installation
pip install pyllmtest
# With OpenAI support
pip install pyllmtest[openai]
# With Anthropic support
pip install pyllmtest[anthropic]
# With all providers and features
pip install pyllmtest[all]
🚀 Quick Start
Basic Test
from pyllmtest import LLMTest, expect, OpenAIProvider
provider = OpenAIProvider(model="gpt-4-turbo")
@LLMTest(provider=provider)
def test_summarization(ctx):
response = ctx.complete("Summarize: AI is transforming industries...")
# Semantic assertions
expect(response.content).to_be_shorter_than(100, unit="words")
expect(response.content).to_contain("AI")
expect(response.content).to_preserve_facts(["transform", "industries"])
# Run the test
result = test_summarization()
print(f"Test {'PASSED' if result.passed else 'FAILED'}")
Snapshot Testing
from pyllmtest import SnapshotManager
snapshot_mgr = SnapshotManager()
@LLMTest(provider=provider)
def test_with_snapshot(ctx):
response = ctx.complete("What are the primary colors?")
# Automatically detects semantic changes
snapshot_mgr.assert_matches_snapshot(
name="primary_colors",
actual_content=response.content
)
Async Testing
@LLMTest(provider=provider)
async def test_parallel_completions(ctx):
tasks = [
ctx.acomplete("Explain Python"),
ctx.acomplete("Explain JavaScript"),
ctx.acomplete("Explain Rust")
]
responses = await asyncio.gather(*tasks)
for resp in responses:
expect(resp.content).to_be_longer_than(50, unit="words")
📚 Core Features
1. Semantic Assertions
Unlike traditional assertions, PyLLMTest understands meaning:
# Traditional (brittle)
assert "artificial intelligence" in response # Fails if AI says "AI"
# PyLLMTest (semantic)
expect(response).to_match_semantic("artificial intelligence", threshold=0.9)
expect(response).to_preserve_facts(["machine learning", "neural networks"])
expect(response).not_to_hallucinate(source_text=original_document)
Available Assertions:
to_contain()/not_to_contain()- Check for substringsto_match_regex()- Regex matchingto_be_shorter_than()/to_be_longer_than()- Length checksto_be_concise()/to_be_detailed()- Quality checksto_preserve_facts()- Fact preservationnot_to_hallucinate()- Hallucination detectionto_be_valid_json()/to_match_schema()- Format validationto_match_semantic()- Semantic similarity
2. Snapshot Testing
Save "golden" outputs and detect regressions:
snapshot_mgr = SnapshotManager(
snapshot_dir=".snapshots",
update_mode=False, # Set to True to update snapshots
semantic_threshold=0.9 # Allow 90% semantic similarity
)
# First run: saves snapshot
# Subsequent runs: compares with snapshot
snapshot_mgr.assert_matches_snapshot("test_name", actual_content)
Features:
- Semantic comparison - Not just exact matching
- Version tracking - Track snapshot history
- Diff generation - See what changed
- Update mode - Review and approve changes
3. Multi-Provider Support
Seamlessly switch between providers:
from pyllmtest import OpenAIProvider, AnthropicProvider
# OpenAI
openai_provider = OpenAIProvider(
model="gpt-4-turbo",
api_key="your-key" # or use OPENAI_API_KEY env var
)
# Anthropic
anthropic_provider = AnthropicProvider(
model="claude-3-5-sonnet-20241022",
api_key="your-key" # or use ANTHROPIC_API_KEY env var
)
# Use in tests
@LLMTest(provider=openai_provider)
def test_openai(ctx):
...
@LLMTest(provider=anthropic_provider)
def test_anthropic(ctx):
...
4. Metrics Tracking
Track everything:
from pyllmtest import MetricsTracker
metrics = MetricsTracker()
# Automatic tracking in tests
@LLMTest(provider=provider)
def test_with_metrics(ctx):
response = ctx.complete("query") # Automatically tracked
# Print comprehensive report
metrics.print_summary()
# Export to JSON/CSV
metrics.export_json("metrics.json")
metrics.export_csv("requests.csv")
Tracked Metrics:
- Total requests and tokens
- Prompt vs completion tokens
- Cost breakdown by model/provider
- Latency percentiles (p50, p95, p99)
- Per-model and per-provider stats
5. RAG Testing
Test retrieval-augmented generation:
from pyllmtest import RAGTester, RetrievedDocument
def my_retrieval_fn(query: str):
# Your retrieval logic
return [
RetrievedDocument(
content="Document content",
score=0.95,
metadata={"source": "doc.txt"}
)
]
def my_generation_fn(query: str, docs: list):
# Your generation logic
return "Generated response"
rag_tester = RAGTester(
retrieval_fn=my_retrieval_fn,
generation_fn=my_generation_fn
)
result = rag_tester.test_query(
query="What is AI?",
expected_facts=["artificial", "intelligence"]
)
# Assertions
rag_tester.assert_retrieval_quality(result, min_docs=3, min_relevance=0.8)
rag_tester.assert_context_used(result)
rag_tester.assert_no_hallucination(result)
rag_tester.assert_performance(result, max_total_ms=1000)
6. Prompt Optimization
A/B test and optimize prompts:
from pyllmtest import PromptOptimizer, PromptVariant
optimizer = PromptOptimizer(provider=provider, quality_fn=my_quality_fn)
variants = [
PromptVariant(
id="detailed",
template="Provide a detailed explanation of {topic}",
description="Detailed prompt"
),
PromptVariant(
id="concise",
template="Briefly explain {topic}",
description="Concise prompt"
)
]
test_inputs = [
{"topic": "machine learning"},
{"topic": "neural networks"}
]
# Compare prompts
results = optimizer.compare_prompts(variants, test_inputs)
optimizer.print_comparison(results)
# Find best prompt
best_id = optimizer.find_best_prompt(
results,
optimize_for="balanced", # "quality", "cost", "latency", or "balanced"
quality_threshold=0.8
)
print(f"Best prompt: {best_id}")
7. Test Suites
Organize tests into suites:
@LLMTest(provider=provider, suite="nlp_tests", name="test_sentiment")
def test_sentiment(ctx):
...
@LLMTest(provider=provider, suite="nlp_tests", name="test_translation")
def test_translation(ctx):
...
# Run all tests
test_sentiment()
test_translation()
# Get suite summary
suite = LLMTest.get_suite("nlp_tests")
summary = suite.get_summary()
print(f"Pass rate: {summary['pass_rate']:.1f}%")
print(f"Total cost: ${summary['total_cost_usd']:.4f}")
🎯 Advanced Features
Streaming Support
@LLMTest(provider=provider)
async def test_streaming(ctx):
full_content = ""
async for chunk in provider.stream("Explain quantum computing"):
full_content += chunk.content
if chunk.is_final:
expect(full_content).to_be_detailed()
Custom Assertions
def is_valid_email(text: str) -> bool:
import re
pattern = r'^[\w\.-]+@[\w\.-]+\.\w+$'
return bool(re.match(pattern, text))
expect(response.content).to_satisfy(
is_valid_email,
message="Response must be a valid email"
)
Semantic Deduplication
from pyllmtest.utils.semantic import semantic_deduplication
texts = [
"Machine learning is a subset of AI",
"ML is part of artificial intelligence", # Similar to above
"Deep learning uses neural networks"
]
unique_texts = semantic_deduplication(texts, provider, threshold=0.95)
# Returns: ["Machine learning is a subset of AI", "Deep learning uses neural networks"]
Semantic Clustering
from pyllmtest.utils.semantic import cluster_texts
texts = [
"Python is great for AI",
"JavaScript is used for web dev",
"TensorFlow is an ML framework",
"React is a web framework"
]
clusters = cluster_texts(texts, provider, num_clusters=2)
# Groups similar texts together
📊 Reporting
Console Reports
# Automatic beautiful console output
metrics.print_summary()
Output:
============================================================
METRICS SUMMARY
============================================================
Total Requests: 10
Total Tokens: 5,420
Prompt Tokens: 2,100
Completion Tokens: 3,320
Total Cost: $0.0542
Latency:
Average: 1,234.56ms
Min: 890.12ms
Max: 2,100.45ms
P50: 1,200.00ms
P95: 1,800.00ms
P99: 2,000.00ms
============================================================
Export Options
# JSON export
metrics.export_json("report.json")
# CSV export (detailed request log)
metrics.export_csv("requests.csv")
🔧 Configuration
Environment Variables
# OpenAI
export OPENAI_API_KEY=your-key
# Anthropic
export ANTHROPIC_API_KEY=your-key
Provider Configuration
provider = OpenAIProvider(
model="gpt-4-turbo",
timeout=60,
max_retries=3,
temperature=0.7
)
📖 Examples
Check out the examples/ directory for:
comprehensive_example.py- All features demonstratedbasic_testing.py- Simple getting startedrag_testing.py- RAG system testingprompt_optimization.py- Prompt A/B testingasync_testing.py- Async patterns
🤝 Contributing
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
🙏 Acknowledgments
Built with ❤️ for the AI community.
Special thanks to:
- OpenAI for their amazing APIs
- Anthropic for Claude
- The Python testing community
Rahul Malik
- Email: rm324556@gmail.com
- GitHub: @RahulMK22
- LinkedIn:(https://www.linkedin.com/in/rahul-malik-b0791a1a7/)
📞 Support
- 📧 Email: rm324556@gmail.com
- 🐛 Issues: GitHub Issues
⭐ Star History
If you find PyLLMTest useful, please consider giving it a star on GitHub!
📄 License
MIT License - see LICENSE file for details.
Copyright (c) 2024 Rahul Malik
Made with 🚀 by Rahul Malik
Making LLM testing as easy as it should be.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyllmtest-1.0.1.tar.gz.
File metadata
- Download URL: pyllmtest-1.0.1.tar.gz
- Upload date:
- Size: 40.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
16ef3e935c5539bb7b0b1f0b97d9daaec1eda41caeecfa961af387abd1ee90f2
|
|
| MD5 |
d75d16c7316b84abe08c10241b86c04b
|
|
| BLAKE2b-256 |
6be0016b21257ef1602e716b1ab976a168db2d81ed0f41dc23d6ebde3435efc3
|
File details
Details for the file pyllmtest-1.0.1-py3-none-any.whl.
File metadata
- Download URL: pyllmtest-1.0.1-py3-none-any.whl
- Upload date:
- Size: 31.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
23b70c32206116832796c65ef490fc625957f8eca4defbfe0c6ef8cb8f50d596
|
|
| MD5 |
d11ef924be4c5d973640f09fd83503e6
|
|
| BLAKE2b-256 |
d1f30b747402c9218a2a06e9e010d73f775c09314f207de506593ab5ffbfdaa3
|