The most comprehensive LLM testing framework for Python

These details have not been verified by PyPI

Project links

Project description

PyLLMTest 🚀

The Most Comprehensive LLM Testing Framework for Python

PyLLMTest is a revolutionary testing framework designed specifically for LLM applications. It provides everything you need to build, test, and optimize AI-powered applications with confidence.

🌟 Why PyLLMTest?

Testing LLM applications is fundamentally different from traditional software testing. PyLLMTest solves the unique challenges of LLM testing:

✅ Semantic Assertions - Test meaning, not exact strings
✅ Snapshot Testing - Detect regressions with semantic awareness
✅ Multi-Provider Support - OpenAI, Anthropic, and more
✅ RAG Testing - Comprehensive retrieval and generation testing
✅ Cost Tracking - Monitor token usage and costs
✅ Prompt Optimization - A/B test and optimize prompts
✅ Performance Benchmarking - Track latency and quality
✅ Async Support - Full async/await compatibility
✅ Beautiful Reporting - Rich test reports and metrics

📦 Installation

# Basic installation
pip install pyllmtest

# With OpenAI support
pip install pyllmtest[openai]

# With Anthropic support
pip install pyllmtest[anthropic]

# With all providers and features
pip install pyllmtest[all]

🚀 Quick Start

Basic Test

from pyllmtest import LLMTest, expect, OpenAIProvider

provider = OpenAIProvider(model="gpt-4-turbo")

@LLMTest(provider=provider)
def test_summarization(ctx):
    response = ctx.complete("Summarize: AI is transforming industries...")
    
    # Semantic assertions
    expect(response.content).to_be_shorter_than(100, unit="words")
    expect(response.content).to_contain("AI")
    expect(response.content).to_preserve_facts(["transform", "industries"])

# Run the test
result = test_summarization()
print(f"Test {'PASSED' if result.passed else 'FAILED'}")

Snapshot Testing

from pyllmtest import SnapshotManager

snapshot_mgr = SnapshotManager()

@LLMTest(provider=provider)
def test_with_snapshot(ctx):
    response = ctx.complete("What are the primary colors?")
    
    # Automatically detects semantic changes
    snapshot_mgr.assert_matches_snapshot(
        name="primary_colors",
        actual_content=response.content
    )

Async Testing

@LLMTest(provider=provider)
async def test_parallel_completions(ctx):
    tasks = [
        ctx.acomplete("Explain Python"),
        ctx.acomplete("Explain JavaScript"),
        ctx.acomplete("Explain Rust")
    ]
    
    responses = await asyncio.gather(*tasks)
    
    for resp in responses:
        expect(resp.content).to_be_longer_than(50, unit="words")

📚 Core Features

1. Semantic Assertions

Unlike traditional assertions, PyLLMTest understands meaning:

# Traditional (brittle)
assert "artificial intelligence" in response  # Fails if AI says "AI"

# PyLLMTest (semantic)
expect(response).to_match_semantic("artificial intelligence", threshold=0.9)
expect(response).to_preserve_facts(["machine learning", "neural networks"])
expect(response).not_to_hallucinate(source_text=original_document)

Available Assertions:

to_contain() / not_to_contain() - Check for substrings
to_match_regex() - Regex matching
to_be_shorter_than() / to_be_longer_than() - Length checks
to_be_concise() / to_be_detailed() - Quality checks
to_preserve_facts() - Fact preservation
not_to_hallucinate() - Hallucination detection
to_be_valid_json() / to_match_schema() - Format validation
to_match_semantic() - Semantic similarity

2. Snapshot Testing

Save "golden" outputs and detect regressions:

snapshot_mgr = SnapshotManager(
    snapshot_dir=".snapshots",
    update_mode=False,  # Set to True to update snapshots
    semantic_threshold=0.9  # Allow 90% semantic similarity
)

# First run: saves snapshot
# Subsequent runs: compares with snapshot
snapshot_mgr.assert_matches_snapshot("test_name", actual_content)

Features:

Semantic comparison - Not just exact matching
Version tracking - Track snapshot history
Diff generation - See what changed
Update mode - Review and approve changes

3. Multi-Provider Support

Seamlessly switch between providers:

from pyllmtest import OpenAIProvider, AnthropicProvider

# OpenAI
openai_provider = OpenAIProvider(
    model="gpt-4-turbo",
    api_key="your-key"  # or use OPENAI_API_KEY env var
)

# Anthropic
anthropic_provider = AnthropicProvider(
    model="claude-3-5-sonnet-20241022",
    api_key="your-key"  # or use ANTHROPIC_API_KEY env var
)

# Use in tests
@LLMTest(provider=openai_provider)
def test_openai(ctx):
    ...

@LLMTest(provider=anthropic_provider)
def test_anthropic(ctx):
    ...

4. Metrics Tracking

Track everything:

from pyllmtest import MetricsTracker

metrics = MetricsTracker()

# Automatic tracking in tests
@LLMTest(provider=provider)
def test_with_metrics(ctx):
    response = ctx.complete("query")  # Automatically tracked

# Print comprehensive report
metrics.print_summary()

# Export to JSON/CSV
metrics.export_json("metrics.json")
metrics.export_csv("requests.csv")

Tracked Metrics:

Total requests and tokens
Prompt vs completion tokens
Cost breakdown by model/provider
Latency percentiles (p50, p95, p99)
Per-model and per-provider stats

5. RAG Testing

Test retrieval-augmented generation:

from pyllmtest import RAGTester, RetrievedDocument

def my_retrieval_fn(query: str):
    # Your retrieval logic
    return [
        RetrievedDocument(
            content="Document content",
            score=0.95,
            metadata={"source": "doc.txt"}
        )
    ]

def my_generation_fn(query: str, docs: list):
    # Your generation logic
    return "Generated response"

rag_tester = RAGTester(
    retrieval_fn=my_retrieval_fn,
    generation_fn=my_generation_fn
)

result = rag_tester.test_query(
    query="What is AI?",
    expected_facts=["artificial", "intelligence"]
)

# Assertions
rag_tester.assert_retrieval_quality(result, min_docs=3, min_relevance=0.8)
rag_tester.assert_context_used(result)
rag_tester.assert_no_hallucination(result)
rag_tester.assert_performance(result, max_total_ms=1000)

6. Prompt Optimization

A/B test and optimize prompts:

from pyllmtest import PromptOptimizer, PromptVariant

optimizer = PromptOptimizer(provider=provider, quality_fn=my_quality_fn)

variants = [
    PromptVariant(
        id="detailed",
        template="Provide a detailed explanation of {topic}",
        description="Detailed prompt"
    ),
    PromptVariant(
        id="concise",
        template="Briefly explain {topic}",
        description="Concise prompt"
    )
]

test_inputs = [
    {"topic": "machine learning"},
    {"topic": "neural networks"}
]

# Compare prompts
results = optimizer.compare_prompts(variants, test_inputs)
optimizer.print_comparison(results)

# Find best prompt
best_id = optimizer.find_best_prompt(
    results,
    optimize_for="balanced",  # "quality", "cost", "latency", or "balanced"
    quality_threshold=0.8
)

print(f"Best prompt: {best_id}")

7. Test Suites

Organize tests into suites:

@LLMTest(provider=provider, suite="nlp_tests", name="test_sentiment")
def test_sentiment(ctx):
    ...

@LLMTest(provider=provider, suite="nlp_tests", name="test_translation")
def test_translation(ctx):
    ...

# Run all tests
test_sentiment()
test_translation()

# Get suite summary
suite = LLMTest.get_suite("nlp_tests")
summary = suite.get_summary()

print(f"Pass rate: {summary['pass_rate']:.1f}%")
print(f"Total cost: ${summary['total_cost_usd']:.4f}")

🎯 Advanced Features

Streaming Support

@LLMTest(provider=provider)
async def test_streaming(ctx):
    full_content = ""
    
    async for chunk in provider.stream("Explain quantum computing"):
        full_content += chunk.content
        
        if chunk.is_final:
            expect(full_content).to_be_detailed()

Custom Assertions

def is_valid_email(text: str) -> bool:
    import re
    pattern = r'^[\w\.-]+@[\w\.-]+\.\w+$'
    return bool(re.match(pattern, text))

expect(response.content).to_satisfy(
    is_valid_email,
    message="Response must be a valid email"
)

Semantic Deduplication

from pyllmtest.utils.semantic import semantic_deduplication

texts = [
    "Machine learning is a subset of AI",
    "ML is part of artificial intelligence",  # Similar to above
    "Deep learning uses neural networks"
]

unique_texts = semantic_deduplication(texts, provider, threshold=0.95)
# Returns: ["Machine learning is a subset of AI", "Deep learning uses neural networks"]

Semantic Clustering

from pyllmtest.utils.semantic import cluster_texts

texts = [
    "Python is great for AI",
    "JavaScript is used for web dev",
    "TensorFlow is an ML framework",
    "React is a web framework"
]

clusters = cluster_texts(texts, provider, num_clusters=2)
# Groups similar texts together

📊 Reporting

Console Reports

# Automatic beautiful console output
metrics.print_summary()

Output:

============================================================
METRICS SUMMARY
============================================================
Total Requests: 10
Total Tokens: 5,420
  Prompt Tokens: 2,100
  Completion Tokens: 3,320
Total Cost: $0.0542

Latency:
  Average: 1,234.56ms
  Min: 890.12ms
  Max: 2,100.45ms
  P50: 1,200.00ms
  P95: 1,800.00ms
  P99: 2,000.00ms
============================================================

Export Options

# JSON export
metrics.export_json("report.json")

# CSV export (detailed request log)
metrics.export_csv("requests.csv")

🔧 Configuration

Environment Variables

# OpenAI
export OPENAI_API_KEY=your-key

# Anthropic
export ANTHROPIC_API_KEY=your-key

Provider Configuration

provider = OpenAIProvider(
    model="gpt-4-turbo",
    timeout=60,
    max_retries=3,
    temperature=0.7
)

📖 Examples

Check out the examples/ directory for:

comprehensive_example.py - All features demonstrated
basic_testing.py - Simple getting started
rag_testing.py - RAG system testing
prompt_optimization.py - Prompt A/B testing
async_testing.py - Async patterns

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

📄 License

MIT License - see LICENSE file for details.

🙏 Acknowledgments

Built with ❤️ for the AI community.

Special thanks to:

OpenAI for their amazing APIs
Anthropic for Claude
The Python testing community

📞 Support

📧 Email: support@pyllmtest.dev
💬 Discord: Join our community
📖 Docs: docs.pyllmtest.dev
🐛 Issues: GitHub Issues

⭐ Star History

If you find PyLLMTest useful, please consider giving it a star on GitHub!

Made with 🚀 by developers, for developers

Making LLM testing as easy as it should be.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.1

Dec 13, 2025

This version

1.0.0

Dec 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyllmtest-1.0.0.tar.gz (40.2 kB view details)

Uploaded Dec 13, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyllmtest-1.0.0-py3-none-any.whl (31.9 kB view details)

Uploaded Dec 13, 2025 Python 3

File details

Details for the file pyllmtest-1.0.0.tar.gz.

File metadata

Download URL: pyllmtest-1.0.0.tar.gz
Upload date: Dec 13, 2025
Size: 40.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for pyllmtest-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`0eb1c27281c464a59a28be74d74663c772988ccb9e950a3aa1016859bcccadc3`
MD5	`3a6ef75a44a32e60b3e75a201c26401c`
BLAKE2b-256	`fb1894ca1964c3eaa991075d484ee167c02509e53bd9940c0fee340399d9fc96`

See more details on using hashes here.

File details

Details for the file pyllmtest-1.0.0-py3-none-any.whl.

File metadata

Download URL: pyllmtest-1.0.0-py3-none-any.whl
Upload date: Dec 13, 2025
Size: 31.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for pyllmtest-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`943e8c6c801edaea46c0267443e8dec25133a26779cd246a33aff90dd32490bd`
MD5	`868ad4822e9b5db1fadc2ea17fc0522c`
BLAKE2b-256	`fd9f9f2c0f519d1d6a6c7f4a0e17fc709186712aadd03d12bc871b386e178634`

See more details on using hashes here.

pyllmtest 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PyLLMTest 🚀

🌟 Why PyLLMTest?

📦 Installation

🚀 Quick Start

Basic Test

Snapshot Testing

Async Testing

📚 Core Features

1. Semantic Assertions

2. Snapshot Testing

3. Multi-Provider Support

4. Metrics Tracking

5. RAG Testing

6. Prompt Optimization

7. Test Suites

🎯 Advanced Features

Streaming Support

Custom Assertions

Semantic Deduplication

Semantic Clustering

📊 Reporting

Console Reports

Export Options

🔧 Configuration

Environment Variables

Provider Configuration

📖 Examples

🤝 Contributing

📄 License

🙏 Acknowledgments

📞 Support

⭐ Star History

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes