Skip to main content

The most comprehensive LLM testing framework for Python

Project description

PyLLMTest 🚀

The Most Comprehensive LLM Testing Framework for Python

License: MIT Python 3.8+ PyPI version

PyLLMTest is a revolutionary testing framework designed specifically for LLM applications. It provides everything you need to build, test, and optimize AI-powered applications with confidence.

🌟 Why PyLLMTest?

Testing LLM applications is fundamentally different from traditional software testing. PyLLMTest solves the unique challenges of LLM testing:

  • Semantic Assertions - Test meaning, not exact strings
  • Snapshot Testing - Detect regressions with semantic awareness
  • Multi-Provider Support - OpenAI, Anthropic, and more
  • RAG Testing - Comprehensive retrieval and generation testing
  • Cost Tracking - Monitor token usage and costs
  • Prompt Optimization - A/B test and optimize prompts
  • Performance Benchmarking - Track latency and quality
  • Async Support - Full async/await compatibility
  • Beautiful Reporting - Rich test reports and metrics

📦 Installation

# Basic installation
pip install pyllmtest

# With OpenAI support
pip install pyllmtest[openai]

# With Anthropic support
pip install pyllmtest[anthropic]

# With all providers and features
pip install pyllmtest[all]

🚀 Quick Start

Basic Test

from pyllmtest import LLMTest, expect, OpenAIProvider

provider = OpenAIProvider(model="gpt-4-turbo")

@LLMTest(provider=provider)
def test_summarization(ctx):
    response = ctx.complete("Summarize: AI is transforming industries...")
    
    # Semantic assertions
    expect(response.content).to_be_shorter_than(100, unit="words")
    expect(response.content).to_contain("AI")
    expect(response.content).to_preserve_facts(["transform", "industries"])

# Run the test
result = test_summarization()
print(f"Test {'PASSED' if result.passed else 'FAILED'}")

Snapshot Testing

from pyllmtest import SnapshotManager

snapshot_mgr = SnapshotManager()

@LLMTest(provider=provider)
def test_with_snapshot(ctx):
    response = ctx.complete("What are the primary colors?")
    
    # Automatically detects semantic changes
    snapshot_mgr.assert_matches_snapshot(
        name="primary_colors",
        actual_content=response.content
    )

Async Testing

@LLMTest(provider=provider)
async def test_parallel_completions(ctx):
    tasks = [
        ctx.acomplete("Explain Python"),
        ctx.acomplete("Explain JavaScript"),
        ctx.acomplete("Explain Rust")
    ]
    
    responses = await asyncio.gather(*tasks)
    
    for resp in responses:
        expect(resp.content).to_be_longer_than(50, unit="words")

📚 Core Features

1. Semantic Assertions

Unlike traditional assertions, PyLLMTest understands meaning:

# Traditional (brittle)
assert "artificial intelligence" in response  # Fails if AI says "AI"

# PyLLMTest (semantic)
expect(response).to_match_semantic("artificial intelligence", threshold=0.9)
expect(response).to_preserve_facts(["machine learning", "neural networks"])
expect(response).not_to_hallucinate(source_text=original_document)

Available Assertions:

  • to_contain() / not_to_contain() - Check for substrings
  • to_match_regex() - Regex matching
  • to_be_shorter_than() / to_be_longer_than() - Length checks
  • to_be_concise() / to_be_detailed() - Quality checks
  • to_preserve_facts() - Fact preservation
  • not_to_hallucinate() - Hallucination detection
  • to_be_valid_json() / to_match_schema() - Format validation
  • to_match_semantic() - Semantic similarity

2. Snapshot Testing

Save "golden" outputs and detect regressions:

snapshot_mgr = SnapshotManager(
    snapshot_dir=".snapshots",
    update_mode=False,  # Set to True to update snapshots
    semantic_threshold=0.9  # Allow 90% semantic similarity
)

# First run: saves snapshot
# Subsequent runs: compares with snapshot
snapshot_mgr.assert_matches_snapshot("test_name", actual_content)

Features:

  • Semantic comparison - Not just exact matching
  • Version tracking - Track snapshot history
  • Diff generation - See what changed
  • Update mode - Review and approve changes

3. Multi-Provider Support

Seamlessly switch between providers:

from pyllmtest import OpenAIProvider, AnthropicProvider

# OpenAI
openai_provider = OpenAIProvider(
    model="gpt-4-turbo",
    api_key="your-key"  # or use OPENAI_API_KEY env var
)

# Anthropic
anthropic_provider = AnthropicProvider(
    model="claude-3-5-sonnet-20241022",
    api_key="your-key"  # or use ANTHROPIC_API_KEY env var
)

# Use in tests
@LLMTest(provider=openai_provider)
def test_openai(ctx):
    ...

@LLMTest(provider=anthropic_provider)
def test_anthropic(ctx):
    ...

4. Metrics Tracking

Track everything:

from pyllmtest import MetricsTracker

metrics = MetricsTracker()

# Automatic tracking in tests
@LLMTest(provider=provider)
def test_with_metrics(ctx):
    response = ctx.complete("query")  # Automatically tracked

# Print comprehensive report
metrics.print_summary()

# Export to JSON/CSV
metrics.export_json("metrics.json")
metrics.export_csv("requests.csv")

Tracked Metrics:

  • Total requests and tokens
  • Prompt vs completion tokens
  • Cost breakdown by model/provider
  • Latency percentiles (p50, p95, p99)
  • Per-model and per-provider stats

5. RAG Testing

Test retrieval-augmented generation:

from pyllmtest import RAGTester, RetrievedDocument

def my_retrieval_fn(query: str):
    # Your retrieval logic
    return [
        RetrievedDocument(
            content="Document content",
            score=0.95,
            metadata={"source": "doc.txt"}
        )
    ]

def my_generation_fn(query: str, docs: list):
    # Your generation logic
    return "Generated response"

rag_tester = RAGTester(
    retrieval_fn=my_retrieval_fn,
    generation_fn=my_generation_fn
)

result = rag_tester.test_query(
    query="What is AI?",
    expected_facts=["artificial", "intelligence"]
)

# Assertions
rag_tester.assert_retrieval_quality(result, min_docs=3, min_relevance=0.8)
rag_tester.assert_context_used(result)
rag_tester.assert_no_hallucination(result)
rag_tester.assert_performance(result, max_total_ms=1000)

6. Prompt Optimization

A/B test and optimize prompts:

from pyllmtest import PromptOptimizer, PromptVariant

optimizer = PromptOptimizer(provider=provider, quality_fn=my_quality_fn)

variants = [
    PromptVariant(
        id="detailed",
        template="Provide a detailed explanation of {topic}",
        description="Detailed prompt"
    ),
    PromptVariant(
        id="concise",
        template="Briefly explain {topic}",
        description="Concise prompt"
    )
]

test_inputs = [
    {"topic": "machine learning"},
    {"topic": "neural networks"}
]

# Compare prompts
results = optimizer.compare_prompts(variants, test_inputs)
optimizer.print_comparison(results)

# Find best prompt
best_id = optimizer.find_best_prompt(
    results,
    optimize_for="balanced",  # "quality", "cost", "latency", or "balanced"
    quality_threshold=0.8
)

print(f"Best prompt: {best_id}")

7. Test Suites

Organize tests into suites:

@LLMTest(provider=provider, suite="nlp_tests", name="test_sentiment")
def test_sentiment(ctx):
    ...

@LLMTest(provider=provider, suite="nlp_tests", name="test_translation")
def test_translation(ctx):
    ...

# Run all tests
test_sentiment()
test_translation()

# Get suite summary
suite = LLMTest.get_suite("nlp_tests")
summary = suite.get_summary()

print(f"Pass rate: {summary['pass_rate']:.1f}%")
print(f"Total cost: ${summary['total_cost_usd']:.4f}")

🎯 Advanced Features

Streaming Support

@LLMTest(provider=provider)
async def test_streaming(ctx):
    full_content = ""
    
    async for chunk in provider.stream("Explain quantum computing"):
        full_content += chunk.content
        
        if chunk.is_final:
            expect(full_content).to_be_detailed()

Custom Assertions

def is_valid_email(text: str) -> bool:
    import re
    pattern = r'^[\w\.-]+@[\w\.-]+\.\w+$'
    return bool(re.match(pattern, text))

expect(response.content).to_satisfy(
    is_valid_email,
    message="Response must be a valid email"
)

Semantic Deduplication

from pyllmtest.utils.semantic import semantic_deduplication

texts = [
    "Machine learning is a subset of AI",
    "ML is part of artificial intelligence",  # Similar to above
    "Deep learning uses neural networks"
]

unique_texts = semantic_deduplication(texts, provider, threshold=0.95)
# Returns: ["Machine learning is a subset of AI", "Deep learning uses neural networks"]

Semantic Clustering

from pyllmtest.utils.semantic import cluster_texts

texts = [
    "Python is great for AI",
    "JavaScript is used for web dev",
    "TensorFlow is an ML framework",
    "React is a web framework"
]

clusters = cluster_texts(texts, provider, num_clusters=2)
# Groups similar texts together

📊 Reporting

Console Reports

# Automatic beautiful console output
metrics.print_summary()

Output:

============================================================
METRICS SUMMARY
============================================================
Total Requests: 10
Total Tokens: 5,420
  Prompt Tokens: 2,100
  Completion Tokens: 3,320
Total Cost: $0.0542

Latency:
  Average: 1,234.56ms
  Min: 890.12ms
  Max: 2,100.45ms
  P50: 1,200.00ms
  P95: 1,800.00ms
  P99: 2,000.00ms
============================================================

Export Options

# JSON export
metrics.export_json("report.json")

# CSV export (detailed request log)
metrics.export_csv("requests.csv")

🔧 Configuration

Environment Variables

# OpenAI
export OPENAI_API_KEY=your-key

# Anthropic
export ANTHROPIC_API_KEY=your-key

Provider Configuration

provider = OpenAIProvider(
    model="gpt-4-turbo",
    timeout=60,
    max_retries=3,
    temperature=0.7
)

📖 Examples

Check out the examples/ directory for:

  • comprehensive_example.py - All features demonstrated
  • basic_testing.py - Simple getting started
  • rag_testing.py - RAG system testing
  • prompt_optimization.py - Prompt A/B testing
  • async_testing.py - Async patterns

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

📄 License

MIT License - see LICENSE file for details.

🙏 Acknowledgments

Built with ❤️ for the AI community.

Special thanks to:

  • OpenAI for their amazing APIs
  • Anthropic for Claude
  • The Python testing community

📞 Support

⭐ Star History

If you find PyLLMTest useful, please consider giving it a star on GitHub!


Made with 🚀 by developers, for developers

Making LLM testing as easy as it should be.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyllmtest-1.0.0.tar.gz (40.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyllmtest-1.0.0-py3-none-any.whl (31.9 kB view details)

Uploaded Python 3

File details

Details for the file pyllmtest-1.0.0.tar.gz.

File metadata

  • Download URL: pyllmtest-1.0.0.tar.gz
  • Upload date:
  • Size: 40.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for pyllmtest-1.0.0.tar.gz
Algorithm Hash digest
SHA256 0eb1c27281c464a59a28be74d74663c772988ccb9e950a3aa1016859bcccadc3
MD5 3a6ef75a44a32e60b3e75a201c26401c
BLAKE2b-256 fb1894ca1964c3eaa991075d484ee167c02509e53bd9940c0fee340399d9fc96

See more details on using hashes here.

File details

Details for the file pyllmtest-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: pyllmtest-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 31.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for pyllmtest-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 943e8c6c801edaea46c0267443e8dec25133a26779cd246a33aff90dd32490bd
MD5 868ad4822e9b5db1fadc2ea17fc0522c
BLAKE2b-256 fd9f9f2c0f519d1d6a6c7f4a0e17fc709186712aadd03d12bc871b386e178634

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page