Skip to main content

The most comprehensive LLM testing framework for Python

Project description

PyLLMTest 🚀

The Most Comprehensive LLM Testing Framework for Python

License: MIT Python 3.8+ PyPI version

PyLLMTest is a revolutionary testing framework designed specifically for LLM applications. It provides everything you need to build, test, and optimize AI-powered applications with confidence.

🌟 Why PyLLMTest?

Testing LLM applications is fundamentally different from traditional software testing. PyLLMTest solves the unique challenges of LLM testing:

  • Semantic Assertions - Test meaning, not exact strings
  • Snapshot Testing - Detect regressions with semantic awareness
  • Multi-Provider Support - OpenAI, Anthropic, and more
  • RAG Testing - Comprehensive retrieval and generation testing
  • Cost Tracking - Monitor token usage and costs
  • Prompt Optimization - A/B test and optimize prompts
  • Performance Benchmarking - Track latency and quality
  • Async Support - Full async/await compatibility
  • Beautiful Reporting - Rich test reports and metrics

📦 Installation

# Basic installation
pip install pyllmtest

# With OpenAI support
pip install pyllmtest[openai]

# With Anthropic support
pip install pyllmtest[anthropic]

# With all providers and features
pip install pyllmtest[all]

🚀 Quick Start

Basic Test

from pyllmtest import LLMTest, expect, OpenAIProvider

provider = OpenAIProvider(model="gpt-4-turbo")

@LLMTest(provider=provider)
def test_summarization(ctx):
    response = ctx.complete("Summarize: AI is transforming industries...")
    
    # Semantic assertions
    expect(response.content).to_be_shorter_than(100, unit="words")
    expect(response.content).to_contain("AI")
    expect(response.content).to_preserve_facts(["transform", "industries"])

# Run the test
result = test_summarization()
print(f"Test {'PASSED' if result.passed else 'FAILED'}")

Snapshot Testing

from pyllmtest import SnapshotManager

snapshot_mgr = SnapshotManager()

@LLMTest(provider=provider)
def test_with_snapshot(ctx):
    response = ctx.complete("What are the primary colors?")
    
    # Automatically detects semantic changes
    snapshot_mgr.assert_matches_snapshot(
        name="primary_colors",
        actual_content=response.content
    )

Async Testing

@LLMTest(provider=provider)
async def test_parallel_completions(ctx):
    tasks = [
        ctx.acomplete("Explain Python"),
        ctx.acomplete("Explain JavaScript"),
        ctx.acomplete("Explain Rust")
    ]
    
    responses = await asyncio.gather(*tasks)
    
    for resp in responses:
        expect(resp.content).to_be_longer_than(50, unit="words")

📚 Core Features

1. Semantic Assertions

Unlike traditional assertions, PyLLMTest understands meaning:

# Traditional (brittle)
assert "artificial intelligence" in response  # Fails if AI says "AI"

# PyLLMTest (semantic)
expect(response).to_match_semantic("artificial intelligence", threshold=0.9)
expect(response).to_preserve_facts(["machine learning", "neural networks"])
expect(response).not_to_hallucinate(source_text=original_document)

Available Assertions:

  • to_contain() / not_to_contain() - Check for substrings
  • to_match_regex() - Regex matching
  • to_be_shorter_than() / to_be_longer_than() - Length checks
  • to_be_concise() / to_be_detailed() - Quality checks
  • to_preserve_facts() - Fact preservation
  • not_to_hallucinate() - Hallucination detection
  • to_be_valid_json() / to_match_schema() - Format validation
  • to_match_semantic() - Semantic similarity

2. Snapshot Testing

Save "golden" outputs and detect regressions:

snapshot_mgr = SnapshotManager(
    snapshot_dir=".snapshots",
    update_mode=False,  # Set to True to update snapshots
    semantic_threshold=0.9  # Allow 90% semantic similarity
)

# First run: saves snapshot
# Subsequent runs: compares with snapshot
snapshot_mgr.assert_matches_snapshot("test_name", actual_content)

Features:

  • Semantic comparison - Not just exact matching
  • Version tracking - Track snapshot history
  • Diff generation - See what changed
  • Update mode - Review and approve changes

3. Multi-Provider Support

Seamlessly switch between providers:

from pyllmtest import OpenAIProvider, AnthropicProvider

# OpenAI
openai_provider = OpenAIProvider(
    model="gpt-4-turbo",
    api_key="your-key"  # or use OPENAI_API_KEY env var
)

# Anthropic
anthropic_provider = AnthropicProvider(
    model="claude-3-5-sonnet-20241022",
    api_key="your-key"  # or use ANTHROPIC_API_KEY env var
)

# Use in tests
@LLMTest(provider=openai_provider)
def test_openai(ctx):
    ...

@LLMTest(provider=anthropic_provider)
def test_anthropic(ctx):
    ...

4. Metrics Tracking

Track everything:

from pyllmtest import MetricsTracker

metrics = MetricsTracker()

# Automatic tracking in tests
@LLMTest(provider=provider)
def test_with_metrics(ctx):
    response = ctx.complete("query")  # Automatically tracked

# Print comprehensive report
metrics.print_summary()

# Export to JSON/CSV
metrics.export_json("metrics.json")
metrics.export_csv("requests.csv")

Tracked Metrics:

  • Total requests and tokens
  • Prompt vs completion tokens
  • Cost breakdown by model/provider
  • Latency percentiles (p50, p95, p99)
  • Per-model and per-provider stats

5. RAG Testing

Test retrieval-augmented generation:

from pyllmtest import RAGTester, RetrievedDocument

def my_retrieval_fn(query: str):
    # Your retrieval logic
    return [
        RetrievedDocument(
            content="Document content",
            score=0.95,
            metadata={"source": "doc.txt"}
        )
    ]

def my_generation_fn(query: str, docs: list):
    # Your generation logic
    return "Generated response"

rag_tester = RAGTester(
    retrieval_fn=my_retrieval_fn,
    generation_fn=my_generation_fn
)

result = rag_tester.test_query(
    query="What is AI?",
    expected_facts=["artificial", "intelligence"]
)

# Assertions
rag_tester.assert_retrieval_quality(result, min_docs=3, min_relevance=0.8)
rag_tester.assert_context_used(result)
rag_tester.assert_no_hallucination(result)
rag_tester.assert_performance(result, max_total_ms=1000)

6. Prompt Optimization

A/B test and optimize prompts:

from pyllmtest import PromptOptimizer, PromptVariant

optimizer = PromptOptimizer(provider=provider, quality_fn=my_quality_fn)

variants = [
    PromptVariant(
        id="detailed",
        template="Provide a detailed explanation of {topic}",
        description="Detailed prompt"
    ),
    PromptVariant(
        id="concise",
        template="Briefly explain {topic}",
        description="Concise prompt"
    )
]

test_inputs = [
    {"topic": "machine learning"},
    {"topic": "neural networks"}
]

# Compare prompts
results = optimizer.compare_prompts(variants, test_inputs)
optimizer.print_comparison(results)

# Find best prompt
best_id = optimizer.find_best_prompt(
    results,
    optimize_for="balanced",  # "quality", "cost", "latency", or "balanced"
    quality_threshold=0.8
)

print(f"Best prompt: {best_id}")

7. Test Suites

Organize tests into suites:

@LLMTest(provider=provider, suite="nlp_tests", name="test_sentiment")
def test_sentiment(ctx):
    ...

@LLMTest(provider=provider, suite="nlp_tests", name="test_translation")
def test_translation(ctx):
    ...

# Run all tests
test_sentiment()
test_translation()

# Get suite summary
suite = LLMTest.get_suite("nlp_tests")
summary = suite.get_summary()

print(f"Pass rate: {summary['pass_rate']:.1f}%")
print(f"Total cost: ${summary['total_cost_usd']:.4f}")

🎯 Advanced Features

Streaming Support

@LLMTest(provider=provider)
async def test_streaming(ctx):
    full_content = ""
    
    async for chunk in provider.stream("Explain quantum computing"):
        full_content += chunk.content
        
        if chunk.is_final:
            expect(full_content).to_be_detailed()

Custom Assertions

def is_valid_email(text: str) -> bool:
    import re
    pattern = r'^[\w\.-]+@[\w\.-]+\.\w+$'
    return bool(re.match(pattern, text))

expect(response.content).to_satisfy(
    is_valid_email,
    message="Response must be a valid email"
)

Semantic Deduplication

from pyllmtest.utils.semantic import semantic_deduplication

texts = [
    "Machine learning is a subset of AI",
    "ML is part of artificial intelligence",  # Similar to above
    "Deep learning uses neural networks"
]

unique_texts = semantic_deduplication(texts, provider, threshold=0.95)
# Returns: ["Machine learning is a subset of AI", "Deep learning uses neural networks"]

Semantic Clustering

from pyllmtest.utils.semantic import cluster_texts

texts = [
    "Python is great for AI",
    "JavaScript is used for web dev",
    "TensorFlow is an ML framework",
    "React is a web framework"
]

clusters = cluster_texts(texts, provider, num_clusters=2)
# Groups similar texts together

📊 Reporting

Console Reports

# Automatic beautiful console output
metrics.print_summary()

Output:

============================================================
METRICS SUMMARY
============================================================
Total Requests: 10
Total Tokens: 5,420
  Prompt Tokens: 2,100
  Completion Tokens: 3,320
Total Cost: $0.0542

Latency:
  Average: 1,234.56ms
  Min: 890.12ms
  Max: 2,100.45ms
  P50: 1,200.00ms
  P95: 1,800.00ms
  P99: 2,000.00ms
============================================================

Export Options

# JSON export
metrics.export_json("report.json")

# CSV export (detailed request log)
metrics.export_csv("requests.csv")

🔧 Configuration

Environment Variables

# OpenAI
export OPENAI_API_KEY=your-key

# Anthropic
export ANTHROPIC_API_KEY=your-key

Provider Configuration

provider = OpenAIProvider(
    model="gpt-4-turbo",
    timeout=60,
    max_retries=3,
    temperature=0.7
)

📖 Examples

Check out the examples/ directory for:

  • comprehensive_example.py - All features demonstrated
  • basic_testing.py - Simple getting started
  • rag_testing.py - RAG system testing
  • prompt_optimization.py - Prompt A/B testing
  • async_testing.py - Async patterns

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

🙏 Acknowledgments

Built with ❤️ for the AI community.

Special thanks to:

  • OpenAI for their amazing APIs
  • Anthropic for Claude
  • The Python testing community

Rahul Malik

📞 Support

⭐ Star History

If you find PyLLMTest useful, please consider giving it a star on GitHub!


📄 License

MIT License - see LICENSE file for details.

Copyright (c) 2024 Rahul Malik


Made with 🚀 by Rahul Malik

Making LLM testing as easy as it should be.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyllmtest-1.0.1.tar.gz (40.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyllmtest-1.0.1-py3-none-any.whl (31.9 kB view details)

Uploaded Python 3

File details

Details for the file pyllmtest-1.0.1.tar.gz.

File metadata

  • Download URL: pyllmtest-1.0.1.tar.gz
  • Upload date:
  • Size: 40.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for pyllmtest-1.0.1.tar.gz
Algorithm Hash digest
SHA256 16ef3e935c5539bb7b0b1f0b97d9daaec1eda41caeecfa961af387abd1ee90f2
MD5 d75d16c7316b84abe08c10241b86c04b
BLAKE2b-256 6be0016b21257ef1602e716b1ab976a168db2d81ed0f41dc23d6ebde3435efc3

See more details on using hashes here.

File details

Details for the file pyllmtest-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: pyllmtest-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 31.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for pyllmtest-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 23b70c32206116832796c65ef490fc625957f8eca4defbfe0c6ef8cb8f50d596
MD5 d11ef924be4c5d973640f09fd83503e6
BLAKE2b-256 d1f30b747402c9218a2a06e9e010d73f775c09314f207de506593ab5ffbfdaa3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page