Skip to main content

Semantic testing framework for LLM applications

Project description

llm_app_test

Discord PyPI Version PyPI Downloads License: MIT Python Documentation

A semantic testing framework for LLM applications that uses LLMs to validate semantic equivalence in test outputs.

✨ Test your LLM apps in minutes, not hours

🚀 CI/CD ready out of the box

💰 Cost-effective testing solution

🔧 No infrastructure needed

You can click here to go straight to the docs.

Due to the number of downloads I am seeing on pypistats.org, I am including these instructions in case a beta update breaks something on your end:

Emergency Rollback Instructions

If you experience issues with version 0.1.0b4, you can roll back to the previous stable version (0.1.0b3.post3) using one of these methods:

Method 1: Direct Installation of Previous Version

pip uninstall llm-app-test 
pip install llm-app-test==0.1.0b3.post3

Method 2: Force Reinstall (if Method 1 fails)

pip install --force-reinstall llm-app-test==0.1.0b3.post3

Verification

After rolling back, verify the installation:

import llm_app_test 
print(llm_app_test.version) # Should show 0.1.0b3.post3

What llm_app_test Does

  • Tests LLM applications (not the LLMs themselves)
  • Validates system message + prompt template outputs
  • Ensures semantic equivalence of responses
  • Tests the parts YOU control in your LLM application

What llm_app_test Doesn't Do

  • Test LLM model performance (that's the provider's responsibility)
  • Validate base model capabilities
  • Test model reliability
  • Handle model safety features

Screenshots

What if you could just use:

semantic_assert.assert_semantic_match(
        actual=actual_output,
        expected_behavior=expected_behavior
    )

and get a pass/fail to test your LLM apps? Well, that's what I'm trying to do. Anyway, seeing is believing so:

Here's llm_app_test passing a test case:

test_pass.jpg

Here's llm_app_test failing a test case (and providing the reason why it failed):

test_fail.jpg

Finally, here's llm_app_test passing a test case with a complex reasoning chain with the simple, natural language instruction of:

A complex, multi-step, scientific explanation.
Must maintain logical consistency across all steps.

complex_reasoning_chain_pass.jpg

Why llm_app_test?

Testing LLM applications is challenging because:

  • Outputs are non-deterministic
  • Semantic meaning matters more than exact matches
  • Traditional testing approaches don't work well
  • Integration into CI/CD pipelines is complex

llm_app_test solves these challenges by:

  • Using LLMs to evaluate semantic equivalence
  • Providing a clean, maintainable testing framework
  • Offering simple CI/CD integration
  • Supporting multiple LLM providers

When to Use llm_app_test

  • Testing application-level LLM integration
  • Validating prompt engineering
  • Testing system message effectiveness
  • Ensuring consistent response patterns

When Not to Use llm_app_test

  • Testing base LLM performance
  • Evaluating model capabilities
  • Testing model safety features

Quick Example

from llm_app_test.semanticassert.semantic_assert import SemanticAssertion

semantic_assert = SemanticAssertion() 
semantic_assert.assert_semantic_match(actual="Hello Alice, how are you?", 
                                      expected_behavior="A polite greeting addressing Alice" 
                                      )

Installation

pip install llm-app-test

Documentation

Full documentation available at: https://Shredmetal.github.io/llmtest/

  • Installation Guide
  • Quick Start Guide
  • API Reference
  • Best Practices
  • CI/CD Integration
  • Configuration Options

License

MIT

Contributing

This project is at an early stage and aims to be an important testing library for LLM applications.

Want to contribute? Great! Some areas we're looking for help:

  • Additional LLM provider support
  • Performance optimizations
  • Test coverage improvements
  • Documentation
  • CI/CD integration examples
  • Test result caching
  • Literally anything else you can think of, I'm all out of ideas, I'm not even sure starting this project was a smart one.

Please:

  1. Fork the repository
  2. Create a feature branch
  3. Submit a Pull Request

For major changes, please open an issue first to discuss what you would like to change, or YOLO in a PR, bonus points if you can insult me in a way that makes me laugh.

Please adhere to clean code principles and include appropriate tests... or else. 🗡️

Reporting Issues

If you encounter issues:

  1. Create an issue on our GitHub repository
  2. Include your Python version and environment details
  3. Describe the problem you encountered with version 0.1.0b4

🆘 Support

⚠️ Important Note About Rate Limits - If Running Large Numbers of Tests:

Anthropic Rate limits:

Tier 1:

Model Maximum Requests per minute (RPM) Maximum Tokens per minute (TPM) Maximum Tokens per day (TPD)
Claude 3.5 Sonnet 2024-10-22 50 40,000 1,000,000
Claude 3.5 Sonnet 2024-06-20 50 40,000 1,000,000
Claude 3 Opus 50 20,000 1,000,000

Tier 2:

Model Maximum Requests per minute (RPM) Maximum Tokens per minute (TPM) Maximum Tokens per day (TPD)
Claude 3.5 Sonnet 2024-10-22 1,000 80,000 2,500,000
Claude 3.5 Sonnet 2024-06-20 1,000 80,000 2,500,000
Claude 3 Opus 1,000 40,000 2,500,000

OpenAI Rate Limits

Tier 1

Model RPM RPD TPM Batch Queue Limit
gpt-4o 500 - 30,000 90,000
gpt-4o-mini 500 10,000 200,000 2,000,000
gpt-4o-realtime-preview 100 100 20,000 -
gpt-4-turbo 500 - 30,000 90,000

Tier 2:

Model RPM TPM Batch Queue Limit
gpt-4o 5,000 450,000 1,350,000
gpt-4o-mini 5,000 2,000,000 20,000,000
gpt-4o-realtime-preview 200 40,000 -
gpt-4-turbo 5,000 450,000 1,350,000

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_app_test-0.1.0b4.tar.gz (13.9 kB view details)

Uploaded Source

Built Distribution

llm_app_test-0.1.0b4-py3-none-any.whl (14.6 kB view details)

Uploaded Python 3

File details

Details for the file llm_app_test-0.1.0b4.tar.gz.

File metadata

  • Download URL: llm_app_test-0.1.0b4.tar.gz
  • Upload date:
  • Size: 13.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.5

File hashes

Hashes for llm_app_test-0.1.0b4.tar.gz
Algorithm Hash digest
SHA256 8a75fcedb49952d790ce11a0f01141903308fe203069addf616e89b9962ccf1e
MD5 9d543477556d982e6485212ea781d733
BLAKE2b-256 1e995f10821bf7e5ea8394ae0be140e3458ae914a384f1a53a485bda9a1b0d68

See more details on using hashes here.

File details

Details for the file llm_app_test-0.1.0b4-py3-none-any.whl.

File metadata

File hashes

Hashes for llm_app_test-0.1.0b4-py3-none-any.whl
Algorithm Hash digest
SHA256 8408ecb9e2aa3744f3136f4e65f32b8f32b13b81e0ca306c9e0eff8bc583f436
MD5 abe74be3b2f9f7255318c7434d01dbc2
BLAKE2b-256 40a5383d3d6c8c3629218ae5c1cd07c1c42f47c3cccc85ab146908ea620c03bb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page