A framework for testing LLM performance using pydanticAI
Project description
LLM Tester
A powerful Python framework for benchmarking, comparing, and optimizing various LLM providers through structured data extraction tasks. Framwork relies on Pydantic models for data structure definition and gives a percentage accuracy score for each provider and a cost.
Purpose
LLM Tester solves three key challenges in LLM development and evaluation:
- Consistent Evaluation: Objectively measure how accurately different LLMs extract structured data
- Prompt Optimization: Automatically refine prompts to improve extraction accuracy
- Cost Analysis: Track token usage and costs across providers to optimize for performance/cost ratio
The framework is designed to help you determine which LLM provider and model best suits your specific data extraction needs, while also helping optimize prompts for maximum accuracy.
You will see very quickly, that the accuracy even between runs fluctuates a lot. I'm using this also to see when models are "having a bad day." Multi-pass and sway calculation is also along then way, as well as sway calculation over time.
Architecture
LLM Tester features a flexible, pluggable architecture that supports multiple integration methods:
Pluggable LLM Providers
The system supports three types of provider implementations:
-
Native Implementations: Direct integration with provider APIs (OpenAI, Anthropic, Mistral, Google, OpenRouter)
- Provider-specific code is encapsulated in dedicated classes
- Each provider has standardized configuration in
config.json(Note: OpenRouter dynamically fetches model details like cost/limits from its API, overriding static config values). - Token usage and costs are automatically tracked
-
PydanticAI Integration: Use the PydanticAI library as an abstraction layer
- Leverage PydanticAI's structured data extraction capabilities
- Benefit from PydanticAI's optimizations and error handling
- Use the same Pydantic models across different providers
-
Mock Implementations: Test without API keys
- Simulate provider responses for development and testing
- Include realistic token counts and timing
- Great for CI/CD pipelines or offline development
Adding a new provider requires minimal effort - just create a directory under llm_tester/llms/ with a provider implementation and configuration file.
Features
- Test multiple LLM providers (OpenAI, Anthropic, Mistral, Google)
- Validate responses against Pydantic models
- Calculate accuracy compared to expected results
- Optimize prompts for better performance
- Generate detailed test reports
- Centralized configuration management
- Enhanced mock response system for testing without API keys
- Track token usage and cost across providers
Supported Models
-
Job Advertisements
- Extract structured job information including title, company, skills, etc.
-
Product Descriptions
- Extract product details including specifications, pricing, etc.
Installation
# Clone the repository
# git clone https://github.com/yourusername/llm-tester.git # Replace with actual repo URL
cd llm-tester
# Create and activate virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Configure API Keys (Interactive)
python -m llm_tester.cli configure keys
# This will prompt for missing keys found in provider configs and offer to save them to llm_tester/.env
Make sure your API keys are set in llm_tester/.env or as environment variables. The configure keys command helps with this.
Running via CLI
The primary way to run tests and manage the tool is via the llm-tester command-line interface (after installation via pip install -e .).
# Make sure the virtual environment is activated
source venv/bin/activate
# Show help and available commands
llm-tester --help
# --- Running Tests ---
# Run tests using all enabled providers and their default models
llm-tester run
# Run tests for specific providers
llm-tester run --providers openai anthropic
# Run tests using specific LLM models for providers
llm-tester run --providers openai openrouter --models openai:gpt-4o --models openrouter/google/gemini-pro-1.5
# Run tests and save report to a file
llm-tester run --output my_report.md
# Run tests with prompt optimization
llm-tester run --optimize
# Output test results as JSON instead of Markdown
llm-tester run --json
# Filter tests by name (e.g., only 'simple' tests in 'job_ads') - Note: Filtering not fully implemented yet
# llm-tester run --filter job_ads/simple
# Increase verbosity for debugging
llm-tester run -vv
# --- Listing Information ---
# List available extraction schemas (test modules)
llm-tester schemas list
# List available test cases and configured providers/models without running tests
llm-tester list
# List specific providers and their models for the list command
llm-tester list --providers openai --models openai:gpt-4o
# --- Configuration & Management ---
# Configure API Keys (Interactive Prompt)
llm-tester configure keys
# List all discoverable providers and their enabled/disabled status
llm-tester providers list
# Enable a provider (adds to or creates enabled_providers.json)
llm-tester providers enable openrouter
# Disable a provider (removes from enabled_providers.json)
llm-tester providers disable google
# List LLM models within a specific provider's config and their status
llm-tester providers manage list openrouter
# Enable a specific LLM model within a provider's config
llm-tester providers manage enable openrouter anthropic/claude-3-haiku
# Disable a specific LLM model within a provider's config
llm-tester providers manage disable openai gpt-3.5-turbo
# Update LLM Model Info (e.g., pricing/limits) from OpenRouter API
llm-tester providers manage update openrouter
# Get LLM-assisted model recommendations for a task (Interactive Prompt)
llm-tester recommend-model
# --- Interactive Mode ---
# Launch the interactive menu
llm-tester interactive
Usage
from llm_tester import LLMTester
# Initialize tester with providers
tester = LLMTester(providers=["openai", "anthropic", "google", "mistral"])
# Run tests
results = tester.run_tests()
# Generate report
report = tester.generate_report(results)
print(report)
# Run optimized tests
optimized_results = tester.run_optimized_tests()
optimized_report = tester.generate_report(optimized_results, optimized=True)
Provider System
LLM Tester uses a pluggable provider system that makes it easy to add and configure different LLM providers:
Native Provider Integration
To use a native provider integration:
tester = LLMTester(providers=["openai", "anthropic", "google", "mistral"])
Native providers directly call the respective provider's API with optimized parameters.
PydanticAI Integration
To use the PydanticAI integration:
tester = LLMTester(providers=["pydantic_ai"])
This will use PydanticAI's extraction capabilities with your specified model.
Mock Testing
For testing without API keys:
tester = LLMTester(providers=["mock"])
Mock providers simulate responses based on the test case structure.
Adding New Providers
- Create a new directory in
llm_tester/llms/your_provider/ - Implement a provider class that inherits from
BaseLLM(seellm_tester/llms/base.py). - Create a
config.jsonfile with provider settings (name,env_key, etc.) and a list ofmodelswith their details (cost, tokens). See existing provider configs for examples. (Note: For OpenRouter, costs and token limits are fetched dynamically viaupdate-modelsor on load, overriding static values inconfig.json). - Add
from .provider import YourProviderClassto the provider's__init__.pyto ensure discovery. - Optionally, enable the provider using
python -m llm_tester.cli providers enable your_provider.
Adding New Extraction Models
- Create a new directory in
llm_tester/models/your_model_type/ - Implement your Pydantic model in
model.pywith these components:- Define your model class extending BaseModel
- Add class variables for module configuration: MODULE_NAME, TEST_DIR, REPORT_DIR
- Implement the
get_test_cases()class method - Implement the
save_module_report()andsave_module_cost_report()class methods
- Create the test structure:
- Create
llm_tester/models/your_model_type/tests/directory - Add
sources/for input data files - Add
prompts/for prompt templates - Add
expected/for expected output JSON - Create
reports/directory for module-specific reports
- Create
- Add appropriate
__init__.pyfiles to ensure proper imports
NOTE: you can add new model / module also with the CLI tool.
Verifying Provider Setup
You can verify your provider setup, check configurations, and see LLM model availability using the CLI:
# List discovered providers and enabled status
llm-tester providers list
# List LLM models within a specific provider's config
llm-tester providers manage list <provider_name>
# Check API keys (will prompt if missing)
llm-tester configure keys
The old ./verify_providers.py script is no longer used; use the commands above instead.
General implementation notes
This package is written initially using Claude Code, using only minimum manual intervention and edits. Further improvements are made with Cline, using Gemini 2.5. LLM generated code is reviewed and tested by the author and all of the architectural decisions are mine.
License
MIT
© 2025 Timo Railo
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pydantic_llm_tester-0.1.0.tar.gz.
File metadata
- Download URL: pydantic_llm_tester-0.1.0.tar.gz
- Upload date:
- Size: 99.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
02a9ff99190138514950dc3ec2d8d161158d95c7a146b60fd02d9ba2ff83026b
|
|
| MD5 |
d816923f65bd003d61fafe5e8421448d
|
|
| BLAKE2b-256 |
752176b3b109b2eb3e3859a23d76716f43980d671d891b5e9aed38ff59505d66
|
Provenance
The following attestation bundles were made for pydantic_llm_tester-0.1.0.tar.gz:
Publisher:
publish-to-pypi.yml on madviking/pydantic-llm-tester
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pydantic_llm_tester-0.1.0.tar.gz -
Subject digest:
02a9ff99190138514950dc3ec2d8d161158d95c7a146b60fd02d9ba2ff83026b - Sigstore transparency entry: 206883610
- Sigstore integration time:
-
Permalink:
madviking/pydantic-llm-tester@b74f774b9ff46c2a601217bda83e912ee83380ad -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/madviking
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@b74f774b9ff46c2a601217bda83e912ee83380ad -
Trigger Event:
push
-
Statement type:
File details
Details for the file pydantic_llm_tester-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pydantic_llm_tester-0.1.0-py3-none-any.whl
- Upload date:
- Size: 128.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0458e116fae5656c58e1b2554ca8e90cbd346d162d3cf9a5ad9eb9c477f04fd1
|
|
| MD5 |
41c303852c2e3b5afec574ea01ac9237
|
|
| BLAKE2b-256 |
fba5489852c6f72c682ace367f26ddb8b82d132a3768cd938f7bcdfc9e9da855
|
Provenance
The following attestation bundles were made for pydantic_llm_tester-0.1.0-py3-none-any.whl:
Publisher:
publish-to-pypi.yml on madviking/pydantic-llm-tester
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pydantic_llm_tester-0.1.0-py3-none-any.whl -
Subject digest:
0458e116fae5656c58e1b2554ca8e90cbd346d162d3cf9a5ad9eb9c477f04fd1 - Sigstore transparency entry: 206883611
- Sigstore integration time:
-
Permalink:
madviking/pydantic-llm-tester@b74f774b9ff46c2a601217bda83e912ee83380ad -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/madviking
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@b74f774b9ff46c2a601217bda83e912ee83380ad -
Trigger Event:
push
-
Statement type: