A framework for testing LLM performance using pydanticAI

These details have not been verified by PyPI

Project description

LLM Tester

A powerful Python framework for benchmarking, comparing, and optimizing various LLM providers through structured data extraction tasks. Framwork relies on Pydantic models for data structure definition and gives a percentage accuracy score for each provider and a cost.

Purpose

LLM Tester solves three key challenges in LLM development and evaluation:

Consistent Evaluation: Objectively measure how accurately different LLMs extract structured data
Prompt Optimization: Automatically refine prompts to improve extraction accuracy
Cost Analysis: Track token usage and costs across providers to optimize for performance/cost ratio

The framework is designed to help you determine which LLM provider and model best suits your specific data extraction needs, while also helping optimize prompts for maximum accuracy.

You will see very quickly, that the accuracy even between runs fluctuates a lot. I'm using this also to see when models are "having a bad day." Multi-pass and sway calculation is also along then way, as well as sway calculation over time.

Architecture

LLM Tester features a flexible, pluggable architecture that supports multiple integration methods:

Pluggable LLM Providers

The system supports three types of provider implementations:

Native Implementations: Direct integration with provider APIs (OpenAI, Anthropic, Mistral, Google, OpenRouter)
- Provider-specific code is encapsulated in dedicated classes
- Each provider has standardized configuration in config.json (Note: OpenRouter dynamically fetches model details like cost/limits from its API, overriding static config values).
- Token usage and costs are automatically tracked
PydanticAI Integration: Use the PydanticAI library as an abstraction layer
- Leverage PydanticAI's structured data extraction capabilities
- Benefit from PydanticAI's optimizations and error handling
- Use the same Pydantic models across different providers
Mock Implementations: Test without API keys
- Simulate provider responses for development and testing
- Include realistic token counts and timing
- Great for CI/CD pipelines or offline development

Adding a new provider requires minimal effort - just create a directory under llm_tester/llms/ with a provider implementation and configuration file.

Features

Test multiple LLM providers (OpenAI, Anthropic, Mistral, Google)
Validate responses against Pydantic models
Calculate accuracy compared to expected results
Optimize prompts for better performance
Generate detailed test reports
Centralized configuration management
Enhanced mock response system for testing without API keys
Track token usage and cost across providers

Supported Models

Job Advertisements
- Extract structured job information including title, company, skills, etc.
Product Descriptions
- Extract product details including specifications, pricing, etc.

Installation

# Clone the repository
# git clone https://github.com/yourusername/llm-tester.git # Replace with actual repo URL
cd llm-tester

# Create and activate virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Configure API Keys (Interactive)
python -m llm_tester.cli configure keys
# This will prompt for missing keys found in provider configs and offer to save them to llm_tester/.env

Make sure your API keys are set in llm_tester/.env or as environment variables. The configure keys command helps with this.

Running via CLI

The primary way to run tests and manage the tool is via the llm-tester command-line interface (after installation via pip install -e .).

# Make sure the virtual environment is activated
source venv/bin/activate

# Show help and available commands
llm-tester --help

# --- Running Tests ---

# Run tests using all enabled providers and their default models
llm-tester run

# Run tests for specific providers
llm-tester run --providers openai anthropic

# Run tests using specific LLM models for providers
llm-tester run --providers openai openrouter --models openai:gpt-4o --models openrouter/google/gemini-pro-1.5

# Run tests and save report to a file
llm-tester run --output my_report.md

# Run tests with prompt optimization
llm-tester run --optimize

# Output test results as JSON instead of Markdown
llm-tester run --json

# Filter tests by name (e.g., only 'simple' tests in 'job_ads') - Note: Filtering not fully implemented yet
# llm-tester run --filter job_ads/simple

# Increase verbosity for debugging
llm-tester run -vv

# --- Listing Information ---

# List available extraction schemas (test modules)
llm-tester schemas list

# List available test cases and configured providers/models without running tests
llm-tester list

# List specific providers and their models for the list command
llm-tester list --providers openai --models openai:gpt-4o

# --- Configuration & Management ---

# Configure API Keys (Interactive Prompt)
llm-tester configure keys

# List all discoverable providers and their enabled/disabled status
llm-tester providers list

# Enable a provider (adds to or creates enabled_providers.json)
llm-tester providers enable openrouter

# Disable a provider (removes from enabled_providers.json)
llm-tester providers disable google

# List LLM models within a specific provider's config and their status
llm-tester providers manage list openrouter

# Enable a specific LLM model within a provider's config
llm-tester providers manage enable openrouter anthropic/claude-3-haiku

# Disable a specific LLM model within a provider's config
llm-tester providers manage disable openai gpt-3.5-turbo

# Update LLM Model Info (e.g., pricing/limits) from OpenRouter API
llm-tester providers manage update openrouter

# Get LLM-assisted model recommendations for a task (Interactive Prompt)
llm-tester recommend-model

# --- Interactive Mode ---

# Launch the interactive menu
llm-tester interactive

Usage

from llm_tester import LLMTester

# Initialize tester with providers
tester = LLMTester(providers=["openai", "anthropic", "google", "mistral"])

# Run tests
results = tester.run_tests()

# Generate report
report = tester.generate_report(results)
print(report)

# Run optimized tests
optimized_results = tester.run_optimized_tests()
optimized_report = tester.generate_report(optimized_results, optimized=True)

Provider System

LLM Tester uses a pluggable provider system that makes it easy to add and configure different LLM providers:

Native Provider Integration

To use a native provider integration:

tester = LLMTester(providers=["openai", "anthropic", "google", "mistral"])

Native providers directly call the respective provider's API with optimized parameters.

PydanticAI Integration

To use the PydanticAI integration:

tester = LLMTester(providers=["pydantic_ai"])

This will use PydanticAI's extraction capabilities with your specified model.

Mock Testing

For testing without API keys:

tester = LLMTester(providers=["mock"])

Mock providers simulate responses based on the test case structure.

Adding New Providers

Create a new directory in llm_tester/llms/your_provider/
Implement a provider class that inherits from BaseLLM (see llm_tester/llms/base.py).
Create a config.json file with provider settings (name, env_key, etc.) and a list of models with their details (cost, tokens). See existing provider configs for examples. (Note: For OpenRouter, costs and token limits are fetched dynamically via update-models or on load, overriding static values in config.json).
Add from .provider import YourProviderClass to the provider's __init__.py to ensure discovery.
Optionally, enable the provider using python -m llm_tester.cli providers enable your_provider.

Adding New Extraction Models

Create a new directory in llm_tester/models/your_model_type/
Implement your Pydantic model in model.py with these components:
- Define your model class extending BaseModel
- Add class variables for module configuration: MODULE_NAME, TEST_DIR, REPORT_DIR
- Implement the get_test_cases() class method
- Implement the save_module_report() and save_module_cost_report() class methods
Create the test structure:
- Create llm_tester/models/your_model_type/tests/ directory
- Add sources/ for input data files
- Add prompts/ for prompt templates
- Add expected/ for expected output JSON
- Create reports/ directory for module-specific reports
Add appropriate __init__.py files to ensure proper imports

NOTE: you can add new model / module also with the CLI tool.

Verifying Provider Setup

You can verify your provider setup, check configurations, and see LLM model availability using the CLI:

# List discovered providers and enabled status
llm-tester providers list

# List LLM models within a specific provider's config
llm-tester providers manage list <provider_name>

# Check API keys (will prompt if missing)
llm-tester configure keys

The old ./verify_providers.py script is no longer used; use the commands above instead.

General implementation notes

This package is written initially using Claude Code, using only minimum manual intervention and edits. Further improvements are made with Cline, using Gemini 2.5. LLM generated code is reviewed and tested by the author and all of the architectural decisions are mine.

License

MIT

Project details

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language

Release history Release notifications | RSS feed

0.1.20

Oct 7, 2025

0.1.18

May 22, 2025

0.1.17

May 8, 2025

0.1.16

May 7, 2025

0.1.15

May 7, 2025

0.1.14

May 7, 2025

0.1.13

May 7, 2025

0.1.12

May 7, 2025

0.1.10

May 7, 2025

0.1.9

May 7, 2025

0.1.8

May 7, 2025

0.1.5

May 7, 2025

This version

0.1.0

May 5, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydantic_llm_tester-0.1.0.tar.gz (99.6 kB view details)

Uploaded May 5, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pydantic_llm_tester-0.1.0-py3-none-any.whl (128.6 kB view details)

Uploaded May 5, 2025 Python 3

File details

Details for the file pydantic_llm_tester-0.1.0.tar.gz.

File metadata

Download URL: pydantic_llm_tester-0.1.0.tar.gz
Upload date: May 5, 2025
Size: 99.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pydantic_llm_tester-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`02a9ff99190138514950dc3ec2d8d161158d95c7a146b60fd02d9ba2ff83026b`
MD5	`d816923f65bd003d61fafe5e8421448d`
BLAKE2b-256	`752176b3b109b2eb3e3859a23d76716f43980d671d891b5e9aed38ff59505d66`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pydantic_llm_tester-0.1.0.tar.gz:

Publisher: publish-to-pypi.yml on madviking/pydantic-llm-tester

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pydantic_llm_tester-0.1.0.tar.gz
- Subject digest: 02a9ff99190138514950dc3ec2d8d161158d95c7a146b60fd02d9ba2ff83026b
- Sigstore transparency entry: 206883610
- Sigstore integration time: May 5, 2025
Source repository:
- Permalink: madviking/pydantic-llm-tester@b74f774b9ff46c2a601217bda83e912ee83380ad
- Branch / Tag: refs/tags/v0.1.3
- Owner: https://github.com/madviking
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypi.yml@b74f774b9ff46c2a601217bda83e912ee83380ad
- Trigger Event: push

File details

Details for the file pydantic_llm_tester-0.1.0-py3-none-any.whl.

File metadata

Download URL: pydantic_llm_tester-0.1.0-py3-none-any.whl
Upload date: May 5, 2025
Size: 128.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pydantic_llm_tester-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0458e116fae5656c58e1b2554ca8e90cbd346d162d3cf9a5ad9eb9c477f04fd1`
MD5	`41c303852c2e3b5afec574ea01ac9237`
BLAKE2b-256	`fba5489852c6f72c682ace367f26ddb8b82d132a3768cd938f7bcdfc9e9da855`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pydantic_llm_tester-0.1.0-py3-none-any.whl:

Publisher: publish-to-pypi.yml on madviking/pydantic-llm-tester

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pydantic_llm_tester-0.1.0-py3-none-any.whl
- Subject digest: 0458e116fae5656c58e1b2554ca8e90cbd346d162d3cf9a5ad9eb9c477f04fd1
- Sigstore transparency entry: 206883611
- Sigstore integration time: May 5, 2025
Source repository:
- Permalink: madviking/pydantic-llm-tester@b74f774b9ff46c2a601217bda83e912ee83380ad
- Branch / Tag: refs/tags/v0.1.3
- Owner: https://github.com/madviking
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypi.yml@b74f774b9ff46c2a601217bda83e912ee83380ad
- Trigger Event: push

pydantic-llm-tester 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

LLM Tester

Purpose

Architecture

Pluggable LLM Providers

Features

Supported Models

Installation

Running via CLI

Usage

Provider System

Native Provider Integration

PydanticAI Integration

Mock Testing

Adding New Providers

Adding New Extraction Models

Verifying Provider Setup

General implementation notes

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance