A lightweight LLM evaluation framework for comparing and testing AI providers
Project description
JUDGE LLM
A lightweight, extensible Python framework for evaluating and comparing LLM providers. Test your AI agents systematically with multi-turn conversations, cost tracking, and comprehensive reporting.
Documentation • Quick Start • Features • Examples • Reports
Purpose
JUDGE LLM helps you evaluate AI agents and LLM providers by running test cases against your models and measuring:
- Response quality (exact matching, semantic similarity, ROUGE scores)
- Cost & latency (token usage, execution time, budget compliance)
- Conversation flow (tool uses, multi-turn interactions)
- Safety & custom metrics (extensible evaluation logic)
Perfect for regression testing, A/B testing providers, and ensuring production-grade quality.
Features
- Multiple Providers: Gemini, Google ADK, ADK HTTP, Mock, and custom providers with registry-based extensibility
- Built-in Evaluators: Response similarity, trajectory validation, cost/latency checks, embedding similarity, LLM-as-judge, sub-agent chain validation
- Custom Components: Create and register custom providers, evaluators, and reporters
- Registry System: Register once in defaults, use everywhere by name
- Rich Reports: Console tables, interactive HTML dashboard, JSON exports, SQLite database, plus custom reporters
- Parallel Execution: Run evaluations concurrently with configurable workers
- Quality Gates: Fail CI/CD builds when thresholds are violated (configurable)
- Config-Driven: YAML configs with smart defaults or programmatic Python API
- Default Config: Reusable configurations with component registration
- Per-Test Overrides: Fine-tune evaluator thresholds per test case
- Environment Variables: Auto-loads
.envfor secure API key management - Telemetry: Optional OpenTelemetry tracing with Phoenix, Jaeger, and OTLP support
Installation
From Source
git clone https://github.com/HiHelloAI/judge-llm.git
cd judge-llm
pip install -e .
From PyPI (when published)
pip install judge-llm
With Optional Dependencies
# Install with Gemini provider support
pip install judge-llm[gemini]
# Install with Google ADK provider support
pip install judge-llm[google_adk]
# Install with dev dependencies
pip install judge-llm[dev]
# Install with OpenTelemetry support
pip install judge-llm[telemetry]
# Install with Arize Phoenix observability
pip install judge-llm[phoenix]
Setup Environment Variables
JUDGE LLM automatically loads environment variables from a .env file:
# Copy the example file
cp .env.example .env
# Edit .env and add your API keys
nano .env
.env file:
# Google Gemini API Key
GOOGLE_API_KEY=your-google-api-key-here
The .env file is automatically loaded when you import the library or run the CLI. Never commit .env to version control - it's already in .gitignore.
Quick Start
CLI Usage
# Run evaluation from config file
judge-llm run --config config.yaml
# Run with inline arguments (supports .json, .yaml, or .yml)
judge-llm run --dataset ./data/eval.yaml --provider mock --agent-id my_agent --report html --output report.html
# Validate configuration
judge-llm validate --config config.yaml
# List available components
judge-llm list providers
judge-llm list evaluators
judge-llm list reporters
# Generate dashboard from database
judge-llm dashboard --db results.db --output dashboard.html
Python API
from judge_llm import evaluate
# From config file
report = evaluate(config="config.yaml")
# Programmatic API (supports .json, .yaml, or .yml datasets)
report = evaluate(
dataset={"loader": "local_file", "paths": ["./data/eval.yaml"]},
providers=[{"type": "mock", "agent_id": "my_agent"}],
evaluators=[{"type": "response_evaluator", "config": {"similarity_threshold": 0.8}}],
reporters=[{"type": "console"}, {"type": "html", "output_path": "./report.html"}]
)
print(f"Success: {report.success_rate:.1%} | Cost: ${report.total_cost:.4f}")
Configuration
Minimal config.yaml:
dataset:
loader: local_file
paths: [./data/eval.json] # Supports .json, .yaml, or .yml files
providers:
- type: gemini
agent_id: my_agent
model: gemini-2.0-flash-exp
evaluators:
- type: response_evaluator
config: {similarity_threshold: 0.8}
reporters:
- type: console
- type: html
output_path: ./report.html
Advanced config with quality gates:
agent:
fail_on_threshold_violation: true # Exit with error if evaluations fail (default: true)
parallel_execution: true # Run tests in parallel
max_workers: 4 # Number of parallel workers
num_runs: 3 # Run each test 3 times
dataset:
loader: local_file
paths: [./data/eval.yaml]
providers:
- type: gemini
agent_id: production_agent
model: gemini-2.0-flash-exp
evaluators:
- type: response_evaluator
config:
similarity_threshold: 0.85 # Minimum 85% similarity required
- type: cost_evaluator
config:
max_cost_per_case: 0.05 # Maximum $0.05 per test
reporters:
- type: database
db_path: ./results.db # Track results over time
Use in CI/CD:
# Fails with exit code 1 if any evaluator thresholds are violated
judge-llm run --config ci-config.yaml
# Or disable failures for monitoring
# Set fail_on_threshold_violation: false in config
Dataset File Formats:
JUDGE LLM supports both JSON and YAML formats for evaluation datasets. Use whichever format you prefer:
# Using JSON dataset
dataset:
loader: local_file
paths: [./data/eval.json]
# Using YAML dataset
dataset:
loader: local_file
paths: [./data/eval.yaml]
# Using multiple datasets (mixed formats)
dataset:
loader: local_file
paths:
- ./data/eval1.json
- ./data/eval2.yaml
# Using directory loader with pattern
dataset:
loader: directory
paths: [./data]
pattern: "*.yaml" # or "*.json" or "*.yml"
Google ADK Provider Configuration:
For agents built with Google's Agent Development Kit (ADK):
providers:
- type: google_adk
agent_id: my_adk_agent
agent_metadata:
module_path: "tool_agent.agent" # Path to your agent module
agent_name: "root_agent" # Agent variable name (default: "root_agent")
root_path: "." # Root path for imports (optional)
The ADK provider automatically:
- Loads your ADK agent from the specified module
- Converts between Judge LLM and ADK formats
- Validates tool usage and conversation flow
- Reports results in standard Judge LLM format
See examples/09-google-adk-agent/ for a complete working example.
ADK HTTP Provider Configuration:
For evaluating agents deployed as remote HTTP services:
providers:
- type: adk_http
agent_id: my_remote_agent
endpoint_url: "http://localhost:8000/run_sse"
auth_type: bearer # bearer, api_key, basic, none
timeout: 60
app_name: "my_app"
model: gemini-2.0-flash
evaluators:
- type: response_evaluator
- type: trajectory_evaluator
- type: subagent_evaluator # Validate agent transfer chains
- type: llm_judge_evaluator # LLM-as-judge quality assessment
config:
evaluation_type: comprehensive
Lifecycle Callbacks (Extensibility):
The ADK HTTP provider exposes lifecycle callbacks that you can override by subclassing:
| Callback | When | Context keys |
|---|---|---|
on_before_session_create |
Before session creation request | payload, headers, url, app_name, user_id, session_id |
on_after_session_create |
After session created | session_id, response, app_name, user_id |
on_before_run |
Before sending message | payload, headers, message, session_id, system_instruction |
on_after_run |
After collecting events | events, session_id, response |
# my_provider.py
from judge_llm.providers.adk_http_provider import ADKHTTPProvider
class MyADKHTTPProvider(ADKHTTPProvider):
def on_before_run(self, context):
context["headers"]["X-Trace-Id"] = "my-trace"
return context
# Register and use via config
providers:
- type: custom
module_path: ./my_provider.py
class_name: MyADKHTTPProvider
register_as: my_adk_http
agent_id: my_agent
endpoint_url: http://localhost:8000/run_sse
See examples/10-adk-http-agent/ for a complete working example.
See the examples/ directory for complete configuration examples including default configs, custom evaluators, and advanced features.
Custom Component Registration
JUDGE LLM supports registering custom providers, evaluators, and reporters for reuse across projects.
Method 1: Register in Default Config
Create .judge_llm.defaults.yaml in your project root:
# Register custom components once
providers:
- type: custom
module_path: ./my_providers/anthropic.py
class_name: AnthropicProvider
register_as: anthropic # ← Use this name in test configs
evaluators:
- type: custom
module_path: ./my_evaluators/safety.py
class_name: SafetyEvaluator
register_as: safety
reporters:
- type: custom
module_path: ./my_reporters/slack.py
class_name: SlackReporter
register_as: slack
Then use them by name in any test config:
# test.yaml - clean and simple!
providers:
- type: anthropic # ← Uses registered custom provider
agent_id: claude
evaluators:
- type: safety # ← Uses registered custom evaluator
reporters:
- type: slack # ← Uses registered custom reporter
config: {webhook_url: ${SLACK_WEBHOOK}}
Method 2: Programmatic Registration
from judge_llm import evaluate, register_provider, register_evaluator, register_reporter
from my_components import CustomProvider, SafetyEvaluator, SlackReporter
# Register components
register_provider("my_provider", CustomProvider)
register_evaluator("safety", SafetyEvaluator)
register_reporter("slack", SlackReporter)
# Use by name
report = evaluate(
dataset={"loader": "local_file", "paths": ["./tests.json"]},
providers=[{"type": "my_provider", "agent_id": "test"}],
evaluators=[{"type": "safety"}],
reporters=[{"type": "slack", "config": {"webhook_url": "..."}}]
)
Benefits:
- ✅ DRY - Register once, use everywhere
- ✅ Team Standardization - Share defaults across team
- ✅ Clean Configs - Test configs reference components by name
- ✅ Easy Updates - Change implementation in one place
See examples/default_config_reporters/ for complete examples.
Testing Examples
Explore 10 complete examples in the examples/ directory, from basic setup to advanced features:
| Example | Description | Key Features |
|---|---|---|
| 01-gemini-agent | Basic Gemini agent evaluation | Simple setup, response evaluation, CLI & Python API usage |
| 02-default-config | Using default configuration files | Config merging, .judge_llm.defaults.yaml, reducing duplication |
| 03-custom-evaluator | Creating custom evaluators | Extending BaseEvaluator, custom validation logic, registration |
| 04-safety-long-conversation | Multi-turn safety evaluation | Long conversations, PII detection, toxicity analysis, LLM-as-judge |
| 05-evaluator-config-override | Per-test-case evaluator overrides | Fine-grained threshold control, two-level configuration |
| 06-database-reporter | SQLite database reporter | Historical tracking, trend analysis, SQL queries, cost monitoring |
| 07-custom-reporter | Creating custom reporters | CSV reporter example, config-based & programmatic registration |
| 08-default-config-reporters | Registering custom components | Register providers, evaluators, and reporters in defaults |
| 09-google-adk-agent | Google ADK agent evaluation | ADK integration, tool usage validation, agent module loading |
| 10-adk-http-agent | Remote ADK HTTP agent evaluation | HTTP/SSE streaming, multi-agent chains, sub-agent evaluation |
Running Examples
Each example includes configuration files, datasets, and detailed instructions:
# Navigate to any example
cd examples/01-gemini-agent
# Run via CLI
judge-llm run --config config.yaml
# Or run via Python script
python run_evaluation.py
# Or use the convenience shell script
./run.sh
Example Categories
Getting Started:
- Start with 01-gemini-agent for basic usage
- Use 02-default-config to learn configuration best practices
Customization:
- 03-custom-evaluator - Build domain-specific evaluation logic
- 07-custom-reporter - Create custom output formats
- 08-default-config-reporters - Organize custom components
Advanced Features:
- 04-safety-long-conversation - Production-ready safety evaluation with LLM-as-judge
- 05-evaluator-config-override - Fine-tune evaluations per test case
- 06-database-reporter - Track metrics over time with SQL queries
- 09-google-adk-agent - Evaluate Google ADK agents seamlessly
- 10-adk-http-agent - Evaluate remote agents via HTTP with multi-agent chain validation
Built-in Components
Providers
- Gemini - Google's Gemini models (requires
GOOGLE_API_KEYin.env) - Google ADK - Google's Agent Development Kit for local agentic workflows (requires
google-adkpackage) - ADK HTTP - Remote ADK HTTP endpoints with SSE streaming, multi-auth, and agent chain tracking (requires
httpx) - Mock - Built-in test provider, no setup required
- Custom - Extend
BaseProviderfor your own LLM providers (OpenAI, Anthropic, etc.)
Evaluators
- ResponseEvaluator - Compare responses (exact, semantic similarity, ROUGE)
- TrajectoryEvaluator - Validate tool uses, conversation flow, and argument matching
- CostEvaluator - Enforce cost thresholds
- LatencyEvaluator - Enforce latency thresholds
- EmbeddingSimilarityEvaluator - Semantic similarity using embeddings (Gemini, OpenAI, sentence-transformers)
- LLMJudgeEvaluator - LLM-as-judge for relevance, hallucination, quality, and factuality assessment
- SubAgentEvaluator - Validate agent transfer chains in multi-agent orchestration systems
- Custom - Extend
BaseEvaluatorfor custom logic (safety, compliance, etc.)
Reporters
- ConsoleReporter - Rich terminal output with colored tables
- HTMLReporter - Interactive HTML dashboard
- JSONReporter - Machine-readable JSON export
- DatabaseReporter - SQLite database for historical tracking
- Custom - Extend
BaseReporterfor custom formats (CSV, Slack, Datadog, etc.)
Reports & Dashboard
HTML Dashboard
Interactive web interface with:
- Sidebar: Summary metrics + execution list with color-coded status
- Main Panel: Execution details, evaluator scores, conversation history
- Features: Dark mode, responsive, self-contained (works offline)
Console Output
Rich formatted tables with live execution progress
JSON Export
Machine-readable results for programmatic analysis
SQLite Database
Persistent storage for:
- Historical trend tracking
- Regression detection
- Cost analysis over time
- SQL-based queries
# Generate dashboard from database
judge-llm dashboard --db results.db --output dashboard.html
Telemetry & Observability (OpenTelemetry)
Judge LLM includes optional OpenTelemetry instrumentation for deep observability into evaluation runs. Disabled by default with zero overhead when not enabled.
Quick Start
# Install telemetry dependencies
pip install judge-llm[telemetry]
# Run with console tracing
judge-llm run --config config.yaml --telemetry
# Run with OTLP exporter (Jaeger, Grafana Tempo, etc.)
judge-llm run --config config.yaml --telemetry --telemetry-exporter otlp
Arize Phoenix Integration
# Install Phoenix support
pip install judge-llm[phoenix]
# Start Phoenix server
pip install arize-phoenix && phoenix serve
# Run with Phoenix tracing
judge-llm run --config config.yaml --telemetry --telemetry-exporter phoenix
# Open http://localhost:6006 to view traces
Enable via YAML Config
agent:
telemetry:
enabled: true
exporter: phoenix # "console", "otlp", or "phoenix"
endpoint: http://localhost:6006
Enable via Environment Variables
export JUDGE_LLM_TELEMETRY=true
export OTEL_EXPORTER_TYPE=phoenix
export PHOENIX_COLLECTOR_ENDPOINT=http://localhost:6006
What Gets Traced
Every evaluation creates a span tree with detailed attributes:
judge_llm.evaluate [CHAIN]
├── judge_llm.execute_task [CHAIN, session.id]
│ ├── judge_llm.provider.execute [LLM, input/output, tokens]
│ │ ├── adk_http.create_session [TOOL, HTTP req/res]
│ │ └── adk_http.send_and_collect [LLM, HTTP req/res, tokens]
│ └── judge_llm.evaluator.evaluate [EVALUATOR, score]
└── judge_llm.reporter.generate [per reporter]
Phoenix-specific features:
- Sessions — spans are grouped by
session.idfor multi-turn conversation tracking - Input/Output — full request payloads and agent response text visible on each span
- HTTP details — request/response bodies, headers, and status codes on HTTP spans
- LLM metadata — model name, token counts (prompt/completion/total) via OpenInference conventions
See the Telemetry Guide for complete documentation.
Development
# Setup
pip install -e ".[dev]"
# Run tests
pytest
# Format code
black judge_llm && ruff check judge_llm
Contributions welcome! Fork, create a feature branch, add tests, and submit a PR.
License
Licensed under CC BY-NC-SA 4.0 - Free for non-commercial use with attribution. See LICENSE for details.
For commercial licensing, contact the maintainers.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file judge_llm-1.0.11.tar.gz.
File metadata
- Download URL: judge_llm-1.0.11.tar.gz
- Upload date:
- Size: 98.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e831c1da6185c453f20cadc70bcda8d446d9914ee94805f98fe800a66fa5f3d8
|
|
| MD5 |
b12e5bb2be3b41cdaa56f45af060e62d
|
|
| BLAKE2b-256 |
8c859348e1ee124695e4ac25a934cd931ace87005b401851fb9f91c72368177e
|
File details
Details for the file judge_llm-1.0.11-py3-none-any.whl.
File metadata
- Download URL: judge_llm-1.0.11-py3-none-any.whl
- Upload date:
- Size: 112.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a5af0e45c16d5502b91ee51a42a65269413db27442c78c39594ccd4bf9ed84ef
|
|
| MD5 |
929ca7a87c60c322b9a8647959358baa
|
|
| BLAKE2b-256 |
dcc352e15382e1044ae2e83e70820fc6c51f183dd53b6f8fe6b712661eace5b2
|