A lightweight LLM evaluation framework for comparing and testing AI providers

These details have not been verified by PyPI

Project links

Project description

JUDGE LLM

A lightweight, extensible Python framework for evaluating and comparing LLM providers. Test your AI agents systematically with multi-turn conversations, cost tracking, and comprehensive reporting.

Documentation • Quick Start • Features • Examples • Reports

Purpose

JUDGE LLM helps you evaluate AI agents and LLM providers by running test cases against your models and measuring:

Response quality (exact matching, semantic similarity, ROUGE scores)
Cost & latency (token usage, execution time, budget compliance)
Conversation flow (tool uses, multi-turn interactions)
Safety & custom metrics (extensible evaluation logic)

Perfect for regression testing, A/B testing providers, and ensuring production-grade quality.

Features

Multiple Providers: Gemini, Google ADK, ADK HTTP, Mock, and custom providers with registry-based extensibility
Built-in Evaluators: Response similarity, trajectory validation, cost/latency checks, embedding similarity, LLM-as-judge, sub-agent chain validation
Custom Components: Create and register custom providers, evaluators, and reporters
Registry System: Register once in defaults, use everywhere by name
Rich Reports: Console tables, interactive HTML dashboard, JSON exports, SQLite database, plus custom reporters
Parallel Execution: Run evaluations concurrently with configurable workers
Quality Gates: Fail CI/CD builds when thresholds are violated (configurable)
Config-Driven: YAML configs with smart defaults or programmatic Python API
Default Config: Reusable configurations with component registration
Per-Test Overrides: Fine-tune evaluator thresholds per test case
Environment Variables: Auto-loads .env for secure API key management
Telemetry: Optional OpenTelemetry tracing with Phoenix, Jaeger, and OTLP support

Installation

From Source

git clone https://github.com/HiHelloAI/judge-llm.git
cd judge-llm
pip install -e .

From PyPI (when published)

pip install judge-llm

With Optional Dependencies

# Install with Gemini provider support
pip install judge-llm[gemini]

# Install with Google ADK provider support
pip install judge-llm[google_adk]

# Install with dev dependencies
pip install judge-llm[dev]

# Install with OpenTelemetry support
pip install judge-llm[telemetry]

# Install with Arize Phoenix observability
pip install judge-llm[phoenix]

Setup Environment Variables

JUDGE LLM automatically loads environment variables from a .env file:

# Copy the example file
cp .env.example .env

# Edit .env and add your API keys
nano .env

.env file:

# Google Gemini API Key
GOOGLE_API_KEY=your-google-api-key-here

The .env file is automatically loaded when you import the library or run the CLI. Never commit .env to version control - it's already in .gitignore.

Quick Start

CLI Usage

# Run evaluation from config file
judge-llm run --config config.yaml

# Run with inline arguments (supports .json, .yaml, or .yml)
judge-llm run --dataset ./data/eval.yaml --provider mock --agent-id my_agent --report html --output report.html

# Validate configuration
judge-llm validate --config config.yaml

# List available components
judge-llm list providers
judge-llm list evaluators
judge-llm list reporters

# Generate dashboard from database
judge-llm dashboard --db results.db --output dashboard.html

Python API

from judge_llm import evaluate

# From config file
report = evaluate(config="config.yaml")

# Programmatic API (supports .json, .yaml, or .yml datasets)
report = evaluate(
    dataset={"loader": "local_file", "paths": ["./data/eval.yaml"]},
    providers=[{"type": "mock", "agent_id": "my_agent"}],
    evaluators=[{"type": "response_evaluator", "config": {"similarity_threshold": 0.8}}],
    reporters=[{"type": "console"}, {"type": "html", "output_path": "./report.html"}]
)

print(f"Success: {report.success_rate:.1%} | Cost: ${report.total_cost:.4f}")

Configuration

Minimal config.yaml:

dataset:
  loader: local_file
  paths: [./data/eval.json]  # Supports .json, .yaml, or .yml files

providers:
  - type: gemini
    agent_id: my_agent
    model: gemini-2.0-flash-exp

evaluators:
  - type: response_evaluator
    config: {similarity_threshold: 0.8}

reporters:
  - type: console
  - type: html
    output_path: ./report.html

Advanced config with quality gates:

agent:
  fail_on_threshold_violation: true  # Exit with error if evaluations fail (default: true)
  parallel_execution: true            # Run tests in parallel
  max_workers: 4                      # Number of parallel workers
  num_runs: 3                         # Run each test 3 times

dataset:
  loader: local_file
  paths: [./data/eval.yaml]

providers:
  - type: gemini
    agent_id: production_agent
    model: gemini-2.0-flash-exp

evaluators:
  - type: response_evaluator
    config:
      similarity_threshold: 0.85  # Minimum 85% similarity required
  - type: cost_evaluator
    config:
      max_cost_per_case: 0.05      # Maximum $0.05 per test

reporters:
  - type: database
    db_path: ./results.db  # Track results over time

Use in CI/CD:

# Fails with exit code 1 if any evaluator thresholds are violated
judge-llm run --config ci-config.yaml

# Or disable failures for monitoring
# Set fail_on_threshold_violation: false in config

Dataset File Formats:

JUDGE LLM supports both JSON and YAML formats for evaluation datasets. Use whichever format you prefer:

# Using JSON dataset
dataset:
  loader: local_file
  paths: [./data/eval.json]

# Using YAML dataset
dataset:
  loader: local_file
  paths: [./data/eval.yaml]

# Using multiple datasets (mixed formats)
dataset:
  loader: local_file
  paths:
    - ./data/eval1.json
    - ./data/eval2.yaml

# Using directory loader with pattern
dataset:
  loader: directory
  paths: [./data]
  pattern: "*.yaml"  # or "*.json" or "*.yml"

Google ADK Provider Configuration:

For agents built with Google's Agent Development Kit (ADK):

providers:
  - type: google_adk
    agent_id: my_adk_agent
    agent_metadata:
      module_path: "tool_agent.agent"  # Path to your agent module
      agent_name: "root_agent"         # Agent variable name (default: "root_agent")
      root_path: "."                   # Root path for imports (optional)

The ADK provider automatically:

Loads your ADK agent from the specified module
Converts between Judge LLM and ADK formats
Validates tool usage and conversation flow
Reports results in standard Judge LLM format

See examples/09-google-adk-agent/ for a complete working example.

ADK HTTP Provider Configuration:

For evaluating agents deployed as remote HTTP services:

providers:
  - type: adk_http
    agent_id: my_remote_agent
    endpoint_url: "http://localhost:8000/run_sse"
    auth_type: bearer          # bearer, api_key, basic, none
    timeout: 60
    app_name: "my_app"
    model: gemini-2.0-flash

evaluators:
  - type: response_evaluator
  - type: trajectory_evaluator
  - type: subagent_evaluator   # Validate agent transfer chains
  - type: llm_judge_evaluator  # LLM-as-judge quality assessment
    config:
      evaluation_type: comprehensive

Lifecycle Callbacks (Extensibility):

The ADK HTTP provider exposes lifecycle callbacks that you can override by subclassing:

Callback	When	Context keys
`on_before_session_create`	Before session creation request	`payload, headers, url, app_name, user_id, session_id`
`on_after_session_create`	After session created	`session_id, response, app_name, user_id`
`on_before_run`	Before sending message	`payload, headers, message, session_id, system_instruction`
`on_after_run`	After collecting events	`events, session_id, response`

# my_provider.py
from judge_llm.providers.adk_http_provider import ADKHTTPProvider

class MyADKHTTPProvider(ADKHTTPProvider):
    def on_before_run(self, context):
        context["headers"]["X-Trace-Id"] = "my-trace"
        return context

# Register and use via config
providers:
  - type: custom
    module_path: ./my_provider.py
    class_name: MyADKHTTPProvider
    register_as: my_adk_http
    agent_id: my_agent
    endpoint_url: http://localhost:8000/run_sse

See examples/10-adk-http-agent/ for a complete working example.

See the examples/ directory for complete configuration examples including default configs, custom evaluators, and advanced features.

Custom Component Registration

JUDGE LLM supports registering custom providers, evaluators, and reporters for reuse across projects.

Method 1: Register in Default Config

Create .judge_llm.defaults.yaml in your project root:

# Register custom components once
providers:
  - type: custom
    module_path: ./my_providers/anthropic.py
    class_name: AnthropicProvider
    register_as: anthropic  # ← Use this name in test configs

evaluators:
  - type: custom
    module_path: ./my_evaluators/safety.py
    class_name: SafetyEvaluator
    register_as: safety

reporters:
  - type: custom
    module_path: ./my_reporters/slack.py
    class_name: SlackReporter
    register_as: slack

Then use them by name in any test config:

# test.yaml - clean and simple!
providers:
  - type: anthropic  # ← Uses registered custom provider
    agent_id: claude

evaluators:
  - type: safety  # ← Uses registered custom evaluator

reporters:
  - type: slack  # ← Uses registered custom reporter
    config: {webhook_url: ${SLACK_WEBHOOK}}

Method 2: Programmatic Registration

from judge_llm import evaluate, register_provider, register_evaluator, register_reporter
from my_components import CustomProvider, SafetyEvaluator, SlackReporter

# Register components
register_provider("my_provider", CustomProvider)
register_evaluator("safety", SafetyEvaluator)
register_reporter("slack", SlackReporter)

# Use by name
report = evaluate(
    dataset={"loader": "local_file", "paths": ["./tests.json"]},
    providers=[{"type": "my_provider", "agent_id": "test"}],
    evaluators=[{"type": "safety"}],
    reporters=[{"type": "slack", "config": {"webhook_url": "..."}}]
)

Benefits:

✅ DRY - Register once, use everywhere
✅ Team Standardization - Share defaults across team
✅ Clean Configs - Test configs reference components by name
✅ Easy Updates - Change implementation in one place

See examples/default_config_reporters/ for complete examples.

Testing Examples

Explore 10 complete examples in the examples/ directory, from basic setup to advanced features:

Example	Description	Key Features
01-gemini-agent	Basic Gemini agent evaluation	Simple setup, response evaluation, CLI & Python API usage
02-default-config	Using default configuration files	Config merging, `.judge_llm.defaults.yaml`, reducing duplication
03-custom-evaluator	Creating custom evaluators	Extending `BaseEvaluator`, custom validation logic, registration
04-safety-long-conversation	Multi-turn safety evaluation	Long conversations, PII detection, toxicity analysis, LLM-as-judge
05-evaluator-config-override	Per-test-case evaluator overrides	Fine-grained threshold control, two-level configuration
06-database-reporter	SQLite database reporter	Historical tracking, trend analysis, SQL queries, cost monitoring
07-custom-reporter	Creating custom reporters	CSV reporter example, config-based & programmatic registration
08-default-config-reporters	Registering custom components	Register providers, evaluators, and reporters in defaults
09-google-adk-agent	Google ADK agent evaluation	ADK integration, tool usage validation, agent module loading
10-adk-http-agent	Remote ADK HTTP agent evaluation	HTTP/SSE streaming, multi-agent chains, sub-agent evaluation

Running Examples

Each example includes configuration files, datasets, and detailed instructions:

# Navigate to any example
cd examples/01-gemini-agent

# Run via CLI
judge-llm run --config config.yaml

# Or run via Python script
python run_evaluation.py

# Or use the convenience shell script
./run.sh

Example Categories

Getting Started:

Start with 01-gemini-agent for basic usage
Use 02-default-config to learn configuration best practices

Customization:

03-custom-evaluator - Build domain-specific evaluation logic
07-custom-reporter - Create custom output formats
08-default-config-reporters - Organize custom components

Advanced Features:

04-safety-long-conversation - Production-ready safety evaluation with LLM-as-judge
05-evaluator-config-override - Fine-tune evaluations per test case
06-database-reporter - Track metrics over time with SQL queries
09-google-adk-agent - Evaluate Google ADK agents seamlessly
10-adk-http-agent - Evaluate remote agents via HTTP with multi-agent chain validation

Built-in Components

Providers

Gemini - Google's Gemini models (requires GOOGLE_API_KEY in .env)
Google ADK - Google's Agent Development Kit for local agentic workflows (requires google-adk package)
ADK HTTP - Remote ADK HTTP endpoints with SSE streaming, multi-auth, and agent chain tracking (requires httpx)
Mock - Built-in test provider, no setup required
Custom - Extend BaseProvider for your own LLM providers (OpenAI, Anthropic, etc.)

Evaluators

ResponseEvaluator - Compare responses (exact, semantic similarity, ROUGE)
TrajectoryEvaluator - Validate tool uses, conversation flow, and argument matching
CostEvaluator - Enforce cost thresholds
LatencyEvaluator - Enforce latency thresholds
EmbeddingSimilarityEvaluator - Semantic similarity using embeddings (Gemini, OpenAI, sentence-transformers)
LLMJudgeEvaluator - LLM-as-judge for relevance, hallucination, quality, and factuality assessment
SubAgentEvaluator - Validate agent transfer chains in multi-agent orchestration systems
Custom - Extend BaseEvaluator for custom logic (safety, compliance, etc.)

Reporters

ConsoleReporter - Rich terminal output with colored tables
HTMLReporter - Interactive HTML dashboard
JSONReporter - Machine-readable JSON export
DatabaseReporter - SQLite database for historical tracking
Custom - Extend BaseReporter for custom formats (CSV, Slack, Datadog, etc.)

Reports & Dashboard

HTML Dashboard

Interactive web interface with:

Sidebar: Summary metrics + execution list with color-coded status
Main Panel: Execution details, evaluator scores, conversation history
Features: Dark mode, responsive, self-contained (works offline)

Console Output

Rich formatted tables with live execution progress

JSON Export

Machine-readable results for programmatic analysis

SQLite Database

Persistent storage for:

Historical trend tracking
Regression detection
Cost analysis over time
SQL-based queries

# Generate dashboard from database
judge-llm dashboard --db results.db --output dashboard.html

Telemetry & Observability (OpenTelemetry)

Judge LLM includes optional OpenTelemetry instrumentation for deep observability into evaluation runs. Disabled by default with zero overhead when not enabled.

Quick Start

# Install telemetry dependencies
pip install judge-llm[telemetry]

# Run with console tracing
judge-llm run --config config.yaml --telemetry

# Run with OTLP exporter (Jaeger, Grafana Tempo, etc.)
judge-llm run --config config.yaml --telemetry --telemetry-exporter otlp

Arize Phoenix Integration

# Install Phoenix support
pip install judge-llm[phoenix]

# Start Phoenix server
pip install arize-phoenix && phoenix serve

# Run with Phoenix tracing
judge-llm run --config config.yaml --telemetry --telemetry-exporter phoenix
# Open http://localhost:6006 to view traces

Enable via YAML Config

agent:
  telemetry:
    enabled: true
    exporter: phoenix    # "console", "otlp", or "phoenix"
    endpoint: http://localhost:6006

Enable via Environment Variables

export JUDGE_LLM_TELEMETRY=true
export OTEL_EXPORTER_TYPE=phoenix
export PHOENIX_COLLECTOR_ENDPOINT=http://localhost:6006

What Gets Traced

Every evaluation creates a span tree with detailed attributes:

judge_llm.evaluate                                [CHAIN]
├── judge_llm.execute_task                        [CHAIN, session.id]
│   ├── judge_llm.provider.execute                [LLM, input/output, tokens]
│   │   ├── adk_http.create_session               [TOOL, HTTP req/res]
│   │   └── adk_http.send_and_collect             [LLM, HTTP req/res, tokens]
│   └── judge_llm.evaluator.evaluate              [EVALUATOR, score]
└── judge_llm.reporter.generate                   [per reporter]

Phoenix-specific features:

Sessions — spans are grouped by session.id for multi-turn conversation tracking
Input/Output — full request payloads and agent response text visible on each span
HTTP details — request/response bodies, headers, and status codes on HTTP spans
LLM metadata — model name, token counts (prompt/completion/total) via OpenInference conventions

See the Telemetry Guide for complete documentation.

Development

# Setup
pip install -e ".[dev]"

# Run tests
pytest

# Format code
black judge_llm && ruff check judge_llm

Contributions welcome! Fork, create a feature branch, add tests, and submit a PR.

License

Licensed under CC BY-NC-SA 4.0 - Free for non-commercial use with attribution. See LICENSE for details.

For commercial licensing, contact the maintainers.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.23

Feb 3, 2026

1.0.22

Feb 3, 2026

1.0.21

Feb 2, 2026

1.0.20

Feb 2, 2026

1.0.19

Feb 2, 2026

1.0.17

Feb 2, 2026

1.0.16

Feb 2, 2026

1.0.15

Feb 2, 2026

1.0.14

Feb 2, 2026

1.0.13

Feb 2, 2026

1.0.12

Feb 2, 2026

This version

1.0.11

Feb 2, 2026

1.0.10

Feb 2, 2026

1.0.8

Feb 2, 2026

1.0.7

Feb 2, 2026

1.0.6

Feb 2, 2026

1.0.5

Jan 26, 2026

1.0.4

Oct 27, 2025

1.0.3

Oct 21, 2025

1.0.2

Oct 21, 2025

1.0.1

Oct 21, 2025

1.0.0

Oct 21, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

judge_llm-1.0.11.tar.gz (98.4 kB view details)

Uploaded Feb 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

judge_llm-1.0.11-py3-none-any.whl (112.3 kB view details)

Uploaded Feb 2, 2026 Python 3

File details

Details for the file judge_llm-1.0.11.tar.gz.

File metadata

Download URL: judge_llm-1.0.11.tar.gz
Upload date: Feb 2, 2026
Size: 98.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for judge_llm-1.0.11.tar.gz
Algorithm	Hash digest
SHA256	`e831c1da6185c453f20cadc70bcda8d446d9914ee94805f98fe800a66fa5f3d8`
MD5	`b12e5bb2be3b41cdaa56f45af060e62d`
BLAKE2b-256	`8c859348e1ee124695e4ac25a934cd931ace87005b401851fb9f91c72368177e`

See more details on using hashes here.

File details

Details for the file judge_llm-1.0.11-py3-none-any.whl.

File metadata

Download URL: judge_llm-1.0.11-py3-none-any.whl
Upload date: Feb 2, 2026
Size: 112.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for judge_llm-1.0.11-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a5af0e45c16d5502b91ee51a42a65269413db27442c78c39594ccd4bf9ed84ef`
MD5	`929ca7a87c60c322b9a8647959358baa`
BLAKE2b-256	`dcc352e15382e1044ae2e83e70820fc6c51f183dd53b6f8fe6b712661eace5b2`

See more details on using hashes here.

judge-llm 1.0.11

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

JUDGE LLM

Purpose

Features

Installation

From Source

From PyPI (when published)

With Optional Dependencies

Setup Environment Variables

Quick Start

CLI Usage

Python API

Configuration

Custom Component Registration

Method 1: Register in Default Config

Method 2: Programmatic Registration

Testing Examples

Running Examples

Example Categories

Built-in Components

Providers

Evaluators

Reporters

Reports & Dashboard

HTML Dashboard

Console Output

JSON Export

SQLite Database

Telemetry & Observability (OpenTelemetry)

Quick Start

Arize Phoenix Integration

Enable via YAML Config

Enable via Environment Variables

What Gets Traced

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes