Skip to main content

A lightweight LLM evaluation framework for comparing and testing AI providers

Project description

Judge LLM

JUDGE LLM

A lightweight, extensible Python framework for evaluating and comparing LLM providers. Test your AI agents systematically with multi-turn conversations, cost tracking, and comprehensive reporting.

Python Version License: CC BY-NC-SA 4.0 PyPI OpenTelemetry

Quick StartFeaturesExamplesReports

Judge LLM Demo

Purpose

JUDGE LLM helps you evaluate AI agents and LLM providers by running test cases against your models and measuring:

  • Response quality (exact matching, semantic similarity, ROUGE scores)
  • Cost & latency (token usage, execution time, budget compliance)
  • Conversation flow (tool uses, multi-turn interactions)
  • Safety & custom metrics (extensible evaluation logic)

Perfect for regression testing, A/B testing providers, and ensuring production-grade quality.

Features

  • Multiple Providers: Gemini, Google ADK, ADK HTTP, Mock, and custom providers with registry-based extensibility
  • Built-in Evaluators: Response similarity, trajectory validation, cost/latency checks, embedding similarity, LLM-as-judge, sub-agent chain validation
  • Custom Components: Create and register custom providers, evaluators, and reporters
  • Registry System: Register once in defaults, use everywhere by name
  • Rich Reports: Console tables, interactive HTML dashboard, JSON exports, SQLite database, plus custom reporters
  • Parallel Execution: Run evaluations concurrently with configurable workers
  • Quality Gates: Fail CI/CD builds when thresholds are violated (configurable)
  • Config-Driven: YAML configs with smart defaults or programmatic Python API
  • Default Config: Reusable configurations with component registration
  • Per-Test Overrides: Fine-tune evaluator thresholds per test case
  • Environment Variables: Auto-loads .env for secure API key management
  • Telemetry: Optional OpenTelemetry tracing with Phoenix, Jaeger, and OTLP support

Installation

From Source

git clone https://github.com/HiHelloAI/judge-llm.git
cd judge-llm
pip install -e .

From PyPI (when published)

pip install judge-llm

With Optional Dependencies

# Install with Gemini provider support
pip install judge-llm[gemini]

# Install with Google ADK provider support
pip install judge-llm[google_adk]

# Install with dev dependencies
pip install judge-llm[dev]

# Install with OpenTelemetry support
pip install judge-llm[telemetry]

# Install with Arize Phoenix observability
pip install judge-llm[phoenix]

Setup Environment Variables

JUDGE LLM automatically loads environment variables from a .env file:

# Copy the example file
cp .env.example .env

# Edit .env and add your API keys
nano .env

.env file:

# Google Gemini API Key
GOOGLE_API_KEY=your-google-api-key-here

The .env file is automatically loaded when you import the library or run the CLI. Never commit .env to version control - it's already in .gitignore.

Quick Start

CLI Usage

# Run evaluation from config file
judge-llm run --config config.yaml

# Run with inline arguments (supports .json, .yaml, or .yml)
judge-llm run --dataset ./data/eval.yaml --provider mock --agent-id my_agent --report html --output report.html

# Validate configuration
judge-llm validate --config config.yaml

# List available components
judge-llm list providers
judge-llm list evaluators
judge-llm list reporters

# Generate dashboard from database
judge-llm dashboard --db results.db --output dashboard.html

Python API

from judge_llm import evaluate

# From config file
report = evaluate(config="config.yaml")

# Programmatic API (supports .json, .yaml, or .yml datasets)
report = evaluate(
    dataset={"loader": "local_file", "paths": ["./data/eval.yaml"]},
    providers=[{"type": "mock", "agent_id": "my_agent"}],
    evaluators=[{"type": "response_evaluator", "config": {"similarity_threshold": 0.8}}],
    reporters=[{"type": "console"}, {"type": "html", "output_path": "./report.html"}]
)

print(f"Success: {report.success_rate:.1%} | Cost: ${report.total_cost:.4f}")

Configuration

Minimal config.yaml:

dataset:
  loader: local_file
  paths: [./data/eval.json]  # Supports .json, .yaml, or .yml files

providers:
  - type: gemini
    agent_id: my_agent
    model: gemini-2.0-flash-exp

evaluators:
  - type: response_evaluator
    config: {similarity_threshold: 0.8}

reporters:
  - type: console
  - type: html
    output_path: ./report.html

Advanced config with quality gates:

agent:
  fail_on_threshold_violation: true  # Exit with error if evaluations fail (default: true)
  parallel_execution: true            # Run tests in parallel
  max_workers: 4                      # Number of parallel workers
  num_runs: 3                         # Run each test 3 times

dataset:
  loader: local_file
  paths: [./data/eval.yaml]

providers:
  - type: gemini
    agent_id: production_agent
    model: gemini-2.0-flash-exp

evaluators:
  - type: response_evaluator
    config:
      similarity_threshold: 0.85  # Minimum 85% similarity required
  - type: cost_evaluator
    config:
      max_cost_per_case: 0.05      # Maximum $0.05 per test

reporters:
  - type: database
    db_path: ./results.db  # Track results over time

Use in CI/CD:

# Fails with exit code 1 if any evaluator thresholds are violated
judge-llm run --config ci-config.yaml

# Or disable failures for monitoring
# Set fail_on_threshold_violation: false in config

Dataset File Formats:

JUDGE LLM supports both JSON and YAML formats for evaluation datasets. Use whichever format you prefer:

# Using JSON dataset
dataset:
  loader: local_file
  paths: [./data/eval.json]

# Using YAML dataset
dataset:
  loader: local_file
  paths: [./data/eval.yaml]

# Using multiple datasets (mixed formats)
dataset:
  loader: local_file
  paths:
    - ./data/eval1.json
    - ./data/eval2.yaml

# Using directory loader with pattern
dataset:
  loader: directory
  paths: [./data]
  pattern: "*.yaml"  # or "*.json" or "*.yml"

Google ADK Provider Configuration:

For agents built with Google's Agent Development Kit (ADK):

providers:
  - type: google_adk
    agent_id: my_adk_agent
    agent_metadata:
      module_path: "tool_agent.agent"  # Path to your agent module
      agent_name: "root_agent"         # Agent variable name (default: "root_agent")
      root_path: "."                   # Root path for imports (optional)

The ADK provider automatically:

  • Loads your ADK agent from the specified module
  • Converts between Judge LLM and ADK formats
  • Validates tool usage and conversation flow
  • Reports results in standard Judge LLM format

See examples/09-google-adk-agent/ for a complete working example.

ADK HTTP Provider Configuration:

For evaluating agents deployed as remote HTTP services:

providers:
  - type: adk_http
    agent_id: my_remote_agent
    endpoint_url: "http://localhost:8000/run_sse"
    auth_type: bearer          # bearer, api_key, basic, none
    timeout: 60
    app_name: "my_app"
    model: gemini-2.0-flash

evaluators:
  - type: response_evaluator
  - type: trajectory_evaluator
  - type: subagent_evaluator   # Validate agent transfer chains
  - type: llm_judge_evaluator  # LLM-as-judge quality assessment
    config:
      evaluation_type: comprehensive

See examples/10-adk-http-agent/ for a complete working example.

See the examples/ directory for complete configuration examples including default configs, custom evaluators, and advanced features.

Custom Component Registration

JUDGE LLM supports registering custom providers, evaluators, and reporters for reuse across projects.

Method 1: Register in Default Config

Create .judge_llm.defaults.yaml in your project root:

# Register custom components once
providers:
  - type: custom
    module_path: ./my_providers/anthropic.py
    class_name: AnthropicProvider
    register_as: anthropic  # ← Use this name in test configs

evaluators:
  - type: custom
    module_path: ./my_evaluators/safety.py
    class_name: SafetyEvaluator
    register_as: safety

reporters:
  - type: custom
    module_path: ./my_reporters/slack.py
    class_name: SlackReporter
    register_as: slack

Then use them by name in any test config:

# test.yaml - clean and simple!
providers:
  - type: anthropic  # ← Uses registered custom provider
    agent_id: claude

evaluators:
  - type: safety  # ← Uses registered custom evaluator

reporters:
  - type: slack  # ← Uses registered custom reporter
    config: {webhook_url: ${SLACK_WEBHOOK}}

Method 2: Programmatic Registration

from judge_llm import evaluate, register_provider, register_evaluator, register_reporter
from my_components import CustomProvider, SafetyEvaluator, SlackReporter

# Register components
register_provider("my_provider", CustomProvider)
register_evaluator("safety", SafetyEvaluator)
register_reporter("slack", SlackReporter)

# Use by name
report = evaluate(
    dataset={"loader": "local_file", "paths": ["./tests.json"]},
    providers=[{"type": "my_provider", "agent_id": "test"}],
    evaluators=[{"type": "safety"}],
    reporters=[{"type": "slack", "config": {"webhook_url": "..."}}]
)

Benefits:

  • DRY - Register once, use everywhere
  • Team Standardization - Share defaults across team
  • Clean Configs - Test configs reference components by name
  • Easy Updates - Change implementation in one place

See examples/default_config_reporters/ for complete examples.

Testing Examples

Explore 10 complete examples in the examples/ directory, from basic setup to advanced features:

Example Description Key Features
01-gemini-agent Basic Gemini agent evaluation Simple setup, response evaluation, CLI & Python API usage
02-default-config Using default configuration files Config merging, .judge_llm.defaults.yaml, reducing duplication
03-custom-evaluator Creating custom evaluators Extending BaseEvaluator, custom validation logic, registration
04-safety-long-conversation Multi-turn safety evaluation Long conversations, PII detection, toxicity analysis, LLM-as-judge
05-evaluator-config-override Per-test-case evaluator overrides Fine-grained threshold control, two-level configuration
06-database-reporter SQLite database reporter Historical tracking, trend analysis, SQL queries, cost monitoring
07-custom-reporter Creating custom reporters CSV reporter example, config-based & programmatic registration
08-default-config-reporters Registering custom components Register providers, evaluators, and reporters in defaults
09-google-adk-agent Google ADK agent evaluation ADK integration, tool usage validation, agent module loading
10-adk-http-agent Remote ADK HTTP agent evaluation HTTP/SSE streaming, multi-agent chains, sub-agent evaluation

Running Examples

Each example includes configuration files, datasets, and detailed instructions:

# Navigate to any example
cd examples/01-gemini-agent

# Run via CLI
judge-llm run --config config.yaml

# Or run via Python script
python run_evaluation.py

# Or use the convenience shell script
./run.sh

Example Categories

Getting Started:

  • Start with 01-gemini-agent for basic usage
  • Use 02-default-config to learn configuration best practices

Customization:

  • 03-custom-evaluator - Build domain-specific evaluation logic
  • 07-custom-reporter - Create custom output formats
  • 08-default-config-reporters - Organize custom components

Advanced Features:

  • 04-safety-long-conversation - Production-ready safety evaluation with LLM-as-judge
  • 05-evaluator-config-override - Fine-tune evaluations per test case
  • 06-database-reporter - Track metrics over time with SQL queries
  • 09-google-adk-agent - Evaluate Google ADK agents seamlessly
  • 10-adk-http-agent - Evaluate remote agents via HTTP with multi-agent chain validation

Built-in Components

Providers

  • Gemini - Google's Gemini models (requires GOOGLE_API_KEY in .env)
  • Google ADK - Google's Agent Development Kit for local agentic workflows (requires google-adk package)
  • ADK HTTP - Remote ADK HTTP endpoints with SSE streaming, multi-auth, and agent chain tracking (requires httpx)
  • Mock - Built-in test provider, no setup required
  • Custom - Extend BaseProvider for your own LLM providers (OpenAI, Anthropic, etc.)

Evaluators

  • ResponseEvaluator - Compare responses (exact, semantic similarity, ROUGE)
  • TrajectoryEvaluator - Validate tool uses, conversation flow, and argument matching
  • CostEvaluator - Enforce cost thresholds
  • LatencyEvaluator - Enforce latency thresholds
  • EmbeddingSimilarityEvaluator - Semantic similarity using embeddings (Gemini, OpenAI, sentence-transformers)
  • LLMJudgeEvaluator - LLM-as-judge for relevance, hallucination, quality, and factuality assessment
  • SubAgentEvaluator - Validate agent transfer chains in multi-agent orchestration systems
  • Custom - Extend BaseEvaluator for custom logic (safety, compliance, etc.)

Reporters

  • ConsoleReporter - Rich terminal output with colored tables
  • HTMLReporter - Interactive HTML dashboard
  • JSONReporter - Machine-readable JSON export
  • DatabaseReporter - SQLite database for historical tracking
  • Custom - Extend BaseReporter for custom formats (CSV, Slack, Datadog, etc.)

Reports & Dashboard

HTML Dashboard

Interactive web interface with:

  • Sidebar: Summary metrics + execution list with color-coded status
  • Main Panel: Execution details, evaluator scores, conversation history
  • Features: Dark mode, responsive, self-contained (works offline)

Console Output

Rich formatted tables with live execution progress

JSON Export

Machine-readable results for programmatic analysis

SQLite Database

Persistent storage for:

  • Historical trend tracking
  • Regression detection
  • Cost analysis over time
  • SQL-based queries
# Generate dashboard from database
judge-llm dashboard --db results.db --output dashboard.html

Telemetry & Observability (OpenTelemetry)

Judge LLM includes optional OpenTelemetry instrumentation for deep observability into evaluation runs. Disabled by default with zero overhead when not enabled.

Quick Start

# Install telemetry dependencies
pip install judge-llm[telemetry]

# Run with console tracing
judge-llm run --config config.yaml --telemetry

# Run with OTLP exporter (Jaeger, Grafana Tempo, etc.)
judge-llm run --config config.yaml --telemetry --telemetry-exporter otlp

Arize Phoenix Integration

# Install Phoenix support
pip install judge-llm[phoenix]

# Start Phoenix server
pip install arize-phoenix && phoenix serve

# Run with Phoenix tracing
judge-llm run --config config.yaml --telemetry --telemetry-exporter phoenix
# Open http://localhost:6006 to view traces

Enable via YAML Config

agent:
  telemetry:
    enabled: true
    exporter: phoenix    # "console", "otlp", or "phoenix"
    endpoint: http://localhost:6006

Enable via Environment Variables

export JUDGE_LLM_TELEMETRY=true
export OTEL_EXPORTER_TYPE=phoenix
export PHOENIX_COLLECTOR_ENDPOINT=http://localhost:6006

What Gets Traced

Every evaluation creates a span tree with detailed attributes:

judge_llm.evaluate
├── judge_llm.execute_task          [per eval_case x provider x run]
│   ├── judge_llm.provider.execute  (success, cost, tokens)
│   │   ├── adk_http.create_session (HTTP status, session ID)
│   │   └── adk_http.send_and_collect (events, retries, errors)
│   └── judge_llm.evaluator.evaluate (score, passed/failed)
└── judge_llm.reporter.generate     [per reporter]

See the Telemetry Guide for complete documentation.

Development

# Setup
pip install -e ".[dev]"

# Run tests
pytest

# Format code
black judge_llm && ruff check judge_llm

Contributions welcome! Fork, create a feature branch, add tests, and submit a PR.

License

Licensed under CC BY-NC-SA 4.0 - Free for non-commercial use with attribution. See LICENSE for details.

For commercial licensing, contact the maintainers.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

judge_llm-1.0.6.tar.gz (94.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

judge_llm-1.0.6-py3-none-any.whl (109.4 kB view details)

Uploaded Python 3

File details

Details for the file judge_llm-1.0.6.tar.gz.

File metadata

  • Download URL: judge_llm-1.0.6.tar.gz
  • Upload date:
  • Size: 94.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for judge_llm-1.0.6.tar.gz
Algorithm Hash digest
SHA256 3642df4234186560ad16f1ba199aedda4f78cbf106eb87e3d6ab11a6bfb05c7c
MD5 f479e9f6facf41f1af1eea64228632e2
BLAKE2b-256 fe90ffa761d792a09f6de2788f7eb2dbafb1569cdbee28f96b9b43baf5c066c1

See more details on using hashes here.

File details

Details for the file judge_llm-1.0.6-py3-none-any.whl.

File metadata

  • Download URL: judge_llm-1.0.6-py3-none-any.whl
  • Upload date:
  • Size: 109.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for judge_llm-1.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 55b8237fa17af094caffd4e3227b940b2f2184aeceba9f721d34a5517a1461fa
MD5 1028ea8620b196e928769202a668ddf3
BLAKE2b-256 a4a192054f526d4802b6d2ed67b390680ffa7cdd89229d938c417864e5a793c5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page