Skip to main content

The open source post-building layer for Agent Behavior Monitoring.

Project description

Judgment Logo

The Continuous-Improvement Stack for Agents

Detect failures, triage root causes, and ship fixes backed by production data.

PyPI Docs

X LinkedIn

Overview

Judgeval is an open-source Python SDK for agent improvement. It provides tracing and agent-judge evaluation for LLM-powered applications — so you can detect failures, understand what went wrong, and validate fixes against real production cases before shipping.

To get started, dive into the docs.

Why Judgeval

OpenTelemetry-based tracing -- Instrument any function with @Tracer.observe(). Automatically captures inputs, outputs, and LLM token usage. Built on OpenTelemetry for full compatibility with existing observability stacks.

Agent judges -- Define prompt-based scorers to evaluate agent behaviors at scale. Judges produce structured behaviors — scored, labeled outputs that describe how your agent acted — which accumulate into a searchable record of agent behavior over time. Run judges against live production traffic or replay them on historical traces to validate fixes before shipping.

Online monitoring -- Automatically score live production traffic server-side with no latency impact. Detected behaviors surface as structured signals — configure Slack alerts so regressions and recurrences never go unnoticed.

Broad integrations -- Auto-instrumentation for OpenAI, Anthropic, Google GenAI, and Together AI. Framework support for LangGraph, OpenLit, and Claude Agent SDK.

Quickstart

Install the SDK:

pip install judgeval

Set your credentials:

export JUDGMENT_API_KEY=...
export JUDGMENT_ORG_ID=...

Add observability to your agent with two lines of setup:

from judgeval import Tracer, wrap
from openai import OpenAI

Tracer.init(project_name="my-project")
client = wrap(OpenAI())

@Tracer.observe(span_type="tool")
def search(query: str) -> str:
    results = vector_db.search(query)
    return results

@Tracer.observe(span_type="agent")
def run_agent(question: str) -> str:
    context = search(question)
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"{context}\n\n{question}"}],
    )
    return response.choices[0].message.content

run_agent("What is the capital of the United States?")

Integrations

Supports OpenAI, Anthropic, Google GenAI, Together AI, LangGraph, OpenLit, and Claude Agent SDK. See the full integrations docs.

CLI

Manage agents, traces, judges, behaviors, and evaluations from the terminal. Query trace history, deploy judges, inspect detected behaviors, and run evals against production data — all without leaving your shell. See the CLI repo and docs.

MCP Server

Connect Judgment to any MCP-compatible AI tool. Query agent traces, invoke judges, browse detected behaviors, and surface failures directly inside your AI assistant or IDE. See the docs.

Links


Judgeval is created and maintained by Judgment Labs.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

judgeval-1.2.0.tar.gz (23.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

judgeval-1.2.0-py3-none-any.whl (196.0 kB view details)

Uploaded Python 3

File details

Details for the file judgeval-1.2.0.tar.gz.

File metadata

  • Download URL: judgeval-1.2.0.tar.gz
  • Upload date:
  • Size: 23.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for judgeval-1.2.0.tar.gz
Algorithm Hash digest
SHA256 4e6d40232147f4a97279bbe3c0c6cec26c857c4cf88a896442b30ed705187d18
MD5 1c2753eb83f1054f590f42b13aaf72ad
BLAKE2b-256 d2e414c70a330efa5979783bbea2d1be8b453b85e198c5a3404ee541dfa09a7b

See more details on using hashes here.

File details

Details for the file judgeval-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: judgeval-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 196.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for judgeval-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1097ab83bc15ba991d394350e47d06753d2aa8fee1490a92b2a04b1f2a34e297
MD5 7a6f6b7f7c791c3def52ac5925b9a229
BLAKE2b-256 a8ce9f6f91d1ba6ae1d2f141d611d660d0a8a54f1528545f8ce0f4b8f5a96205

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page