Skip to main content

The open source post-building layer for Agent Behavior Monitoring.

Project description

Judgment Logo

Agent Behavior Monitoring

Track and judge agent behavior in online and offline setups. Set up Sentry-style alerts and analyze agent behaviors at scale.

PyPI Docs Judgment Cloud Self-Host

X LinkedIn

Overview

Judgeval is an open-source Python SDK for agent behavior monitoring. It provides tracing, evaluation, and online monitoring for LLM-powered applications, enabling you to catch failures in real time and improve agents from production data.

To get started, try one of the cookbooks below or dive into the docs.

Why Judgeval

OpenTelemetry-based tracing -- Instrument any function with @Tracer.observe(). Automatically captures inputs, outputs, and LLM token usage. Built on OpenTelemetry for full compatibility with existing observability stacks.

Hosted and custom evaluation -- Run evaluations against Judgment's hosted scorers (faithfulness, answer relevancy, instruction adherence, etc.) or define your own Judge classes with binary, numeric, or categorical response types.

Online monitoring -- Score live production traffic asynchronously with Tracer.async_evaluate(). Runs server-side with no latency impact. Configure Slack alerts for failures.

Custom scorer hosting -- Upload arbitrary Python scorers to run in secure Firecracker microVMs. Any logic you can express in Python -- LLM-as-a-judge, code checks, multi-step pipelines -- can run as a hosted scorer.

Dataset management and prompt versioning -- Store golden evaluation sets, version prompt templates with {{variable}} syntax, and tag versions for production/staging workflows.

Broad integrations -- Auto-instrumentation for OpenAI, Anthropic, Google GenAI, and Together AI. Framework support for LangGraph, OpenLit, and Claude Agent SDK.

Quickstart

Install the SDK:

pip install judgeval

Set your credentials (create a free account if you don't have keys):

export JUDGMENT_API_KEY=...
export JUDGMENT_ORG_ID=...

Tracing

Add observability to your agent with two lines of setup:

from judgeval import Tracer, wrap
from openai import OpenAI

Tracer.init(project_name="my-project")
client = wrap(OpenAI())

@Tracer.observe(span_type="tool")
def search(query: str) -> str:
    results = vector_db.search(query)
    return results

@Tracer.observe(span_type="agent")
def run_agent(question: str) -> str:
    context = search(question)
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"{context}\n\n{question}"}],
    )
    return response.choices[0].message.content

run_agent("What is the capital of the United States?")

All traces are delivered to your Judgment dashboard:

Judgment Platform Trajectory View

Online Monitoring

Score live traffic asynchronously inside any traced function. Evaluations run server-side after the span completes:

@Tracer.observe(span_type="agent")
def run_agent(question: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": question}],
    )
    answer = response.choices[0].message.content

    Tracer.async_evaluate(
        "answer_relevancy",
        {"input": question, "actual_output": answer},
    )

    return answer

Custom Scorer Online ABM

Offline Evaluation

Use the Judgeval client to run batch evaluations against hosted scorers:

from judgeval import Judgeval
from judgeval.data import Example

client = Judgeval(project_name="my-project")
evaluation = client.evaluation.create()

results = evaluation.run(
    examples=[
        Example.create(
            input="What is 2+2?",
            actual_output="4",
            expected_output="4",
        ),
    ],
    scorers=["faithfulness", "answer_relevancy"],
    eval_run_name="nightly-eval",
)

Results are returned as ScoringResult objects and displayed in the dashboard.

Custom Judges

Define your own evaluation logic by subclassing Judge with a response type:

from judgeval.judges import Judge
from judgeval.hosted.responses import BinaryResponse
from judgeval.data import Example

class CorrectnessJudge(Judge[BinaryResponse]):
    async def score(self, data: Example) -> BinaryResponse:
        correct = data["expected_output"].lower() in data["actual_output"].lower()
        return BinaryResponse(
            value=correct,
            reason="Contains expected answer" if correct else "Missing expected answer",
        )

Three response types are available:

Type Value Use case
BinaryResponse bool Pass/fail checks
NumericResponse float Continuous scores (0.0 -- 1.0)
CategoricalResponse str Classification into defined categories

Scaffold and upload via CLI

judgeval scorer init -t binary -n CorrectnessJudge
judgeval scorer upload correctness_judge.py -p my-project

Once uploaded, your judge runs in a secure Firecracker microVM and can be used with Tracer.async_evaluate() for online monitoring.

Datasets

Manage golden evaluation sets through the platform:

from judgeval import Judgeval
from judgeval.data import Example

client = Judgeval(project_name="my-project")

dataset = client.datasets.create(
    name="golden-set",
    examples=[
        Example.create(input="What is 2+2?", expected_output="4"),
        Example.create(input="Capital of France?", expected_output="Paris"),
    ],
)

dataset = client.datasets.get(name="golden-set")

Datasets support import from JSON/YAML, batch appending, and export.

Prompt Versioning

Version and tag prompt templates with {{variable}} placeholders:

client = Judgeval(project_name="my-project")

prompt = client.prompts.create(
    name="system-prompt",
    prompt="You are a helpful assistant for {{product}}. Answer in {{language}}.",
    tags=["production"],
)

prompt = client.prompts.get(name="system-prompt", tag="production")
compiled = prompt.compile(product="Acme Search", language="English")

Integrations

LLM Providers

Wrap any supported client with wrap() for automatic span creation and token/cost tracking:

from judgeval import wrap

client = wrap(OpenAI())          # OpenAI
client = wrap(Anthropic())       # Anthropic
client = wrap(genai.Client())    # Google GenAI
client = wrap(Together())        # Together AI

Frameworks

Framework Setup
LangGraph from judgeval.integrations import Langgraph; Langgraph.initialize()
OpenLit from judgeval.integrations import Openlit; Openlit.initialize()
Claude Agent SDK from judgeval.integrations import setup_claude_agent_sdk; setup_claude_agent_sdk()

Cookbooks

Topic Notebook Description
Online ABM Research Agent Monitor agent behavior in production
Custom Scorers HumanEval Build custom evaluators for your agents

Browse the full cookbook repository or watch video tutorials.

Links


Judgeval is created and maintained by Judgment Labs.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

judgeval-1.0.0.tar.gz (23.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

judgeval-1.0.0-py3-none-any.whl (156.0 kB view details)

Uploaded Python 3

File details

Details for the file judgeval-1.0.0.tar.gz.

File metadata

  • Download URL: judgeval-1.0.0.tar.gz
  • Upload date:
  • Size: 23.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for judgeval-1.0.0.tar.gz
Algorithm Hash digest
SHA256 191b2fa99e092bc4fdf794d8624d4ae559f78106e452bb4019d62f48e166d039
MD5 fc01f1c6d83344965330277da85ecad3
BLAKE2b-256 e86645a4eda904e2241488cb919263712cadbc7f786ffdf63a4387441b2b8647

See more details on using hashes here.

File details

Details for the file judgeval-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: judgeval-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 156.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for judgeval-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3e90bf02564f78ad55d3c4610e1463648c0873eaf1362fa6c23aede4ed21bc58
MD5 abda6cbafd2984c77b008ae433dd1a85
BLAKE2b-256 2a136e1b1c214576507a2f166dae82f08f307d287be8646c30723428ca6e7b15

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page