Skip to main content

The open source post-building layer for Agent Behavior Monitoring.

Project description

Judgment Logo

Agent Behavior Monitoring

Track and judge agent behavior in online and offline setups. Set up Sentry-style alerts and analyze agent behaviors at scale.

PyPI Docs Judgment Cloud

X LinkedIn

Overview

Judgeval is an open-source Python SDK for agent behavior monitoring. It provides tracing, evaluation, and online monitoring for LLM-powered applications, enabling you to catch failures in real time and improve agents from production data.

To get started, try one of the cookbooks below or dive into the docs.

Why Judgeval

OpenTelemetry-based tracing -- Instrument any function with @Tracer.observe(). Automatically captures inputs, outputs, and LLM token usage. Built on OpenTelemetry for full compatibility with existing observability stacks.

Hosted and custom evaluation -- Run evaluations against Judgment's hosted scorers (faithfulness, answer relevancy, instruction adherence, etc.) or define your own Judge classes with binary, numeric, or categorical response types.

Online monitoring -- Score live production traffic asynchronously with Tracer.async_evaluate(). Runs server-side with no latency impact. Configure Slack alerts for failures.

Custom scorer hosting -- Upload arbitrary Python scorers to run in secure Firecracker microVMs. Any logic you can express in Python -- LLM-as-a-judge, code checks, multi-step pipelines -- can run as a hosted scorer.

Dataset management and prompt versioning -- Store golden evaluation sets, version prompt templates with {{variable}} syntax, and tag versions for production/staging workflows.

Broad integrations -- Auto-instrumentation for OpenAI, Anthropic, Google GenAI, and Together AI. Framework support for LangGraph, OpenLit, and Claude Agent SDK.

Quickstart

Install the SDK:

pip install judgeval

Set your credentials (create a free account if you don't have keys):

export JUDGMENT_API_KEY=...
export JUDGMENT_ORG_ID=...

Tracing

Add observability to your agent with two lines of setup:

from judgeval import Tracer, wrap
from openai import OpenAI

Tracer.init(project_name="my-project")
client = wrap(OpenAI())

@Tracer.observe(span_type="tool")
def search(query: str) -> str:
    results = vector_db.search(query)
    return results

@Tracer.observe(span_type="agent")
def run_agent(question: str) -> str:
    context = search(question)
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"{context}\n\n{question}"}],
    )
    return response.choices[0].message.content

run_agent("What is the capital of the United States?")

All traces are delivered to your Judgment dashboard:

Judgment Platform Trajectory View

Online Monitoring

Score live traffic asynchronously inside any traced function. Evaluations run server-side after the span completes:

@Tracer.observe(span_type="agent")
def run_agent(question: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": question}],
    )
    answer = response.choices[0].message.content

    Tracer.async_evaluate(
        "answer_relevancy",
        {"input": question, "actual_output": answer},
    )

    return answer

Custom Scorer Online ABM

Offline Evaluation

Use the Judgeval client to run batch evaluations against hosted scorers:

from judgeval import Judgeval
from judgeval.data import Example

client = Judgeval(project_name="my-project")
evaluation = client.evaluation.create()

results = evaluation.run(
    examples=[
        Example.create(
            input="What is 2+2?",
            actual_output="4",
            expected_output="4",
        ),
    ],
    scorers=["faithfulness", "answer_relevancy"],
    eval_run_name="nightly-eval",
)

Results are returned as ScoringResult objects and displayed in the dashboard.

Custom Judges

Define your own evaluation logic by subclassing Judge with a response type:

from judgeval.judges import Judge
from judgeval.hosted.responses import BinaryResponse
from judgeval.data import Example

class CorrectnessJudge(Judge[BinaryResponse]):
    async def score(self, data: Example) -> BinaryResponse:
        correct = data["expected_output"].lower() in data["actual_output"].lower()
        return BinaryResponse(
            value=correct,
            reason="Contains expected answer" if correct else "Missing expected answer",
        )

Three response types are available:

Type Value Use case
BinaryResponse bool Pass/fail checks
NumericResponse float Continuous scores (0.0 -- 1.0)
CategoricalResponse str Classification into defined categories

Scaffold and upload via CLI

judgeval scorer init -t binary -n CorrectnessJudge
judgeval scorer upload correctness_judge.py -p my-project

Once uploaded, your judge runs in a secure Firecracker microVM and can be used with Tracer.async_evaluate() for online monitoring.

Datasets

Manage golden evaluation sets through the platform:

from judgeval import Judgeval
from judgeval.data import Example

client = Judgeval(project_name="my-project")

dataset = client.datasets.create(
    name="golden-set",
    examples=[
        Example.create(input="What is 2+2?", expected_output="4"),
        Example.create(input="Capital of France?", expected_output="Paris"),
    ],
)

dataset = client.datasets.get(name="golden-set")

Datasets support import from JSON/YAML, batch appending, and export.

Prompt Versioning

Version and tag prompt templates with {{variable}} placeholders:

client = Judgeval(project_name="my-project")

prompt = client.prompts.create(
    name="system-prompt",
    prompt="You are a helpful assistant for {{product}}. Answer in {{language}}.",
    tags=["production"],
)

prompt = client.prompts.get(name="system-prompt", tag="production")
compiled = prompt.compile(product="Acme Search", language="English")

Integrations

LLM Providers

Wrap any supported client with wrap() for automatic span creation and token/cost tracking:

from judgeval import wrap

client = wrap(OpenAI())          # OpenAI
client = wrap(Anthropic())       # Anthropic
client = wrap(genai.Client())    # Google GenAI
client = wrap(Together())        # Together AI

Frameworks

Framework Setup
LangGraph from judgeval.integrations import Langgraph; Langgraph.initialize()
OpenLit from judgeval.integrations import Openlit; Openlit.initialize()
Claude Agent SDK from judgeval.integrations import setup_claude_agent_sdk; setup_claude_agent_sdk()

See full list here

Links


Judgeval is created and maintained by Judgment Labs.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

judgeval-1.1.0.tar.gz (23.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

judgeval-1.1.0-py3-none-any.whl (163.8 kB view details)

Uploaded Python 3

File details

Details for the file judgeval-1.1.0.tar.gz.

File metadata

  • Download URL: judgeval-1.1.0.tar.gz
  • Upload date:
  • Size: 23.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for judgeval-1.1.0.tar.gz
Algorithm Hash digest
SHA256 7f17a8ff0f2220ba622a4ac4e616f38e768a3a024098118353db83c4345b2b82
MD5 edbe998ce40af7b47dfd83ad25a412e6
BLAKE2b-256 706542a4ec581584e80018c68b3233a65f5b70044b55ad161c0ef18a84e15f40

See more details on using hashes here.

File details

Details for the file judgeval-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: judgeval-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 163.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for judgeval-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 39e89f61172f6e165de205c02b974eff37166c3221c29eb7436607ffec722c48
MD5 90d2b96f564449d5ffe56bdc8acbd126
BLAKE2b-256 2008aeac9efcadeb1b3012bde4d028d75370800e78e389290ff25a2504145d96

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page