judgeval

The open source post-building layer for Agent Behavior Monitoring.

These details have not been verified by PyPI

Project links

Project description

Agent Behavior Monitoring

Track and judge agent behavior in online and offline setups. Set up Sentry-style alerts and analyze agent behaviors at scale.

Overview

Judgeval is an open-source Python SDK for agent behavior monitoring. It provides tracing, evaluation, and online monitoring for LLM-powered applications, enabling you to catch failures in real time and improve agents from production data.

To get started, try one of the cookbooks below or dive into the docs.

Why Judgeval

OpenTelemetry-based tracing -- Instrument any function with @Tracer.observe(). Automatically captures inputs, outputs, and LLM token usage. Built on OpenTelemetry for full compatibility with existing observability stacks.

Hosted and custom evaluation -- Run evaluations against Judgment's hosted scorers (faithfulness, answer relevancy, instruction adherence, etc.) or define your own Judge classes with binary, numeric, or categorical response types.

Online monitoring -- Score live production traffic asynchronously with Tracer.async_evaluate(). Runs server-side with no latency impact. Configure Slack alerts for failures.

Custom scorer hosting -- Upload arbitrary Python scorers to run in secure Firecracker microVMs. Any logic you can express in Python -- LLM-as-a-judge, code checks, multi-step pipelines -- can run as a hosted scorer.

Dataset management and prompt versioning -- Store golden evaluation sets, version prompt templates with {{variable}} syntax, and tag versions for production/staging workflows.

Broad integrations -- Auto-instrumentation for OpenAI, Anthropic, Google GenAI, and Together AI. Framework support for LangGraph, OpenLit, and Claude Agent SDK.

Quickstart

Install the SDK:

pip install judgeval

Set your credentials (create a free account if you don't have keys):

export JUDGMENT_API_KEY=...
export JUDGMENT_ORG_ID=...

Tracing

Add observability to your agent with two lines of setup:

from judgeval import Tracer, wrap
from openai import OpenAI

Tracer.init(project_name="my-project")
client = wrap(OpenAI())

@Tracer.observe(span_type="tool")
def search(query: str) -> str:
    results = vector_db.search(query)
    return results

@Tracer.observe(span_type="agent")
def run_agent(question: str) -> str:
    context = search(question)
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"{context}\n\n{question}"}],
    )
    return response.choices[0].message.content

run_agent("What is the capital of the United States?")

All traces are delivered to your Judgment dashboard:

Judgment Platform Trajectory View

Online Monitoring

Score live traffic asynchronously inside any traced function. Evaluations run server-side after the span completes:

@Tracer.observe(span_type="agent")
def run_agent(question: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": question}],
    )
    answer = response.choices[0].message.content

    Tracer.async_evaluate(
        "answer_relevancy",
        {"input": question, "actual_output": answer},
    )

    return answer

Custom Scorer Online ABM

Offline Evaluation

Use the Judgeval client to run batch evaluations against hosted scorers:

from judgeval import Judgeval
from judgeval.data import Example

client = Judgeval(project_name="my-project")
evaluation = client.evaluation.create()

results = evaluation.run(
    examples=[
        Example.create(
            input="What is 2+2?",
            actual_output="4",
            expected_output="4",
        ),
    ],
    scorers=["faithfulness", "answer_relevancy"],
    eval_run_name="nightly-eval",
)

Results are returned as ScoringResult objects and displayed in the dashboard.

Custom Judges

Define your own evaluation logic by subclassing Judge with a response type:

from judgeval.judges import Judge
from judgeval.hosted.responses import BinaryResponse
from judgeval.data import Example

class CorrectnessJudge(Judge[BinaryResponse]):
    async def score(self, data: Example) -> BinaryResponse:
        correct = data["expected_output"].lower() in data["actual_output"].lower()
        return BinaryResponse(
            value=correct,
            reason="Contains expected answer" if correct else "Missing expected answer",
        )

Three response types are available:

Type	Value	Use case
`BinaryResponse`	`bool`	Pass/fail checks
`NumericResponse`	`float`	Continuous scores (0.0 -- 1.0)
`CategoricalResponse`	`str`	Classification into defined categories

Scaffold and upload via CLI

judgeval scorer init -t binary -n CorrectnessJudge
judgeval scorer upload correctness_judge.py -p my-project

Once uploaded, your judge runs in a secure Firecracker microVM and can be used with Tracer.async_evaluate() for online monitoring.

Datasets

Manage golden evaluation sets through the platform:

from judgeval import Judgeval
from judgeval.data import Example

client = Judgeval(project_name="my-project")

dataset = client.datasets.create(
    name="golden-set",
    examples=[
        Example.create(input="What is 2+2?", expected_output="4"),
        Example.create(input="Capital of France?", expected_output="Paris"),
    ],
)

dataset = client.datasets.get(name="golden-set")

Datasets support import from JSON/YAML, batch appending, and export.

Prompt Versioning

Version and tag prompt templates with {{variable}} placeholders:

client = Judgeval(project_name="my-project")

prompt = client.prompts.create(
    name="system-prompt",
    prompt="You are a helpful assistant for {{product}}. Answer in {{language}}.",
    tags=["production"],
)

prompt = client.prompts.get(name="system-prompt", tag="production")
compiled = prompt.compile(product="Acme Search", language="English")

Integrations

LLM Providers

Wrap any supported client with wrap() for automatic span creation and token/cost tracking:

from judgeval import wrap

client = wrap(OpenAI())          # OpenAI
client = wrap(Anthropic())       # Anthropic
client = wrap(genai.Client())    # Google GenAI
client = wrap(Together())        # Together AI

Frameworks

Framework	Setup
LangGraph	`from judgeval.integrations import Langgraph; Langgraph.initialize()`
OpenLit	`from judgeval.integrations import Openlit; Openlit.initialize()`
Claude Agent SDK	`from judgeval.integrations import setup_claude_agent_sdk; setup_claude_agent_sdk()`

See full list here

Links

Judgeval is created and maintained by Judgment Labs.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.1.1

May 14, 2026

This version

1.1.0

May 11, 2026

1.0.5

Apr 9, 2026

1.0.4

Apr 6, 2026

1.0.3

Apr 4, 2026

1.0.2

Mar 27, 2026

1.0.1

Mar 27, 2026

1.0.0

Mar 26, 2026

0.32.1

Mar 24, 2026

0.32.0

Mar 18, 2026

0.31.0

Mar 16, 2026

0.30.0

Mar 10, 2026

0.29.0

Mar 4, 2026

0.28.1

Feb 28, 2026

0.28.0

Feb 19, 2026

0.27.1

Feb 10, 2026

0.27.0

Feb 6, 2026

0.26.2

Feb 5, 2026

0.26.1

Feb 4, 2026

0.26.0

Jan 29, 2026

0.25.1

Jan 28, 2026

0.25.0

Jan 27, 2026

0.24.3

Jan 26, 2026

0.24.2

Jan 23, 2026

0.24.1

Jan 22, 2026

0.24.0

Jan 17, 2026

0.23.12

Dec 22, 2025

0.23.11

Dec 17, 2025

0.23.10

Dec 14, 2025

0.23.9

Dec 12, 2025

0.23.8

Dec 11, 2025

0.23.7

Dec 4, 2025

0.23.6

Dec 3, 2025

0.23.5

Dec 2, 2025

0.23.4

Dec 2, 2025

0.23.3

Dec 1, 2025

0.23.2

Nov 29, 2025

0.23.1

Nov 27, 2025

0.23.0

Nov 26, 2025

0.22.8

Nov 25, 2025

0.22.7

Nov 24, 2025

0.22.6

Nov 19, 2025

0.22.5

Nov 18, 2025

0.22.4

Nov 18, 2025

0.22.3

Nov 14, 2025

0.22.2

Nov 8, 2025

0.22.1

Nov 6, 2025

0.22.0

Nov 5, 2025

0.21.0

Nov 4, 2025

0.20.1

Oct 30, 2025

0.20.0

Oct 26, 2025

0.19.0

Oct 23, 2025

0.18.0

Oct 23, 2025

0.17.0

Oct 16, 2025

0.16.8

Oct 15, 2025

0.16.7

Oct 14, 2025

0.16.6

Oct 12, 2025

0.16.5

Oct 11, 2025

0.16.4

Oct 10, 2025

0.16.3

Oct 9, 2025

0.16.2

Oct 9, 2025

0.16.1

Oct 9, 2025

0.16.0

Oct 8, 2025

0.15.0

Oct 5, 2025

0.14.1

Sep 29, 2025

0.14.0

Sep 28, 2025

0.13.1

Sep 27, 2025

0.13.0

Sep 25, 2025

0.12.0

Sep 19, 2025

0.11.0

Sep 16, 2025

0.10.1

Sep 11, 2025

0.10.0

Sep 11, 2025

0.9.4

Sep 7, 2025

0.9.3

Sep 4, 2025

0.9.2

Sep 3, 2025

0.9.1

Sep 3, 2025

0.9.0

Sep 3, 2025

0.8.0

Aug 26, 2025

0.7.1

Aug 17, 2025

0.7.0

Aug 16, 2025

0.6.0

Aug 11, 2025

0.5.0

Aug 5, 2025

0.4.0

Aug 1, 2025

0.3.2

Jul 30, 2025

0.3.1

Jul 30, 2025

0.3.0

Jul 29, 2025

0.2.0

Jul 24, 2025

0.1.0

Jul 19, 2025

0.0.55

Jul 18, 2025

0.0.54

Jul 12, 2025

0.0.53

Jul 12, 2025

0.0.52

Jul 11, 2025

0.0.51

Jul 10, 2025

0.0.50

Jul 5, 2025

0.0.49

Jul 5, 2025

0.0.48

Jul 4, 2025

0.0.47

Jul 3, 2025

0.0.46

Jul 3, 2025

0.0.44

Jun 23, 2025

0.0.43

Jun 23, 2025

0.0.42

Jun 18, 2025

0.0.41

Jun 6, 2025

0.0.40

May 31, 2025

0.0.39

May 21, 2025

0.0.38

May 20, 2025

0.0.37

May 16, 2025

0.0.36

May 6, 2025

0.0.35

Apr 29, 2025

0.0.34

Apr 29, 2025

0.0.33

Apr 28, 2025

0.0.32

Apr 24, 2025

0.0.31

Apr 20, 2025

0.0.30

Apr 13, 2025

0.0.29

Apr 13, 2025

0.0.28

Apr 13, 2025

0.0.27

Apr 9, 2025

0.0.26

Apr 3, 2025

0.0.25

Mar 26, 2025

0.0.24

Mar 24, 2025

0.0.23

Mar 23, 2025

0.0.22

Mar 23, 2025

0.0.21

Mar 21, 2025

0.0.20

Mar 15, 2025

0.0.19

Mar 11, 2025

0.0.18

Mar 11, 2025

0.0.17

Mar 7, 2025

0.0.16

Mar 5, 2025

0.0.14

Mar 3, 2025

0.0.13

Feb 28, 2025

0.0.12

Feb 26, 2025

0.0.11

Feb 26, 2025

0.0.10

Feb 18, 2025

0.0.9

Feb 12, 2025

0.0.8

Feb 11, 2025

0.0.7

Feb 6, 2025

0.0.6

Feb 6, 2025

0.0.5

Feb 6, 2025

0.0.4

Feb 5, 2025

0.0.3

Jan 31, 2025

0.0.2

Jan 23, 2025

0.0.1

Jan 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

judgeval-1.1.0.tar.gz (23.3 MB view details)

Uploaded May 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

judgeval-1.1.0-py3-none-any.whl (163.8 kB view details)

Uploaded May 11, 2026 Python 3

File details

Details for the file judgeval-1.1.0.tar.gz.

File metadata

Download URL: judgeval-1.1.0.tar.gz
Upload date: May 11, 2026
Size: 23.3 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for judgeval-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`7f17a8ff0f2220ba622a4ac4e616f38e768a3a024098118353db83c4345b2b82`
MD5	`edbe998ce40af7b47dfd83ad25a412e6`
BLAKE2b-256	`706542a4ec581584e80018c68b3233a65f5b70044b55ad161c0ef18a84e15f40`

See more details on using hashes here.

File details

Details for the file judgeval-1.1.0-py3-none-any.whl.

File metadata

Download URL: judgeval-1.1.0-py3-none-any.whl
Upload date: May 11, 2026
Size: 163.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for judgeval-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`39e89f61172f6e165de205c02b974eff37166c3221c29eb7436607ffec722c48`
MD5	`90d2b96f564449d5ffe56bdc8acbd126`
BLAKE2b-256	`2008aeac9efcadeb1b3012bde4d028d75370800e78e389290ff25a2504145d96`

See more details on using hashes here.

judgeval 1.1.0

Navigation

Verified details

Owner

Unverified details

Project links

Meta

Classifiers

Project description

Agent Behavior Monitoring

Overview

Why Judgeval

Quickstart

Tracing

Online Monitoring

Offline Evaluation

Custom Judges

Scaffold and upload via CLI

Datasets

Prompt Versioning

Integrations

LLM Providers

Frameworks

Links

Project details

Verified details

Owner

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes