The open source post-building layer for Agent Behavior Monitoring.
Project description
Agent Behavior Monitoring
Track and judge agent behavior in online and offline setups. Set up Sentry-style alerts and analyze agent behaviors at scale.
Overview
Judgeval is an open-source Python SDK for agent behavior monitoring. It provides tracing, evaluation, and online monitoring for LLM-powered applications, enabling you to catch failures in real time and improve agents from production data.
To get started, try one of the cookbooks below or dive into the docs.
Why Judgeval
OpenTelemetry-based tracing -- Instrument any function with @Tracer.observe(). Automatically captures inputs, outputs, and LLM token usage. Built on OpenTelemetry for full compatibility with existing observability stacks.
Hosted and custom evaluation -- Run evaluations against Judgment's hosted scorers (faithfulness, answer relevancy, instruction adherence, etc.) or define your own Judge classes with binary, numeric, or categorical response types.
Online monitoring -- Score live production traffic asynchronously with Tracer.async_evaluate(). Runs server-side with no latency impact. Configure Slack alerts for failures.
Custom scorer hosting -- Upload arbitrary Python scorers to run in secure Firecracker microVMs. Any logic you can express in Python -- LLM-as-a-judge, code checks, multi-step pipelines -- can run as a hosted scorer.
Dataset management and prompt versioning -- Store golden evaluation sets, version prompt templates with {{variable}} syntax, and tag versions for production/staging workflows.
Broad integrations -- Auto-instrumentation for OpenAI, Anthropic, Google GenAI, and Together AI. Framework support for LangGraph, OpenLit, and Claude Agent SDK.
Quickstart
Install the SDK:
pip install judgeval
Set your credentials (create a free account if you don't have keys):
export JUDGMENT_API_KEY=...
export JUDGMENT_ORG_ID=...
Tracing
Add observability to your agent with two lines of setup:
from judgeval import Tracer, wrap
from openai import OpenAI
Tracer.init(project_name="my-project")
client = wrap(OpenAI())
@Tracer.observe(span_type="tool")
def search(query: str) -> str:
results = vector_db.search(query)
return results
@Tracer.observe(span_type="agent")
def run_agent(question: str) -> str:
context = search(question)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"{context}\n\n{question}"}],
)
return response.choices[0].message.content
run_agent("What is the capital of the United States?")
All traces are delivered to your Judgment dashboard:
Online Monitoring
Score live traffic asynchronously inside any traced function. Evaluations run server-side after the span completes:
@Tracer.observe(span_type="agent")
def run_agent(question: str) -> str:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": question}],
)
answer = response.choices[0].message.content
Tracer.async_evaluate(
"answer_relevancy",
{"input": question, "actual_output": answer},
)
return answer
Offline Evaluation
Use the Judgeval client to run batch evaluations against hosted scorers:
from judgeval import Judgeval
from judgeval.data import Example
client = Judgeval(project_name="my-project")
evaluation = client.evaluation.create()
results = evaluation.run(
examples=[
Example.create(
input="What is 2+2?",
actual_output="4",
expected_output="4",
),
],
scorers=["faithfulness", "answer_relevancy"],
eval_run_name="nightly-eval",
)
Results are returned as ScoringResult objects and displayed in the dashboard.
Custom Judges
Define your own evaluation logic by subclassing Judge with a response type:
from judgeval.judges import Judge
from judgeval.hosted.responses import BinaryResponse
from judgeval.data import Example
class CorrectnessJudge(Judge[BinaryResponse]):
async def score(self, data: Example) -> BinaryResponse:
correct = data["expected_output"].lower() in data["actual_output"].lower()
return BinaryResponse(
value=correct,
reason="Contains expected answer" if correct else "Missing expected answer",
)
Three response types are available:
| Type | Value | Use case |
|---|---|---|
BinaryResponse |
bool |
Pass/fail checks |
NumericResponse |
float |
Continuous scores (0.0 -- 1.0) |
CategoricalResponse |
str |
Classification into defined categories |
Scaffold and upload via CLI
judgeval scorer init -t binary -n CorrectnessJudge
judgeval scorer upload correctness_judge.py -p my-project
Once uploaded, your judge runs in a secure Firecracker microVM and can be used with Tracer.async_evaluate() for online monitoring.
Datasets
Manage golden evaluation sets through the platform:
from judgeval import Judgeval
from judgeval.data import Example
client = Judgeval(project_name="my-project")
dataset = client.datasets.create(
name="golden-set",
examples=[
Example.create(input="What is 2+2?", expected_output="4"),
Example.create(input="Capital of France?", expected_output="Paris"),
],
)
dataset = client.datasets.get(name="golden-set")
Datasets support import from JSON/YAML, batch appending, and export.
Prompt Versioning
Version and tag prompt templates with {{variable}} placeholders:
client = Judgeval(project_name="my-project")
prompt = client.prompts.create(
name="system-prompt",
prompt="You are a helpful assistant for {{product}}. Answer in {{language}}.",
tags=["production"],
)
prompt = client.prompts.get(name="system-prompt", tag="production")
compiled = prompt.compile(product="Acme Search", language="English")
Integrations
LLM Providers
Wrap any supported client with wrap() for automatic span creation and token/cost tracking:
from judgeval import wrap
client = wrap(OpenAI()) # OpenAI
client = wrap(Anthropic()) # Anthropic
client = wrap(genai.Client()) # Google GenAI
client = wrap(Together()) # Together AI
Frameworks
| Framework | Setup |
|---|---|
| LangGraph | from judgeval.integrations import Langgraph; Langgraph.initialize() |
| OpenLit | from judgeval.integrations import Openlit; Openlit.initialize() |
| Claude Agent SDK | from judgeval.integrations import setup_claude_agent_sdk; setup_claude_agent_sdk() |
Cookbooks
| Topic | Notebook | Description |
|---|---|---|
| Online ABM | Research Agent | Monitor agent behavior in production |
| Custom Scorers | HumanEval | Build custom evaluators for your agents |
Browse the full cookbook repository or watch video tutorials.
Links
- Documentation
- Judgment Platform
- Self-Hosting Guide
- Custom Scorers Guide
- Online Evaluation Guide
- Cookbook Repository
- Video Tutorials
Judgeval is created and maintained by Judgment Labs.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file judgeval-1.0.0.tar.gz.
File metadata
- Download URL: judgeval-1.0.0.tar.gz
- Upload date:
- Size: 23.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
191b2fa99e092bc4fdf794d8624d4ae559f78106e452bb4019d62f48e166d039
|
|
| MD5 |
fc01f1c6d83344965330277da85ecad3
|
|
| BLAKE2b-256 |
e86645a4eda904e2241488cb919263712cadbc7f786ffdf63a4387441b2b8647
|
File details
Details for the file judgeval-1.0.0-py3-none-any.whl.
File metadata
- Download URL: judgeval-1.0.0-py3-none-any.whl
- Upload date:
- Size: 156.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3e90bf02564f78ad55d3c4610e1463648c0873eaf1362fa6c23aede4ed21bc58
|
|
| MD5 |
abda6cbafd2984c77b008ae433dd1a85
|
|
| BLAKE2b-256 |
2a136e1b1c214576507a2f166dae82f08f307d287be8646c30723428ca6e7b15
|