judgeval

The open source post-building layer for Agent Behavior Monitoring.

These details have not been verified by PyPI

Project links

Project description

Agent Behavior Monitoring (ABM)

Track and judge any agent behavior in online and offline setups. Set up Sentry-style alerts and analyze agent behaviors / topic patterns at scale!

[NEW] 🎆 Agent Reinforcement Learning

Train your agents with multi-turn reinforcement learning using judgeval and Fireworks AI! Judgeval's ABM now integrates with Fireworks' Reinforcement Fine-Tuning (RFT) endpoint, supporting gpt-oss, qwen3, Kimi2, DeepSeek, and more.

Judgeval's agent monitoring infra provides a simple harness for integrating GRPO into any Python agent, giving builders a quick method to try RL with minimal code changes to their existing agents!

await trainer.train(
    agent_function=your_agent_function,  # entry point to your agent
    scorers=[RewardScorer()],  # Custom scorer you define based on task criteria, acts as reward
    prompts=training_prompts  # Tasks
)

That's it! Judgeval automatically manages trajectory collection and reward tagging - your agent can learn from production data with minimal code changes.

👉 Check out the Wikipedia Racer notebook, where an agent learns to navigate Wikipedia using RL, to see Judgeval in action.

You can view and monitor training progress for free via the Judgment Dashboard.

Judgeval Overview

Judgeval is an open-source framework for agent behavior monitoring. Judgeval offers a toolkit to track and judge agent behavior in online and offline setups, enabling you to convert interaction data from production/test environments into improved agents. To get started, try running one of the notebooks below or dive deeper in our docs.

Our mission is to unlock the power of production data for agent development, enabling teams to improve their apps by catching real-time failures and optimizing over their users' preferences.

📚 Cookbooks

Try Out	Notebook	Description
RL	Wikipedia Racer	Train agents with reinforcement learning
Online ABM	Research Agent	Monitor agent behavior in production
Custom Scorers	HumanEval	Build custom evaluators for your agents
Offline Testing	[Get Started For Free]	Compare how different prompts, models, or agent configs affect performance across ANY metric

You can access our repo of cookbooks.

You can find a list of video tutorials for Judgeval use cases.

Why Judgeval?

🤖 Simple to run multi-turn RL: Optimize your agents with multi-turn RL without managing compute infrastructure or data pipelines. Just add a few lines of code to your existing agent code and train!

⚙️ Custom Evaluators: No restriction to only monitoring with prefab scorers. Judgeval provides simple abstractions for custom Python scorers, supporting any LLM-as-a-judge rubrics/models and code-based scorers that integrate to our live agent-tracking infrastructure. Learn more

🚨 Production Monitoring: Run any custom scorer in a hosted, virtualized secure container to flag agent behaviors online in production. Get Slack alerts for failures and add custom hooks to address regressions before they impact users. Learn more

📊 Behavior/Topic Grouping: Group agent runs by behavior type or topic for deeper analysis. Drill down into subsets of users, agents, or use cases to reveal patterns of agent behavior.

🧪 Run experiments on your agents: Compare test different prompts, models, or agent configs across customer segments. Measure which changes improve agent performance and decrease bad agent behaviors.

🛠️ Quickstart

Get started with Judgeval by installing our SDK using pip:

pip install judgeval

Ensure you have your JUDGMENT_API_KEY and JUDGMENT_ORG_ID environment variables set to connect to the Judgment Platform.

export JUDGMENT_API_KEY=...
export JUDGMENT_ORG_ID=...

If you don't have keys, create an account for free on the platform!

Start monitoring with Judgeval

from judgeval.tracer import Tracer, wrap
from judgeval.data import Example
from judgeval.scorers import AnswerRelevancyScorer
from openai import OpenAI


judgment = Tracer(project_name="default_project")
client = wrap(OpenAI())  # tracks all LLM calls

@judgment.observe(span_type="tool")
def format_question(question: str) -> str:
    # dummy tool
    return f"Question : {question}"

@judgment.observe(span_type="function")
def run_agent(prompt: str) -> str:
    task = format_question(prompt)
    response = client.chat.completions.create(
        model="gpt-5-mini",
        messages=[{"role": "user", "content": task}]
    )

    judgment.async_evaluate(  # trigger online monitoring
        scorer=AnswerRelevancyScorer(threshold=0.5),  # swap with any scorer
        example=Example(input=task, actual_output=response),  # customize to your data
        model="gpt-5",
    )
    return response.choices[0].message.content

run_agent("What is the capital of the United States?")

Running this code will deliver monitoring results to your free platform account and should look like this:

Judgment Platform Trajectory View

Customizable Scorers Over Agent Behavior

Judgeval's strongest suit is the full customization over the types of scorers you can run online monitoring with. No restrictions to only single-prompt LLM judges or prefab scorers - if you can express your scorer in python code, judgeval can monitor it! Under the hood, judgeval hosts your scorer in a virtualized secure container, enabling online monitoring for any scorer.

First, create a behavior scorer in a file called helpfulness_scorer.py:

from judgeval.data import Example
from judgeval.scorers.example_scorer import ExampleScorer

# Define custom example class
class QuestionAnswer(Example):
    question: str
    answer: str

# Define a server-hosted custom scorer
class HelpfulnessScorer(ExampleScorer):
    name: str = "Helpfulness Scorer"
    server_hosted: bool = True  # Enable server hosting
    async def a_score_example(self, example: QuestionAnswer):
        # Custom scoring logic for agent behavior
        # Can be an arbitrary combination of code and LLM calls
        if len(example.answer) > 10 and "?" not in example.answer:
            self.reason = "Answer is detailed and provides helpful information"
            return 1.0
        else:
            self.reason = "Answer is too brief or unclear"
            return 0.0

Then deploy your scorer to Judgment's infrastructure:

echo "pydantic" > requirements.txt
uv run judgeval upload_scorer helpfulness_scorer.py requirements.txt

Now you can instrument your agent with monitoring and online evaluation:

from judgeval.tracer import Tracer, wrap
from helpfulness_scorer import HelpfulnessScorer, QuestionAnswer
from openai import OpenAI

judgment = Tracer(project_name="default_project")
client = wrap(OpenAI())  # tracks all LLM calls

@judgment.observe(span_type="tool")
def format_task(question: str) -> str:  # replace with your prompt engineering
    return f"Please answer the following question: {question}"

@judgment.observe(span_type="tool")
def answer_question(prompt: str) -> str:  # replace with your LLM system calls
    response = client.chat.completions.create(
        model="gpt-5-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

@judgment.observe(span_type="function")
def run_agent(question: str) -> str:
    task = format_task(question)
    answer = answer_question(task)

    # Add online evaluation with server-hosted scorer
    judgment.async_evaluate(
        scorer=HelpfulnessScorer(),
        example=QuestionAnswer(question=question, answer=answer),
        sampling_rate=0.9  # Evaluate 90% of agent runs
    )

    return answer

if __name__ == "__main__":
    result = run_agent("What is the capital of the United States?")
    print(result)

Congratulations! Your online eval result should look like this:

Custom Scorer Online ABM

You can now run any online scorer in a secure Firecracker microVMs with no latency impact on your applications.

Judgeval is created and maintained by Judgment Labs.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.5

Apr 9, 2026

1.0.4

Apr 6, 2026

1.0.3

Apr 4, 2026

1.0.2

Mar 27, 2026

1.0.1

Mar 27, 2026

1.0.0

Mar 26, 2026

0.32.1

Mar 24, 2026

0.32.0

Mar 18, 2026

0.31.0

Mar 16, 2026

0.30.0

Mar 10, 2026

This version

0.29.0

Mar 4, 2026

0.28.1

Feb 28, 2026

0.28.0

Feb 19, 2026

0.27.1

Feb 10, 2026

0.27.0

Feb 6, 2026

0.26.2

Feb 5, 2026

0.26.1

Feb 4, 2026

0.26.0

Jan 29, 2026

0.25.1

Jan 28, 2026

0.25.0

Jan 27, 2026

0.24.3

Jan 26, 2026

0.24.2

Jan 23, 2026

0.24.1

Jan 22, 2026

0.24.0

Jan 17, 2026

0.23.12

Dec 22, 2025

0.23.11

Dec 17, 2025

0.23.10

Dec 14, 2025

0.23.9

Dec 12, 2025

0.23.8

Dec 11, 2025

0.23.7

Dec 4, 2025

0.23.6

Dec 3, 2025

0.23.5

Dec 2, 2025

0.23.4

Dec 2, 2025

0.23.3

Dec 1, 2025

0.23.2

Nov 29, 2025

0.23.1

Nov 27, 2025

0.23.0

Nov 26, 2025

0.22.8

Nov 25, 2025

0.22.7

Nov 24, 2025

0.22.6

Nov 19, 2025

0.22.5

Nov 18, 2025

0.22.4

Nov 18, 2025

0.22.3

Nov 14, 2025

0.22.2

Nov 8, 2025

0.22.1

Nov 6, 2025

0.22.0

Nov 5, 2025

0.21.0

Nov 4, 2025

0.20.1

Oct 30, 2025

0.20.0

Oct 26, 2025

0.19.0

Oct 23, 2025

0.18.0

Oct 23, 2025

0.17.0

Oct 16, 2025

0.16.8

Oct 15, 2025

0.16.7

Oct 14, 2025

0.16.6

Oct 12, 2025

0.16.5

Oct 11, 2025

0.16.4

Oct 10, 2025

0.16.3

Oct 9, 2025

0.16.2

Oct 9, 2025

0.16.1

Oct 9, 2025

0.16.0

Oct 8, 2025

0.15.0

Oct 5, 2025

0.14.1

Sep 29, 2025

0.14.0

Sep 28, 2025

0.13.1

Sep 27, 2025

0.13.0

Sep 25, 2025

0.12.0

Sep 19, 2025

0.11.0

Sep 16, 2025

0.10.1

Sep 11, 2025

0.10.0

Sep 11, 2025

0.9.4

Sep 7, 2025

0.9.3

Sep 4, 2025

0.9.2

Sep 3, 2025

0.9.1

Sep 3, 2025

0.9.0

Sep 3, 2025

0.8.0

Aug 26, 2025

0.7.1

Aug 17, 2025

0.7.0

Aug 16, 2025

0.6.0

Aug 11, 2025

0.5.0

Aug 5, 2025

0.4.0

Aug 1, 2025

0.3.2

Jul 30, 2025

0.3.1

Jul 30, 2025

0.3.0

Jul 29, 2025

0.2.0

Jul 24, 2025

0.1.0

Jul 19, 2025

0.0.55

Jul 18, 2025

0.0.54

Jul 12, 2025

0.0.53

Jul 12, 2025

0.0.52

Jul 11, 2025

0.0.51

Jul 10, 2025

0.0.50

Jul 5, 2025

0.0.49

Jul 5, 2025

0.0.48

Jul 4, 2025

0.0.47

Jul 3, 2025

0.0.46

Jul 3, 2025

0.0.44

Jun 23, 2025

0.0.43

Jun 23, 2025

0.0.42

Jun 18, 2025

0.0.41

Jun 6, 2025

0.0.40

May 31, 2025

0.0.39

May 21, 2025

0.0.38

May 20, 2025

0.0.37

May 16, 2025

0.0.36

May 6, 2025

0.0.35

Apr 29, 2025

0.0.34

Apr 29, 2025

0.0.33

Apr 28, 2025

0.0.32

Apr 24, 2025

0.0.31

Apr 20, 2025

0.0.30

Apr 13, 2025

0.0.29

Apr 13, 2025

0.0.28

Apr 13, 2025

0.0.27

Apr 9, 2025

0.0.26

Apr 3, 2025

0.0.25

Mar 26, 2025

0.0.24

Mar 24, 2025

0.0.23

Mar 23, 2025

0.0.22

Mar 23, 2025

0.0.21

Mar 21, 2025

0.0.20

Mar 15, 2025

0.0.19

Mar 11, 2025

0.0.18

Mar 11, 2025

0.0.17

Mar 7, 2025

0.0.16

Mar 5, 2025

0.0.14

Mar 3, 2025

0.0.13

Feb 28, 2025

0.0.12

Feb 26, 2025

0.0.11

Feb 26, 2025

0.0.10

Feb 18, 2025

0.0.9

Feb 12, 2025

0.0.8

Feb 11, 2025

0.0.7

Feb 6, 2025

0.0.6

Feb 6, 2025

0.0.5

Feb 6, 2025

0.0.4

Feb 5, 2025

0.0.3

Jan 31, 2025

0.0.2

Jan 23, 2025

0.0.1

Jan 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

judgeval-0.29.0.tar.gz (23.2 MB view details)

Uploaded Mar 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

judgeval-0.29.0-py3-none-any.whl (123.8 kB view details)

Uploaded Mar 4, 2026 Python 3

File details

Details for the file judgeval-0.29.0.tar.gz.

File metadata

Download URL: judgeval-0.29.0.tar.gz
Upload date: Mar 4, 2026
Size: 23.2 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for judgeval-0.29.0.tar.gz
Algorithm	Hash digest
SHA256	`5fe0a10814389a3fa05779de56eae38e83b2a1453c696309abbdfb5f51f7fd35`
MD5	`e7119225196f6ffa3c9d939265d29e4d`
BLAKE2b-256	`c0dae590c9f5ec3521f5bc639787ba0e9dbfe5ab9cf8204bf9180d10e9b24e66`

See more details on using hashes here.

File details

Details for the file judgeval-0.29.0-py3-none-any.whl.

File metadata

Download URL: judgeval-0.29.0-py3-none-any.whl
Upload date: Mar 4, 2026
Size: 123.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for judgeval-0.29.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0aee0ff61d42c5d0b644e622999d34ddaf6ceb9b284ecf73aace1690c0f5d4f5`
MD5	`f4f4edd4bbc0a82e026561e02a35adbe`
BLAKE2b-256	`4020585c976e92a5f781ecebffd87862702219c0ef48a909970de851138370ea`

See more details on using hashes here.

judgeval 0.29.0

Navigation

Verified details

Owner

Unverified details

Project links

Meta

Classifiers

Project description

Agent Behavior Monitoring (ABM)

[NEW] 🎆 Agent Reinforcement Learning

Judgeval Overview

📚 Cookbooks

Why Judgeval?

🛠️ Quickstart

Start monitoring with Judgeval

Customizable Scorers Over Agent Behavior

Project details

Verified details

Owner

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes