Skip to main content

Agent behavior testing for LLM apps. Auto-capture agent runs, eval tool calls and traces, catch regressions in CI.

Project description

Rubric

Rubric

Agent behavior testing for LLM apps.

Test what your agent did — tools called, arguments, trace, latency — not just what it said. Catch regressions in CI before they ship.

PyPI version CI Python 3.9+ License: MIT


The problem

Your agent passed every manual check. Then a prompt tweak shipped, and it quietly stopped calling lookup_order and started answering from memory. The responses still look fine — string-match evals and LLM judges that only see the final output can't catch it.

Rubric tests the behavior: which tools were called, with what arguments, in what order, whether forbidden tools were avoided, how clean the reasoning trace was, and how fast it ran. Zero required dependencies, fully local, MIT.

pip install rubric-eval

Test a LangGraph agent in 60 seconds

No callbacks, no wrappers, no manual wiring. Rubric extracts tool calls, arguments, outputs, errors, the full trace, latency, and token usage from the messages your agent already produces.

import rubriceval as rubric
from langgraph.prebuilt import create_react_agent

agent = create_react_agent(model, tools=[lookup_order, create_ticket, send_email])

report = rubric.evaluate(
    test_cases=rubric.run_langgraph(agent, scenarios=[
        rubric.AgentScenario(
            input="Where is my order #ORD-9821?",
            expected_tools=["lookup_order"],
        ),
        rubric.AgentScenario(
            input="My account is locked, this is urgent.",
            expected_tools=["create_ticket"],
            forbidden_tools=["send_email"],   # must not bypass the ticketing system
        ),
    ]),
    metrics=[
        rubric.ToolCallAccuracy(),            # right tools? no forbidden ones?
        rubric.TraceQuality(),                # no loops, within step budget?
        rubric.LatencyMetric(max_ms=3000),
    ],
    output_html="report.html",
    output_json="report.json",
)
  [2/2] My account is locked, this is urgent.
    ❌ Score: 0.667
        ✗ tool_call_accuracy: 0.000 — Missing expected tools: ['create_ticket']; Called forbidden tools: ['send_email']
        ✓ trace_quality: 1.000 — Trace looks clean. 3 steps taken.
        ✓ latency: 1.000 — Latency 1840ms is within budget (3000ms).

Already have a result from agent.invoke()? One call:

case = rubric.from_langgraph(result, expected_tools=["lookup_order"])

Not on LangGraph? rubric.from_messages() accepts any OpenAI-format message list (role / content / tool_calls), so it works with raw OpenAI tool-calling loops too. LangFuse and LangSmith trace exports load via load_langfuse() / load_langsmith(), and you can always construct an AgentTestCase by hand.


Catch regressions in CI

Rubric ships a GitHub Action that runs your evals on every PR, diffs against a baseline, and posts the result as a comment — like Codecov, but for agent behavior.

# .github/workflows/eval.yml
name: Agent Evals
on: [pull_request]

permissions:
  pull-requests: write

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: Kareem-Rashed/rubric-eval@v0.2.0
        with:
          eval-file: evals/regression.py
          baseline: evals/baseline.json
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

The PR comment looks like this:

🧪 Rubric eval — 🔻 1 regression

Baseline Current Δ
Pass rate 100.0% (12/12) 91.7% (11/12) 🔻 -0.08
Avg score 0.96 0.89 🔻 -0.07

🔻 Regressions (1)

  • Urgent — account locked — pass → fail (score 1.00 → 0.67)
    • tool_call_accuracy: 1.00 → 0.00 — Missing expected tools: ['create_ticket']; Called forbidden tools: ['send_email']

The same diff is available locally and in any CI system:

rubric run evals/regression.py --output-json current.json
rubric compare current.json --baseline evals/baseline.json --fail-on-regression

rubric compare flags pass→fail regressions, score drops on still-passing tests, fixed tests, and new/removed tests — with the failing metric's reason inline.


Metrics

Agent behavior (the core)

Metric Checks
ToolCallAccuracy() Expected tools called, forbidden tools avoided, optional order check
TraceQuality() No loops, within step budget
TaskCompletion() The task was actually finished
ToolCallEfficiency() No redundant or wasted tool calls
SafetyCompliance() No unsafe actions in the trace
ReasoningQuality() Coherent multi-step reasoning
ContextUtilization() Provided context was actually used
LatencyMetric(max_ms=...) Within latency budget
CostMetric(max_cost_usd=...) Within cost budget

Output quality

Metric Checks
LLMJudge(criteria=...) Custom LLM-based scoring — works with any callable (OpenAI, Anthropic, Ollama)
GEval(name=..., criteria=...) Chain-of-thought LLM evaluation
HallucinationScore() Output grounded in the provided context (LLM judge or NLI mode)
SemanticSimilarity(threshold=...) Embedding similarity vs expected output ([semantic] extra)
RougeScore() ROUGE overlap for summarization ([rouge] extra)
ExactMatch() / Contains() / NotContains() / RegexMatch() String checks, zero dependencies

LLM-judge metrics support repeated runs with flakiness detection — Rubric reports the score variance so you know when your judge, not your agent, is the unstable part.

Custom metrics

from rubriceval import BaseMetric, MetricResult

class NoApologySpam(BaseMetric):
    name = "no_apology_spam"
    threshold = 1.0

    def measure(self, test_case) -> MetricResult:
        count = test_case.actual_output.lower().count("sorry")
        return MetricResult(
            metric_name=self.name,
            score=1.0 if count <= 1 else 0.0,
            passed=count <= 1,
            reason=f"'sorry' appears {count} time(s).",
        )

pytest integration

Evals as regular tests, via the built-in rubric_eval fixture:

def test_agent_routes_urgent_requests(rubric_eval):
    result = agent.invoke({"messages": [{"role": "user", "content": "Account locked, urgent!"}]})
    rubric_eval.add(
        rubric.from_langgraph(result,
                              expected_tools=["create_ticket"],
                              forbidden_tools=["send_email"]),
        metrics=[rubric.ToolCallAccuracy()],
    )
    # auto-asserts at end of test

CLI

rubric run evals/regression.py                      # run an eval file
rubric run evals/regression.py --output-html report.html --output-json report.json
rubric run evals/regression.py --quiet --fail-on-error
rubric compare current.json --baseline baseline.json --fail-on-regression
rubric version

The HTML report is a single self-contained file with per-test traces, tool calls, and per-metric breakdowns — open it locally, attach it to CI artifacts, no server needed.


Why Rubric

  • Behavior-first. Most eval frameworks score the final answer. Rubric's core abstraction is the agent run — tools, arguments, trace, latency — because that's where agent bugs actually live.
  • Zero wiring. run_langgraph() / from_messages() turn the messages you already have into test cases. No SDK to thread through your app.
  • CI-native. Baseline diffing, PR comments, and exit codes are built in, not a paid add-on.
  • Zero required dependencies, fully local. pip install rubric-eval pulls in nothing else. Your prompts and traces never leave your machine.
  • Independent and MIT-licensed. Not owned by an AI company or a platform vendor — no pressure to route your evals through anyone's cloud.

Examples


Roadmap

See ROADMAP.md. Next up: auto-capture for CrewAI, the OpenAI Agents SDK, and MCP servers; baseline auto-update on merge; dataset loaders.

Contributing

Contributions are welcome — the issues tagged good first issue are genuinely scoped to a first PR.

git clone https://github.com/Kareem-Rashed/rubric-eval
cd rubric-eval
pip install -e ".[dev]"
pytest tests/

See CONTRIBUTING.md for guidelines.

License

MIT © Kareem Rashed

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rubric_eval-0.2.0.tar.gz (531.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rubric_eval-0.2.0-py3-none-any.whl (65.4 kB view details)

Uploaded Python 3

File details

Details for the file rubric_eval-0.2.0.tar.gz.

File metadata

  • Download URL: rubric_eval-0.2.0.tar.gz
  • Upload date:
  • Size: 531.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rubric_eval-0.2.0.tar.gz
Algorithm Hash digest
SHA256 31dbb7eef08a82d3bff4296e25ae34601944093656e16afbe982ad8df30f1453
MD5 e0f68bd5a874f3dc02dcf4472fa121c6
BLAKE2b-256 12eb7ef7665cc1ab1a39f48698ef2c662dabb772fcd42163b48b38cac973c674

See more details on using hashes here.

Provenance

The following attestation bundles were made for rubric_eval-0.2.0.tar.gz:

Publisher: publish.yml on Kareem-Rashed/rubric-eval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rubric_eval-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: rubric_eval-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 65.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rubric_eval-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e628a420232067fd1fa793075710ada57fc0f3136dac0cd5c040e904065f28a0
MD5 8f90e8b8dbaa9a2671339c062355ec3c
BLAKE2b-256 0d0de8c8dc14b46dc6e668b540f850d9966c6a2f175ea83ec9ef18c17a683ecf

See more details on using hashes here.

Provenance

The following attestation bundles were made for rubric_eval-0.2.0-py3-none-any.whl:

Publisher: publish.yml on Kareem-Rashed/rubric-eval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page