Skip to main content

The independent LLM & AI agent evaluation framework. pytest for AI.

Project description

📐 Rubric

The independent LLM & AI agent evaluation framework.

PyPI version Python 3.9+ License: MIT GitHub Stars

Not owned by any AI company. Open source forever.

Now that Promptfoo has joined OpenAI, the community needs a neutral eval framework. Rubric is built by developers, for developers — no conflict of interest.


Why Rubric?

Rubric DeepEval Promptfoo
Open source ✅ MIT ✅ Apache ✅ MIT (now OpenAI-owned)
Agent trace evaluation ✅ First-class ❌ Limited ❌ No
Zero required dependencies ❌ Requires LLM API ❌ Requires Node.js
Works with any LLM ✅ Any callable
pytest integration ✅ Native fixture ❌ YAML-based
Local HTML dashboard ✅ Built-in 💰 Paid cloud ❌ No
Owned by AI company ❌ Independent ❌ Independent ✅ OpenAI

Install

pip install rubric-eval

# Optional extras (install what you need):
pip install "rubric-eval[semantic]"   # SemanticSimilarity metric
pip install "rubric-eval[openai]"     # LLM judge via OpenAI
pip install "rubric-eval[anthropic]"  # LLM judge via Anthropic
pip install "rubric-eval[all]"        # Everything

Quick Start

import rubriceval as rubric

# 1. Define test cases
test_cases = [
    rubric.TestCase(
        input="What is the capital of France?",
        actual_output=my_llm("What is the capital of France?"),
        expected_output="The capital of France is Paris.",
    ),
    rubric.TestCase(
        input="What is 2 + 2?",
        actual_output=my_llm("What is 2 + 2?"),
        expected_output="4",
    ),
]

# 2. Run evaluation
report = rubric.evaluate(
    test_cases=test_cases,
    metrics=[
        rubric.ExactMatch(),
        rubric.Contains("Paris"),
        rubric.SemanticSimilarity(threshold=0.8),
    ],
    output_html="report.html",   # beautiful local dashboard
    output_json="report.json",   # for CI/CD
)

# 3. View results  (evaluate() already prints a full summary when verbose=True)
# Call report.print_summary() again only if you set verbose=False above

Output:

🔍 Rubric — Running 2 test case(s) with 3 metric(s)...

  [1/2] What is the capital of France?
    ✅ Score: 1.000
        ✓ exact_match: 1.000
        ✓ contains: 1.000
        ✓ semantic_similarity: 0.952

  [2/2] What is 2 + 2?
    ✅ Score: 1.000

============================================================
  RUBRIC EVALUATION REPORT
  Total: 2   ✅ Passed: 2   Pass Rate: 100.0%   Avg Score: 1.000
============================================================

Agent Evaluation (Rubric's Superpower)

Unlike other frameworks that only check final output, Rubric evaluates the entire agent execution — tool calls, reasoning trace, latency, and task completion.

import rubriceval as rubric

# Your agent returns tool calls and a trace
result = my_agent.run("Book a flight from Cairo to Paris")

test = rubric.AgentTestCase(
    input="Book a flight from Cairo to Paris",
    actual_output=result.output,

    # Which tools MUST be called?
    expected_tools=["search_flights", "book_flight"],

    # Which tools must NOT be called? (safety guardrails)
    forbidden_tools=["send_email", "charge_card"],

    # Pass the actual tool calls your agent made
    tool_calls=result.tool_calls,

    # Pass the full reasoning trace
    trace=result.trace,

    latency_ms=result.latency_ms,
    max_steps=10,  # agent should complete in ≤ 10 steps
)

report = rubric.evaluate(
    test_cases=[test],
    metrics=[
        rubric.ToolCallAccuracy(check_order=True),  # Did it call the right tools?
        rubric.TraceQuality(penalize_loops=True),   # Did it avoid looping?
        rubric.TaskCompletion(),                     # Did it actually finish?
        rubric.LatencyMetric(max_ms=5000),           # Was it fast enough?
        rubric.CostMetric(max_cost_usd=0.05),        # Was it cheap enough?
    ],
)

pytest Integration

Rubric integrates natively with pytest — write your evals as regular tests.

# test_my_llm.py
def test_factual_accuracy(rubric_eval):
    rubric_eval.add(
        rubric.TestCase(
            input="What is the capital of Egypt?",
            actual_output=my_llm("What is the capital of Egypt?"),
            expected_output="Cairo",
        ),
        metrics=[rubric.Contains("Cairo"), rubric.SemanticSimilarity(threshold=0.8)],
    )
    # Auto-asserts at end of test — no extra code needed

def test_agent_books_flight(rubric_eval):
    result = agent.run("Book a flight to Paris")
    rubric_eval.add(
        rubric.AgentTestCase(
            input="Book a flight to Paris",
            actual_output=result.output,
            expected_tools=["search_flights", "book_flight"],
            tool_calls=result.tool_calls,
        ),
        metrics=[rubric.ToolCallAccuracy(), rubric.TaskCompletion()],
    )
pytest tests/ -v

LLM-as-Judge

Use any LLM to evaluate response quality with custom criteria. Works with OpenAI, Anthropic, Ollama, or any callable.

from openai import OpenAI
client = OpenAI()

def my_judge(prompt: str) -> str:
    return client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
    ).choices[0].message.content

report = rubric.evaluate(
    test_cases=test_cases,
    metrics=[
        rubric.LLMJudge(
            criteria="Is the response accurate, concise, and helpful?",
            judge_fn=my_judge,
            threshold=0.7,
        ),
        rubric.GEval(
            name="coherence",
            criteria="The response is logically consistent and well-structured.",
            judge_fn=my_judge,
        ),
    ],
)

Available Metrics

String Matching (no dependencies)

Metric Description
ExactMatch() Exact string comparison (case-insensitive by default)
Contains(substring) Output contains required string(s)
NotContains(forbidden) Output does NOT contain forbidden strings
RegexMatch(pattern) Output matches a regex pattern

Semantic (requires pip install rubric-eval[semantic])

Metric Description
SemanticSimilarity(threshold=0.8) Cosine similarity via sentence-transformers
RougeScore(rouge_type="rougeL") ROUGE overlap score for summarization

LLM Judge (requires an LLM API key)

Metric Description
LLMJudge(criteria=...) Custom LLM-based scoring
GEval(name=..., criteria=...) Chain-of-thought LLM evaluation

Agent & Performance

Metric Description
ToolCallAccuracy() Were the right tools called? Were forbidden tools avoided?
TraceQuality() Did the agent avoid loops and stay within step budget?
TaskCompletion() Did the agent complete the task?
LatencyMetric(max_ms=5000) Was the response within latency budget?
CostMetric(max_cost_usd=0.01) Was the API cost within budget?

Custom Metrics

from rubriceval import BaseMetric, MetricResult

class MyCustomMetric(BaseMetric):
    name = "my_metric"
    threshold = 0.5

    def measure(self, test_case) -> MetricResult:
        score = 1.0 if "good" in test_case.actual_output else 0.0
        return MetricResult(
            metric_name=self.name,
            score=score,
            passed=score >= self.threshold,
            reason="Output contains 'good'." if score else "Output lacks 'good'.",
        )

CLI

# Run an eval file
rubric run my_evals.py

# With HTML and JSON reports
rubric run my_evals.py --output-html report.html --output-json report.json

# Check version
rubric version

CI/CD Integration

# .github/workflows/eval.yml
name: LLM Evaluation
on: [push, pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install rubric-eval
      - run: rubric run evals/regression.py --output-json report.json
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      - uses: actions/upload-artifact@v4
        with:
          name: rubric-report
          path: report.json

Or use raise_on_failure=True:

rubric.evaluate(
    test_cases=test_cases,
    metrics=[...],
    raise_on_failure=True,  # calls sys.exit(1) if any test fails
)

Roadmap

  • 🌐 Web dashboard (local server with history)
  • 📊 Dataset management (load from CSV/JSONL)
  • 🔄 Regression detection (alert when pass rate drops)
  • 🔗 LangChain / LlamaIndex / CrewAI integrations
  • 📱 Slack/Discord notifications on eval failure
  • 🔴 Real-time production monitoring

Contributing

Rubric is built in the open. Contributions welcome!

git clone https://github.com/kareemrashed/rubric-eval
cd rubric-eval
pip install -e ".[dev]"
pytest tests/

See CONTRIBUTING.md for guidelines.


License

MIT © Kareem Rashed


Built at AUC 🏛️ · Cairo, Egypt 🇪🇬
If Rubric saves you time, consider giving it a ⭐

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rubric_eval-0.1.1.tar.gz (75.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rubric_eval-0.1.1-py3-none-any.whl (46.3 kB view details)

Uploaded Python 3

File details

Details for the file rubric_eval-0.1.1.tar.gz.

File metadata

  • Download URL: rubric_eval-0.1.1.tar.gz
  • Upload date:
  • Size: 75.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rubric_eval-0.1.1.tar.gz
Algorithm Hash digest
SHA256 f87a5a8e3ae2985ad11ef63e6676b48ccf2bac4b5759c3e8c10731ecfe32ade6
MD5 4ff1bdbf67130e5f66389976f535c095
BLAKE2b-256 0ca5ba04eebfe0ce216c19575c92ad8332c17459712e9b5d08104a84d1097ec8

See more details on using hashes here.

Provenance

The following attestation bundles were made for rubric_eval-0.1.1.tar.gz:

Publisher: publish.yml on Kareem-Rashed/rubric-eval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rubric_eval-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: rubric_eval-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 46.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rubric_eval-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c28b96f8a530654d0541e1a38f6ee384aa47333ea5d5e2cb4e1e41b6e669ada9
MD5 54e69a217160d94f7aa3bc2de60ff3e5
BLAKE2b-256 30439c6f27b3405611b9f5f6fccbe63df2cda72c110a20d64cd1a67b4b438692

See more details on using hashes here.

Provenance

The following attestation bundles were made for rubric_eval-0.1.1-py3-none-any.whl:

Publisher: publish.yml on Kareem-Rashed/rubric-eval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page