The independent LLM & AI agent evaluation framework. pytest for AI.

These details have not been verified by PyPI

Project description

📐 Rubric

The independent LLM & AI agent evaluation framework.

Not owned by any AI company. Open source forever.

Now that Promptfoo has joined OpenAI, the community needs a neutral eval framework. Rubric is built by developers, for developers — no conflict of interest.

Why Rubric?

	Rubric	DeepEval	Promptfoo
Open source	✅ MIT	✅ Apache	✅ MIT (now OpenAI-owned)
Agent trace evaluation	✅ First-class	❌ Limited	❌ No
Zero required dependencies	✅	❌ Requires LLM API	❌ Requires Node.js
Works with any LLM	✅ Any callable	✅	✅
pytest integration	✅ Native fixture	✅	❌ YAML-based
Local HTML dashboard	✅ Built-in	💰 Paid cloud	❌ No
Owned by AI company	❌ Independent	❌ Independent	✅ OpenAI

Install

pip install rubric-eval

# Optional extras (install what you need):
pip install "rubric-eval[semantic]"   # SemanticSimilarity metric
pip install "rubric-eval[openai]"     # LLM judge via OpenAI
pip install "rubric-eval[anthropic]"  # LLM judge via Anthropic
pip install "rubric-eval[all]"        # Everything

Quick Start

import rubriceval as rubric

# 1. Define test cases
test_cases = [
    rubric.TestCase(
        input="What is the capital of France?",
        actual_output=my_llm("What is the capital of France?"),
        expected_output="The capital of France is Paris.",
    ),
    rubric.TestCase(
        input="What is 2 + 2?",
        actual_output=my_llm("What is 2 + 2?"),
        expected_output="4",
    ),
]

# 2. Run evaluation
report = rubric.evaluate(
    test_cases=test_cases,
    metrics=[
        rubric.ExactMatch(),
        rubric.Contains("Paris"),
        rubric.SemanticSimilarity(threshold=0.8),
    ],
    output_html="report.html",   # beautiful local dashboard
    output_json="report.json",   # for CI/CD
)

# 3. View results  (evaluate() already prints a full summary when verbose=True)
# Call report.print_summary() again only if you set verbose=False above

Output:

🔍 Rubric — Running 2 test case(s) with 3 metric(s)...

  [1/2] What is the capital of France?
    ✅ Score: 1.000
        ✓ exact_match: 1.000
        ✓ contains: 1.000
        ✓ semantic_similarity: 0.952

  [2/2] What is 2 + 2?
    ✅ Score: 1.000

============================================================
  RUBRIC EVALUATION REPORT
  Total: 2   ✅ Passed: 2   Pass Rate: 100.0%   Avg Score: 1.000
============================================================

Agent Evaluation (Rubric's Superpower)

Unlike other frameworks that only check final output, Rubric evaluates the entire agent execution — tool calls, reasoning trace, latency, and task completion.

import rubriceval as rubric

# Your agent returns tool calls and a trace
result = my_agent.run("Book a flight from Cairo to Paris")

test = rubric.AgentTestCase(
    input="Book a flight from Cairo to Paris",
    actual_output=result.output,

    # Which tools MUST be called?
    expected_tools=["search_flights", "book_flight"],

    # Which tools must NOT be called? (safety guardrails)
    forbidden_tools=["send_email", "charge_card"],

    # Pass the actual tool calls your agent made
    tool_calls=result.tool_calls,

    # Pass the full reasoning trace
    trace=result.trace,

    latency_ms=result.latency_ms,
    max_steps=10,  # agent should complete in ≤ 10 steps
)

report = rubric.evaluate(
    test_cases=[test],
    metrics=[
        rubric.ToolCallAccuracy(check_order=True),  # Did it call the right tools?
        rubric.TraceQuality(penalize_loops=True),   # Did it avoid looping?
        rubric.TaskCompletion(),                     # Did it actually finish?
        rubric.LatencyMetric(max_ms=5000),           # Was it fast enough?
        rubric.CostMetric(max_cost_usd=0.05),        # Was it cheap enough?
    ],
)

pytest Integration

Rubric integrates natively with pytest — write your evals as regular tests.

# test_my_llm.py
def test_factual_accuracy(rubric_eval):
    rubric_eval.add(
        rubric.TestCase(
            input="What is the capital of Egypt?",
            actual_output=my_llm("What is the capital of Egypt?"),
            expected_output="Cairo",
        ),
        metrics=[rubric.Contains("Cairo"), rubric.SemanticSimilarity(threshold=0.8)],
    )
    # Auto-asserts at end of test — no extra code needed

def test_agent_books_flight(rubric_eval):
    result = agent.run("Book a flight to Paris")
    rubric_eval.add(
        rubric.AgentTestCase(
            input="Book a flight to Paris",
            actual_output=result.output,
            expected_tools=["search_flights", "book_flight"],
            tool_calls=result.tool_calls,
        ),
        metrics=[rubric.ToolCallAccuracy(), rubric.TaskCompletion()],
    )

pytest tests/ -v

LLM-as-Judge

Use any LLM to evaluate response quality with custom criteria. Works with OpenAI, Anthropic, Ollama, or any callable.

from openai import OpenAI
client = OpenAI()

def my_judge(prompt: str) -> str:
    return client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
    ).choices[0].message.content

report = rubric.evaluate(
    test_cases=test_cases,
    metrics=[
        rubric.LLMJudge(
            criteria="Is the response accurate, concise, and helpful?",
            judge_fn=my_judge,
            threshold=0.7,
        ),
        rubric.GEval(
            name="coherence",
            criteria="The response is logically consistent and well-structured.",
            judge_fn=my_judge,
        ),
    ],
)

Available Metrics

String Matching (no dependencies)

Metric	Description
`ExactMatch()`	Exact string comparison (case-insensitive by default)
`Contains(substring)`	Output contains required string(s)
`NotContains(forbidden)`	Output does NOT contain forbidden strings
`RegexMatch(pattern)`	Output matches a regex pattern

Semantic (requires `pip install rubric-eval[semantic]`)

Metric	Description
`SemanticSimilarity(threshold=0.8)`	Cosine similarity via sentence-transformers
`RougeScore(rouge_type="rougeL")`	ROUGE overlap score for summarization

LLM Judge (requires an LLM API key)

Metric	Description
`LLMJudge(criteria=...)`	Custom LLM-based scoring
`GEval(name=..., criteria=...)`	Chain-of-thought LLM evaluation

Agent & Performance

Metric	Description
`ToolCallAccuracy()`	Were the right tools called? Were forbidden tools avoided?
`TraceQuality()`	Did the agent avoid loops and stay within step budget?
`TaskCompletion()`	Did the agent complete the task?
`LatencyMetric(max_ms=5000)`	Was the response within latency budget?
`CostMetric(max_cost_usd=0.01)`	Was the API cost within budget?

Custom Metrics

from rubriceval import BaseMetric, MetricResult

class MyCustomMetric(BaseMetric):
    name = "my_metric"
    threshold = 0.5

    def measure(self, test_case) -> MetricResult:
        score = 1.0 if "good" in test_case.actual_output else 0.0
        return MetricResult(
            metric_name=self.name,
            score=score,
            passed=score >= self.threshold,
            reason="Output contains 'good'." if score else "Output lacks 'good'.",
        )

CLI

# Run an eval file
rubric run my_evals.py

# With HTML and JSON reports
rubric run my_evals.py --output-html report.html --output-json report.json

# Check version
rubric version

CI/CD Integration

# .github/workflows/eval.yml
name: LLM Evaluation
on: [push, pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install rubric-eval
      - run: rubric run evals/regression.py --output-json report.json
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      - uses: actions/upload-artifact@v4
        with:
          name: rubric-report
          path: report.json

Or use raise_on_failure=True:

rubric.evaluate(
    test_cases=test_cases,
    metrics=[...],
    raise_on_failure=True,  # calls sys.exit(1) if any test fails
)

Roadmap

🌐 Web dashboard (local server with history)
📊 Dataset management (load from CSV/JSONL)
🔄 Regression detection (alert when pass rate drops)
🔗 LangChain / LlamaIndex / CrewAI integrations
📱 Slack/Discord notifications on eval failure
🔴 Real-time production monitoring

Contributing

Rubric is built in the open. Contributions welcome!

git clone https://github.com/kareemrashed/rubric-eval
cd rubric-eval
pip install -e ".[dev]"
pytest tests/

See CONTRIBUTING.md for guidelines.

License

MIT © Kareem Rashed

Built at AUC 🏛️ · Cairo, Egypt 🇪🇬
_{If Rubric saves you time, consider giving it a ⭐}

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.1

Mar 25, 2026

0.1.0

Mar 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rubric_eval-0.1.1.tar.gz (75.9 kB view details)

Uploaded Mar 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rubric_eval-0.1.1-py3-none-any.whl (46.3 kB view details)

Uploaded Mar 25, 2026 Python 3

File details

Details for the file rubric_eval-0.1.1.tar.gz.

File metadata

Download URL: rubric_eval-0.1.1.tar.gz
Upload date: Mar 25, 2026
Size: 75.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rubric_eval-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`f87a5a8e3ae2985ad11ef63e6676b48ccf2bac4b5759c3e8c10731ecfe32ade6`
MD5	`4ff1bdbf67130e5f66389976f535c095`
BLAKE2b-256	`0ca5ba04eebfe0ce216c19575c92ad8332c17459712e9b5d08104a84d1097ec8`

See more details on using hashes here.

Provenance

The following attestation bundles were made for rubric_eval-0.1.1.tar.gz:

Publisher: publish.yml on Kareem-Rashed/rubric-eval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: rubric_eval-0.1.1.tar.gz
- Subject digest: f87a5a8e3ae2985ad11ef63e6676b48ccf2bac4b5759c3e8c10731ecfe32ade6
- Sigstore transparency entry: 1179888628
- Sigstore integration time: Mar 25, 2026
Source repository:
- Permalink: Kareem-Rashed/rubric-eval@2022caa867013909394432465524377c7f909cc2
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/Kareem-Rashed
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@2022caa867013909394432465524377c7f909cc2
- Trigger Event: push

File details

Details for the file rubric_eval-0.1.1-py3-none-any.whl.

File metadata

Download URL: rubric_eval-0.1.1-py3-none-any.whl
Upload date: Mar 25, 2026
Size: 46.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rubric_eval-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c28b96f8a530654d0541e1a38f6ee384aa47333ea5d5e2cb4e1e41b6e669ada9`
MD5	`54e69a217160d94f7aa3bc2de60ff3e5`
BLAKE2b-256	`30439c6f27b3405611b9f5f6fccbe63df2cda72c110a20d64cd1a67b4b438692`

See more details on using hashes here.

Provenance

The following attestation bundles were made for rubric_eval-0.1.1-py3-none-any.whl:

Publisher: publish.yml on Kareem-Rashed/rubric-eval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: rubric_eval-0.1.1-py3-none-any.whl
- Subject digest: c28b96f8a530654d0541e1a38f6ee384aa47333ea5d5e2cb4e1e41b6e669ada9
- Sigstore transparency entry: 1179888699
- Sigstore integration time: Mar 25, 2026
Source repository:
- Permalink: Kareem-Rashed/rubric-eval@2022caa867013909394432465524377c7f909cc2
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/Kareem-Rashed
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@2022caa867013909394432465524377c7f909cc2
- Trigger Event: push

rubric-eval 0.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

📐 Rubric

Why Rubric?

Install

Quick Start

Agent Evaluation (Rubric's Superpower)

pytest Integration

LLM-as-Judge

Available Metrics

String Matching (no dependencies)

Semantic (requires pip install rubric-eval[semantic])

LLM Judge (requires an LLM API key)

Agent & Performance

Custom Metrics

CLI

CI/CD Integration

Roadmap

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Semantic (requires `pip install rubric-eval[semantic]`)