The independent LLM & AI agent evaluation framework. pytest for AI.
Project description
📐 Rubric
The independent LLM & AI agent evaluation framework.
Not owned by any AI company. Open source forever.
Now that Promptfoo has joined OpenAI, the community needs a neutral eval framework. Rubric is built by developers, for developers — no conflict of interest.
Why Rubric?
| Rubric | DeepEval | Promptfoo | |
|---|---|---|---|
| Open source | ✅ MIT | ✅ Apache | ✅ MIT (now OpenAI-owned) |
| Agent trace evaluation | ✅ First-class | ❌ Limited | ❌ No |
| Zero required dependencies | ✅ | ❌ Requires LLM API | ❌ Requires Node.js |
| Works with any LLM | ✅ Any callable | ✅ | ✅ |
| pytest integration | ✅ Native fixture | ✅ | ❌ YAML-based |
| Local HTML dashboard | ✅ Built-in | 💰 Paid cloud | ❌ No |
| Owned by AI company | ❌ Independent | ❌ Independent | ✅ OpenAI |
Install
pip install rubric-eval
# Optional extras (install what you need):
pip install "rubric-eval[semantic]" # SemanticSimilarity metric
pip install "rubric-eval[openai]" # LLM judge via OpenAI
pip install "rubric-eval[anthropic]" # LLM judge via Anthropic
pip install "rubric-eval[all]" # Everything
Quick Start
import rubriceval as rubric
# 1. Define test cases
test_cases = [
rubric.TestCase(
input="What is the capital of France?",
actual_output=my_llm("What is the capital of France?"),
expected_output="The capital of France is Paris.",
),
rubric.TestCase(
input="What is 2 + 2?",
actual_output=my_llm("What is 2 + 2?"),
expected_output="4",
),
]
# 2. Run evaluation
report = rubric.evaluate(
test_cases=test_cases,
metrics=[
rubric.ExactMatch(),
rubric.Contains("Paris"),
rubric.SemanticSimilarity(threshold=0.8),
],
output_html="report.html", # beautiful local dashboard
output_json="report.json", # for CI/CD
)
# 3. View results (evaluate() already prints a full summary when verbose=True)
# Call report.print_summary() again only if you set verbose=False above
Output:
🔍 Rubric — Running 2 test case(s) with 3 metric(s)...
[1/2] What is the capital of France?
✅ Score: 1.000
✓ exact_match: 1.000
✓ contains: 1.000
✓ semantic_similarity: 0.952
[2/2] What is 2 + 2?
✅ Score: 1.000
============================================================
RUBRIC EVALUATION REPORT
Total: 2 ✅ Passed: 2 Pass Rate: 100.0% Avg Score: 1.000
============================================================
Agent Evaluation (Rubric's Superpower)
Unlike other frameworks that only check final output, Rubric evaluates the entire agent execution — tool calls, reasoning trace, latency, and task completion.
import rubriceval as rubric
# Your agent returns tool calls and a trace
result = my_agent.run("Book a flight from Cairo to Paris")
test = rubric.AgentTestCase(
input="Book a flight from Cairo to Paris",
actual_output=result.output,
# Which tools MUST be called?
expected_tools=["search_flights", "book_flight"],
# Which tools must NOT be called? (safety guardrails)
forbidden_tools=["send_email", "charge_card"],
# Pass the actual tool calls your agent made
tool_calls=result.tool_calls,
# Pass the full reasoning trace
trace=result.trace,
latency_ms=result.latency_ms,
max_steps=10, # agent should complete in ≤ 10 steps
)
report = rubric.evaluate(
test_cases=[test],
metrics=[
rubric.ToolCallAccuracy(check_order=True), # Did it call the right tools?
rubric.TraceQuality(penalize_loops=True), # Did it avoid looping?
rubric.TaskCompletion(), # Did it actually finish?
rubric.LatencyMetric(max_ms=5000), # Was it fast enough?
rubric.CostMetric(max_cost_usd=0.05), # Was it cheap enough?
],
)
pytest Integration
Rubric integrates natively with pytest — write your evals as regular tests.
# test_my_llm.py
def test_factual_accuracy(rubric_eval):
rubric_eval.add(
rubric.TestCase(
input="What is the capital of Egypt?",
actual_output=my_llm("What is the capital of Egypt?"),
expected_output="Cairo",
),
metrics=[rubric.Contains("Cairo"), rubric.SemanticSimilarity(threshold=0.8)],
)
# Auto-asserts at end of test — no extra code needed
def test_agent_books_flight(rubric_eval):
result = agent.run("Book a flight to Paris")
rubric_eval.add(
rubric.AgentTestCase(
input="Book a flight to Paris",
actual_output=result.output,
expected_tools=["search_flights", "book_flight"],
tool_calls=result.tool_calls,
),
metrics=[rubric.ToolCallAccuracy(), rubric.TaskCompletion()],
)
pytest tests/ -v
LLM-as-Judge
Use any LLM to evaluate response quality with custom criteria. Works with OpenAI, Anthropic, Ollama, or any callable.
from openai import OpenAI
client = OpenAI()
def my_judge(prompt: str) -> str:
return client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
).choices[0].message.content
report = rubric.evaluate(
test_cases=test_cases,
metrics=[
rubric.LLMJudge(
criteria="Is the response accurate, concise, and helpful?",
judge_fn=my_judge,
threshold=0.7,
),
rubric.GEval(
name="coherence",
criteria="The response is logically consistent and well-structured.",
judge_fn=my_judge,
),
],
)
Available Metrics
String Matching (no dependencies)
| Metric | Description |
|---|---|
ExactMatch() |
Exact string comparison (case-insensitive by default) |
Contains(substring) |
Output contains required string(s) |
NotContains(forbidden) |
Output does NOT contain forbidden strings |
RegexMatch(pattern) |
Output matches a regex pattern |
Semantic (requires pip install rubric-eval[semantic])
| Metric | Description |
|---|---|
SemanticSimilarity(threshold=0.8) |
Cosine similarity via sentence-transformers |
RougeScore(rouge_type="rougeL") |
ROUGE overlap score for summarization |
LLM Judge (requires an LLM API key)
| Metric | Description |
|---|---|
LLMJudge(criteria=...) |
Custom LLM-based scoring |
GEval(name=..., criteria=...) |
Chain-of-thought LLM evaluation |
Agent & Performance
| Metric | Description |
|---|---|
ToolCallAccuracy() |
Were the right tools called? Were forbidden tools avoided? |
TraceQuality() |
Did the agent avoid loops and stay within step budget? |
TaskCompletion() |
Did the agent complete the task? |
LatencyMetric(max_ms=5000) |
Was the response within latency budget? |
CostMetric(max_cost_usd=0.01) |
Was the API cost within budget? |
Custom Metrics
from rubriceval import BaseMetric, MetricResult
class MyCustomMetric(BaseMetric):
name = "my_metric"
threshold = 0.5
def measure(self, test_case) -> MetricResult:
score = 1.0 if "good" in test_case.actual_output else 0.0
return MetricResult(
metric_name=self.name,
score=score,
passed=score >= self.threshold,
reason="Output contains 'good'." if score else "Output lacks 'good'.",
)
CLI
# Run an eval file
rubric run my_evals.py
# With HTML and JSON reports
rubric run my_evals.py --output-html report.html --output-json report.json
# Check version
rubric version
CI/CD Integration
# .github/workflows/eval.yml
name: LLM Evaluation
on: [push, pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- run: pip install rubric-eval
- run: rubric run evals/regression.py --output-json report.json
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
- uses: actions/upload-artifact@v4
with:
name: rubric-report
path: report.json
Or use raise_on_failure=True:
rubric.evaluate(
test_cases=test_cases,
metrics=[...],
raise_on_failure=True, # calls sys.exit(1) if any test fails
)
Roadmap
- 🌐 Web dashboard (local server with history)
- 📊 Dataset management (load from CSV/JSONL)
- 🔄 Regression detection (alert when pass rate drops)
- 🔗 LangChain / LlamaIndex / CrewAI integrations
- 📱 Slack/Discord notifications on eval failure
- 🔴 Real-time production monitoring
Contributing
Rubric is built in the open. Contributions welcome!
git clone https://github.com/kareemrashed/rubric-eval
cd rubric-eval
pip install -e ".[dev]"
pytest tests/
See CONTRIBUTING.md for guidelines.
License
MIT © Kareem Rashed
If Rubric saves you time, consider giving it a ⭐
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rubric_eval-0.1.1.tar.gz.
File metadata
- Download URL: rubric_eval-0.1.1.tar.gz
- Upload date:
- Size: 75.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f87a5a8e3ae2985ad11ef63e6676b48ccf2bac4b5759c3e8c10731ecfe32ade6
|
|
| MD5 |
4ff1bdbf67130e5f66389976f535c095
|
|
| BLAKE2b-256 |
0ca5ba04eebfe0ce216c19575c92ad8332c17459712e9b5d08104a84d1097ec8
|
Provenance
The following attestation bundles were made for rubric_eval-0.1.1.tar.gz:
Publisher:
publish.yml on Kareem-Rashed/rubric-eval
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rubric_eval-0.1.1.tar.gz -
Subject digest:
f87a5a8e3ae2985ad11ef63e6676b48ccf2bac4b5759c3e8c10731ecfe32ade6 - Sigstore transparency entry: 1179888628
- Sigstore integration time:
-
Permalink:
Kareem-Rashed/rubric-eval@2022caa867013909394432465524377c7f909cc2 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/Kareem-Rashed
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2022caa867013909394432465524377c7f909cc2 -
Trigger Event:
push
-
Statement type:
File details
Details for the file rubric_eval-0.1.1-py3-none-any.whl.
File metadata
- Download URL: rubric_eval-0.1.1-py3-none-any.whl
- Upload date:
- Size: 46.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c28b96f8a530654d0541e1a38f6ee384aa47333ea5d5e2cb4e1e41b6e669ada9
|
|
| MD5 |
54e69a217160d94f7aa3bc2de60ff3e5
|
|
| BLAKE2b-256 |
30439c6f27b3405611b9f5f6fccbe63df2cda72c110a20d64cd1a67b4b438692
|
Provenance
The following attestation bundles were made for rubric_eval-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on Kareem-Rashed/rubric-eval
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rubric_eval-0.1.1-py3-none-any.whl -
Subject digest:
c28b96f8a530654d0541e1a38f6ee384aa47333ea5d5e2cb4e1e41b6e669ada9 - Sigstore transparency entry: 1179888699
- Sigstore integration time:
-
Permalink:
Kareem-Rashed/rubric-eval@2022caa867013909394432465524377c7f909cc2 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/Kareem-Rashed
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2022caa867013909394432465524377c7f909cc2 -
Trigger Event:
push
-
Statement type: