Evaluation engine: RAGAS, DeepEval, LLM-as-Judge, and audit report generation
Project description
rag-forge-evaluator
RAG pipeline evaluation engine for the RAG-Forge toolkit: RAGAS, DeepEval, LLM-as-Judge, and the RAG Maturity Model.
Installation
pip install rag-forge-evaluator
Usage
from rag_forge_evaluator.assess import RMMAssessor
assessor = RMMAssessor()
result = assessor.assess(config={
"retrieval_strategy": "hybrid",
"input_guard_configured": True,
"output_guard_configured": True,
})
print(result.badge) # e.g., "RMM-3 Better Trust"
Features
- RMM (RAG Maturity Model) scoring (levels 0-5)
- RAGAS, DeepEval, and LLM-as-Judge evaluators
- Golden set management with traffic sampling
- Cost estimation
- HTML and PDF report generation
Bring your own judge provider
rag-forge-evaluator ships with Claude and OpenAI judges out of the box, but the JudgeProvider protocol is intentionally minimal so you can plug in any LLM — Gemini, Cohere, Bedrock, Ollama, vLLM, or a private model behind your own gateway. Implementing one is ~20 lines:
# my_gemini_judge.py
import os
import google.generativeai as genai
class GeminiJudge:
"""Minimal judge implementation backed by Google Gemini."""
def __init__(self, model: str = "gemini-2.5-pro", api_key: str | None = None) -> None:
key = api_key or os.environ.get("GOOGLE_API_KEY")
if not key:
raise ValueError("GOOGLE_API_KEY not set")
genai.configure(api_key=key)
self._model_name = model
self._client = genai.GenerativeModel(model)
def judge(self, system_prompt: str, user_prompt: str) -> str:
response = self._client.generate_content(
[system_prompt, user_prompt],
generation_config={"max_output_tokens": 4096},
)
return response.text or ""
def model_name(self) -> str:
return self._model_name
Wire it into an audit by passing the instance directly to LLMJudgeEvaluator:
from my_gemini_judge import GeminiJudge
from rag_forge_evaluator.metrics.llm_judge import LLMJudgeEvaluator
judge = GeminiJudge(model="gemini-2.5-pro")
evaluator = LLMJudgeEvaluator(judge=judge)
result = evaluator.evaluate(samples)
The protocol contract:
class JudgeProvider(Protocol):
def judge(self, system_prompt: str, user_prompt: str) -> str: ...
def model_name(self) -> str: ...
That's it. Anything that responds to those two methods works. Implementation hints:
- Always set
max_tokens>= 4096 for faithfulness/hallucination metrics. Long responses produce 30-50 enumerated claims; smaller budgets truncate the JSON mid-array and the metric ends upskipped. - Wrap your client with retry logic for transient 429/5xx. The Anthropic and OpenAI SDKs honor a
max_retriesconstructor arg with built-in exponential backoff — most provider SDKs offer something similar. - Return the raw response text, including any prose around the JSON. The shared response parser handles code fences, leading prose, trailing prose, and truncated output, so you don't need to clean anything up.
First-party Gemini, Bedrock, and Ollama judges are tracked for v0.1.2.
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rag_forge_evaluator-0.2.1.tar.gz.
File metadata
- Download URL: rag_forge_evaluator-0.2.1.tar.gz
- Upload date:
- Size: 100.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
686e694cff0aedd6d88317a1bfc1aca3e5f6b18f6da13e9111e0fcbd27f5aa6f
|
|
| MD5 |
3b5c6c9005703997428cb6b66893a31d
|
|
| BLAKE2b-256 |
e0210892ce7cb93b5aa13680f8456e990e781cdaa257e1ea5bd22b98318eecdc
|
Provenance
The following attestation bundles were made for rag_forge_evaluator-0.2.1.tar.gz:
Publisher:
publish.yml on hallengray/rag-forge
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rag_forge_evaluator-0.2.1.tar.gz -
Subject digest:
686e694cff0aedd6d88317a1bfc1aca3e5f6b18f6da13e9111e0fcbd27f5aa6f - Sigstore transparency entry: 1298639747
- Sigstore integration time:
-
Permalink:
hallengray/rag-forge@18947f4f7e14cfcfa5da4dd0a3066727b15b8a4d -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/hallengray
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@18947f4f7e14cfcfa5da4dd0a3066727b15b8a4d -
Trigger Event:
release
-
Statement type:
File details
Details for the file rag_forge_evaluator-0.2.1-py3-none-any.whl.
File metadata
- Download URL: rag_forge_evaluator-0.2.1-py3-none-any.whl
- Upload date:
- Size: 80.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d76ac784ffe68fb02b2c591a7fb6c231caf618eec56e2aca7f3ebfab7d2aa490
|
|
| MD5 |
24e85037edf5ad7f28fbb2725884df47
|
|
| BLAKE2b-256 |
39851b83441ddbddad42ed08188124cac503c23c9a857144ac979a2365a08d3f
|
Provenance
The following attestation bundles were made for rag_forge_evaluator-0.2.1-py3-none-any.whl:
Publisher:
publish.yml on hallengray/rag-forge
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rag_forge_evaluator-0.2.1-py3-none-any.whl -
Subject digest:
d76ac784ffe68fb02b2c591a7fb6c231caf618eec56e2aca7f3ebfab7d2aa490 - Sigstore transparency entry: 1298639799
- Sigstore integration time:
-
Permalink:
hallengray/rag-forge@18947f4f7e14cfcfa5da4dd0a3066727b15b8a4d -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/hallengray
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@18947f4f7e14cfcfa5da4dd0a3066727b15b8a4d -
Trigger Event:
release
-
Statement type: