Evaluation engine: RAGAS, DeepEval, LLM-as-Judge, and audit report generation

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

hgray

These details have not been verified by PyPI

Project description

rag-forge-evaluator

RAG pipeline evaluation engine for the RAG-Forge toolkit: RAGAS, DeepEval, LLM-as-Judge, and the RAG Maturity Model.

Installation

pip install rag-forge-evaluator

Usage

from rag_forge_evaluator.assess import RMMAssessor

assessor = RMMAssessor()
result = assessor.assess(config={
    "retrieval_strategy": "hybrid",
    "input_guard_configured": True,
    "output_guard_configured": True,
})
print(result.badge)  # e.g., "RMM-3 Better Trust"

Features

RMM (RAG Maturity Model) scoring (levels 0-5)
RAGAS, DeepEval, and LLM-as-Judge evaluators
Golden set management with traffic sampling
Cost estimation
HTML and PDF report generation

Bring your own judge provider

rag-forge-evaluator ships with Claude and OpenAI judges out of the box, but the JudgeProvider protocol is intentionally minimal so you can plug in any LLM — Gemini, Cohere, Bedrock, Ollama, vLLM, or a private model behind your own gateway. Implementing one is ~20 lines:

# my_gemini_judge.py
import os
import google.generativeai as genai


class GeminiJudge:
    """Minimal judge implementation backed by Google Gemini."""

    def __init__(self, model: str = "gemini-2.5-pro", api_key: str | None = None) -> None:
        key = api_key or os.environ.get("GOOGLE_API_KEY")
        if not key:
            raise ValueError("GOOGLE_API_KEY not set")
        genai.configure(api_key=key)
        self._model_name = model
        self._client = genai.GenerativeModel(model)

    def judge(self, system_prompt: str, user_prompt: str) -> str:
        response = self._client.generate_content(
            [system_prompt, user_prompt],
            generation_config={"max_output_tokens": 4096},
        )
        return response.text or ""

    def model_name(self) -> str:
        return self._model_name

Wire it into an audit by passing the instance directly to LLMJudgeEvaluator:

from my_gemini_judge import GeminiJudge
from rag_forge_evaluator.metrics.llm_judge import LLMJudgeEvaluator

judge = GeminiJudge(model="gemini-2.5-pro")
evaluator = LLMJudgeEvaluator(judge=judge)
result = evaluator.evaluate(samples)

The protocol contract:

class JudgeProvider(Protocol):
    def judge(self, system_prompt: str, user_prompt: str) -> str: ...
    def model_name(self) -> str: ...

That's it. Anything that responds to those two methods works. Implementation hints:

Always set max_tokens >= 4096 for faithfulness/hallucination metrics. Long responses produce 30-50 enumerated claims; smaller budgets truncate the JSON mid-array and the metric ends up skipped.
Wrap your client with retry logic for transient 429/5xx. The Anthropic and OpenAI SDKs honor a max_retries constructor arg with built-in exponential backoff — most provider SDKs offer something similar.
Return the raw response text, including any prose around the JSON. The shared response parser handles code fences, leading prose, trailing prose, and truncated output, so you don't need to clean anything up.

First-party Gemini, Bedrock, and Ollama judges are tracked for v0.1.2.

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

hgray

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.3

Apr 16, 2026

This version

0.2.2

Apr 15, 2026

0.2.1

Apr 14, 2026

0.1.3

Apr 13, 2026

0.1.2

Apr 13, 2026

0.1.1

Apr 13, 2026

0.1.0

Apr 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rag_forge_evaluator-0.2.2.tar.gz (116.1 kB view details)

Uploaded Apr 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rag_forge_evaluator-0.2.2-py3-none-any.whl (84.5 kB view details)

Uploaded Apr 15, 2026 Python 3

File details

Details for the file rag_forge_evaluator-0.2.2.tar.gz.

File metadata

Download URL: rag_forge_evaluator-0.2.2.tar.gz
Upload date: Apr 15, 2026
Size: 116.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rag_forge_evaluator-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`5cec6f0d4d69f7aefefef93444515cc09a0cfa792fec529218ea68f1f4516404`
MD5	`fe074f753143911bc48440a29c5c2ccc`
BLAKE2b-256	`67d2459f75375a7243af4baa12c807feec1e8f34fa6acc189b6c5badb4b64950`

See more details on using hashes here.

Provenance

The following attestation bundles were made for rag_forge_evaluator-0.2.2.tar.gz:

Publisher: publish.yml on hallengray/rag-forge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: rag_forge_evaluator-0.2.2.tar.gz
- Subject digest: 5cec6f0d4d69f7aefefef93444515cc09a0cfa792fec529218ea68f1f4516404
- Sigstore transparency entry: 1313174936
- Sigstore integration time: Apr 15, 2026
Source repository:
- Permalink: hallengray/rag-forge@8f45df5b96ef342caf6f39cec535193225bd4860
- Branch / Tag: refs/tags/v0.2.2
- Owner: https://github.com/hallengray
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@8f45df5b96ef342caf6f39cec535193225bd4860
- Trigger Event: release

File details

Details for the file rag_forge_evaluator-0.2.2-py3-none-any.whl.

File metadata

Download URL: rag_forge_evaluator-0.2.2-py3-none-any.whl
Upload date: Apr 15, 2026
Size: 84.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rag_forge_evaluator-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a06e8b27764dd4b9de81859abe708d647c735bfd6148f8356bff74b21c185bb9`
MD5	`8d76f6fe8163965672ffc5684fb96f78`
BLAKE2b-256	`881036d4315446f1f30ca9d4100a57c9adbded0270b7d7bd5f1e24033b36bb37`

See more details on using hashes here.

Provenance

The following attestation bundles were made for rag_forge_evaluator-0.2.2-py3-none-any.whl:

Publisher: publish.yml on hallengray/rag-forge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: rag_forge_evaluator-0.2.2-py3-none-any.whl
- Subject digest: a06e8b27764dd4b9de81859abe708d647c735bfd6148f8356bff74b21c185bb9
- Sigstore transparency entry: 1313175116
- Sigstore integration time: Apr 15, 2026
Source repository:
- Permalink: hallengray/rag-forge@8f45df5b96ef342caf6f39cec535193225bd4860
- Branch / Tag: refs/tags/v0.2.2
- Owner: https://github.com/hallengray
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@8f45df5b96ef342caf6f39cec535193225bd4860
- Trigger Event: release

rag-forge-evaluator 0.2.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

rag-forge-evaluator

Installation

Usage

Features

Bring your own judge provider

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance