Skip to main content

Standalone RAG evaluation library — run LLM-as-judge evaluation locally with your own API key

Project description

rageval-ai

Standalone RAG evaluation library — run LLM-as-judge evaluation locally with your own API key.

Evaluate your RAG pipelines with 9+ metrics including hallucination detection, answer relevancy, faithfulness, and more. No server needed — everything runs locally.

PyPI version Python License: MIT

Installation

pip install rageval-ai

Quick Start

import os
from rageval_sdk import evaluate

result = evaluate(
    question="What is the capital of France?",
    answer="The capital of France is Paris.",
    contexts=["Paris is the capital and largest city of France."],
    ground_truth="Paris",
    api_key=os.environ["OPENAI_API_KEY"],
)

print(f"Overall Score: {result['overall_score']}")
print(f"Hallucination: {result['hallucination_score']}")
print(f"Faithfulness:  {result['faithfulness']}")
print(f"Relevancy:     {result['answer_relevancy']}")
print(f"Cost:          ${result['cost_usd']}")

Environment Variables

export OPENAI_API_KEY="sk-your-openai-key"

Custom Configuration

from rageval_sdk import evaluate, EvalConfig

config = EvalConfig(
    api_key=os.environ["OPENAI_API_KEY"],
    base_url="https://api.openai.com/v1",     # or OpenRouter, Azure, etc.
    stage_1_model="gpt-4o",                   # reasoning model
    stage_2_model="gpt-4o-mini",              # JSON conversion model
    rag_metrics_model="gpt-4o-mini",          # RAG metrics model
)

result = evaluate(
    question="What is RAG?",
    answer="RAG is Retrieval-Augmented Generation.",
    config=config,
)

Async Usage

import asyncio
from rageval_sdk import evaluate_trace, EvalConfig

async def main():
    config = EvalConfig(api_key="sk-...")
    result = await evaluate_trace(
        question="What is RAG?",
        answer="RAG is Retrieval-Augmented Generation.",
        contexts=["RAG combines retrieval with generation."],
        config=config,
    )
    print(result["overall_score"])

asyncio.run(main())

Background Evaluation (Non-blocking)

The RagEvaluator runs evaluations in background threads so your RAG pipeline is never blocked:

from rageval_sdk import RagEvaluator

evaluator = RagEvaluator(api_key=os.environ["OPENAI_API_KEY"], max_workers=4)

# Your RAG pipeline runs normally — evaluation happens in background
for query in user_queries:
    answer, contexts = my_rag_pipeline(query)  # your existing code

    # Non-blocking: submits and returns immediately
    evaluator.submit(
        question=query,
        answer=answer,
        contexts=contexts,
    )

# Check how many are done
print(f"Completed: {evaluator.completed_count}, Pending: {evaluator.pending_count}")

# When ready, collect all results
results = evaluator.wait()
for r in results:
    print(f"Score: {r['overall_score']}, Hallucination: {r['hallucination_score']}")

evaluator.shutdown()

Batch Evaluation

Evaluate multiple traces at once:

from rageval_sdk import RagEvaluator

with RagEvaluator(api_key=os.environ["OPENAI_API_KEY"]) as evaluator:
    results = evaluator.evaluate_batch([
        {
            "question": "What is RAG?",
            "answer": "RAG is Retrieval-Augmented Generation.",
            "contexts": ["RAG combines retrieval with generation."],
        },
        {
            "question": "What is Python?",
            "answer": "Python is a programming language.",
            "contexts": ["Python was created by Guido van Rossum."],
        },
    ])

    for r in results:
        print(f"Score: {r['overall_score']}")

Features

  • Standalone — No server needed, runs entirely locally
  • Background Evaluation — Non-blocking evaluation with RagEvaluator
  • Batch Support — Evaluate multiple traces concurrently
  • 9+ Metrics — Hallucination, relevancy, faithfulness, completeness, and more
  • Parallel Pipeline — Stage 1 + Stage 2 + RAG metrics run concurrently
  • OpenAI Compatible — Works with OpenAI, OpenRouter, Azure, or any compatible API
  • Retry & Circuit Breaker — Production-grade reliability
  • Typed — Full type hints with py.typed marker
  • Lightweight — Only httpx as required dependency

Evaluation Metrics

Metric Description
overall_score Weighted combination of all metrics
hallucination_score Detects fabricated information (claim-level)
faithfulness Ensures answer is grounded in context
answer_relevancy Measures answer relevance to the question
completeness Key-point coverage verification
context_precision Evaluates quality of retrieved contexts
context_recall Checks if all needed facts are retrieved
citation_check Validates source citations against contexts
clarity Answer clarity and readability
coherence Logical flow and consistency
helpfulness How actionable/useful the answer is
is_off_topic Off-topic detection
is_deflection Deflection detection ("I don't know")

API Reference

evaluate()

result = evaluate(
    question,                    # str — the user question
    answer,                      # str — the LLM answer
    contexts=None,               # list[str] — retrieved context passages
    ground_truth=None,           # str — expected correct answer
    api_key=None,                # str — your OpenAI API key
    config=None,                 # EvalConfig — full configuration
    **config_overrides,          # additional EvalConfig fields
)

EvalConfig

config = EvalConfig(
    api_key="sk-...",                           # Required
    base_url="https://api.openai.com/v1",       # LLM endpoint
    stage_1_model="gpt-4o",                     # Reasoning model
    stage_2_model="gpt-4o-mini",                # JSON model
    rag_metrics_model="gpt-4o-mini",            # RAG metrics model
    timeout_seconds=120.0,                      # Request timeout
)

License

MIT — see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rageval_ai-0.1.2.tar.gz (26.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rageval_ai-0.1.2-py3-none-any.whl (32.8 kB view details)

Uploaded Python 3

File details

Details for the file rageval_ai-0.1.2.tar.gz.

File metadata

  • Download URL: rageval_ai-0.1.2.tar.gz
  • Upload date:
  • Size: 26.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for rageval_ai-0.1.2.tar.gz
Algorithm Hash digest
SHA256 3f92d6b47bf729dbed217ded58aa4f482f58388f2ead5fc4de25822a59e860c8
MD5 8edf95dbc3cdbca274152c19a989e1d5
BLAKE2b-256 eaeda18c7fd1b54f42b76164ad17360b33904c71d995d34f192eb9f72abea104

See more details on using hashes here.

File details

Details for the file rageval_ai-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: rageval_ai-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 32.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for rageval_ai-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 79e9e3587974937a3aaeaab05612aad7124c2b3a29d3988c145bc3ac3ed1b473
MD5 13cc0b238c2a7b837a44da4f69fe0256
BLAKE2b-256 edf3f4fdd2b78538574acc75ca1d31f2c6b7b7be5d0b4ab99ace6446f2237b1a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page