Skip to main content

Local-first evaluation framework for RAG pipelines and AI agents

Project description

ragnarok-ai logo

Local-first RAG evaluation framework for LLM applications

Evaluate, benchmark, and monitor your RAG pipelines — 100% locally, no API keys required.

PyPI CI Python 3.10+ License: AGPL-3.0 Ruff Type Checked: mypy Open In Colab

ProblemSolutionQuick StartInstallationRoadmap


The Problem

Building RAG systems is easy. Knowing if they actually work is hard.

Current evaluation tools are either:

Tool Issue
Giskard Heavy, slow (45-60 min scans), loses progress on crash, enterprise-focused
RAGAS Requires OpenAI API keys, no local-first option
Manual testing Doesn't scale, not reproducible

You need a tool that:

  • Runs 100% locally (Ollama, local models)
  • Evaluates fast with checkpointing (no lost progress)
  • Integrates with your existing stack (LangChain, LangGraph)
  • Fits in CI/CD pipelines
  • Doesn't require a PhD to use

The Solution

ragnarok-ai is a lightweight, local-first framework to evaluate RAG pipelines.

RAGnarok-AI Overview

from ragnarok_ai import evaluate, generate_testset

# Generate test questions from your knowledge base
testset = await generate_testset(
    knowledge_base="./docs/",
    num_questions=50,
    types=["simple", "multi_hop", "adversarial"],
    llm="ollama/mistral",
    checkpoint=True,  # Resume if interrupted
)

# Evaluate your RAG pipeline
results = await evaluate(
    rag_pipeline=my_rag,
    testset=testset,
    metrics=["retrieval", "faithfulness", "relevance"],
    llm="ollama/mistral",
)

# Get actionable insights
results.summary()
# ┌─────────────────┬───────┬────────┐
# │ Metric          │ Score │ Status │
# ├─────────────────┼───────┼────────┤
# │ Retrieval P@10  │ 0.82  │ PASS   │
# │ Faithfulness    │ 0.74  │ WARN   │
# │ Relevance       │ 0.89  │ PASS   │
# │ Hallucination   │ 0.12  │ PASS   │
# └─────────────────┴───────┴────────┘

results.export("report.html")

v1.7.0 is now available! Alerting system with Webhook and Slack adapters. Install with pip install ragnarok-ai


Key Features

Feature Description
100% Local Runs entirely on your machine with Ollama. No OpenAI, no API keys, no data leaving your network.
Production Monitoring Collect traces, export Prometheus metrics, track latency and success rates in production.
LLM-as-Judge Multi-criteria evaluation with Prometheus 2: faithfulness, relevance, hallucination, completeness.
Cost Tracking Track token usage and costs. Local models = $0.00, see exactly what cloud APIs cost.
Jupyter Integration Rich HTML display in notebooks with metrics visualization.
Fast & Resilient Built-in checkpointing — crash mid-evaluation? Resume exactly where you left off.
Framework Agnostic Works with LangChain, LangGraph, LlamaIndex, or your custom RAG.
Comprehensive Metrics Retrieval quality, faithfulness, relevance, hallucination detection, latency tracking.
Test Generation Auto-generate diverse test sets from your knowledge base.
CI/CD Ready CLI-first design, JSON output, exit codes for pipeline integration.
Enterprise Ready Kubernetes Helm charts, air-gapped deployment, data sovereignty.
Lightweight Minimal dependencies. No torch/transformers in core.

Comparison

Feature ragnarok-ai Giskard RAGAS
100% Local Yes Partial No
Checkpointing Yes No No
Fast evaluation Yes No (45-60 min) Yes
CLI support Yes No No
LangChain integration Yes Yes Yes
Minimal deps Yes No Partial
Free & OSS AGPL-3.0 Open-core Apache-2.0

Performance

Benchmarked on Apple M2 16GB, Python 3.10:

Retrieval Metrics: ~24,000 queries/sec

Queries Time Peak RAM
50 0.002s 0.02 MB
500 0.021s 0.03 MB
5000 0.217s 0.17 MB

LLM-as-Judge (Prometheus 2):

Criterion Avg Time
Faithfulness ~25s
Relevance ~22s
Hallucination ~28s

Retrieval is pure computation — instant. LLM-as-Judge is the bottleneck (~25s/eval), but runs 100% local.

Full benchmarks →


Quick Start

Try it now: Open in Google Colab

Prerequisites

  • Python 3.10+
  • Ollama running locally
  • uv (recommended) or pip

Install

pip install ragnarok-ai

With optional dependencies:

pip install ragnarok-ai[ollama,qdrant]

Run your first evaluation

# CLI demo
ragnarok evaluate --demo

# With options
ragnarok evaluate --demo --output results.json --fail-under 0.7

# Or in Python
python examples/basic_evaluation.py

Installation

Using pip

pip install ragnarok-ai

Optional dependencies

# LLM providers
pip install ragnarok-ai[ollama]      # Ollama support
pip install ragnarok-ai[openai]      # OpenAI support
pip install ragnarok-ai[anthropic]   # Anthropic support

# Vector stores
pip install ragnarok-ai[qdrant]      # Qdrant support
pip install ragnarok-ai[chroma]      # ChromaDB support
pip install ragnarok-ai[faiss]       # FAISS support

# RAG frameworks
pip install ragnarok-ai[langchain]   # LangChain/LangGraph support
pip install ragnarok-ai[llamaindex]  # LlamaIndex support
pip install ragnarok-ai[dspy]        # DSPy support

# Observability
pip install ragnarok-ai[telemetry]   # OpenTelemetry tracing

# Everything
pip install ragnarok-ai[all]

Development

git clone https://github.com/2501Pr0ject/RAGnarok-AI.git
cd RAGnarok-AI
pip install -e ".[dev]"
pre-commit install

Use Cases

Continuous RAG Testing in CI/CD

# .github/workflows/rag-tests.yml
- uses: 2501Pr0ject/ragnarok-evaluate-action@v1
  with:
    config: ragnarok.yaml
    threshold: 0.8
    # fail-on-threshold: false (default - advisory only)
    # comment-on-pr: true (default - posts PR comment)

The action posts a PR comment distinguishing deterministic retrieval metrics from advisory LLM-as-Judge scores.

Compare Embedding Models

configs = [
    {"embedder": "nomic-embed-text", "chunk_size": 512},
    {"embedder": "mxbai-embed-large", "chunk_size": 256},
]

results = await benchmark(
    rag_factory=create_rag,
    configs=configs,
    testset=testset,
)
results.compare()  # Side-by-side comparison

Monitor Production Quality

# Track quality drift over time
metrics = await evaluate(rag, production_queries)
metrics.log_to("./metrics/")  # Time-series storage

Metrics

Retrieval Metrics

  • Precision@K — Relevant docs in top K results
  • Recall@K — Coverage of relevant docs
  • MRR — Mean Reciprocal Rank
  • NDCG — Normalized Discounted Cumulative Gain

Generation Metrics

  • Faithfulness — Is the answer grounded in retrieved context?
  • Relevance — Does the answer address the question?
  • Hallucination — Does the answer contain fabricated info?
  • Completeness — Are all aspects of the question covered?

LLM-as-Judge (v1.2+)

Use Prometheus 2 for comprehensive, local evaluation:

from ragnarok_ai import LLMJudge

# Initialize judge (uses Prometheus 2 by default)
judge = LLMJudge()

# Evaluate a single response
result = await judge.evaluate_all(
    context="Python was created by Guido van Rossum in 1991.",
    question="Who created Python?",
    answer="Guido van Rossum created Python.",
)

print(f"Overall: {result.overall_verdict} ({result.overall_score:.2f})")
# Overall: PASS (0.85)

print(f"Faithfulness: {result.faithfulness.verdict}")
print(f"Hallucination: {result.hallucination.verdict}")

Performance:

  • ~20-30s per evaluation on Apple M2 16GB
  • Prometheus 2 Q5_K_M: ~5GB RAM usage
  • keep_alive enabled by default (prevents model unloading between requests)

Installation:

# Install Prometheus 2 (~5GB, runs on 16GB RAM)
ollama pull hf.co/RichardErkhov/prometheus-eval_-_prometheus-7b-v2.0-gguf:Q5_K_M

Medical Mode

Reduce false positives in healthcare RAG evaluation with automatic medical abbreviation normalization.

from ragnarok_ai import LLMJudge

# Enable medical mode
judge = LLMJudge(medical_mode=True)

# "CHF" and "congestive heart failure" are now treated as equivalent
result = await judge.evaluate_faithfulness(
    context="Patient diagnosed with CHF.",
    question="What condition does the patient have?",
    answer="Patient has congestive heart failure.",
)
# Without medical_mode: may flag as unfaithful (text mismatch)
# With medical_mode: correctly identifies as faithful

Features:

  • 350+ medical abbreviations (CHF, MI, COPD, DVT...)
  • Context-aware disambiguation (MS = multiple sclerosis vs mitral stenosis)
  • Multiple formats: dotted (q.d.), slash (s/p), mixed-case (SpO2)
  • False positive filtering (OR, US, IT stay unchanged)

Also works with FaithfulnessEvaluator(llm, medical_mode=True).

Contributed by @harish1120

System Metrics

  • Latency — End-to-end response time
  • Token usage — Cost tracking for LLM calls

Cost Tracking (v1.3+)

Track exactly what your evaluations cost:

results = await evaluate(rag, testset, track_cost=True)
print(results.cost)
# +--------------------+------------+----------+
# | Provider           |     Tokens |     Cost |
# +--------------------+------------+----------+
# | ollama (local)     |     45,230 |    $0.00 |
# | openai             |     12,500 |    $0.38 |
# +--------------------+------------+----------+

Local-first advantage: Ollama evaluations cost $0.00.

Jupyter Notebook (v1.3.1+)

Rich HTML display for evaluation results:

from ragnarok_ai.notebook import display, display_comparison

# Full dashboard with metrics, cost, latency
display(results)

# Compare multiple pipelines side-by-side
display_comparison([
    ("Baseline", baseline_results),
    ("Improved", improved_results),
])

Roadmap

Completed

v0.1 — Foundation
  • Project setup & architecture
  • Core retrieval metrics (precision, recall, MRR, NDCG)
  • Ollama adapter
  • Console reporter
  • JSON reporter
  • Basic CLI
  • CI/CD with GitHub Actions
v0.2 — Generation Metrics & Reporting
  • Qdrant adapter
  • Faithfulness evaluator
  • Relevance evaluator
  • Hallucination detection
  • HTML report with drill-down (failed questions, retrieved chunks)
  • Intelligent CI gating (stable metrics fail, LLM judgments warn)
v0.3 — Test Generation & Golden Sets
  • Synthetic question generation
  • Multi-hop question support
  • Adversarial question generation
  • Checkpointing system
  • Golden set support (human-validated, versioned question sets)
  • Baselines library (configs + expected results)
  • NovaTech example dataset for quickstart
v0.4 — Framework Adapters & Observability
  • LangChain integration
  • LangGraph integration
  • Custom RAG protocol support
  • OpenTelemetry export for tracing & debugging
v0.5 — Performance & Scale
  • Async parallelization (max_concurrency parameter)
  • Result caching (MemoryCache, DiskCache, CacheProtocol)
  • Batch processing (BatchEvaluator for 1000+ queries)
  • Progress callbacks (sync and async support)
  • Timeout and retry (timeout, max_retries, retry_delay)
  • Cache error handling (graceful degradation)
v0.6 — Cloud & Local Adapters
  • vLLM adapter (local high-performance inference)
  • OpenAI adapter (optional cloud fallback)
  • Anthropic adapter
  • ChromaDB adapter
  • FAISS adapter (pure local, no server)
v0.7 — Framework Adapters
  • LlamaIndex adapter (Retriever, QueryEngine, Index)
  • DSPy adapter (Retrieve, Module, RAG pattern)
  • Custom RAG support via RAGProtocol
  • Adapter contribution guide
v0.8 — Comparison & Benchmarking
  • Comparison mode (compare() for side-by-side evaluation)
  • Regression detection (alert on quality drop vs baseline)
  • Benchmark history tracking (time-series storage)
  • Diff reports (what changed between runs)
v0.9 — Agent Evaluation
  • AgentProtocol for agent pipelines
  • Tool-use correctness metrics (precision, recall, F1)
  • Multi-step reasoning evaluators (coherence, goal progress, efficiency)
  • ReAct/CoT pattern adapters
  • Trajectory analysis (loops, dead ends, failure detection)
  • Visualization (ASCII, Mermaid, HTML reports)
v1.0 — Production Ready
  • PyPI publish (pip install ragnarok-ai)
  • Stable public API
  • Complete README with examples
  • CHANGELOG.md (v0.1 → v1.0)
v1.1 — CLI Complete
  • ragnarok generate command (synthetic testset generation)
  • ragnarok benchmark command (history tracking, regression detection)
  • Standardized JSON envelope for --json output
  • E2E tests for CLI workflow
  • Trusted Publishing (PyPI OIDC)
v1.2 — LLM-as-Judge
  • LLMJudge class with Prometheus 2 integration
  • Multi-criteria evaluation (faithfulness, relevance, hallucination, completeness)
  • 100% local evaluation with Ollama (Q5_K_M quantization, ~5GB)
  • Rubric-based prompts with 1-5 scoring normalized to 0-1
  • Detailed explanations for each judgment
  • Batch evaluation support
  • Robust JSON parsing for LLM responses (handles incomplete JSON)
  • keep_alive support for Ollama (prevents model unloading between requests)
v1.2.5 — Plugin Architecture
  • Plugin system based on Python entry points
  • PluginRegistry singleton for adapter discovery
  • Dynamic discovery of external plugins via importlib.metadata
  • ragnarok plugins CLI command (list, info, filters)
  • Support for 4 namespaces: llm, vectorstore, framework, evaluator
  • LOCAL/CLOUD classification for all adapters
  • Plugin documentation (docs/PLUGINS.md)
  • E2E plugin test with mock package
v1.3.0 — Cost Tracking
  • Cost tracking module (ragnarok_ai.cost)
  • Pricing table for OpenAI, Anthropic, Groq, Mistral, Together AI
  • Token counting with tiktoken (fallback to estimation)
  • CostTracker class with context manager support
  • track_cost=True parameter in evaluate()
  • Formatted summary table and JSON export
  • Local providers (Ollama, vLLM) = $0.00
  • Automatic tracking in LLM adapters
v1.3.1 — Jupyter Integration
  • Jupyter notebook module (ragnarok_ai.notebook)
  • Rich HTML display for evaluation results
  • Metrics visualization with progress bars
  • Cost breakdown tables
  • Pipeline comparison display
  • Auto-detection of notebook environment
v1.4.0 — More Integrations
  • LLM Adapters: Groq, Mistral, Together AI
  • VectorStore Adapters: Pinecone, Weaviate, Milvus, pgvector
  • Framework Adapters: Haystack, Semantic Kernel
  • Medical Mode: Abbreviation normalizer with 350+ terms (contributed by @harish1120)
  • CLI: ragnarok judge command, --config ragnarok.yaml support
  • Docs: MkDocs documentation site, performance benchmarks
v1.4.1 — Dataset Versioning
  • Shared hashing utilities: Canonical JSON, SHA256, content-based keys
  • TestSet versioning: schema_version, dataset_version, created_at, author, source
  • Dataset diff: ragnarok dataset diff CLI command
  • Stable item keys: metadata.id or content hash fallback
  • Diff report: Added/removed/modified/unchanged detection
v1.5.0 — Enterprise Deployment
  • Kubernetes Helm Chart: Job and CronJob modes, ConfigMap, PVC support
  • Air-Gapped Deployment Guide: Complete documentation for offline environments
  • Docker improvements: Fixed Dockerfile for proper module installation
  • Data Sovereignty: Full support for GDPR, HIPAA, defense environments
v1.6.0 — Production Monitoring
  • MonitorClient: Instrument RAG pipelines with configurable sampling
  • Monitor Daemon: HTTP server with /ingest, /metrics, /health, /stats
  • Prometheus Export: Request counts, success rate, latency percentiles
  • SQLite Storage: 7-day trace retention, 90-day aggregate retention
  • CLI Commands: ragnarok monitor start|stop|status|stats
  • PII Safety: Query hashing (SHA256)
  • Custom Metadata: Tenant/route slicing support
v1.7.0 — Alerting
  • AlertManager: Multi-channel alert dispatch with concurrent sending
  • AlertRule: Threshold-based rules with cooldown support
  • WebhookAlertAdapter: Generic HTTP webhook notifications
  • SlackAlertAdapter: Slack notifications with Block Kit formatting
  • Alert Protocols: Alert, AlertResult, AlertSeverity (INFO, WARNING, CRITICAL)
  • Public Exports: AlertManager, AlertRule, AlertSeverity from package root

Planned

v1.8+

  • More alert adapters (Discord, Email)
  • Drift detection
  • Web UI dashboard

Future

Web UI
  • Basic Web UI (read-only dashboard)
  • Full Web UI dashboard
Developer Experience
Advanced Features
  • Streaming evaluation
  • A/B testing support
  • Dataset versioning
  • Fine-tuning recommendations
  • Multi-modal evaluation (images, audio)
Enterprise (On-Premise)
  • SSO support (SAML, OIDC)
  • Role-based access control
  • Audit logging
  • Air-gapped deployment guide
  • Docker/Kubernetes helm charts

Architecture

RAGnarok-AI Architecture

View project structure
ragnarok-ai/
├── src/ragnarok_ai/
│   ├── core/           # Types, protocols, exceptions
│   ├── evaluators/     # Metric implementations
│   ├── generators/     # Test set generation
│   ├── adapters/       # LLM, vector store, framework adapters
│   ├── reporters/      # Output formatters (JSON, HTML, console)
│   └── cli/            # Command-line interface
├── tests/              # Test suite (pytest)
├── examples/           # Usage examples
├── benchmarks/         # Performance benchmarks
└── docs/               # Documentation

Development

# Setup
uv pip install -e ".[dev]"
pre-commit install

# Run checks
pytest                    # Tests
pytest --cov=ragnarok_ai  # With coverage
ruff check . --fix        # Lint
ruff format .             # Format
mypy src/                 # Type check

Advanced Usage

Importing Types

For advanced use cases (custom RAG implementations, type hints), import types directly from submodules:

# Core types
from ragnarok_ai.core.types import Document, Query, RAGResponse, TestSet

# Protocols (for implementing custom adapters)
from ragnarok_ai.core.protocols import RAGProtocol, LLMProtocol, VectorStoreProtocol

# Evaluators
from ragnarok_ai.evaluators import FaithfulnessEvaluator, RelevanceEvaluator

# Adapters
from ragnarok_ai.adapters.llm import OllamaLLM, OpenAILLM
from ragnarok_ai.adapters.vectorstore import ChromaVectorStore, QdrantVectorStore

Implementing a Custom RAG

from ragnarok_ai.core.protocols import RAGProtocol
from ragnarok_ai.core.types import RAGResponse, Document

class MyCustomRAG:
    """Custom RAG implementing the RAGProtocol."""

    async def query(self, question: str, k: int = 5) -> RAGResponse:
        # Your retrieval logic here
        docs = await self.retrieve(question, k)
        answer = await self.generate(question, docs)

        return RAGResponse(
            answer=answer,
            retrieved_docs=[
                Document(id=d.id, content=d.text, metadata=d.meta)
                for d in docs
            ],
        )

# Use with ragnarok-ai
from ragnarok_ai import evaluate

results = await evaluate(
    rag_pipeline=MyCustomRAG(),
    testset=testset,
    metrics=["retrieval", "faithfulness"],
)

Feedback

Your feedback helps improve RAGnarok-AI. Pick the right channel:

Type Link
Bug report Report a bug
Feedback / UX Share feedback
Feature request Request a feature
Questions / Ideas Discussions

Contributing

Contributions are welcome! Please read CONTRIBUTING.md for guidelines.

Priority areas for contributions:

  • Framework adapters (Haystack, Semantic Kernel)
  • Agent evaluation features
  • Streaming evaluation support
  • Multi-modal evaluation
  • Documentation & examples

License

RAGnarok-AI is dual-licensed:

License Use Case
AGPL-3.0 Open source projects, personal use, research
Commercial Proprietary software, SaaS, organizations with AGPL restrictions

Why dual licensing?

  • AGPL ensures improvements stay open-source
  • Commercial license enables enterprise adoption without copyleft obligations

For commercial licensing inquiries: abdel.touati@gmail.com


Acknowledgments

Built out of frustration with complex evaluation setups. We wanted something that just works — locally, fast, and without API keys.


Built with ❤️ in Lyon, France

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragnarok_ai-1.7.0.tar.gz (367.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ragnarok_ai-1.7.0-py3-none-any.whl (309.3 kB view details)

Uploaded Python 3

File details

Details for the file ragnarok_ai-1.7.0.tar.gz.

File metadata

  • Download URL: ragnarok_ai-1.7.0.tar.gz
  • Upload date:
  • Size: 367.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ragnarok_ai-1.7.0.tar.gz
Algorithm Hash digest
SHA256 840deb441da53ecec4c8a91ad045ffae5847832a06a246db9daa928d2a36c726
MD5 e7f1d0e98f70f9d7da99e17a67a9bdfd
BLAKE2b-256 546740546d9e779bbcfc3395548003d75d1b561fbf0ae2c1e3ead7e87238ae2b

See more details on using hashes here.

Provenance

The following attestation bundles were made for ragnarok_ai-1.7.0.tar.gz:

Publisher: publish.yml on 2501Pr0ject/RAGnarok-AI

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ragnarok_ai-1.7.0-py3-none-any.whl.

File metadata

  • Download URL: ragnarok_ai-1.7.0-py3-none-any.whl
  • Upload date:
  • Size: 309.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ragnarok_ai-1.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e6efc74fc49ed6ff0c3d0b654e54693e5295cf920c777fb74b7eecb8ad1641e8
MD5 0d52cf3a248e0714cc069c6ff35804a2
BLAKE2b-256 5f2026590a9c410ffd552cb0f96d76fc85d0ef53a07cee0f21a8be8f9ffc785e

See more details on using hashes here.

Provenance

The following attestation bundles were made for ragnarok_ai-1.7.0-py3-none-any.whl:

Publisher: publish.yml on 2501Pr0ject/RAGnarok-AI

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page