Skip to main content

The Fastest RAG Audit - Generate QA datasets & evaluate RAG systems in Colab, Jupyter, or CLI. Privacy-first, lightning fast, visual reports.

Project description

RAGScore Logo

PyPI version PyPI Downloads Python 3.9+ License Ollama Open In Colab MCP

Generate QA datasets & evaluate RAG systems in 2 commands

๐Ÿ”’ Privacy-First โ€ข โšก Lightning Fast โ€ข ๐Ÿค– Any LLM โ€ข ๐Ÿ  Local or Cloud โ€ข ๐ŸŒ Multilingual

English | ไธญๆ–‡ | ๆ—ฅๆœฌ่ชž | Deutsch


โšก 2-Line RAG Evaluation

# Step 1: Generate QA pairs from your docs
ragscore generate docs/

# Step 2: Evaluate your RAG system
ragscore evaluate http://localhost:8000/query

That's it. Get accuracy scores and incorrect QA pairs instantly.

============================================================
โœ… EXCELLENT: 85/100 correct (85.0%)
Average Score: 4.20/5.0
============================================================

โŒ 15 Incorrect Pairs:

  1. Q: "What is RAG?"
     Score: 2/5 - Factually incorrect

  2. Q: "How does retrieval work?"
     Score: 3/5 - Incomplete answer

๐Ÿš€ Quick Start

Install

pip install ragscore              # Core (works with Ollama)
pip install "ragscore[openai]"    # + OpenAI support
pip install "ragscore[notebook]"  # + Jupyter/Colab support
pip install "ragscore[all]"       # + All providers

Option 1: Python API (Notebook-Friendly)

Perfect for Jupyter, Colab, and rapid iteration. Get instant visualizations.

from ragscore import quick_test

# 1. Audit your RAG in one line
result = quick_test(
    endpoint="http://localhost:8000/query",  # Your RAG API
    docs="docs/",                            # Your documents
    n=10,                                    # Number of test questions
)

# 1b. Tailored QA โ€” target specific audiences
result = quick_test(
    endpoint="http://localhost:8000/query",
    docs="docs/",
    audience="developers",                   # Who asks the questions?
    purpose="api-integration",               # What's the document for?
)

# 2. See the report
result.plot()

# 3. Inspect failures
bad_rows = result.df[result.df['score'] < 3]
display(bad_rows[['question', 'rag_answer', 'reason']])

Rich Object API:

  • result.accuracy - Accuracy score
  • result.df - Pandas DataFrame of all results
  • result.plot() - 3-panel visualization (4-panel with detailed=True)
  • result.corrections - List of items to fix

Option 2: CLI (Production)

Generate QA Pairs

# Set API key (or use local Ollama - no key needed!)
export OPENAI_API_KEY="sk-..."

# Generate from any document
ragscore generate paper.pdf
ragscore generate docs/*.pdf --concurrency 10

# Tailored QA generation โ€” target specific audiences
ragscore generate docs/ --audience developers --purpose faq
ragscore generate docs/ --audience customers --purpose "pre-sales"
ragscore generate docs/ --audience "compliance auditors" --purpose "security audit"

Evaluate Your RAG

# Point to your RAG endpoint
ragscore evaluate http://localhost:8000/query

# Custom options
ragscore evaluate http://api/ask --model gpt-4o --output results.json

๐Ÿ”ฌ Detailed Multi-Metric Evaluation

Go beyond a single score. Add detailed=True to get 5 diagnostic dimensions per answer โ€” in the same single LLM call.

result = quick_test(
    endpoint=my_rag,
    docs="docs/",
    n=10,
    detailed=True,  # โญ Enable multi-metric evaluation
)

# Inspect per-question metrics
display(result.df[[
    "question", "score", "correctness", "completeness",
    "relevance", "conciseness", "faithfulness"
]])

# Radar chart + 4-panel visualization
result.plot()
==================================================
โœ… PASSED: 9/10 correct (90%)
Average Score: 4.3/5.0
Threshold: 70%
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  Correctness: 4.5/5.0
  Completeness: 4.2/5.0
  Relevance: 4.8/5.0
  Conciseness: 4.1/5.0
  Faithfulness: 4.6/5.0
==================================================
Metric What it measures Scale
Correctness Semantic match to golden answer 5 = fully correct
Completeness Covers all key points 5 = fully covered
Relevance Addresses the question asked 5 = perfectly on-topic
Conciseness Focused, no filler 5 = concise and precise
Faithfulness No fabricated claims 5 = fully faithful

CLI:

ragscore evaluate http://localhost:8000/query --detailed

๐Ÿ““ Full demo notebook โ€” build a mini RAG and test it with detailed metrics.

๐ŸŽฏ Audience & Purpose demo โ€” generate tailored QA for developers, customers, auditors, and more.

๐Ÿ  Ollama local demo โ€” 100% private RAG evaluation with no API keys.


๐Ÿ  100% Private with Local LLMs

# Use Ollama - no API keys, no cloud, 100% private
ollama pull llama3.1
ragscore generate confidential_docs/*.pdf
ragscore evaluate http://localhost:8000/query

Perfect for: Healthcare ๐Ÿฅ โ€ข Legal โš–๏ธ โ€ข Finance ๐Ÿฆ โ€ข Research ๐Ÿ”ฌ

Ollama Model Recommendations

RAGScore generates complex structured QA pairs (question + answer + rationale + support span) in JSON format. This requires models with strong instruction-following and JSON output capabilities.

Model Size Min RAM QA Quality Recommended
llama3.1:70b 40GB 48GB VRAM Excellent GPU server (A100, L40)
qwen2.5:32b 18GB 24GB VRAM Excellent GPU server (A10, L20)
llama3.1:8b 4.7GB 8GB VRAM Good Best local choice
qwen2.5:7b 4.4GB 8GB VRAM Good Good local alternative
mistral:7b 4.1GB 8GB VRAM Good Good local alternative
llama3.2:3b 2.0GB 4GB RAM Fair CPU-only / testing
qwen2.5:1.5b 1.0GB 2GB RAM Poor Not recommended

Minimum recommended: 8B+ models. Smaller models (1.5Bโ€“3B) produce lower quality support spans and may timeout on longer chunks.

Ollama Performance Guide

# Recommended: 8B model with concurrency 2 for local machines
ollama pull llama3.1:8b
ragscore generate docs/ --provider ollama --model llama3.1:8b

# GPU server (A10/L20): larger model with higher concurrency
ollama pull qwen2.5:32b
ragscore generate docs/ --provider ollama --model qwen2.5:32b --concurrency 5

Expected performance (28 chunks, 5 QA pairs per chunk):

Hardware Model Time Concurrency
MacBook (CPU) llama3.2:3b ~45 min 2
MacBook (CPU) llama3.1:8b ~25 min 2
A10 (24GB) llama3.1:8b ~3โ€“5 min 5
L20/L40 (48GB) qwen2.5:32b ~3โ€“5 min 5
OpenAI API gpt-4o-mini ~2 min 10

RAGScore auto-reduces concurrency to 2 for local Ollama to avoid GPU/CPU contention.


๐Ÿ”Œ Supported LLMs

Provider Setup Notes
Ollama ollama serve Local, free, private
OpenAI export OPENAI_API_KEY="sk-..." Best quality
Anthropic export ANTHROPIC_API_KEY="..." Long context
DashScope export DASHSCOPE_API_KEY="..." Qwen models
vLLM export LLM_BASE_URL="..." Production-grade
Any OpenAI-compatible export LLM_BASE_URL="..." Groq, Together, etc.

๐Ÿ“Š Output Formats

Generated QA Pairs (output/generated_qas.jsonl)

{
  "id": "abc123",
  "question": "What is RAG?",
  "answer": "RAG (Retrieval-Augmented Generation) combines...",
  "rationale": "This is explicitly stated in the introduction...",
  "support_span": "RAG systems retrieve relevant documents...",
  "difficulty": "medium",
  "source_path": "docs/rag_intro.pdf"
}

Evaluation Results (--output results.json)

{
  "summary": {
    "total": 100,
    "correct": 85,
    "incorrect": 15,
    "accuracy": 0.85,
    "avg_score": 4.2
  },
  "incorrect_pairs": [
    {
      "question": "What is RAG?",
      "golden_answer": "RAG combines retrieval with generation...",
      "rag_answer": "RAG is a database system.",
      "score": 2,
      "reason": "Factually incorrect - RAG is not a database"
    }
  ]
}

๐Ÿงช Python API

from ragscore import run_pipeline, run_evaluation

# Generate QA pairs
run_pipeline(paths=["docs/"], concurrency=10)

# Generate tailored QA pairs for specific audiences
run_pipeline(
    paths=["docs/"],
    audience="support engineers",
    purpose="fine-tuning a support chatbot",
)

# Evaluate RAG
results = run_evaluation(
    endpoint="http://localhost:8000/query",
    model="gpt-4o",  # LLM for judging
)
print(f"Accuracy: {results.accuracy:.1%}")

๐Ÿค– AI Agent Integration

RAGScore is designed for AI agents and automation:

# Structured CLI with predictable output
ragscore generate docs/ --concurrency 5
ragscore evaluate http://api/query --output results.json

# Exit codes: 0 = success, 1 = error
# JSON output for programmatic parsing

CLI Reference:

Command Description
ragscore generate <paths> Generate QA pairs from documents
ragscore generate <paths> --audience <who> Tailored QA for specific audience
ragscore generate <paths> --purpose <why> Focus QA on document purpose
ragscore evaluate <endpoint> Evaluate RAG against golden QAs
ragscore evaluate <endpoint> --detailed Multi-metric evaluation
ragscore --help Show all commands and options
ragscore generate --help Show generate options
ragscore evaluate --help Show evaluate options

โš™๏ธ Configuration

Zero config required. Optional environment variables:

export RAGSCORE_CHUNK_SIZE=512          # Chunk size for documents
export RAGSCORE_QUESTIONS_PER_CHUNK=5   # QAs per chunk
export RAGSCORE_WORK_DIR=/path/to/dir   # Working directory

๐Ÿ” Privacy & Security

Data Cloud LLM Local LLM
Documents โœ… Local โœ… Local
Text chunks โš ๏ธ Sent to LLM โœ… Local
Generated QAs โœ… Local โœ… Local
Evaluation results โœ… Local โœ… Local

Compliance: GDPR โœ… โ€ข HIPAA โœ… (with local LLMs) โ€ข SOC 2 โœ…


๐Ÿงช Development

git clone https://github.com/HZYAI/RagScore.git
cd RagScore
pip install -e ".[dev,all]"
pytest

๐Ÿ“ก Telemetry

RAGScore collects telemetry only in MCP server mode (ragscore serve). Standard CLI and Python API usage do not send telemetry.

We collect limited anonymous operational metrics to understand feature usage and improve reliability. No document content, prompts, QA text, model outputs, API keys, endpoint URLs, or file paths are collected.

Collected in MCP mode:

  • MCP tool invoked
  • LLM provider and model name
  • ragscore version, Python version, OS type
  • Success/failure status
  • Random anonymous installation ID

Opt out:

export RAGSCORE_NO_TELEMETRY=1

๏ฟฝ๏ฟฝ Links


โญ Star us on GitHub if RAGScore helps you!
Made with โค๏ธ for the RAG community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragscore-0.8.2.tar.gz (69.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ragscore-0.8.2-py3-none-any.whl (68.6 kB view details)

Uploaded Python 3

File details

Details for the file ragscore-0.8.2.tar.gz.

File metadata

  • Download URL: ragscore-0.8.2.tar.gz
  • Upload date:
  • Size: 69.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ragscore-0.8.2.tar.gz
Algorithm Hash digest
SHA256 4fd0439746cdc94a65b309f73dfacb6b4304b37195ca28b3fac65a0e5a573426
MD5 fbddbb3fa92ec9b1d3d4bc2db0ea81a8
BLAKE2b-256 da2af35494b7a6e7b95b043b6d4264aca7f5b25f130ea354613fdf5d0e46c4bf

See more details on using hashes here.

Provenance

The following attestation bundles were made for ragscore-0.8.2.tar.gz:

Publisher: ci.yml on HZYAI/RagScore

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ragscore-0.8.2-py3-none-any.whl.

File metadata

  • Download URL: ragscore-0.8.2-py3-none-any.whl
  • Upload date:
  • Size: 68.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ragscore-0.8.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2284984e0820e9a453b3408fb1fbd87afc43da9534da85edb6ce79b25c33509a
MD5 69d20a41607471b32e3d3d097ce8b81d
BLAKE2b-256 f4e649687e59b1ed2a16ee12165fcebf3cb517a0bdb106db8c51df47e0547459

See more details on using hashes here.

Provenance

The following attestation bundles were made for ragscore-0.8.2-py3-none-any.whl:

Publisher: ci.yml on HZYAI/RagScore

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page