Skip to main content

RAG retrieval regression testing — define Golden Questions, detect lost chunks in CI

Project description

LongProbe Logo

Sub-second RAG regression testing for production pipelines

PyPI version PyPI Downloads Python Versions License: MIT CI Documentation

Quick StartDocumentationPython APICI/CD


Overview

"Did my last commit break retrieval?" — now you know in seconds.

LongProbe is a sub-second RAG regression harness. Define your Golden Questions once, run longprobe check on every commit, and get an exact diff of which document chunks were lost in your latest change — before your users notice.

Think pytest --watch for your RAG pipeline.

🎬 Demos

Test RAG Retrieval

Quick validation of retrieval quality with live progress tracking.

Test RAG Retrieval

Monitor RAG Quality

Detailed quality monitoring with Python API and comprehensive results.

Monitor RAG Quality

Detect Regressions

Baseline comparison and regression detection with deployment verdict.

Detect Regressions

Why LongProbe?

Every RAG developer faces the same silent killer: you refactor chunking strategy, upgrade LangChain, or add a new document — and your retrieval silently degrades. DeepEval and RAGChecker are heavyweight evaluation frameworks meant for batch analysis, not fast regression checks in a dev loop.

LongProbe gives you instant feedback:

  • Sub-second checks on small golden sets
  • 🔍 Exact diffs showing which chunks were lost/gained
  • 📊 Recall scores with per-question breakdown
  • 💾 Baseline tracking to catch regressions over time
  • 🧪 pytest integration for existing test suites
  • 🔌 Pluggable adapters for any vector store

Part of the Long Suite

LongProbe is part of the EnDevSols Long Suite of RAG tools:

Together they cover the full RAG pipeline from ingestion to production monitoring.

Features

  • Sub-second checks on small golden sets
  • 📋 Golden Questions + Required Chunks defined in simple YAML
  • 🔍 Three match modes: exact ID, text substring, semantic similarity
  • 📊 Recall Score with per-question breakdown
  • 🔄 Regression diff: exactly which chunks were lost/gained
  • 💾 SQLite baseline store: compare against any previous run
  • 🧪 pytest plugin: integrate into existing test suites
  • 🔌 Pluggable adapters: LangChain, LlamaIndex, Chroma, Pinecone, Qdrant
  • 🖥️ Beautiful CLI with Rich tables, JSON, and GitHub Actions output
  • 👀 Watch mode: auto re-run on file changes
  • 🏗️ CI/CD ready: fails pipeline on regression

Quick Start

Installation

# Install with UV (recommended)
uv pip install longprobe

# Install with pip
pip install longprobe

# Install with optional dependencies
uv pip install longprobe[chroma]      # ChromaDB support
uv pip install longprobe[openai]      # OpenAI embeddings
uv pip install longprobe[all]         # Everything

Initialize

longprobe init

This creates:

  • .longprobe/ — directory for baseline storage
  • goldens.yaml — example golden questions
  • longprobe.yaml — configuration file

Define Golden Questions

Edit goldens.yaml with your test cases:

name: "my-rag-golden-set"
version: "1.0"

questions:
  - id: "q1"
    question: "What is the termination clause?"
    match_mode: "id"            # exact chunk ID match
    required_chunks:
      - "contracts_chunk_42"
      - "contracts_chunk_43"
    top_k: 5
    tags: ["contracts", "critical"]

  - id: "q2"
    question: "What are the payment terms?"
    match_mode: "text"          # substring match
    required_chunks:
      - "net 30 days from invoice"
    top_k: 5

  - id: "q3"
    question: "Who can sign contracts?"
    match_mode: "semantic"      # embedding similarity
    semantic_threshold: 0.80
    required_chunks:
      - "The following officers are authorized to sign"
    top_k: 10

Configure Your Retriever

Edit longprobe.yaml:

retriever:
  type: "chroma"
  chroma:
    persist_directory: "./chroma_db"
    collection: "my_documents"

embedder:
  provider: "local"
  model: "text-embedding-3-small"

scoring:
  recall_threshold: 0.8
  fail_on_regression: true

baseline:
  db_path: ".longprobe/baselines.db"
  auto_compare: true

Run Checks

# Run against live vector store
longprobe check --goldens goldens.yaml

# Override settings
longprobe check --threshold 0.9 --top-k 10

# JSON output for automation
longprobe check --output json

# GitHub Actions annotations
longprobe check --output github

CLI Reference

Core Commands

Command Description
longprobe init Create starter configuration files
longprobe check Run probes against the golden set
longprobe diff Compare current results against baseline
longprobe baseline save Save current results as baseline
longprobe baseline list List all saved baselines
longprobe watch Watch golden file and re-run on changes
longprobe generate Auto-generate Golden Questions from documents
longprobe capture Build goldens.yaml by querying your retriever

Examples

# Initialize project
longprobe init

# Run checks with custom config
longprobe check -g goldens.yaml -c longprobe.yaml

# Save baseline for comparison
longprobe baseline save --label v1.0

# Compare against baseline
longprobe diff --baseline v1.0

# Watch mode for development
longprobe watch --interval 2

# Generate questions from documents
longprobe generate ./docs --capture --auto

Python API

Basic Usage

from longprobe import LongProbe
from longprobe.adapters import create_adapter

# Create adapter for your vector store
adapter = create_adapter(
    "chroma",
    collection_name="my_documents",
    persist_directory="./chroma_db"
)

# Create and run probe
probe = LongProbe(
    adapter=adapter,
    goldens_path="goldens.yaml",
    config_path="longprobe.yaml"
)
report = probe.run()

print(f"Overall Recall: {report.overall_recall:.2%}")
print(f"Pass Rate: {report.pass_rate:.2%}")

Baseline Management

from longprobe import LongProbe
from longprobe.adapters import create_adapter

adapter = create_adapter("chroma", collection_name="docs", persist_directory="./db")
probe = LongProbe(adapter=adapter, goldens_path="goldens.yaml")

# Run and save baseline
report = probe.run()
probe.save_baseline(label="v1.0")

# After making changes...
report2 = probe.run()

# Compare against baseline
diff = probe.diff(baseline_label="v1.0")
print(f"Regressions: {len(diff['regressions'])}")
print(f"Improvements: {len(diff['improvements'])}")

With LangChain

from longprobe import LongProbe
from longprobe.adapters import LangChainRetrieverAdapter

# Wrap your existing LangChain retriever
adapter = LangChainRetrieverAdapter(your_langchain_retriever)
probe = LongProbe(adapter=adapter, goldens_path="goldens.yaml")
report = probe.run()

assert report.overall_recall >= 0.85, f"Recall too low: {report.overall_recall}"

With LlamaIndex

from longprobe import LongProbe
from longprobe.adapters import LlamaIndexRetrieverAdapter

adapter = LlamaIndexRetrieverAdapter(your_llamaindex_retriever)
probe = LongProbe(adapter=adapter, goldens_path="goldens.yaml")
report = probe.run()

Pytest Integration

Configuration

# conftest.py
import pytest
from longprobe import LongProbe
from longprobe.adapters import create_adapter

@pytest.fixture
def probe():
    adapter = create_adapter(
        "chroma",
        collection_name="test_docs",
        persist_directory="./test_db"
    )
    return LongProbe(
        adapter=adapter,
        goldens_path="tests/goldens.yaml",
        recall_threshold=0.85
    )

Writing Tests

def test_retrieval_recall(probe):
    report = probe.run()
    assert report.overall_recall >= 0.85, (
        f"Recall dropped to {report.overall_recall:.2f}"
    )

def test_no_regression_vs_baseline(probe):
    report = probe.run()
    assert not report.regression_detected, (
        f"Regression detected! Delta: {report.recall_delta}"
    )

Retriever Adapters

LongProbe supports multiple vector stores and retrieval frameworks:

Adapter Type Configuration
ChromaDB Direct type: chroma
Pinecone Direct type: pinecone
Qdrant Direct type: qdrant
HTTP API Direct type: http
LangChain Programmatic LangChainRetrieverAdapter
LlamaIndex Programmatic LlamaIndexRetrieverAdapter

ChromaDB Example

retriever:
  type: chroma
  collection: my_collection
  persist_directory: ./chroma_db

HTTP API Example

retriever:
  type: http
  url: "http://localhost:8000/api/retrieve"
  method: "POST"
  body_template: '{"query": "{question}"}'
  response_mapping:
    results_path: "data.chunks"
    text_field: "content"

GitHub Actions

name: RAG Regression Check

on: [push, pull_request]

jobs:
  rag-probe:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: astral-sh/setup-uv@v4
      - run: uv pip install longprobe[chroma]
      - name: Run RAG regression check
        run: longprobe check --goldens goldens.yaml --output github
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Match Modes

ID Match (match_mode: "id")

Exact string match on chunk/document IDs. Best when you control the IDs in your vector store.

Text Match (match_mode: "text")

Case-insensitive substring matching. Checks if the required text appears anywhere in the retrieved documents.

Semantic Match (match_mode: "semantic")

Word-frequency cosine similarity. Useful when exact text may vary but meaning should be preserved.

Development

# Install for development
git clone https://github.com/ENDEVSOLS/LongProbe.git
cd LongProbe
uv sync --dev

# Run tests
uv run pytest tests/unit/ -v
uv run pytest tests/ -v --run-integration

# Lint and format
uv run ruff check src/
uv run ruff format src/

How It Works

goldens.yaml → GoldenLoader → QueryEmbedder → RetrieverAdapter → RecallScorer
                                                                      ↓
                                                                BaselineStore → DiffReporter
  1. Define your Golden Questions + Required Fact Chunks in YAML
  2. Embed each question using your configured embedding model
  3. Retrieve from your live vector store using the pluggable adapter
  4. Score each question by checking if required chunks appear in Top-K results
  5. Compare against saved baselines to detect regressions
  6. Report a Recall Score, diff of lost chunks, and optionally fail CI/CD

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Security

For security issues, please see SECURITY.md.

License

MIT License — see LICENSE for details.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

longprobe-0.1.1.tar.gz (80.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

longprobe-0.1.1-py3-none-any.whl (61.8 kB view details)

Uploaded Python 3

File details

Details for the file longprobe-0.1.1.tar.gz.

File metadata

  • Download URL: longprobe-0.1.1.tar.gz
  • Upload date:
  • Size: 80.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for longprobe-0.1.1.tar.gz
Algorithm Hash digest
SHA256 ef94e0de05688cb7a05e325c91080770774b330a24ca6f96bddc67a0dce15f1a
MD5 507b4657854a887779f52df62e8ec9dc
BLAKE2b-256 779f1a866c33070e3ff5f5cc8333f7dfb27ab464aa2f5a92fba4b1e9546f5790

See more details on using hashes here.

File details

Details for the file longprobe-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: longprobe-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 61.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for longprobe-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 fa5646a08bb7f53f5c74c88d51359a79d6a97e5e7a9dc8a872834e0e2907359c
MD5 451c3b385e81db611e4545250cdcf7da
BLAKE2b-256 9b3f37aee9bb4804bb45e723ab363613acbc5ee136066a5bab578022d8efb4de

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page