A lightweight debugger for RAG pipelines

These details have not been verified by PyPI

Project links

Project description

ragpeek

A lightweight debugger for RAG pipelines.

When a RAG pipeline returns a bad answer, the usual move is to print the retrieved chunks and squint at them. ragpeek replaces the squinting: wrap your pipeline in one decorator and it shows you, per query, what was retrieved, the score of every chunk, the exact prompt sent to the model, and a plain-English read on where things went sideways retrieval, context ranking, or generation.

Ask a question in one command no code (output depends on your question and the LLM):

$ ragpeek demo
Question> How hot is Venus?

Retrieval  k=4/4
  ✓ 0.77  Venus is the hottest planet, with surface temperatures…
  ✗ 0.39  Mercury is the smallest planet and the closest to the Sun.
  ✗ 0.34  Neptune is the most distant planet from the Sun…
  ✗ 0.31  Mars hosts Olympus Mons, the tallest volcano…

  ⚠ 3 of 4 chunks sit in the lower half of this result's score range
    (top 0.77, bottom 0.31) possible low-relevance padding.
  ✓ Sharp rank-1 separation (0.77 vs 0.39): the retriever cleanly
    separates the top match a precision signal.

Generation  model=llama3.2
  Venus's average surface temperature is around 465 °C…
  ✓ Generation looks healthy - no obvious signals.

Score convention: ragpeek assumes higher scores mean more relevant chunks. If your vector store returns distances, convert them to similarities first see Works with any vector store.

Install

pip install ragpeek

The default install is lightweight only rich at runtime. For the embedding-based context analyzer (and ragpeek demo, which retrieves with real embeddings), add the semantic extra:

pip install "ragpeek[semantic]"

Requires Python 3.10+. On first semantic run, ragpeek downloads a small embedding model (~80MB) once. ragpeek demo also generates an answer if a local Ollama server is running; without one it shows retrieval only.

From source:

git clone https://github.com/meutsabdahal/ragpeek
cd ragpeek
uv sync --group dev        # create the env + install dev deps
uv run pytest tests/ -v

Command line

Once installed, ragpeek is a command:

ragpeek demo                       # prompts for a question, then retrieves + answers + traces it
ragpeek demo "How hot is Venus?"   # or pass the question directly
ragpeek demo --model mistral       # choose the Ollama model (default: llama3.2)
ragpeek demo --html report.html    # also save a shareable HTML report
ragpeek path/to/trace.json         # view a saved trace (from @trace(output=...) / serialize_trace)
ragpeek                            # help

ragpeek demo retrieves over a small built-in corpus with real embeddings (needs the semantic extra) and answers via a local Ollama server if one is running. Running from a source checkout instead of an install? Prefix with uv run:

uv run ragpeek demo "How hot is Venus?"
uv run ragpeek demo --html report.html              # also save an HTML report
uv run ragpeek tests/fixtures/sample_session.json   # view a saved trace
uv run ragpeek                                      # help

Instrument your pipeline

Tracing your own pipeline is two imports and two log calls ragpeek never monkey-patches your stack, so it works with any retriever and any model.

from ragpeek import trace, log_retrieval, log_generation

@trace
def answer_question(query: str) -> str:
    docs, scores = retriever.search(query, k=5)
    log_retrieval(query=query, chunks=docs, scores=scores)

    prompt = build_prompt(docs, query)
    response = llm.generate(prompt)
    log_generation(prompt=prompt, response=response, model="llama3.2")

    return response

Call the function exactly as before the trace prints automatically:

answer_question("Which is the largest planet in the Solar System?")

Async pipelines work the same way; the active session follows your coroutines through every await (it rides a contextvars.ContextVar), so concurrent queries never cross-contaminate:

@trace
async def answer(query: str) -> str:
    docs, scores = await retriever.asearch(query, k=5)
    log_retrieval(query=query, chunks=docs, scores=scores)

    response = await llm.acomplete(build_prompt(docs, query))
    log_generation(prompt=build_prompt(docs, query), response=response, model="llama3.2")
    return response

Configuration

Pass a TracerConfig to tune thresholds, or flip decorator flags for common cases:

from ragpeek import trace, TracerConfig

config = TracerConfig(
    score_gap_threshold=0.3,     # rank-1→rank-2 gap that reads as precision
    semantic=True,               # embedding-based context analysis
    show_prompt=False,           # hide the full prompt in terminal output
    # min_score_threshold=0.6,   # opt-in absolute floor — only set once you've
    #                            # calibrated a cutoff for your own embedder
)

@trace(config=config)
def answer(query: str) -> str:
    ...

@trace(semantic=False)              # skip the embedding model (faster, no download)
@trace(output="report.html")        # save a shareable HTML report
@trace(render=False)                # don't print — just populate session.analysis_report

With render=False the analyzers still run; grab the finalized session and hand it to downstream tooling with serialize_trace(...) (and deserialize_trace(...) to read it back, e.g. ragpeek trace.json).

Works with any vector store

log_retrieval takes similarity scores (higher = better). Most stores return those directly; some return distances you convert first.

# ChromaDB (cosine space): distance ∈ [0, 2] → similarity = 1 - distance
results = collection.query(query_texts=[query], n_results=5)
log_retrieval(query=query,
              chunks=results["documents"][0],
              scores=[1.0 - d for d in results["distances"][0]])

# FAISS IndexFlatL2 with normalized vectors: similarity = 1 - d² / 2
distances, indices = index.search(query_embedding, k=5)
log_retrieval(query=query,
              chunks=[corpus[i] for i in indices[0]],
              scores=[1.0 - (d ** 2) / 2 for d in distances[0].tolist()])

# Qdrant (cosine): .score is already a similarity — use it as-is
results = client.search("docs", query_vector=embedding, limit=5)
log_retrieval(query=query,
              chunks=[r.payload["text"] for r in results],
              scores=[r.score for r in results])

Note on scores: ragpeek assumes higher score = more relevant. There is no single distance→similarity formula convert per metric:

Store returns Correct conversion

Cosine distance (∈ [0, 2]) score = 1.0 - distance (exact)

L2 / Euclidean, normalized vectors score = 1.0 - distance ** 2 / 2 (exact)

L2 / Euclidean, un-normalized score = 1.0 / (1.0 + distance) (monotonic squash)

Inner product / dot product already a similarity use as-is (negate if returned as a distance)

score = 1.0 - distance is only correct for cosine distance; using it on raw L2 distances silently produces wrong (often negative) similarities.

Store returns	Correct conversion
Cosine distance (∈ [0, 2])	`score = 1.0 - distance` (exact)
L2 / Euclidean, normalized vectors	`score = 1.0 - distance ** 2 / 2` (exact)
L2 / Euclidean, un-normalized	`score = 1.0 / (1.0 + distance)` (monotonic squash)
Inner product / dot product	already a similarity use as-is (negate if returned as a distance)

Need a non-default retrieval→generation association? Keep the returned span objects and pair them explicitly:

from ragpeek import trace, log_retrieval, log_generation, link_retrieval_to_generation

@trace(render=False)
def answer(query: str) -> str:
    retrieval = log_retrieval(query=query, chunks=["chunk"], scores=[0.9])
    response = llm.complete(query)
    generation = log_generation(prompt=query, response=response, model="llama3.2")
    link_retrieval_to_generation(retrieval, generation)
    return response

What it surfaces

These are signals to calibrate, not verdicts. Scores are read within each result set, so they don't assume an absolute scale tune thresholds to your own embedder.

Signal	What it means
Within-set padding	Most chunks fall in the lower half of this result's score range (relative, not an absolute cutoff)
Sharp rank-1 separation	The retriever cleanly separates the top match a precision signal, not noise
Flat distribution	Scores barely differ the retriever can't discriminate (query too vague / chunks too broad)
k mismatch	Retriever returned fewer chunks than requested
Rank disagreement	The answer aligns with a chunk the retriever didn't rank first a reranking signal
Low context utilisation	The response is semantically dissimilar to every retrieved chunk
Hedging language	Phrase-level signal the model may be answering from training weights, not context

How it works

@trace wraps your function and opens a TraceSession.
The session id lives in a contextvars.ContextVar, so it propagates through both sync and async code without you threading anything through your call stack.
log_retrieval() and log_generation() read that ContextVar and append spans to the active session.
When your function returns, three analyzers run over the collected spans:
- Retrieval: within-set score distribution, low-relevance padding, rank-1 precision, k mismatch.
- Context: chunk↔response similarity and the rank-disagreement (reranking) signal.
- Generation: hedging language and response-length anomalies.
The terminal renderer prints the trace; the HTML renderer saves a shareable report.

The embedding model runs entirely on your machine your data never leaves it.

Limitations

Explicit, not magic. You call log_retrieval / log_generation yourself ragpeek doesn't patch framework internals. That's three lines of instrumentation per pipeline, traded for working with any stack.
Signals, not truth. Retrieval signals are computed within each result set and assume higher = better, but they can't know your embedder's absolute scale. Treat every diagnosis as a prompt to calibrate, and convert distances to similarities per metric (table above) before calling log_retrieval.

Contributing

Issues and PRs welcome. If a vector-store integration doesn't work or a diagnosis looks wrong, open an issue with a minimal reproduction.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jun 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragpeek-0.1.0.tar.gz (239.1 kB view details)

Uploaded Jun 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ragpeek-0.1.0-py3-none-any.whl (26.8 kB view details)

Uploaded Jun 22, 2026 Python 3

File details

Details for the file ragpeek-0.1.0.tar.gz.

File metadata

Download URL: ragpeek-0.1.0.tar.gz
Upload date: Jun 22, 2026
Size: 239.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for ragpeek-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`22a1d0ff193944d6897d8ddf27fb3b965df85503a4209a57155a5bf86e84d3b7`
MD5	`e297e9ef1083484e367169f67ab2f710`
BLAKE2b-256	`ec62c63cf102875c769b2e2f7c9f0190ba20d8617e1e074d051c565e2ad4b119`

See more details on using hashes here.

File details

Details for the file ragpeek-0.1.0-py3-none-any.whl.

File metadata

Download URL: ragpeek-0.1.0-py3-none-any.whl
Upload date: Jun 22, 2026
Size: 26.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for ragpeek-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d2c7e3ae9dd7f2f8f8c0ef957965bd6736288bae1c93ad3525966fa3a3453fe8`
MD5	`366b7e7b44f70ca51f2fd117d5e405e2`
BLAKE2b-256	`e311731055de21b01168fe034d6f09bb91f844e30f69d43c081e63060d627e01`

See more details on using hashes here.

ragpeek 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ragpeek

Install

Command line

Instrument your pipeline

Configuration

Works with any vector store

What it surfaces

How it works

Limitations

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes