Heuristic quality metrics for RAG retrieval and grounded answers. Python port of @mukundakatta/rag-quality-kit.
Project description
rag-quality-kit
Heuristic quality metrics for RAG retrieval and grounded answers. Zero runtime dependencies, pure-Python.
Python port of @mukundakatta/rag-quality-kit. The JS sibling has the original heuristics; this README sticks to the Python API.
Install
pip install rag-quality-kit
Usage
from rag_quality_kit import score, missing_evidence
question = "Who wrote Hamlet and when was it first performed?"
contexts = [
{"id": "doc-1", "text": "Hamlet is a tragedy by William Shakespeare, written around 1600."},
{"id": "doc-2", "text": "Records suggest Hamlet was first performed in 1602."},
]
answer = "Hamlet was written by Shakespeare and first performed in 1602."
r = score(question, contexts, answer)
r.groundedness # 0..1 -- answer terms that appear in any context
r.context_relevance # 0..1 -- question terms covered by the contexts
r.answer_relevance # 0..1 -- question terms covered by the answer
r.conciseness # 0..1 -- 1.0 if answer is roughly question-sized, decays as it balloons
r.overall # unweighted mean of the four
missing_evidence(answer, contexts) # -> list[str] of answer terms not in any context
Metrics
| Metric | Range | Behavior |
|---|---|---|
groundedness |
0..1 | Fraction of answer terms found in any context. |
context_relevance |
0..1 | Fraction of (longer) question terms covered by the contexts. Mirrors the JS retrievalCoverage. |
answer_relevance |
0..1 | Fraction of question terms that the answer addresses. |
conciseness |
0..1 | 1.0 when the answer is up to ~2x the question's term count; linearly decays to 0 at 10x. |
overall |
0..1 | Unweighted mean of the four. |
All metrics are heuristic and token-overlap based -- fast, deterministic, no LLM calls. For evaluation-grade scoring layer an LLM judge on top.
API differences from the JS sibling
- Python signature is
score(question, contexts, answer)(positional) instead ofscoreRag({ query, answer, contexts }). - Returns a
QualityResultdataclass instead of a plain object. - Metric names:
context_relevance(wasretrievalCoverage),groundednessis unchanged. Adds two extra heuristics:answer_relevanceandconciseness. The aggregate isoverall(wasscore) and now averages all four. - Drops the
citationCoveragemetric -- it's heavily citation-format dependent and best owned by the calling app. Usemissing_evidence(answer, contexts)for an analogous signal.
See the JS sibling for the original heuristics and broader design notes.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rag_quality_kit-0.1.0.tar.gz.
File metadata
- Download URL: rag_quality_kit-0.1.0.tar.gz
- Upload date:
- Size: 6.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf20a1ca372b8bfb3c5f404cc82eccf47199bdddd155c198687b5561c73f284e
|
|
| MD5 |
2d1b7e870346e62e1dd7e40d2a3534e7
|
|
| BLAKE2b-256 |
923b0ce9dcf77e1c5adb0aef3fc423e656cfd81e2f160855eeaf50739f0429ca
|
File details
Details for the file rag_quality_kit-0.1.0-py3-none-any.whl.
File metadata
- Download URL: rag_quality_kit-0.1.0-py3-none-any.whl
- Upload date:
- Size: 5.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
46a0b61162a9da136c8d502513f1ccc9ae4eefb6d833077c7aa7ee731b59ef3b
|
|
| MD5 |
522142a7fa58d3aa0d2865e425f808b2
|
|
| BLAKE2b-256 |
59d0fbe4099fd936448524dd8dda2cfcc4e93dee26af141156d611d15eaa6945
|