Evaluation layer for ragradar observability system
Project description
ragradar-evaluate
Scores captured runs. Two tasks, one discovery helper:
| Task | Call | Cost |
|---|---|---|
| "Is this run healthy?" | check(run_id) |
free — deterministic, no LLM, instant |
| "Score it fully" | evaluate(run_id) |
free input metrics + LLM-judged output metrics |
| "What can be scored?" | available_metrics() |
free |
pip install ragradar-evaluate
Is this run healthy? — check()
Call before paying for an LLM; put it in CI.
import ragradar_capture
from ragradar_evaluate import check
run_id = ragradar_capture.capture(
"what is RRF?", "RRF fuses rankings.",
chunks=[{"chunk_id": "c1", "source_doc_id": "d1",
"content": "RRF combines rankings.", "token_count": 10,
"rerank_score": 0.9}],
)
result = check(run_id)
print(result.verdict) # "ok" | "warn" | "fail"
print(result.problems) # ["duplicate chunks: ratio 0.50 exceeds 0.20", ...]
print(result.risk_score) # 0.0-1.0, None if input quality couldn't be assessed
print(result.factors) # per-factor {value, threshold, status}
print(result.thresholds) # "learned" | "policy" — which standards were applied
check() compares all free input metrics against the current
standards: once at least 10 evaluated runs exist for the pipeline it
uses thresholds learned from your own history (and says so via
thresholds == "learned"); before that it falls back to the policy
defaults. A run captured without chunks gets a warn verdict explaining
the missing data — never an exception.
Score it fully — evaluate()
import ragradar_capture
from ragradar_evaluate import evaluate
run_id = ragradar_capture.capture(
"what is RRF?", "RRF fuses rankings.",
chunks=[{"chunk_id": "c1", "source_doc_id": "d1",
"content": "RRF combines rankings.", "token_count": 10,
"rerank_score": 0.9}],
)
# Complete eval: every metric applicable to the record.
result = evaluate(run_id)
# One atomic metric — nothing else is computed:
result = evaluate(run_id, metrics=["duplicates"], save=False)
# A chosen subset:
result = evaluate(run_id, metrics=["relevance", "faithfulness"])
target can be an sNrN string (what ragradar_capture.capture() returns), a
committed Capture object, or a bare RunRecord (then pass
save=False — there's no run row to write to).
EvalResult
| Field | Meaning |
|---|---|
metrics |
per-metric results: a dict of values for input families, a float for RAGAS metrics |
skipped |
metric → reason: "not requested", "missing data: ...", or "requires ground_truth" |
errors |
metric → error string; RAGAS-not-installed and RAGAS runtime failures land here identically — evaluate() never raises for judge failures |
policy_violations |
policy thresholds breached by the computed values |
risk_score |
None when input metrics weren't computed; 0.0 only ever means "computed, no risk" |
run_id / saved |
identity and whether scores were persisted |
save=True (default) persists via the one store path; ragradar explain <run_id> then shows the scores alongside its analysis.
available_metrics()
| Metric | Layer | Cost | Requires |
|---|---|---|---|
relevance |
input | free | chunks |
duplicates |
input | free | chunks |
truncation |
input | free | chunks |
token_efficiency |
input | free | chunks |
coherence |
input | free | chunks |
faithfulness |
output | llm | chunks, response |
answer_relevancy |
output | llm | chunks, response |
context_precision |
output | llm | chunks, response |
context_recall |
output | llm | chunks, response, ground_truth |
Output metrics are RAGAS LLM-as-judge calls — they cost money and need a
configured judge. To stay free-only, use check(), or select input
metrics explicitly: evaluate(run_id, metrics=["relevance", "duplicates", "truncation", "token_efficiency", "coherence"]).
Policy system
Human-set thresholds encoding known failure modes; active from day one
and the fallback standard for check(). Stored per pipeline.
ragradar-evaluate policy show
ragradar-evaluate policy set max_duplicate_ratio 0.1
ragradar-evaluate policy reset
Programmatic override: evaluate(run_id, policy=InputQualityPolicy(...)).
Benchmark lifecycle (CLI)
Learned thresholds accumulate as you evaluate real runs — check()
picks them up automatically at 10+ evaluated runs. The CLI exposes the
machinery for inspection:
ragradar-evaluate run s2r3 # evaluate one run (both layers)
ragradar-evaluate run s2r3 --input-only # free metrics only
ragradar-evaluate run --session s2 # evaluate a whole session
ragradar-evaluate benchmark show # current learned thresholds
ragradar-evaluate benchmark build # rebuild from evaluated history
ragradar-evaluate benchmark check s2r3 # factor-by-factor threshold check
ragradar-evaluate benchmark export # RAGAS-compatible JSONL dataset
--input-only --output-only together is an error (it would compute
nothing).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ragradar_evaluate-0.1.0.tar.gz.
File metadata
- Download URL: ragradar_evaluate-0.1.0.tar.gz
- Upload date:
- Size: 30.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
19d2945bb3d6e5f8e83dff201ed25496453333afd484483bcf9e82938d65817e
|
|
| MD5 |
69aee3e43787d2e9c56bd8b0fccfe640
|
|
| BLAKE2b-256 |
1dcf54f6b90f4cd0c4e2191c744882c86a4f5de72614be02d90b1cc8320c645a
|
File details
Details for the file ragradar_evaluate-0.1.0-py3-none-any.whl.
File metadata
- Download URL: ragradar_evaluate-0.1.0-py3-none-any.whl
- Upload date:
- Size: 25.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
56203c9e299002251bc96d85e702361a33db912ac7302e3b9cd27cbf069b6985
|
|
| MD5 |
f06ae7e02da8004b707290457e6f884c
|
|
| BLAKE2b-256 |
1e1c70726052c12a0470a676ae39e1cd34d840be137f4909fcfd5edba1c6332a
|