Evaluation layer for ragradar observability system

These details have not been verified by PyPI

Project links

Project description

ragradar-evaluate

Scores captured runs. Two tasks, one discovery helper:

Task	Call	Cost
"Is this run healthy?"	`check(run_id)`	free — deterministic, no LLM, instant
"Score it fully"	`evaluate(run_id)`	free input metrics + LLM-judged output metrics
"What can be scored?"	`available_metrics()`	free

pip install ragradar-evaluate

Is this run healthy? — check()

Call before paying for an LLM; put it in CI.

import ragradar_capture
from ragradar_evaluate import check

run_id = ragradar_capture.capture(
    "what is RRF?", "RRF fuses rankings.",
    chunks=[{"chunk_id": "c1", "source_doc_id": "d1",
             "content": "RRF combines rankings.", "token_count": 10,
             "rerank_score": 0.9}],
)
result = check(run_id)

print(result.verdict)      # "ok" | "warn" | "fail"
print(result.problems)     # ["duplicate chunks: ratio 0.50 exceeds 0.20", ...]
print(result.risk_score)   # 0.0-1.0, None if input quality couldn't be assessed
print(result.factors)      # per-factor {value, threshold, status}
print(result.thresholds)   # "learned" | "policy" — which standards were applied

check() compares all free input metrics against the current standards: once at least 10 evaluated runs exist for the pipeline it uses thresholds learned from your own history (and says so via thresholds == "learned"); before that it falls back to the policy defaults. A run captured without chunks gets a warn verdict explaining the missing data — never an exception.

Score it fully — evaluate()

import ragradar_capture
from ragradar_evaluate import evaluate

run_id = ragradar_capture.capture(
    "what is RRF?", "RRF fuses rankings.",
    chunks=[{"chunk_id": "c1", "source_doc_id": "d1",
             "content": "RRF combines rankings.", "token_count": 10,
             "rerank_score": 0.9}],
)

# Complete eval: every metric applicable to the record.
result = evaluate(run_id)

# One atomic metric — nothing else is computed:
result = evaluate(run_id, metrics=["duplicates"], save=False)

# A chosen subset:
result = evaluate(run_id, metrics=["relevance", "faithfulness"])

target can be an sNrN string (what ragradar_capture.capture() returns), a committed Capture object, or a bare RunRecord (then pass save=False — there's no run row to write to).

EvalResult

Field	Meaning
`metrics`	per-metric results: a dict of values for input families, a float for RAGAS metrics
`skipped`	metric → reason: `"not requested"`, `"missing data: ..."`, or `"requires ground_truth"`
`errors`	metric → error string; RAGAS-not-installed and RAGAS runtime failures land here identically — `evaluate()` never raises for judge failures
`policy_violations`	policy thresholds breached by the computed values
`risk_score`	`None` when input metrics weren't computed; `0.0` only ever means "computed, no risk"
`run_id` / `saved`	identity and whether scores were persisted

save=True (default) persists via the one store path; ragradar explain <run_id> then shows the scores alongside its analysis.

available_metrics()

Metric	Layer	Cost	Requires
`relevance`	input	free	chunks
`duplicates`	input	free	chunks
`truncation`	input	free	chunks
`token_efficiency`	input	free	chunks
`coherence`	input	free	chunks
`faithfulness`	output	llm	chunks, response
`answer_relevancy`	output	llm	chunks, response
`context_precision`	output	llm	chunks, response
`context_recall`	output	llm	chunks, response, ground_truth

Output metrics are RAGAS LLM-as-judge calls — they cost money and need a configured judge. To stay free-only, use check(), or select input metrics explicitly: evaluate(run_id, metrics=["relevance", "duplicates", "truncation", "token_efficiency", "coherence"]).

Policy system

Human-set thresholds encoding known failure modes; active from day one and the fallback standard for check(). Stored per pipeline.

ragradar-evaluate policy show
ragradar-evaluate policy set max_duplicate_ratio 0.1
ragradar-evaluate policy reset

Programmatic override: evaluate(run_id, policy=InputQualityPolicy(...)).

Benchmark lifecycle (CLI)

Learned thresholds accumulate as you evaluate real runs — check() picks them up automatically at 10+ evaluated runs. The CLI exposes the machinery for inspection:

ragradar-evaluate run s2r3                 # evaluate one run (both layers)
ragradar-evaluate run s2r3 --input-only    # free metrics only
ragradar-evaluate run --session s2         # evaluate a whole session
ragradar-evaluate benchmark show           # current learned thresholds
ragradar-evaluate benchmark build          # rebuild from evaluated history
ragradar-evaluate benchmark check s2r3     # factor-by-factor threshold check
ragradar-evaluate benchmark export         # RAGAS-compatible JSONL dataset

--input-only --output-only together is an error (it would compute nothing).

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jul 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragradar_evaluate-0.1.0.tar.gz (30.1 kB view details)

Uploaded Jul 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ragradar_evaluate-0.1.0-py3-none-any.whl (25.9 kB view details)

Uploaded Jul 4, 2026 Python 3

File details

Details for the file ragradar_evaluate-0.1.0.tar.gz.

File metadata

Download URL: ragradar_evaluate-0.1.0.tar.gz
Upload date: Jul 4, 2026
Size: 30.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for ragradar_evaluate-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`19d2945bb3d6e5f8e83dff201ed25496453333afd484483bcf9e82938d65817e`
MD5	`69aee3e43787d2e9c56bd8b0fccfe640`
BLAKE2b-256	`1dcf54f6b90f4cd0c4e2191c744882c86a4f5de72614be02d90b1cc8320c645a`

See more details on using hashes here.

File details

Details for the file ragradar_evaluate-0.1.0-py3-none-any.whl.

File metadata

Download URL: ragradar_evaluate-0.1.0-py3-none-any.whl
Upload date: Jul 4, 2026
Size: 25.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for ragradar_evaluate-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`56203c9e299002251bc96d85e702361a33db912ac7302e3b9cd27cbf069b6985`
MD5	`f06ae7e02da8004b707290457e6f884c`
BLAKE2b-256	`1e1c70726052c12a0470a676ae39e1cd34d840be137f4909fcfd5edba1c6332a`

See more details on using hashes here.

ragradar-evaluate 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ragradar-evaluate

Is this run healthy? — check()

Score it fully — evaluate()

EvalResult

available_metrics()

Policy system

Benchmark lifecycle (CLI)

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes