Skip to main content

Evaluation layer for ragradar observability system

Project description

ragradar-evaluate

Scores captured runs. Two tasks, one discovery helper:

Task Call Cost
"Is this run healthy?" check(run_id) free — deterministic, no LLM, instant
"Score it fully" evaluate(run_id) free input metrics + LLM-judged output metrics
"What can be scored?" available_metrics() free
pip install ragradar-evaluate

Is this run healthy? — check()

Call before paying for an LLM; put it in CI.

import ragradar_capture
from ragradar_evaluate import check

run_id = ragradar_capture.capture(
    "what is RRF?", "RRF fuses rankings.",
    chunks=[{"chunk_id": "c1", "source_doc_id": "d1",
             "content": "RRF combines rankings.", "token_count": 10,
             "rerank_score": 0.9}],
)
result = check(run_id)

print(result.verdict)      # "ok" | "warn" | "fail"
print(result.problems)     # ["duplicate chunks: ratio 0.50 exceeds 0.20", ...]
print(result.risk_score)   # 0.0-1.0, None if input quality couldn't be assessed
print(result.factors)      # per-factor {value, threshold, status}
print(result.thresholds)   # "learned" | "policy" — which standards were applied

check() compares all free input metrics against the current standards: once at least 10 evaluated runs exist for the pipeline it uses thresholds learned from your own history (and says so via thresholds == "learned"); before that it falls back to the policy defaults. A run captured without chunks gets a warn verdict explaining the missing data — never an exception.

Score it fully — evaluate()

import ragradar_capture
from ragradar_evaluate import evaluate

run_id = ragradar_capture.capture(
    "what is RRF?", "RRF fuses rankings.",
    chunks=[{"chunk_id": "c1", "source_doc_id": "d1",
             "content": "RRF combines rankings.", "token_count": 10,
             "rerank_score": 0.9}],
)

# Complete eval: every metric applicable to the record.
result = evaluate(run_id)

# One atomic metric — nothing else is computed:
result = evaluate(run_id, metrics=["duplicates"], save=False)

# A chosen subset:
result = evaluate(run_id, metrics=["relevance", "faithfulness"])

target can be an sNrN string (what ragradar_capture.capture() returns), a committed Capture object, or a bare RunRecord (then pass save=False — there's no run row to write to).

EvalResult

Field Meaning
metrics per-metric results: a dict of values for input families, a float for RAGAS metrics
skipped metric → reason: "not requested", "missing data: ...", or "requires ground_truth"
errors metric → error string; RAGAS-not-installed and RAGAS runtime failures land here identically — evaluate() never raises for judge failures
policy_violations policy thresholds breached by the computed values
risk_score None when input metrics weren't computed; 0.0 only ever means "computed, no risk"
run_id / saved identity and whether scores were persisted

save=True (default) persists via the one store path; ragradar explain <run_id> then shows the scores alongside its analysis.

available_metrics()

Metric Layer Cost Requires
relevance input free chunks
duplicates input free chunks
truncation input free chunks
token_efficiency input free chunks
coherence input free chunks
faithfulness output llm chunks, response
answer_relevancy output llm chunks, response
context_precision output llm chunks, response
context_recall output llm chunks, response, ground_truth

Output metrics are RAGAS LLM-as-judge calls — they cost money and need a configured judge. To stay free-only, use check(), or select input metrics explicitly: evaluate(run_id, metrics=["relevance", "duplicates", "truncation", "token_efficiency", "coherence"]).

Policy system

Human-set thresholds encoding known failure modes; active from day one and the fallback standard for check(). Stored per pipeline.

ragradar-evaluate policy show
ragradar-evaluate policy set max_duplicate_ratio 0.1
ragradar-evaluate policy reset

Programmatic override: evaluate(run_id, policy=InputQualityPolicy(...)).

Benchmark lifecycle (CLI)

Learned thresholds accumulate as you evaluate real runs — check() picks them up automatically at 10+ evaluated runs. The CLI exposes the machinery for inspection:

ragradar-evaluate run s2r3                 # evaluate one run (both layers)
ragradar-evaluate run s2r3 --input-only    # free metrics only
ragradar-evaluate run --session s2         # evaluate a whole session
ragradar-evaluate benchmark show           # current learned thresholds
ragradar-evaluate benchmark build          # rebuild from evaluated history
ragradar-evaluate benchmark check s2r3     # factor-by-factor threshold check
ragradar-evaluate benchmark export         # RAGAS-compatible JSONL dataset

--input-only --output-only together is an error (it would compute nothing).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragradar_evaluate-0.1.0.tar.gz (30.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ragradar_evaluate-0.1.0-py3-none-any.whl (25.9 kB view details)

Uploaded Python 3

File details

Details for the file ragradar_evaluate-0.1.0.tar.gz.

File metadata

  • Download URL: ragradar_evaluate-0.1.0.tar.gz
  • Upload date:
  • Size: 30.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for ragradar_evaluate-0.1.0.tar.gz
Algorithm Hash digest
SHA256 19d2945bb3d6e5f8e83dff201ed25496453333afd484483bcf9e82938d65817e
MD5 69aee3e43787d2e9c56bd8b0fccfe640
BLAKE2b-256 1dcf54f6b90f4cd0c4e2191c744882c86a4f5de72614be02d90b1cc8320c645a

See more details on using hashes here.

File details

Details for the file ragradar_evaluate-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for ragradar_evaluate-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 56203c9e299002251bc96d85e702361a33db912ac7302e3b9cd27cbf069b6985
MD5 f06ae7e02da8004b707290457e6f884c
BLAKE2b-256 1e1c70726052c12a0470a676ae39e1cd34d840be137f4909fcfd5edba1c6332a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page