Deterministic LLM-output quality scoring in milliseconds. No AI judge in the loop.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Sundeyp

These details have not been verified by PyPI

Project description

VRTY

The deterministic, zero-dependency LLM evaluator. Sub-millisecond, no API key, byte-identical across runs.

A stdlib alternative to ROUGE for no-reference scoring, and a sanity layer in front of GPT-as-judge when reproducibility matters.

VRTY scores a (prompt, response) pair on four standard, auditable dimensions and returns a single composite plus a per-dimension breakdown. Every formula is a textbook formula you can verify against a reference in five minutes. There is no LLM call anywhere in the scoring path.

What VRTY does not do. VRTY measures surface text properties — vocabulary overlap, sentence flow, term coverage, information density. It does not check whether the answer is true. A confident wrong answer that echoes the prompt's vocabulary will score higher than a correct one-word answer (see Known properties and limitations: "London is the capital of France." scores 0.879; "Paris." scores 0.350). Use VRTY to catch malformed, off-topic, or padded output; pair it with a fact-check or human review when correctness matters.

from vrty import score
result = score("What is the capital of France?", "Paris is the capital of France.")
print(result.composite)               # 0.8653358523094898
print(result.explanations["relevance"])  # Relevance: 0.83 - response strongly overlaps with the prompt's key terms.

That is the entire 60-second example. Four lines, runs as-is, returns a score. No configuration, no API key.

About that 0.865. That number is what factoid prompts look like — short prompt, short answer, heavy vocabulary overlap. Open-ended prompts (customer support, instruction-following, prose drafts) typically score 0.20 – 0.40 because the response is expected not to echo prompt vocabulary. VRTY is calibrated relative to a fixed prompt, not as an absolute quality threshold. See Calibration bands below before setting CI gates.

Install

pip install vrty

Or from source:

git clone https://github.com/sundeyp/vrty
cd vrty
pip install -e .

Determinism is guaranteed only on the pinned interpreter (Python 3.11.9) and pinned dependency set. The scoring path has zero third-party runtime dependencies — everything is Python stdlib. See Determinism below.

The four dimensions

Dimension	Formula	What it measures
Relevance	TF·IDF weighted cosine similarity between prompt and response	How much the response's content overlaps the prompt's content
Coherence	Mean cosine similarity of adjacent-sentence TF·IDF vectors	How much each sentence shares with the next (topical flow)
Completeness	IDF-weighted fraction of prompt content terms that appear in the response	How many of the prompt's key terms are addressed
Conciseness	`	unique content tokens

Each dimension returns a value in [0.0, 1.0]. The composite is a fixed, version-locked weighted sum:

composite = 0.35 * relevance
          + 0.20 * coherence
          + 0.30 * completeness
          + 0.15 * conciseness

The weights are pinned constants, not configurable. Configurability is explicitly post-v1.0.

What you get back

score() returns a frozen VrtyScore object with a 9-key to_dict():

{
  "composite":       0.8653358523094898,
  "relevance":       0.8295310065985426,
  "coherence":       1.0,
  "completeness":    1.0,
  "conciseness":     0.5,
  "explanations": {
    "relevance":    "Relevance: 0.83 - response strongly overlaps with the prompt's key terms.",
    "coherence":    "Coherence: 1.00 - adjacent sentences carry consistent topic.",
    "completeness": "Completeness: 1.00 - most of the prompt's key terms appear in the response.",
    "conciseness":  "Conciseness: 0.50 - response has moderate information density."
  },
  "vrty_version": "1.0.0",
  "idf_sha256":      "0e475bcaa5524d1e26cbb166bb5c138e37f87e1e47b75e6506c6460a94259fd2",
  "weights":         {"relevance": 0.35, "coherence": 0.20, "completeness": 0.30, "conciseness": 0.15}
}

vrty_version and idf_sha256 make every score reproducible — together they pin the scoring logic and the exact IDF data used.

CLI

vrty --prompt "What is the capital of France?" \
        --response "Paris is the capital of France."

Equivalent stdlib invocation:

python -m vrty --prompt "..." --response "..."

Accepts --prompt-file PATH / --response-file PATH for long inputs; /dev/stdin works as a file path. --pretty indents the JSON. Exit codes: 0 success, 1 I/O error, 2 argparse error.

Benchmarks

VRTY is not an embedding-based scorer; if you need semantic similarity that survives paraphrase, use BERTScore or MoverScore. VRTY is not n-gram precision against a reference; if you have reference answers, use BLEU or ROUGE. VRTY's niche is no-reference, no-model, deterministic scoring — the gap ROUGE leaves when you don't have a gold reference, and the gap GPT-as-judge leaves when you need reproducibility.

Reproducibility, cost, and latency vs ROUGE and LLM-as-judge. VRTY and ROUGE were measured on the same machine with the same 1000 synthetic (prompt, response) pairs per response-size bucket; reproduce via python tools/benchmark.py. LLM-as-judge cost and latency are intentionally not measured here — they depend on model choice and provider pricing, both of which drift; fill them in for your own model before relying on the comparison.

	VRTY	ROUGE (rouge-score 0.1.2)	LLM-as-judge
Reproducibility	Byte-identical across processes (pinned Python 3.11.9, asserted in CI on three subprocesses with adversarial `PYTHONHASHSEED` values)	Deterministic for a fixed tokenizer	Non-deterministic; varies with temperature, sampling, model version
Cost per score	$0 (no API call)	$0 (local)	$ per call × tokens; measure with your chosen model
Latency p99 — 100 tokens	0.16 ms	1.66 ms	typically 500–2000 ms (network + inference)
Latency p99 — 500 tokens	0.52 ms	6.66 ms	typically 500–2000 ms
Latency p99 — 2000 tokens	2.94 ms	25.96 ms	typically 1000–5000 ms
Network required	No	No	Yes
Reference hardware	AMD Ryzen 7 8745HS, 16 cores, 27 GiB RAM, Ubuntu 24.04, Python 3.11.9	(same)	(varies by provider)

Latency claim (v1.0): < 3 ms p99 for responses under 2000 tokens on AMD Ryzen 7 8745HS. Reproduce: python tools/benchmark.py from a clean venv with vrty and rouge-score==0.1.2 installed.

VRTY is roughly 9–10× faster than ROUGE at every input size in this table because the scoring path is pure stdlib with no regex-based stemmer and no sentence-pair grid construction.

Calibration bands

Expected composite ranges by prompt type, observed across realistic input. Use these to set CI gates and user-facing displays — do not assume a single threshold works across prompt types.

Prompt type	Typical composite	Use the score as
Factoid Q&A where the answer echoes prompt vocabulary (`"capital of France?"` → `"Paris is the capital of France."`)	0.70 – 0.90	Absolute threshold viable
Customer-support / instruction-following	0.20 – 0.40	Relative delta from a baseline answer on the same prompt
Open-ended prose (email drafts, summaries)	0.15 – 0.35	Relative delta only
Repetition / padding spam with OOV technical terms	can score 0.60+	Catch by pairing with a length / repetition sanity check

Practical rule. Compute a baseline composite on a known-good response to your prompt, then gate on score >= baseline * k for some k ∈ [0.7, 0.9]. Do not gate on composite > 0.8 as an absolute — that will fire false-negative on obviously-fine open-ended responses.

Determinism

Identical input returns byte-identical output. This guarantee holds under the following conditions, all of which are documented and enforced:

Pinned interpreter: Python 3.11.9 (CPython, official build or python-build-standalone). The CI matrix runs on this version. Other 3.x versions are likely to produce identical output but are not asserted.
Pinned IDF data: vrty/data/idf.json.gz ships with the package and is SHA-256-verified at import. A modified data file fails fast with VrtyDataError before any score is computed.
Zero third-party runtime dependencies: the scoring path uses only CPython stdlib (re, math, collections, json, gzip, hashlib, importlib.resources, unicodedata). No numpy, no scikit-learn, no BLAS-backed FP variance.
Sort-before-reduction: every set and dict is sorted before any floating-point accumulation, so dict-iteration order under PYTHONHASHSEED randomization cannot change the result.

The test suite asserts byte-identity on json.dumps(result.to_dict(), sort_keys=True) across three fresh OS subprocesses with PYTHONHASHSEED set to 0, 12345, and the CPython default (random).

Self-host

A one-command Docker self-host is shipped alongside the library. See the Dockerfile for the pinned image and the GitHub Actions snippet for CI/CD integration.

docker build -t vrty:1.0.0 .
docker run --rm vrty:1.0.0 \
  --prompt "What is the capital of France?" \
  --response "Paris is the capital of France."

Known properties and limitations

Read this section before integrating VRTY into anything load-bearing. Seven honest limitations of the v1.0 design.

1. VRTY scores surface properties, not factual correctness

The four dimensions measure term overlap, sentence flow, key-term coverage, and information density. They do not verify that the response is factually true. A correct answer that does not echo prompt vocabulary scores low on relevance and completeness; a confident wrong answer that echoes prompt vocabulary scores high.

Worked example, prompt = "What is the capital of France?":

Response	Correct?	Composite	Relevance	Completeness	Conciseness
`"Paris is the capital of France."`	yes	0.865	0.830	1.000	0.500
`"London is the capital of France."`	no	0.879	0.867	1.000	0.500
`"Paris."`	yes	0.350	0.000	0.000	1.000
`"London."`	no	0.350	0.000	0.000	1.000
`"Banana."`	no	0.350	0.000	0.000	1.000

The verbose incorrect answer scores higher than the verbose correct one (slight IDF asymmetry between "london" and "paris" in the bundled corpus); the three terse responses — one correct, two wrong — receive identical 0.350 scores. VRTY cannot distinguish them; an external fact-check must. Use VRTY to detect malformed, off-topic, or padded outputs; use a separate fact-check or human review to verify truth.

2. Conciseness and completeness intentionally pull against each other

A response that covers every prompt term tends to be longer (lower conciseness); a terse response tends to omit prompt terms (lower completeness). This tension is correct behavior, not a bug. Always read the per-dimension breakdown — a single composite hides the trade-off.

3. Single-sentence coherence returns 1.0 by deliberate choice

When the response is one sentence (or zero — see the empty-response wrapper), there is no adjacent-sentence pair that can disagree, so coherence is set to 1.0. This is a deliberate v1.0 convention: penalizing short responses on coherence would double-count what completeness already measures via prompt-term coverage.

4. OOV tokens receive maximum IDF weight by deliberate choice

Tokens not present in the bundled IDF corpus are assigned idf_oov = log(N+1) + 1, the value the smoothed IDF formula assigns to a token that appears in zero documents. This treats unseen words as maximally informative — the standard add-one (Laplace) smoothing choice — so technical jargon and proper nouns are not silently dropped to zero weight.

5. Conciseness is a type–token ratio, which is mildly length-sensitive

The conciseness measure (|unique content tokens| / |total tokens|) tends to decline for longer responses because the vocabulary saturates while the length keeps growing. This is a known property of the type–token ratio (Hess et al. 1986). Two responses of very different lengths are not directly comparable on conciseness alone; interpret the conciseness score together with the other dimensions and the response length.

6. Repetition can score high when prompt terms are out-of-corpus

Because OOV tokens receive maximum IDF weight (limitation 4 above) and conciseness is a type–token ratio (limitation 5), a response that repeats OOV technical terms (e.g. "multi-head multi-head attention attention attention transformer transformer transformer." against a transformer- architecture prompt) can score higher than a substantive paragraph on the same prompt. Mitigation: combine the VRTY composite with a basic length / repetition sanity check, or treat the composite as one signal among several. This is a known property of TF·IDF-family scorers, not unique to VRTY.

7. The bundled IDF corpus is 19th-century English literature

IDF weights are computed from ten US-public-domain Project Gutenberg books (Austen, Melville, Shelley, Doyle, Stoker, Carroll, Wilde, Dickens, Wells, Thoreau) — about 5,400 200-token pseudo-documents, 32,000-word vocabulary. Modern technical vocabulary like "API", "endpoint", "deploy", "kubernetes", "async" is not in the corpus and falls into the OOV bucket, where it receives the maximum IDF weight (see limitation 4).

This generally helps technical text (rare jargon is correctly treated as informative) but can cause uneven weighting when one technical term is in-corpus by coincidence and a similar one is not. A domain-matched IDF corpus is explicitly post-v1.0; v1.0 disclaims this rather than fixes it. Non-English text scores as-is with no special handling and is similarly disclaimed.

Input contract

Behavior on degenerate inputs is part of the v1.0 spec, not an afterthought:

Input	Behavior
Empty response	Every dimension and the composite return `0.0`; explanations say "response contained no scorable tokens."
Empty prompt	Relevance and completeness return `0.0`; coherence and conciseness depend only on the response and score normally
Inputs above 2,048 tokens	Truncated at 2,048 tokens (the `MAX_TOKENS` constant) before scoring; truncation is deterministic
Non-English text	NFKD-normalized then ASCII-stripped; accented Latin folds to base letters; non-Latin scripts (CJK, Cyrillic, Arabic, ...) drop entirely. Quality outside English is not claimed
Response identical to prompt	Scored normally; no special case
Single word	Scored normally; no special case

License

MIT — see LICENSE.

Versioning

vrty_version is included with every score so any historical score is traceable to the exact scoring logic that produced it. The bundled IDF data file's SHA-256 (idf_sha256) is also returned with every score so two scores from different builds can be compared at the data-pinning level, not just the code level. Bumping either invalidates byte-equality guarantees and requires a version bump.

A score from vrty_version="1.0.0" will be reproducible on any future machine that installs vrty==1.0.0 on Python 3.11.9.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Sundeyp

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.0.1

May 26, 2026

This version

1.0.0

May 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vrty-1.0.0.tar.gz (199.0 kB view details)

Uploaded May 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vrty-1.0.0-py3-none-any.whl (178.1 kB view details)

Uploaded May 26, 2026 Python 3

File details

Details for the file vrty-1.0.0.tar.gz.

File metadata

Download URL: vrty-1.0.0.tar.gz
Upload date: May 26, 2026
Size: 199.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for vrty-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`d95865a2600c119395994f03f0bbe737ff49ce3818eec26832205d7c07ee9b55`
MD5	`8cdbdaa16d67b194083af97e84a9fe23`
BLAKE2b-256	`8b1a6b0f6cc8212496201abb014b1dc376c017e9da00719edfbc9b24b3849c9a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for vrty-1.0.0.tar.gz:

Publisher: publish.yml on Sundeyp/vrty

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: vrty-1.0.0.tar.gz
- Subject digest: d95865a2600c119395994f03f0bbe737ff49ce3818eec26832205d7c07ee9b55
- Sigstore transparency entry: 1635428272
- Sigstore integration time: May 26, 2026
Source repository:
- Permalink: Sundeyp/vrty@f10e35fc86936bda867d7a4b30d07e673315e399
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/Sundeyp
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@f10e35fc86936bda867d7a4b30d07e673315e399
- Trigger Event: push

File details

Details for the file vrty-1.0.0-py3-none-any.whl.

File metadata

Download URL: vrty-1.0.0-py3-none-any.whl
Upload date: May 26, 2026
Size: 178.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for vrty-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`224f6849f364a534ae0a4bdb855f7f18352c8d4501d3505abe79c04fc0b23f99`
MD5	`3e8f13e2fe50ca51626165be0cd36434`
BLAKE2b-256	`1c888dbb29a4ba381e8a8aee3b1b8fa3c63f03bfa4d27ab8e1fed93885e6bba5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for vrty-1.0.0-py3-none-any.whl:

Publisher: publish.yml on Sundeyp/vrty

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: vrty-1.0.0-py3-none-any.whl
- Subject digest: 224f6849f364a534ae0a4bdb855f7f18352c8d4501d3505abe79c04fc0b23f99
- Sigstore transparency entry: 1635428274
- Sigstore integration time: May 26, 2026
Source repository:
- Permalink: Sundeyp/vrty@f10e35fc86936bda867d7a4b30d07e673315e399
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/Sundeyp
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@f10e35fc86936bda867d7a4b30d07e673315e399
- Trigger Event: push

vrty 1.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

VRTY

Install

The four dimensions

What you get back

CLI

Benchmarks

Calibration bands

Determinism

Self-host

Known properties and limitations

1. VRTY scores surface properties, not factual correctness

2. Conciseness and completeness intentionally pull against each other

3. Single-sentence coherence returns 1.0 by deliberate choice

4. OOV tokens receive maximum IDF weight by deliberate choice

5. Conciseness is a type–token ratio, which is mildly length-sensitive

6. Repetition can score high when prompt terms are out-of-corpus

7. The bundled IDF corpus is 19th-century English literature

Input contract

License

Versioning

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance