Audit golden evaluation datasets for LLM/RAG application quality risks.
Project description
GoldenSetAuditor
Evaluation dataset quality auditor for LLM / RAG applications.
GoldenSetAuditor audits golden evaluation datasets for LLM and RAG applications before benchmark scores are trusted. It does not score model outputs — it audits the dataset being scored against.
About
Nobody questions the golden set. It's treated as ground truth — the fixed reference that tells you whether your model improved. But golden sets are assembled by humans, often under deadline pressure, from domain knowledge that isn't always consistent or independently reviewed. The same question appears twice with different expected answers. A near-trivial question makes up a third of a category. A reference answer is one word. An ambiguous pronoun in a prompt means no single correct answer exists.
None of this is visible in the benchmark score itself. A bad golden set doesn't produce obviously wrong numbers — it produces confidently wrong ones. You don't know the score is unreliable until you dig into why a supposedly better model regressed, or why two domain experts disagree on whether an answer was correct.
The deeper problem is circular. You're using the golden set to validate the model, but nothing is validating the golden set. The tool that needs quality assurance is the one everyone assumes is already correct.
GoldenSetAuditor breaks that circularity. Feed it the DataFrame backing your evaluation suite, and it checks for conflicting labels, duplicate prompts, weak reference answers, ambiguous questions, over-easy examples, near-duplicate pairs, and category coverage gaps. The output is a per-finding audit report in JSON, Markdown, and HTML — structured, row-level, and attachable to your evaluation documentation before a single benchmark score is published.
The truth boundary is explicit on every report: this tool audits dataset quality. It does not evaluate model answers.
Architecture
flowchart TD
IN["GoldenSetAuditConfig + DataFrame\n──────────────────────────────\nquestion · expected_answer\ncategory · id · thresholds"]
subgraph STRUCTURAL ["① Structural Integrity can FAIL"]
CD["Conflicting Labels\nsame prompt → different expected answers\nlabel conflict → ground truth undefined"]
ED["Exact Duplicates\nsame prompt → same expected answer\nredundant example"]
end
subgraph CONTENT ["② Content Quality WARN only"]
WA["Weak Expected Answer\nmeaningful token count < min_expected_words"]
AP["Ambiguous Prompt\nanaphoric ref in short prompt\nOR multiple question marks"]
OE["Over-easy Example\nJaccard(prompt, answer) ≥ threshold\nmodel may answer by extraction"]
ND["Near-duplicate Scan\npairwise token Jaccard O(n²)\nparaphrased prompts reduce diversity"]
end
subgraph COVERAGE ["③ Coverage INSUFFICIENT_INPUT if no category_col"]
CC["Category Coverage\nexamples per category < min_category_count\nper-group metrics unreliable"]
end
IN --> STRUCTURAL
IN --> CONTENT
IN --> COVERAGE
STRUCTURAL & CONTENT & COVERAGE --> AGG
AGG["FAIL › WARN › INSUFFICIENT_INPUT › PASS"]
AGG --> OUT["GoldenSetReport\nJSON · Markdown · HTML\nexplicit truth boundary"]
The 7 checks
| Group | Check | Method | Status |
|---|---|---|---|
| Structural | Conflicting labels | Exact match on normalised input text; answers differ | FAIL |
| Structural | Exact duplicates | Exact match on normalised input text; answers match | WARN |
| Content | Weak expected answer | Meaningful token count after stopword filter | WARN |
| Content | Ambiguous prompt | Anaphoric reference pattern + multi-question heuristic | WARN |
| Content | Over-easy example | Token Jaccard between prompt and expected answer | WARN |
| Content | Near-duplicate scan | Pairwise token Jaccard across all prompt pairs (O(n²)) | WARN or INSUFFICIENT_INPUT |
| Coverage | Category coverage | Example count per category value | WARN or INSUFFICIENT_INPUT |
Only conflicting labels can produce FAIL — the ground truth is structurally undefined. Everything else is WARN: suspicious, requiring human confirmation.
Truth boundary
GoldenSetAuditor does not evaluate model answers. It audits the evaluation dataset. It does not check whether expected answers are factually correct, whether the evaluation metric (exact match, ROUGE, LLM-judge) is appropriate, whether the golden set covers the production query distribution, or whether pretraining contamination has occurred.
Install
pip install goldensetauditor
Quickstart
import pandas as pd
from goldensetauditor import GoldenSetAuditConfig, audit_golden_set
df = pd.read_csv("data/demo_golden_set.csv")
config = GoldenSetAuditConfig(
input_col="question",
expected_col="expected_answer",
category_col="category",
id_col="id",
)
report = audit_golden_set(df, config)
print(report.status) # FAIL / WARN / PASS
report.save("outputs/") # writes JSON, Markdown, HTML
Run the demo
git clone https://github.com/SidharthKriplani/goldensetauditor
cd goldensetauditor
pip install -e .
python scripts/generate_demo_reports.py
open outputs/goldensetauditor_report.html
Resume-safe claim
Built GoldenSetAuditor, an evaluation dataset quality auditor for LLM/RAG applications that checks golden sets for conflicting expected answers, exact and near-duplicate prompts, weak reference answers, ambiguous questions, over-easy examples, and category coverage gaps, producing structured JSON/Markdown/HTML audit reports with per-finding PASS/WARN/FAIL/INSUFFICIENT_INPUT status and explicit truth boundary.
Roadmap
- Semantic near-duplicate detection via sentence embeddings (alternative to token Jaccard)
- Pretraining contamination flag (n-gram overlap against known public corpora)
- Multi-turn / context-dependent conversation auditing
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file goldensetauditor-0.2.0.tar.gz.
File metadata
- Download URL: goldensetauditor-0.2.0.tar.gz
- Upload date:
- Size: 12.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
43b783984f1688e32846f10a265899984390d0f68544b6e57663840b0a4fd0fc
|
|
| MD5 |
5067a10b00058a6b1ba91a7f635f8af7
|
|
| BLAKE2b-256 |
c4db5eece0f9fbca002dc0ee5467ce4d9e9e515f8367669e3d8c11f9075c17ec
|
Provenance
The following attestation bundles were made for goldensetauditor-0.2.0.tar.gz:
Publisher:
publish.yml on SidharthKriplani/goldensetauditor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
goldensetauditor-0.2.0.tar.gz -
Subject digest:
43b783984f1688e32846f10a265899984390d0f68544b6e57663840b0a4fd0fc - Sigstore transparency entry: 1441285696
- Sigstore integration time:
-
Permalink:
SidharthKriplani/goldensetauditor@d4b4ea4bce3d1014f7f6935ccd6061d518727fb2 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/SidharthKriplani
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d4b4ea4bce3d1014f7f6935ccd6061d518727fb2 -
Trigger Event:
release
-
Statement type:
File details
Details for the file goldensetauditor-0.2.0-py3-none-any.whl.
File metadata
- Download URL: goldensetauditor-0.2.0-py3-none-any.whl
- Upload date:
- Size: 8.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6f949784c014ade648e07ddafb32d67f4ac9c962b9abb49537c61689784ed4b3
|
|
| MD5 |
0b0e0c9ea0ee46a9fb0b6514496dad4d
|
|
| BLAKE2b-256 |
f1de2de891d109e9320f56ef321390ad5834011674fbe0dfc4c861818bc28aed
|
Provenance
The following attestation bundles were made for goldensetauditor-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on SidharthKriplani/goldensetauditor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
goldensetauditor-0.2.0-py3-none-any.whl -
Subject digest:
6f949784c014ade648e07ddafb32d67f4ac9c962b9abb49537c61689784ed4b3 - Sigstore transparency entry: 1441285782
- Sigstore integration time:
-
Permalink:
SidharthKriplani/goldensetauditor@d4b4ea4bce3d1014f7f6935ccd6061d518727fb2 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/SidharthKriplani
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d4b4ea4bce3d1014f7f6935ccd6061d518727fb2 -
Trigger Event:
release
-
Statement type: