Audit golden evaluation datasets for LLM/RAG application quality risks.

These details have not been verified by PyPI

Project description

GoldenSetAuditor

Evaluation dataset quality auditor for LLM / RAG applications.

PyPI Python License

Checks Tests Status

GoldenSetAuditor audits golden evaluation datasets for LLM and RAG applications before benchmark scores are trusted. It does not score model outputs — it audits the dataset being scored against.

About

Nobody questions the golden set. It's treated as ground truth — the fixed reference that tells you whether your model improved. But golden sets are assembled by humans, often under deadline pressure, from domain knowledge that isn't always consistent or independently reviewed. The same question appears twice with different expected answers. A near-trivial question makes up a third of a category. A reference answer is one word. An ambiguous pronoun in a prompt means no single correct answer exists.

None of this is visible in the benchmark score itself. A bad golden set doesn't produce obviously wrong numbers — it produces confidently wrong ones. You don't know the score is unreliable until you dig into why a supposedly better model regressed, or why two domain experts disagree on whether an answer was correct.

The deeper problem is circular. You're using the golden set to validate the model, but nothing is validating the golden set. The tool that needs quality assurance is the one everyone assumes is already correct.

GoldenSetAuditor breaks that circularity. Feed it the DataFrame backing your evaluation suite, and it checks for conflicting labels, duplicate prompts, weak reference answers, ambiguous questions, over-easy examples, near-duplicate pairs, and category coverage gaps. The output is a per-finding audit report in JSON, Markdown, and HTML — structured, row-level, and attachable to your evaluation documentation before a single benchmark score is published.

The truth boundary is explicit on every report: this tool audits dataset quality. It does not evaluate model answers.

Architecture

flowchart TD
    IN["GoldenSetAuditConfig + DataFrame\n──────────────────────────────\nquestion · expected_answer\ncategory · id · thresholds"]

    subgraph STRUCTURAL ["① Structural Integrity   can FAIL"]
        CD["Conflicting Labels\nsame prompt → different expected answers\nlabel conflict → ground truth undefined"]
        ED["Exact Duplicates\nsame prompt → same expected answer\nredundant example"]
    end

    subgraph CONTENT ["② Content Quality   WARN only"]
        WA["Weak Expected Answer\nmeaningful token count < min_expected_words"]
        AP["Ambiguous Prompt\nanaphoric ref in short prompt\nOR multiple question marks"]
        OE["Over-easy Example\nJaccard(prompt, answer) ≥ threshold\nmodel may answer by extraction"]
        ND["Near-duplicate Scan\npairwise token Jaccard O(n²)\nparaphrased prompts reduce diversity"]
    end

    subgraph COVERAGE ["③ Coverage   INSUFFICIENT_INPUT if no category_col"]
        CC["Category Coverage\nexamples per category < min_category_count\nper-group metrics unreliable"]
    end

    IN --> STRUCTURAL
    IN --> CONTENT
    IN --> COVERAGE

    STRUCTURAL & CONTENT & COVERAGE --> AGG

    AGG["FAIL › WARN › INSUFFICIENT_INPUT › PASS"]
    AGG --> OUT["GoldenSetReport\nJSON · Markdown · HTML\nexplicit truth boundary"]

The 7 checks

Group	Check	Method	Status
Structural	Conflicting labels	Exact match on normalised input text; answers differ	FAIL
Structural	Exact duplicates	Exact match on normalised input text; answers match	WARN
Content	Weak expected answer	Meaningful token count after stopword filter	WARN
Content	Ambiguous prompt	Anaphoric reference pattern + multi-question heuristic	WARN
Content	Over-easy example	Token Jaccard between prompt and expected answer	WARN
Content	Near-duplicate scan	Pairwise token Jaccard across all prompt pairs (O(n²))	WARN or INSUFFICIENT_INPUT
Coverage	Category coverage	Example count per category value	WARN or INSUFFICIENT_INPUT

Only conflicting labels can produce FAIL — the ground truth is structurally undefined. Everything else is WARN: suspicious, requiring human confirmation.

Truth boundary

GoldenSetAuditor does not evaluate model answers. It audits the evaluation dataset. It does not check whether expected answers are factually correct, whether the evaluation metric (exact match, ROUGE, LLM-judge) is appropriate, whether the golden set covers the production query distribution, or whether pretraining contamination has occurred.

Install

pip install goldensetauditor

Quickstart

import pandas as pd
from goldensetauditor import GoldenSetAuditConfig, audit_golden_set

df = pd.read_csv("data/demo_golden_set.csv")

config = GoldenSetAuditConfig(
    input_col="question",
    expected_col="expected_answer",
    category_col="category",
    id_col="id",
)

report = audit_golden_set(df, config)

print(report.status)       # FAIL / WARN / PASS
report.save("outputs/")    # writes JSON, Markdown, HTML

Run the demo

git clone https://github.com/SidharthKriplani/goldensetauditor
cd goldensetauditor
pip install -e .
python scripts/generate_demo_reports.py
open outputs/goldensetauditor_report.html

Resume-safe claim

Built GoldenSetAuditor, an evaluation dataset quality auditor for LLM/RAG applications that checks golden sets for conflicting expected answers, exact and near-duplicate prompts, weak reference answers, ambiguous questions, over-easy examples, and category coverage gaps, producing structured JSON/Markdown/HTML audit reports with per-finding PASS/WARN/FAIL/INSUFFICIENT_INPUT status and explicit truth boundary.

Roadmap

Semantic near-duplicate detection via sentence embeddings (alternative to token Jaccard)
Pretraining contamination flag (n-gram overlap against known public corpora)
Multi-turn / context-dependent conversation auditing

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

May 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

goldensetauditor-0.2.0.tar.gz (12.4 kB view details)

Uploaded May 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

goldensetauditor-0.2.0-py3-none-any.whl (8.8 kB view details)

Uploaded May 5, 2026 Python 3

File details

Details for the file goldensetauditor-0.2.0.tar.gz.

File metadata

Download URL: goldensetauditor-0.2.0.tar.gz
Upload date: May 5, 2026
Size: 12.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for goldensetauditor-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`43b783984f1688e32846f10a265899984390d0f68544b6e57663840b0a4fd0fc`
MD5	`5067a10b00058a6b1ba91a7f635f8af7`
BLAKE2b-256	`c4db5eece0f9fbca002dc0ee5467ce4d9e9e515f8367669e3d8c11f9075c17ec`

See more details on using hashes here.

Provenance

The following attestation bundles were made for goldensetauditor-0.2.0.tar.gz:

Publisher: publish.yml on SidharthKriplani/goldensetauditor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: goldensetauditor-0.2.0.tar.gz
- Subject digest: 43b783984f1688e32846f10a265899984390d0f68544b6e57663840b0a4fd0fc
- Sigstore transparency entry: 1441285696
- Sigstore integration time: May 5, 2026
Source repository:
- Permalink: SidharthKriplani/goldensetauditor@d4b4ea4bce3d1014f7f6935ccd6061d518727fb2
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/SidharthKriplani
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@d4b4ea4bce3d1014f7f6935ccd6061d518727fb2
- Trigger Event: release

File details

Details for the file goldensetauditor-0.2.0-py3-none-any.whl.

File metadata

Download URL: goldensetauditor-0.2.0-py3-none-any.whl
Upload date: May 5, 2026
Size: 8.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for goldensetauditor-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6f949784c014ade648e07ddafb32d67f4ac9c962b9abb49537c61689784ed4b3`
MD5	`0b0e0c9ea0ee46a9fb0b6514496dad4d`
BLAKE2b-256	`f1de2de891d109e9320f56ef321390ad5834011674fbe0dfc4c861818bc28aed`

See more details on using hashes here.

Provenance

The following attestation bundles were made for goldensetauditor-0.2.0-py3-none-any.whl:

Publisher: publish.yml on SidharthKriplani/goldensetauditor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: goldensetauditor-0.2.0-py3-none-any.whl
- Subject digest: 6f949784c014ade648e07ddafb32d67f4ac9c962b9abb49537c61689784ed4b3
- Sigstore transparency entry: 1441285782
- Sigstore integration time: May 5, 2026
Source repository:
- Permalink: SidharthKriplani/goldensetauditor@d4b4ea4bce3d1014f7f6935ccd6061d518727fb2
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/SidharthKriplani
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@d4b4ea4bce3d1014f7f6935ccd6061d518727fb2
- Trigger Event: release

goldensetauditor 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

GoldenSetAuditor

About

Architecture

The 7 checks

Truth boundary

Install

Quickstart

Run the demo

Resume-safe claim

Roadmap

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance