Version-controlled golden dataset management and semantic evaluation for RAG pipelines.

These details have not been verified by PyPI

Project links

Project description

Golden Dataset Studio — RAG Evaluation Pipeline

✦ Golden Dataset Studio

A production-grade RAG evaluation framework built around a curated golden dataset core.
Hand-crafted ground truth · Versioned Q&A pairs · Semantic similarity scoring · RAGAS + DeepEval integration

What is Golden Dataset Studio?

Most RAG evaluation frameworks treat the golden dataset as an afterthought — a CSV you import once and forget. Golden Dataset Studio flips that. The golden dataset is the foundation everything else is built on.

A golden dataset is a hand-curated set of question + expected answer pairs specific to your domain and your documents. Every entry is something a human has verified is correct. It is your ground truth — the one thing you trust completely when your model changes, your chunking strategy changes, or your LLM is swapped.

Without a golden dataset:     you are guessing whether your RAG works
With a golden dataset:        you know — precisely, repeatably, with a version history

Golden Dataset Studio gives you:

A versioned store for your Q&A pairs (v1.0 → v1.3 → ...)
A semantic similarity scorer that compares LLM answers against expected answers
A manifest system that tracks which version was evaluated with which model
A clean eval runner that strips noise (model tags, markdown) before scoring
Integration hooks for RAGAS and DeepEval when you need deeper diagnostics

The Three-Layer Evaluation Strategy

                    ┌─────────────────────────────────────────────────────┐
                    │             GOLDEN DATASET STUDIO                   │
                    │                                                     │
                    │  "Is the final answer CORRECT for my domain?"      │
                    │                                                     │
                    │  Hand-curated Q&A  ·  Versioned  ·  Fast  ·  Cheap │
                    └─────────────────┬───────────────────────────────────┘
                                      │ PASS
                                      ▼
                    ┌─────────────────────────────────────────────────────┐
                    │                    RAGAS                            │
                    │                                                     │
                    │  "WHY is the pipeline producing that answer?"      │
                    │                                                     │
                    │  Context Precision · Context Recall · Faithfulness  │
                    └─────────────────┬───────────────────────────────────┘
                                      │ PASS
                                      ▼
                    ┌─────────────────────────────────────────────────────┐
                    │                  DEEPEVAL                           │
                    │                                                     │
                    │  "Is the LLM safe to ship to users?"               │
                    │                                                     │
                    │  Hallucination · Toxicity · Bias · Correctness     │
                    └─────────────────────────────────────────────────────┘

If Golden Dataset fails badly — stop here. Fix the retriever or chunking first. Do not burn RAGAS/DeepEval credits diagnosing a broken pipeline.

Golden Dataset — Deep Dive

What goes into it

Each entry has three fields:

{
  "id": "b9e08070",
  "question": "How many annual leave days do employees get?",
  "answer": "Employees receive 25 days of annual leave per year."
}

The answer is what a human expert wrote after reading your source document. It is not generated by an LLM. It is not scraped. It is your authoritative ground truth.

How scoring works

LLM answer  ──► sentence-transformers ──► embedding vector ──┐
                                                              ├──► cosine similarity ──► score
Expected answer ──► sentence-transformers ──► embedding vector ──┘

score >= 0.7  →  PASS ✓
score <  0.7  →  FAIL ✗

We use sentence-transformers/all-MiniLM-L6-v2 for embeddings. Semantic similarity means paraphrases pass — the LLM doesn't need to match word-for-word.

The clean_answer step

Before scoring, LLM output is cleaned to remove noise that tanks similarity scores unfairly:

def clean_answer(answer: str) -> str:
    # Remove [Model: gpt-oss:20b] prefix injected for traceability
    answer = re.sub(r'^\[Model:[^\]]+\]\s*', '', answer.strip())
    # Remove markdown bold **...**
    answer = re.sub(r'\*{1,2}([^*]+)\*{1,2}', r'\1', answer)
    return answer.strip()

Without this, [Model: gpt-oss:20b] **25 annual leave days** scores 0.46 against the expected answer. After cleaning, the same answer scores 0.93.

Versioning

Every eval run is saved with the dataset version and model name:

evals/
  eval_v1.3_20250610_143022.json   ← which version, which timestamp
my_answers.json                    ← { "model": "gpt-oss:20b", "answers_raw": [...], "answers_clean": [...] }

This means you can re-evaluate old answers against a new dataset version without re-running the LLM.

RAGAS — Retrieval Diagnostics

Run after Golden Dataset passes. Answers where the pipeline is breaking.

Metric	Question it answers
Context Precision	Were the retrieved chunks relevant to the question?
Context Recall	Did the retriever miss any important information?
Faithfulness	Is the LLM answer grounded in the retrieved context, or hallucinated?
Answer Relevancy	Does the answer actually address what was asked?

A real debugging scenario:

LLM answers "25 days"  ✓  Golden Dataset PASSES

But Context Precision = 0.3  ✗

Meaning: the retriever fetched 5 chunks, only 1 was relevant.
The LLM got lucky. Fix: reduce chunk_size, tune similarity threshold.

DeepEval — LLM Behaviour Gate

Run before every production release. The final sign-off check.

Metric	What it catches
Hallucination	LLM fabricating facts not present in retrieved context
Toxicity	Unsafe or harmful output
Bias	Unfair or skewed responses
Answer Correctness	End-to-end factual accuracy across all questions

When to Run Each Layer

Trigger	Layers to run	Reason
Every code change / daily dev	Golden Dataset only	Fast, free, catches regressions immediately
Before merging a PR	Golden Dataset + RAGAS	Ensures retrieval quality didn't degrade
Before a production release	All three	Full sign-off — correctness, retrieval, safety

Quick Start

git clone https://github.com/your-username/golden-dataset-studio.git
cd golden-dataset-studio

pip install -r requirements.txt

# Run the golden dataset evaluation
python run_golden_eval.py --llm gpt-oss:20b

# Default model (llama3.2)
python run_golden_eval.py

Sample Output

Model: gpt-oss:20b
Running RAG pipeline against 2 golden entries (v1.3)...

[1/2] Q: How many annual leave days?
         Model:    gpt-oss:20b
         A:        [Model: gpt-oss:20b] Employees are entitled to 25 annual leave days.
         Cleaned:  Employees are entitled to 25 annual leave days.
         Expected: Employees receive 25 days of annual leave per year.
         Latency:  36.95s

==================================================
EVALUATION RESULTS (v1.3)
Model: gpt-oss:20b
==================================================
  [b9e08070] ✓ similarity=0.923  Q: How many annual leave days?
  [f8aac4c1] ✓ similarity=0.941  Q: What is the notice period?

  Avg Similarity: 0.932
  Pass (>=0.7):   ✓
  Saved to:       evals/eval_v1.3_20250610_143022.json

Project Structure

golden-dataset-studio/
├── data/
│   └── company_policy.txt          # Source document
├── golden_dataset.py               # DatasetStore, Evaluator, EvalResult, Manifest
├── run_golden_eval.py              # Eval runner — RAG + scoring + clean_answer
├── my_answers.json                 # { model, answers_raw, answers_clean }
├── evals/                          # Versioned eval results
│   └── eval_v1.3_<timestamp>.json
├── architecture.png                # Pipeline diagram
└── requirements.txt

Troubleshooting

Symptom	Root cause	Fix
Similarity score ~0.46 despite correct answer	Model tag or markdown `...` in LLM output polluting embedding	`clean_answer()` strips both before scoring
`invalid model name (status code: 400)`	Wrong Ollama model name format	Use `gpt-oss:20b` not `gpt:oss:20b`
Context precision low	Chunks too large, retriever fetching noise	Reduce `chunk_size`, tune retriever `k`
Faithfulness low	LLM hallucinating beyond retrieved context	Tighten prompt, try different model
Answer relevancy low	Prompt not focused enough	Refine the prompt template

Tech Stack

Layer	Technology
Document loading	LangChain `TextLoader`
Chunking	`RecursiveCharacterTextSplitter`
Embeddings	`sentence-transformers/all-MiniLM-L6-v2`
Vector store	FAISS
LLM inference	Ollama (local) — `gpt-oss:20b`, `llama3.2`
RAG orchestration	LangChain
Retrieval evaluation	RAGAS
LLM evaluation	DeepEval
Observability	Langfuse

Roadmap

RAGAS metrics wired into the main eval runner
DeepEval hallucination checks per golden entry
Langfuse trace per eval run
Multi-model comparison mode (--llm-compare llama3.2,gpt-oss:20b)
CI/CD gate — fail pipeline if avg similarity < 0.7
Streamlit dashboard for visualising eval history across versions

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

Jun 18, 2026

This version

0.1.0

Jun 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

golden_dataset_studio-0.1.0.tar.gz (18.4 kB view details)

Uploaded Jun 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

golden_dataset_studio-0.1.0-py3-none-any.whl (16.9 kB view details)

Uploaded Jun 18, 2026 Python 3

File details

Details for the file golden_dataset_studio-0.1.0.tar.gz.

File metadata

Download URL: golden_dataset_studio-0.1.0.tar.gz
Upload date: Jun 18, 2026
Size: 18.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for golden_dataset_studio-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`8f75876d7777f90bf23731b083e41f6cc20b5edcda0be257d824b66b9c403dbe`
MD5	`9ac0ef56249bd8aecace93bfb4ad97b0`
BLAKE2b-256	`caeef490a01a90dd006b7b720a671e605b15ce771ede3012eddf34248181556b`

See more details on using hashes here.

File details

Details for the file golden_dataset_studio-0.1.0-py3-none-any.whl.

File metadata

Download URL: golden_dataset_studio-0.1.0-py3-none-any.whl
Upload date: Jun 18, 2026
Size: 16.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for golden_dataset_studio-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2f9befb6dea84ffac8c480a360476e4289125bc8a8830e842394b49627a04d82`
MD5	`f634f696e1f8e3a2c5bfbb049f1bca5b`
BLAKE2b-256	`648bd08e07a604cd6568f75363a1b2f228ce2a256245613292f392dc9fbf36e9`

See more details on using hashes here.

golden-dataset-studio 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

✦ Golden Dataset Studio

What is Golden Dataset Studio?

The Three-Layer Evaluation Strategy

Golden Dataset — Deep Dive

What goes into it

How scoring works

The clean_answer step

Versioning

RAGAS — Retrieval Diagnostics

DeepEval — LLM Behaviour Gate

When to Run Each Layer

Quick Start

Sample Output

Project Structure

Troubleshooting

Tech Stack

Roadmap

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes