Generate QA pairs from JSON documents to evaluate RAG pipelines
Project description
QA Pairs Generator for RAG Evaluation
Automatically generate question-answer pairs from a corpus of JSON documents to evaluate Retrieval-Augmented Generation (RAG) pipelines. The tool extracts named entities from your documents, matches each entity to its most relevant documents via hybrid search (keyword + embeddings), and then prompts an LLM to produce one QA pair per (entity, document) combination.
Table of contents
Installation
Requires Python ≥ 3.10.
With pip
# runtime only
pip install -e .
# runtime + dev dependencies (pytest)
pip install -e ".[dev]"
# runtime + Azure Blob Storage support
pip install -e ".[azure]"
With uv
uv sync # runtime only
uv sync --extra dev # include dev dependencies
uv sync --extra azure # include Azure Blob Storage support
After installing, the qa-generate console script is available as an alternative to python run_pipeline.py.
Input document format
Each document must be a single .json file inside the input directory. A document can have any top-level fields; you tell the pipeline which ones to use via --search-fields.
Minimal example (docs/doc_001.json):
{
"title": "Introduction to Transformers",
"description": "Transformers are a type of neural network architecture introduced in the paper Attention Is All You Need.",
"author": "Vaswani et al.",
"year": 2017
}
If you run the pipeline with --search-fields title description, the tool will concatenate the title and description fields to build the corpus text used for entity extraction and search. Fields listed in --search-fields that are absent from a document are silently skipped.
Environment variables
OpenAI (default)
| Variable | Required | Description |
|---|---|---|
OPENAI_API_KEY |
Yes | Your OpenAI secret key |
Azure OpenAI (--client azure)
| Variable | Required | Description |
|---|---|---|
AZURE_OPENAI_API_KEY |
Yes | Your Azure OpenAI key |
AZURE_OPENAI_ENDPOINT |
Yes | Your Azure endpoint URL (e.g. https://<resource>.openai.azure.com/) |
OPENAI_API_VERSION |
Yes | API version (e.g. 2024-02-01) |
Azure Blob Storage (az:// input URIs)
The following variables are resolved in priority order:
| Priority | Variable(s) | Description |
|---|---|---|
| 1 | AZURE_STORAGE_CONNECTION_STRING |
Full connection string |
| 2 | AZURE_STORAGE_ACCOUNT_NAME + AZURE_STORAGE_ACCOUNT_KEY |
Account name and key |
| 3 | AZURE_STORAGE_ACCOUNT_NAME (only) |
Uses DefaultAzureCredential (managed identity, Azure CLI, workload identity, etc.) |
You can export the variables in your shell or store them in a .env file and load it before running the pipeline.
# Linux / macOS
export OPENAI_API_KEY="sk-..."
# Windows PowerShell
$env:OPENAI_API_KEY = "sk-..."
Usage
The pipeline can be invoked either via the installed script or directly:
# installed command
qa-generate --input-dir <DIR> --search-fields <FIELD ...> --output <FILE.json> [options]
# or via script
python run_pipeline.py --input-dir <DIR> --search-fields <FIELD ...> --output <FILE.json> [options]
| Argument | Required | Default | Description |
|---|---|---|---|
--input-dir |
Yes | — | Local path or remote URI (e.g. az://container/prefix/, s3://bucket/prefix/) containing .json input files |
--search-fields |
Yes | — | One or more document fields to use for entity extraction and corpus building |
--output |
Yes | — | Path to the output JSON file |
--client |
No | openai |
LLM provider: openai or azure |
--model |
No | gpt-4o-mini |
Chat model used for entity extraction and QA generation |
--embedding-model |
No | text-embedding-3-small |
Embedding model used for semantic search |
--top-n |
No | 3 |
Number of documents retrieved per entity via embedding search |
Complete example
python run_pipeline.py \
--input-dir ./docs \
--search-fields title description \
--output qa_output.json \
--client openai \
--model gpt-4o-mini \
--embedding-model text-embedding-3-small \
--top-n 3
Azure OpenAI example
python run_pipeline.py \
--input-dir ./docs \
--search-fields title description \
--output qa_output.json \
--client azure \
--model my-gpt4o-deployment \
--embedding-model my-embedding-deployment
Remote storage
--input-dir accepts any URI supported by fsspec. The pipeline reads .json files transparently from:
| Scheme | Backend | Extra install |
|---|---|---|
./path/ or /abs/path/ |
Local filesystem | — |
az://container/prefix/ |
Azure Blob Storage | pip install -e ".[azure]" |
s3://bucket/prefix/ |
Amazon S3 | pip install s3fs |
gcs://bucket/prefix/ |
Google Cloud Storage | pip install gcsfs |
# read documents from Azure Blob Storage
python run_pipeline.py \
--input-dir az://my-container/corpus/ \
--search-fields title description \
--output qa_output.json
Output format
The output is a JSON array. Each element is a QA pair with the following fields:
[
{
"entity": "Transformers",
"question": "What problem do Transformers solve compared to RNNs?",
"answer": "Transformers solve the sequential computation bottleneck of RNNs by relying entirely on self-attention mechanisms, enabling parallelisation during training.",
"source_document": "Introduction to Transformers\nTransformers are a type of neural network architecture..."
}
]
| Field | Type | Description |
|---|---|---|
entity |
string |
Named entity extracted from the documents |
question |
string |
Generated question about the entity |
answer |
string |
Generated answer grounded in the source document |
source_document |
string |
Concatenated text of the document used to generate the pair |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rag_evaluation_dataset-0.4.0.tar.gz.
File metadata
- Download URL: rag_evaluation_dataset-0.4.0.tar.gz
- Upload date:
- Size: 142.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9db7fab461965e01e70aeef9a625e72e5002fc725dcca5fcc5830eacc0dc3168
|
|
| MD5 |
3cd1c8da3802d9960697da59dc596f94
|
|
| BLAKE2b-256 |
a6cb55c205fe8d61e85d8122d74f425126f47ba524735534e55cef28965231da
|
Provenance
The following attestation bundles were made for rag_evaluation_dataset-0.4.0.tar.gz:
Publisher:
cd.yml on ViniciusKos/rag_evaluation_dataset
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rag_evaluation_dataset-0.4.0.tar.gz -
Subject digest:
9db7fab461965e01e70aeef9a625e72e5002fc725dcca5fcc5830eacc0dc3168 - Sigstore transparency entry: 1283554217
- Sigstore integration time:
-
Permalink:
ViniciusKos/rag_evaluation_dataset@0e87c97604c769deab25eb59e61c1793fd79ab3a -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/ViniciusKos
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
cd.yml@0e87c97604c769deab25eb59e61c1793fd79ab3a -
Trigger Event:
push
-
Statement type:
File details
Details for the file rag_evaluation_dataset-0.4.0-py3-none-any.whl.
File metadata
- Download URL: rag_evaluation_dataset-0.4.0-py3-none-any.whl
- Upload date:
- Size: 14.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cf34b1fc8f9e27c798fefc7746bc19430d64a4f9efca145d38056a21a3f9d7d8
|
|
| MD5 |
6d52ee1ed8954fa91c0fbd67d8f4fb9d
|
|
| BLAKE2b-256 |
622c116270ff29b50498ff51cd0731694588021c5097852de1af68a90906be01
|
Provenance
The following attestation bundles were made for rag_evaluation_dataset-0.4.0-py3-none-any.whl:
Publisher:
cd.yml on ViniciusKos/rag_evaluation_dataset
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rag_evaluation_dataset-0.4.0-py3-none-any.whl -
Subject digest:
cf34b1fc8f9e27c798fefc7746bc19430d64a4f9efca145d38056a21a3f9d7d8 - Sigstore transparency entry: 1283554301
- Sigstore integration time:
-
Permalink:
ViniciusKos/rag_evaluation_dataset@0e87c97604c769deab25eb59e61c1793fd79ab3a -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/ViniciusKos
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
cd.yml@0e87c97604c769deab25eb59e61c1793fd79ab3a -
Trigger Event:
push
-
Statement type: