Generate QA pairs from JSON documents to evaluate RAG pipelines

Project description

QA Pairs Generator for RAG Evaluation

Automatically generate question-answer pairs from a corpus of JSON documents to evaluate Retrieval-Augmented Generation (RAG) pipelines. The tool extracts named entities from your documents, matches each entity to its most relevant documents via hybrid search (keyword + embeddings), and then prompts an LLM to produce one QA pair per (entity, document) combination.

Installation
Input document format
Environment variables
Usage
Remote storage
Output format

Installation

Requires Python ≥ 3.10.

With pip

# runtime only
pip install -e .

# runtime + dev dependencies (pytest)
pip install -e ".[dev]"

# runtime + Azure Blob Storage support
pip install -e ".[azure]"

With uv

uv sync                    # runtime only
uv sync --extra dev        # include dev dependencies
uv sync --extra azure      # include Azure Blob Storage support

After installing, the qa-generate console script is available as an alternative to python run_pipeline.py.

Input document format

Each document must be a single .json file inside the input directory. A document can have any top-level fields; you tell the pipeline which ones to use via --search-fields.

Minimal example (docs/doc_001.json):

{
  "title": "Introduction to Transformers",
  "description": "Transformers are a type of neural network architecture introduced in the paper Attention Is All You Need.",
  "author": "Vaswani et al.",
  "year": 2017
}

If you run the pipeline with --search-fields title description, the tool will concatenate the title and description fields to build the corpus text used for entity extraction and search. Fields listed in --search-fields that are absent from a document are silently skipped.

Environment variables

OpenAI (default)

Variable	Required	Description
`OPENAI_API_KEY`	Yes	Your OpenAI secret key

Azure OpenAI (`--client azure`)

Variable	Required	Description
`AZURE_OPENAI_API_KEY`	Yes	Your Azure OpenAI key
`AZURE_OPENAI_ENDPOINT`	Yes	Your Azure endpoint URL (e.g. `https://<resource>.openai.azure.com/`)
`OPENAI_API_VERSION`	Yes	API version (e.g. `2024-02-01`)

Azure Blob Storage (`az://` input URIs)

The following variables are resolved in priority order:

Priority	Variable(s)	Description
1	`AZURE_STORAGE_CONNECTION_STRING`	Full connection string
2	`AZURE_STORAGE_ACCOUNT_NAME` + `AZURE_STORAGE_ACCOUNT_KEY`	Account name and key
3	`AZURE_STORAGE_ACCOUNT_NAME` (only)	Uses `DefaultAzureCredential` (managed identity, Azure CLI, workload identity, etc.)

You can export the variables in your shell or store them in a .env file and load it before running the pipeline.

# Linux / macOS
export OPENAI_API_KEY="sk-..."

# Windows PowerShell
$env:OPENAI_API_KEY = "sk-..."

Usage

The pipeline can be invoked either via the installed script or directly:

# installed command
qa-generate --input-dir <DIR> --search-fields <FIELD ...> --output <FILE.json> [options]

# or via script
python run_pipeline.py --input-dir <DIR> --search-fields <FIELD ...> --output <FILE.json> [options]

Argument	Required	Default	Description
`--input-dir`	Yes	—	Local path or remote URI (e.g. `az://container/prefix/`, `s3://bucket/prefix/`) containing `.json` input files
`--search-fields`	Yes	—	One or more document fields to use for entity extraction and corpus building
`--output`	Yes	—	Path to the output JSON file
`--client`	No	`openai`	LLM provider: `openai` or `azure`
`--model`	No	`gpt-4o-mini`	Chat model used for entity extraction and QA generation
`--embedding-model`	No	`text-embedding-3-small`	Embedding model used for semantic search
`--top-n`	No	`3`	Number of documents retrieved per entity via embedding search

Complete example

python run_pipeline.py \
    --input-dir   ./docs \
    --search-fields title description \
    --output      qa_output.json \
    --client      openai \
    --model       gpt-4o-mini \
    --embedding-model text-embedding-3-small \
    --top-n       3

Azure OpenAI example

python run_pipeline.py \
    --input-dir   ./docs \
    --search-fields title description \
    --output      qa_output.json \
    --client      azure \
    --model       my-gpt4o-deployment \
    --embedding-model my-embedding-deployment

Remote storage

--input-dir accepts any URI supported by fsspec. The pipeline reads .json files transparently from:

Scheme	Backend	Extra install
`./path/` or `/abs/path/`	Local filesystem	—
`az://container/prefix/`	Azure Blob Storage	`pip install -e ".[azure]"`
`s3://bucket/prefix/`	Amazon S3	`pip install s3fs`
`gcs://bucket/prefix/`	Google Cloud Storage	`pip install gcsfs`

# read documents from Azure Blob Storage
python run_pipeline.py \
    --input-dir   az://my-container/corpus/ \
    --search-fields title description \
    --output      qa_output.json

Output format

The output is a JSON array. Each element is a QA pair with the following fields:

[
  {
    "entity": "Transformers",
    "question": "What problem do Transformers solve compared to RNNs?",
    "answer": "Transformers solve the sequential computation bottleneck of RNNs by relying entirely on self-attention mechanisms, enabling parallelisation during training.",
    "source_document": "Introduction to Transformers\nTransformers are a type of neural network architecture..."
  }
]

Field	Type	Description
`entity`	`string`	Named entity extracted from the documents
`question`	`string`	Generated question about the entity
`answer`	`string`	Generated answer grounded in the source document
`source_document`	`string`	Concatenated text of the document used to generate the pair

Project details

Release history Release notifications | RSS feed

This version

0.4.0

Apr 13, 2026

0.1.0

Apr 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rag_evaluation_dataset-0.4.0.tar.gz (142.3 kB view details)

Uploaded Apr 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rag_evaluation_dataset-0.4.0-py3-none-any.whl (14.6 kB view details)

Uploaded Apr 13, 2026 Python 3

File details

Details for the file rag_evaluation_dataset-0.4.0.tar.gz.

File metadata

Download URL: rag_evaluation_dataset-0.4.0.tar.gz
Upload date: Apr 13, 2026
Size: 142.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rag_evaluation_dataset-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`9db7fab461965e01e70aeef9a625e72e5002fc725dcca5fcc5830eacc0dc3168`
MD5	`3cd1c8da3802d9960697da59dc596f94`
BLAKE2b-256	`a6cb55c205fe8d61e85d8122d74f425126f47ba524735534e55cef28965231da`

See more details on using hashes here.

Provenance

The following attestation bundles were made for rag_evaluation_dataset-0.4.0.tar.gz:

Publisher: cd.yml on ViniciusKos/rag_evaluation_dataset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: rag_evaluation_dataset-0.4.0.tar.gz
- Subject digest: 9db7fab461965e01e70aeef9a625e72e5002fc725dcca5fcc5830eacc0dc3168
- Sigstore transparency entry: 1283554217
- Sigstore integration time: Apr 13, 2026
Source repository:
- Permalink: ViniciusKos/rag_evaluation_dataset@0e87c97604c769deab25eb59e61c1793fd79ab3a
- Branch / Tag: refs/tags/v0.4.0
- Owner: https://github.com/ViniciusKos
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: cd.yml@0e87c97604c769deab25eb59e61c1793fd79ab3a
- Trigger Event: push

File details

Details for the file rag_evaluation_dataset-0.4.0-py3-none-any.whl.

File metadata

Download URL: rag_evaluation_dataset-0.4.0-py3-none-any.whl
Upload date: Apr 13, 2026
Size: 14.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rag_evaluation_dataset-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cf34b1fc8f9e27c798fefc7746bc19430d64a4f9efca145d38056a21a3f9d7d8`
MD5	`6d52ee1ed8954fa91c0fbd67d8f4fb9d`
BLAKE2b-256	`622c116270ff29b50498ff51cd0731694588021c5097852de1af68a90906be01`

See more details on using hashes here.

Provenance

The following attestation bundles were made for rag_evaluation_dataset-0.4.0-py3-none-any.whl:

Publisher: cd.yml on ViniciusKos/rag_evaluation_dataset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: rag_evaluation_dataset-0.4.0-py3-none-any.whl
- Subject digest: cf34b1fc8f9e27c798fefc7746bc19430d64a4f9efca145d38056a21a3f9d7d8
- Sigstore transparency entry: 1283554301
- Sigstore integration time: Apr 13, 2026
Source repository:
- Permalink: ViniciusKos/rag_evaluation_dataset@0e87c97604c769deab25eb59e61c1793fd79ab3a
- Branch / Tag: refs/tags/v0.4.0
- Owner: https://github.com/ViniciusKos
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: cd.yml@0e87c97604c769deab25eb59e61c1793fd79ab3a
- Trigger Event: push

rag-evaluation-dataset 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

QA Pairs Generator for RAG Evaluation

Table of contents

Installation

With pip

With uv

Input document format

Environment variables

OpenAI (default)

Azure OpenAI (`--client azure`)

Azure Blob Storage (`az://` input URIs)

Usage

Complete example

Azure OpenAI example

Remote storage

Output format

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

rag-evaluation-dataset 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

QA Pairs Generator for RAG Evaluation

Table of contents

Installation

With pip

With uv

Input document format

Environment variables

OpenAI (default)

Azure OpenAI (--client azure)

Azure Blob Storage (az:// input URIs)

Usage

Complete example

Azure OpenAI example

Remote storage

Output format

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Azure OpenAI (`--client azure`)

Azure Blob Storage (`az://` input URIs)