Skip to main content

Generate QA pairs from JSON documents to evaluate RAG pipelines

Project description

QA Pairs Generator for RAG Evaluation

Automatically generate question-answer pairs from a corpus of JSON documents to evaluate Retrieval-Augmented Generation (RAG) pipelines. The tool extracts named entities from your documents, matches each entity to its most relevant documents via hybrid search (keyword + embeddings), and then prompts an LLM to produce one QA pair per (entity, document) combination.


Table of contents


Installation

Requires Python ≥ 3.10.

With pip

# runtime only
pip install -e .

# runtime + dev dependencies (pytest)
pip install -e ".[dev]"

# runtime + Azure Blob Storage support
pip install -e ".[azure]"

With uv

uv sync                    # runtime only
uv sync --extra dev        # include dev dependencies
uv sync --extra azure      # include Azure Blob Storage support

After installing, the qa-generate console script is available as an alternative to python run_pipeline.py.


Input document format

Each document must be a single .json file inside the input directory. A document can have any top-level fields; you tell the pipeline which ones to use via --search-fields.

Minimal example (docs/doc_001.json):

{
  "title": "Introduction to Transformers",
  "description": "Transformers are a type of neural network architecture introduced in the paper Attention Is All You Need.",
  "author": "Vaswani et al.",
  "year": 2017
}

If you run the pipeline with --search-fields title description, the tool will concatenate the title and description fields to build the corpus text used for entity extraction and search. Fields listed in --search-fields that are absent from a document are silently skipped.


Environment variables

OpenAI (default)

Variable Required Description
OPENAI_API_KEY Yes Your OpenAI secret key

Azure OpenAI (--client azure)

Variable Required Description
AZURE_OPENAI_API_KEY Yes Your Azure OpenAI key
AZURE_OPENAI_ENDPOINT Yes Your Azure endpoint URL (e.g. https://<resource>.openai.azure.com/)
OPENAI_API_VERSION Yes API version (e.g. 2024-02-01)

Azure Blob Storage (az:// input URIs)

The following variables are resolved in priority order:

Priority Variable(s) Description
1 AZURE_STORAGE_CONNECTION_STRING Full connection string
2 AZURE_STORAGE_ACCOUNT_NAME + AZURE_STORAGE_ACCOUNT_KEY Account name and key
3 AZURE_STORAGE_ACCOUNT_NAME (only) Uses DefaultAzureCredential (managed identity, Azure CLI, workload identity, etc.)

You can export the variables in your shell or store them in a .env file and load it before running the pipeline.

# Linux / macOS
export OPENAI_API_KEY="sk-..."

# Windows PowerShell
$env:OPENAI_API_KEY = "sk-..."

Usage

The pipeline can be invoked either via the installed script or directly:

# installed command
qa-generate --input-dir <DIR> --search-fields <FIELD ...> --output <FILE.json> [options]

# or via script
python run_pipeline.py --input-dir <DIR> --search-fields <FIELD ...> --output <FILE.json> [options]
Argument Required Default Description
--input-dir Yes Local path or remote URI (e.g. az://container/prefix/, s3://bucket/prefix/) containing .json input files
--search-fields Yes One or more document fields to use for entity extraction and corpus building
--output Yes Path to the output JSON file
--client No openai LLM provider: openai or azure
--model No gpt-4o-mini Chat model used for entity extraction and QA generation
--embedding-model No text-embedding-3-small Embedding model used for semantic search
--top-n No 3 Number of documents retrieved per entity via embedding search

Complete example

python run_pipeline.py \
    --input-dir   ./docs \
    --search-fields title description \
    --output      qa_output.json \
    --client      openai \
    --model       gpt-4o-mini \
    --embedding-model text-embedding-3-small \
    --top-n       3

Azure OpenAI example

python run_pipeline.py \
    --input-dir   ./docs \
    --search-fields title description \
    --output      qa_output.json \
    --client      azure \
    --model       my-gpt4o-deployment \
    --embedding-model my-embedding-deployment

Remote storage

--input-dir accepts any URI supported by fsspec. The pipeline reads .json files transparently from:

Scheme Backend Extra install
./path/ or /abs/path/ Local filesystem
az://container/prefix/ Azure Blob Storage pip install -e ".[azure]"
s3://bucket/prefix/ Amazon S3 pip install s3fs
gcs://bucket/prefix/ Google Cloud Storage pip install gcsfs
# read documents from Azure Blob Storage
python run_pipeline.py \
    --input-dir   az://my-container/corpus/ \
    --search-fields title description \
    --output      qa_output.json

Output format

The output is a JSON array. Each element is a QA pair with the following fields:

[
  {
    "entity": "Transformers",
    "question": "What problem do Transformers solve compared to RNNs?",
    "answer": "Transformers solve the sequential computation bottleneck of RNNs by relying entirely on self-attention mechanisms, enabling parallelisation during training.",
    "source_document": "Introduction to Transformers\nTransformers are a type of neural network architecture..."
  }
]
Field Type Description
entity string Named entity extracted from the documents
question string Generated question about the entity
answer string Generated answer grounded in the source document
source_document string Concatenated text of the document used to generate the pair

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rag_evaluation_dataset-0.4.0.tar.gz (142.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rag_evaluation_dataset-0.4.0-py3-none-any.whl (14.6 kB view details)

Uploaded Python 3

File details

Details for the file rag_evaluation_dataset-0.4.0.tar.gz.

File metadata

  • Download URL: rag_evaluation_dataset-0.4.0.tar.gz
  • Upload date:
  • Size: 142.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rag_evaluation_dataset-0.4.0.tar.gz
Algorithm Hash digest
SHA256 9db7fab461965e01e70aeef9a625e72e5002fc725dcca5fcc5830eacc0dc3168
MD5 3cd1c8da3802d9960697da59dc596f94
BLAKE2b-256 a6cb55c205fe8d61e85d8122d74f425126f47ba524735534e55cef28965231da

See more details on using hashes here.

Provenance

The following attestation bundles were made for rag_evaluation_dataset-0.4.0.tar.gz:

Publisher: cd.yml on ViniciusKos/rag_evaluation_dataset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rag_evaluation_dataset-0.4.0-py3-none-any.whl.

File metadata

File hashes

Hashes for rag_evaluation_dataset-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cf34b1fc8f9e27c798fefc7746bc19430d64a4f9efca145d38056a21a3f9d7d8
MD5 6d52ee1ed8954fa91c0fbd67d8f4fb9d
BLAKE2b-256 622c116270ff29b50498ff51cd0731694588021c5097852de1af68a90906be01

See more details on using hashes here.

Provenance

The following attestation bundles were made for rag_evaluation_dataset-0.4.0-py3-none-any.whl:

Publisher: cd.yml on ViniciusKos/rag_evaluation_dataset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page