Skip to main content

A Multiagent Framework for Generating Multimodal Multihop QA Datasets for RAG Evaluation

Project description

MiRAGE: A Multiagent Framework for Generating Multimodal Multihop Question-Answer Dataset for RAG Evaluation

Python 3.9+ License PyPI

MiRAGE is a multi-agent framework for generating high-quality, multimodal, multihop question-answer datasets for evaluating Retrieval-Augmented Generation (RAG) systems.

Multiagent Architecture

MiRAGE Framework Architecture

Sample QA Pair

Sample QA Pair Generated

Interactive Process Flow

Explore the step-by-step multihop QA generation process:

๐Ÿ”— View Interactive Visualization

Key Features

  • Multi-hop Context Completion: Iteratively expands incomplete chunks with relevant context.
  • Domain and Expert Role Detection: Automatic domain identification using BERTopic + LLM
  • Multi-stage QA Pipeline: Generate, Select, Verify, Correct for quality assurance
  • Multimodal Support: Handles text, tables, figures, and images
  • Multiple Backend Support: Gemini, OpenAI, and local Ollama models
  • Fully Parallelized: Thread and process pools for maximum throughput
  • Token Usage Tracking: Automatic tracking of input/output tokens across all LLM calls
  • Checkpoint & Resume: Interrupt and resume long-running pipelines without losing progress

Table of Contents

Installation

From PyPI

pip install mirage-benchmark

From Source

git clone https://github.com/ChandanKSahu/MiRAGE.git
cd MiRAGE
pip install -e .

With Optional Dependencies

pip install mirage-benchmark[pdf]   # PDF processing (docling, matplotlib)
pip install mirage-benchmark[eval]  # Evaluation metrics (ragas)
pip install mirage-benchmark[all]   # All pip dependencies

GPU Support (FAISS-GPU)

For GPU-accelerated similarity search, install FAISS-GPU via conda:

# Create conda environment (recommended)
conda create -n mirage python=3.11
conda activate mirage

# Install FAISS-GPU
conda install -c pytorch faiss-gpu

# Then install MiRAGE
pip install mirage-benchmark[gpu]

Quick Start

Step 1: Set Up API Key

Choose one of the following backends:

Option A: Google Gemini (Recommended)

export GEMINI_API_KEY="your-gemini-api-key"

Option B: OpenAI

export OPENAI_API_KEY="your-openai-api-key"

Option C: Local Ollama (No API key needed)

# Install and start Ollama
ollama serve
ollama pull llama3

Step 2: Prepare Your Data

Place your documents in a folder:

mkdir -p data/my_documents
cp /path/to/your/*.pdf data/my_documents/

Step 3: Run MiRAGE

# Using Gemini (default backend) - API key from environment
export GEMINI_API_KEY="your-gemini-key"
python run_mirage.py --input data/my_documents --output output/my_dataset

# Using Gemini with API key as argument
python run_mirage.py -i data/my_documents -o output/my_dataset --backend gemini --api-key YOUR_GEMINI_KEY

# Using OpenAI
python run_mirage.py -i data/my_documents -o output/my_dataset --backend openai --api-key YOUR_OPENAI_KEY

# Using local Ollama (no API key needed)
python run_mirage.py -i data/my_documents -o output/my_dataset --backend ollama

Note: When using --api-key, always specify --backend to indicate which service the key is for.

Step 4: Check Results

ls output/my_dataset/
# qa_multihop_pass.json  - Generated QA pairs (always created)
# chunks.json            - Semantic chunks (always created)

# Optional outputs (if --deduplication and --evaluation flags used):
# qa_deduplicated.json   - Deduplicated QA pairs (with --deduplication)
# evaluation_report.json - Quality metrics (with --evaluation)

Usage

Basic Usage (QA Generation Only)

By default, MiRAGE runs the core pipeline: document processing, chunking, embedding, and QA generation/verification. Deduplication and evaluation are OFF by default.

# Default: Generates QA pairs without deduplication or evaluation
python run_mirage.py --input <INPUT_DIR> --output <OUTPUT_DIR>

With Deduplication

To merge similar QA pairs and remove duplicates:

python run_mirage.py -i data/documents -o output/results --deduplication

With Evaluation Metrics

To compute quality metrics (faithfulness, relevancy, etc.):

python run_mirage.py -i data/documents -o output/results --evaluation

Full Pipeline (Deduplication + Evaluation)

python run_mirage.py -i data/documents -o output/results --deduplication --evaluation

With All Options

python run_mirage.py \
    --input data/documents \
    --output output/results \
    --backend gemini \
    --api-key YOUR_GEMINI_KEY \
    --num-qa-pairs 100 \
    --max-workers 4 \
    --deduplication \
    --evaluation \
    --verbose

Backend Options:

  • gemini (default) - Requires GEMINI_API_KEY or --api-key
  • openai - Requires OPENAI_API_KEY or --api-key
  • ollama - No API key needed (runs locally)

Pipeline Steps:

Step Description Default
1. Document Processing PDF/HTML to Markdown Mandatory
2. Chunking Semantic chunking Mandatory
3. Embedding FAISS index creation Mandatory
4. Domain Detection Expert persona extraction Mandatory
5. QA Generation Multi-hop QA with verification Mandatory
6. Deduplication Merge similar QA pairs OFF (use --deduplication)
7. Evaluation Quality metrics OFF (use --evaluation)

Run Preflight Checks

Before running the full pipeline, verify your setup:

python run_mirage.py --preflight

Using Sample Dataset

A sample dataset is included for testing:

# Unzip sample data
unzip data/FinanceAnnualReports.zip -d data/sample/

# Run on sample
python run_mirage.py -i data/sample -o output/sample_results

API Keys Setup

Google Gemini

  1. Get API key from: https://makersuite.google.com/app/apikey
  2. Set environment variable:
export GEMINI_API_KEY="your-key-here"

Or create a file:

mkdir -p ~/.config/gemini
echo "your-key-here" > ~/.config/gemini/api_key.txt

OpenAI

  1. Get API key from: https://platform.openai.com/api-keys
  2. Set environment variable:
export OPENAI_API_KEY="your-key-here"

Ollama (Local - Free)

No API key needed! Just install Ollama:

# Install
curl -fsSL https://ollama.com/install.sh | sh

# Start server
ollama serve

# Pull models
ollama pull llama3      # For text
ollama pull llava       # For vision

Configuration

Using config.yaml

Copy the example config and customize:

cp config.yaml.example config.yaml

Edit config.yaml:

backend:
  active: GEMINI  # GEMINI, OPENAI, or OLLAMA
  
  gemini:
    api_key_path: ~/.config/gemini/api_key.txt
    llm_model: gemini-2.0-flash
    vlm_model: gemini-2.0-flash
    
  openai:
    api_key_path: ~/.config/openai/api_key.txt
    llm_model: gpt-4o
    vlm_model: gpt-4o
    
  ollama:
    base_url: http://localhost:11434
    llm_model: llama3
    vlm_model: llava

paths:
  input_pdf_dir: data/documents
  output_dir: output/results

qa_generation:
  target_qa_pairs: 100
  max_workers: 4

Then run:

python run_mirage.py --config config.yaml

Cost Optimization

MiRAGE uses LLM/VLM APIs extensively. Two operations consume the most tokens:

1. Document Processing (PDF/HTML โ†’ Markdown โ†’ Chunks)

Cost: High (processes every page with VLM for image/table extraction)

Recommendation:

  • Only process documents once on a curated set of relevant files
  • Use --skip-pdf-processing and --skip-chunking on subsequent runs
  • Pre-filter documents to remove irrelevant content before running MiRAGE
# First run: Process and chunk documents
python run_mirage.py -i data/documents -o output/results

# Subsequent runs: Skip processing, only generate QA
python run_mirage.py -i data/documents -o output/results --skip-pdf-processing --skip-chunking

2. Multi-hop Context Building

Cost: High (recursive LLM calls to expand context at each depth level)

Recommendation:

  • Default is now max_depth: 2 (previously 5)
  • Higher depths exponentially increase token usage with diminishing returns
  • Depth 2 captures most meaningful cross-document relationships
# config.yaml
context:
  max_depth: 2  # Recommended: 2 (default: 5)

Use print_token_stats() or check the pipeline summary to monitor actual token consumption.

Command Line Options

Option Short Description Default
--input -i Input directory with documents Required
--output -o Output directory for results Required
--api-key -k API key for LLM backend From env
--backend -b Backend: gemini, openai, ollama gemini
--model Model name Auto
--config -c Config file path config.yaml
--num-qa-pairs Target QA pairs to generate 10
--max-workers Parallel workers 4
--preflight Run preflight checks only -
--skip-preflight Skip preflight checks -
--skip-pdf-processing Skip PDF conversion -
--skip-chunking Skip chunking step -
--verbose -v Verbose output -
--version Show version -
--help -h Show help -

Multihop QA Visualization

Explore an interactive visualization of the multihop QA generation process, showing how context chunks are linked through keywords to generate complex questions:

View Interactive Multihop QA Visualization

The visualization demonstrates:

  • Context chunk retrieval and keyword extraction
  • Keyword chain relationships across chunks
  • Iterative retrieval depth progression
  • Final question-answer generation with highlighted concepts

Output Format

Generated Files

output/my_dataset/
โ”œโ”€โ”€ markdown/              # Converted markdown files
โ”œโ”€โ”€ chunks.json           # Semantic chunks
โ”œโ”€โ”€ qa_dataset.json       # Raw QA pairs
โ”œโ”€โ”€ qa_deduplicated.json  # Final deduplicated QA pairs
โ”œโ”€โ”€ evaluation_report.json # Quality metrics
โ””โ”€โ”€ run_config.json       # Run configuration

QA Dataset Structure

{
  "chunk_id": 1,
  "question": "What is the company's revenue growth?",
  "answer": "The company achieved 15% revenue growth...",
  "context_chunks": [...],
  "hop_count": 2,
  "relevance_score": "9",
  "difficulty_score": "7",
  "expert_persona": "Financial Analyst",
  "domain": "Finance"
}

Sample QA Pair

Multihop QA Visualization

See the Interactive Process Flow at the top of this page for a step-by-step visualization showing:

  • Context chunk retrieval and keyword extraction
  • Keyword chain relationships across chunks
  • Iterative retrieval depth progression
  • Final question-answer generation with highlighted concepts

Project Structure

MiRAGE/
โ”œโ”€โ”€ src/mirage/                    # Main package
โ”‚   โ”œโ”€โ”€ __init__.py               # Package initialization
โ”‚   โ”œโ”€โ”€ main.py                   # Pipeline orchestration
โ”‚   โ”œโ”€โ”€ cli.py                    # Command-line interface
โ”‚   โ”œโ”€โ”€ core/                     # Core functionality
โ”‚   โ”‚   โ”œโ”€โ”€ config.py             # Configuration management
โ”‚   โ”‚   โ”œโ”€โ”€ llm.py                # LLM/VLM API interfaces + token tracking
โ”‚   โ”‚   โ””โ”€โ”€ prompts.py            # Prompt templates
โ”‚   โ”œโ”€โ”€ embeddings/               # Embedding models
โ”‚   โ”‚   โ”œโ”€โ”€ models.py             # Embedding model selection
โ”‚   โ”‚   โ”œโ”€โ”€ rerankers_multimodal.py  # VLM-based reranking
โ”‚   โ”‚   โ””โ”€โ”€ rerankers_text.py     # Text-based reranking
โ”‚   โ”œโ”€โ”€ pipeline/                 # Processing pipeline
โ”‚   โ”‚   โ”œโ”€โ”€ pdf_processor.py      # PDF to Markdown conversion
โ”‚   โ”‚   โ”œโ”€โ”€ chunker.py            # Semantic chunking
โ”‚   โ”‚   โ”œโ”€โ”€ context.py            # Multi-hop context retrieval
โ”‚   โ”‚   โ”œโ”€โ”€ qa_generator.py       # QA generation and verification
โ”‚   โ”‚   โ”œโ”€โ”€ domain.py             # Domain/expert extraction
โ”‚   โ”‚   โ””โ”€โ”€ deduplication.py      # QA deduplication
โ”‚   โ”œโ”€โ”€ evaluation/               # Evaluation metrics
โ”‚   โ”‚   โ”œโ”€โ”€ metrics.py            # Standard RAGAS metrics
โ”‚   โ”‚   โ””โ”€โ”€ metrics_optimized.py  # Optimized metrics (faster)
โ”‚   โ””โ”€โ”€ utils/                    # Utilities
โ”‚       โ”œโ”€โ”€ preflight.py          # System checks
โ”‚       โ”œโ”€โ”€ stats.py              # Dataset statistics
โ”‚       โ”œโ”€โ”€ ablation.py           # Ablation studies
โ”‚       โ”œโ”€โ”€ checkpoint.py         # Checkpoint/resume support
โ”‚       โ”œโ”€โ”€ llm_cache.py          # LLM response caching
โ”‚       โ”œโ”€โ”€ visualize_multihop.py # Multihop QA visualization
โ”‚       โ””โ”€โ”€ visualize_pipeline.py # Pipeline flow visualization
โ”œโ”€โ”€ data/documents/               # Input documents folder
โ”œโ”€โ”€ output/                       # Generated results
โ”œโ”€โ”€ assets/                       # Documentation images
โ”œโ”€โ”€ config.yaml.example           # Example configuration
โ”œโ”€โ”€ run_mirage.py                 # Main entry point script
โ”œโ”€โ”€ setup.py                      # Package installation
โ”œโ”€โ”€ pyproject.toml                # Package configuration
โ”œโ”€โ”€ requirements.txt              # Dependencies
โ”œโ”€โ”€ README.md                     # This file
โ”œโ”€โ”€ CONTRIBUTING.md               # Contribution guidelines
โ””โ”€โ”€ LICENSE                       # Apache 2.0 License

Python API

For programmatic access, you can import and use MiRAGE modules directly:

# Import the main pipeline
from mirage import run_pipeline
# Or import specific components
from mirage.core.llm import call_llm_simple, call_vlm_interweaved
from mirage.pipeline.context import build_complete_context
from mirage.pipeline.qa_generator import generate_qa, verify_qa
from mirage.pipeline.domain import fetch_domain_and_role
from mirage.embeddings.models import NomicVLEmbed, get_best_embedding_model
from mirage.utils.preflight import run_preflight_checks

# Example: Run preflight checks
success, results = run_preflight_checks()

# Example: Call LLM
response = call_llm_simple("What is 2+2?")

# Example: Use embedding model
embedder = NomicVLEmbed()
embedding = embedder.encode("Sample text")

# Example: Track token usage
from mirage.core.llm import get_token_stats, print_token_stats, reset_token_stats

# After running LLM calls, check token usage
stats = get_token_stats()
print(f"Input tokens: {stats['total_input_tokens']}")
print(f"Output tokens: {stats['total_output_tokens']}")

# Print formatted summary
print_token_stats()

# Reset counters for a new run
reset_token_stats()

See the module docstrings for detailed API documentation.

Examples

Generate QA from PDFs

# Using Gemini
export GEMINI_API_KEY="your-key"
python run_mirage.py -i data/pdfs -o output/qa_dataset

# Using OpenAI  
export OPENAI_API_KEY="your-key"
python run_mirage.py -i data/pdfs -o output/qa_dataset --backend openai

# Using Ollama (local, free)
python run_mirage.py -i data/pdfs -o output/qa_dataset --backend ollama

Generate More QA Pairs

python run_mirage.py -i data/documents -o output/large_dataset --num-qa-pairs 500

Use More Workers

python run_mirage.py -i data/documents -o output/fast_run --max-workers 8

Skip Already Processed Steps

# If you already have markdown files
python run_mirage.py -i data/documents -o output/results --skip-pdf-processing

# If you already have chunks
python run_mirage.py -i data/documents -o output/results --skip-chunking

Troubleshooting

API Key Issues

# Check if API key is set
echo $GEMINI_API_KEY

# Set it if missing
export GEMINI_API_KEY="your-key"

Import Errors

# Reinstall package
pip install -e .

Preflight Check Failures

# Run verbose preflight
python run_mirage.py --preflight --verbose

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Submit a pull request

See CONTRIBUTING.md for details.

Citation

@misc{sahu2026miragemultiagentframeworkgenerating,
      title={MiRAGE: A Multiagent Framework for Generating Multimodal Multihop Question-Answer Dataset for RAG Evaluation}, 
      author={Chandan Kumar Sahu and Premith Kumar Chilukuri and Matthew Hetrich},
      year={2026},
      eprint={2601.15487},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2601.15487}, 
}

License

Apache License 2.0 - see LICENSE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mirage_benchmark-1.2.3.tar.gz (1.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mirage_benchmark-1.2.3-py3-none-any.whl (220.3 kB view details)

Uploaded Python 3

File details

Details for the file mirage_benchmark-1.2.3.tar.gz.

File metadata

  • Download URL: mirage_benchmark-1.2.3.tar.gz
  • Upload date:
  • Size: 1.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mirage_benchmark-1.2.3.tar.gz
Algorithm Hash digest
SHA256 a3775b6c0ff0e77b8e87fb319abe841771b697a33a9a55aeb07651bcd6e9bfa9
MD5 7ea16e3f0d036be09e129542a08c381d
BLAKE2b-256 8357415cb4a1c15062896a0dffc8697895f9efcec5f2e33a8db65622e64631b4

See more details on using hashes here.

Provenance

The following attestation bundles were made for mirage_benchmark-1.2.3.tar.gz:

Publisher: publish-pypi.yml on ChandanKSahu/MiRAGE

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mirage_benchmark-1.2.3-py3-none-any.whl.

File metadata

File hashes

Hashes for mirage_benchmark-1.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 3db38c24e809f9ed118c76a885d8550f9f601afd5976b540bf095c9cbc114e3e
MD5 bb81e42080cf32ac42771dc2a15174b7
BLAKE2b-256 cdb1829dd00a78b74709a36c99b774c22ee515a12c6fc38302e6997a9d641a8b

See more details on using hashes here.

Provenance

The following attestation bundles were made for mirage_benchmark-1.2.3-py3-none-any.whl:

Publisher: publish-pypi.yml on ChandanKSahu/MiRAGE

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page