A RAG-based benchmark for multilingual question answering.

These details have not been verified by PyPI

Project links

Homepage

Project description

MIRAGE-BENCH: Benchmarking LLM Generation Across Multiple Languages

This repository provides an easy way to achieve the following four objectives:

Generate RAG-based answers to multilingual questions, with support for many open-source LLMs integrated via vLLM, as well as closed-source LLMs through APIs such as Azure OpenAI, Cohere, Anthropic, etc.
Evaluate multilingual RAG answers based on a variety of heuristic features (e.g., support, fluency) or automatic evaluations using open-source LLMs supported in vLLM.
Conduct an LLM-as-a-Judge design to compare pairwise multilingual RAG answers and train a Bradley-Terry model (with bootstrapping) to build an offline multilingual RAG arena.
Train a surrogate judge (linear regression) to learn from and bootstrap the expensive LLM-as-a-Judge approach using heuristic features.

For more information, check out our publication:

MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems (Accepted at NAACL 2025 Main Conference :star:)

Installation

We recommend Python 3.9+ and installing the latest version of vLLM.

Install with pip:

pip install -U mirage-bench

Install from sources

Alternatively, you can also clone the latest version from the repository and install it directly from the source code:

pip install -e .

Datasets

Resource	Description
:hugs: mirage-bench	All queries & input prompts available in MIRAGE-Bench
:hugs: mirage-bench-output	Pre-computed RAG answers and all feature scores for 21 models
:hugs: mirage-bench-pairwise-judgments	Pairwise judgments using GPT-4o LLM judge across all 19 models

Getting Started

Make sure you have the latest vLLM installed correctly.

1. Multilingual RAG Answer Generation

Generate the RAG answer for given multilingual queries in mirage-bench using an LLM model.

Similarly, you can even generate answers with HF models on single/multiple GPU instances with vLLM.

# export AZURE_OPENAI_ENDPOINT="xxxxx"
# export AZURE_OPENAI_API_KEY="xxxx"

from mirage_bench import util
from mirage_bench.generate import AzureOpenAIClient

# Many other clients also available, e.g., Cohere or Anthropic
client = AzureOpenAIClient(model_name_or_path="gpt-4o-mini")

### Prompts_dict contains query_id as key and prompt as value
prompts_dict = util.load_prompts(
    dataset_name="nthakur/mirage-bench", 
    language_code="en", # or "ar", "bn" ... 18 languages supported
    split="dev" # only dev split is available in mirage-bench
) 
query_ids = list(prompts_dict.keys())
outputs = client.batch_call(
    prompts=list(prompts_dict.values()),
    temperature=0.1,
    max_new_tokens=2048,
)
#### output contains the List of RAG outputs

2. Heuristic & Automatic RAG Evaluation

After generating RAG answers, we evaluate the quality of the response using heuristic features:

from mirage_bench import util
from mirage_bench.evaluate import RougeBleuEvaluator

evaluator = RougeBleuEvaluator(language_code="en")

# Load the documents (relevant & non-relevant)
documents = util.load_documents(
    dataset_name="nthakur/mirage-bench", 
    language_code="en", 
    split="dev"
)

# Load the multilingual RAG predictions available for 20+ models.
# In this example, we are evaluating: meta-llama/Meta-Llama-3-8B-Instruct
predictions = util.load_predictions(
    dataset_name="nthakur/mirage-bench-output",
    language_code="en",
    split="dev",
    model_name="meta-llama/Meta-Llama-3-8B-Instruct",
)

# Need to load the reference model, i.e., ground_truth predictions
# This step is not necessary in all heuristic features
reference_predictions = util.load_predictions(
    dataset_name="nthakur/mirage-bench-output",
    language_code="en",
    split="dev",
    model_name="gpt-4-azure",
)

# Evaluate the predictions
scores = evaluator.evaluate(
    predictions=predictions, 
    reference_predictions=reference_predictions, 
    documents=documents
)
# => query_id: {"answer_bleu": 0.9, "answer_rougeL": 0.75}

3. LLM-as-a-Judge Pairwise Evaluation

After generating RAG answers, we can also use a LLM as a judge to compare two RAG outputs and decide which output is better.

from mirage_bench import util
from mirage_bench.evaluate import PairwiseLLMJudgeEvaluator

evaluator = PairwiseLLMJudgeEvaluator(
    client="azure_openai",
    model_name_or_path="gpt-4o-mini"
)

# Load the documents (relevant & non-relevant)
documents = util.load_documents(
    dataset_name="nthakur/mirage-bench", 
    language_code="en", 
    split="dev"
)
queries = util.load_queries(
    dataset_name="nthakur/mirage-bench", 
    language_code="en", 
    split="dev"
)

# In this example we will evaluate two models:
models = [
    "meta-llama/Meta-Llama-3-8B-Instruct",
    "meta-llama/Meta-Llama-3-70B-Instruct"
]

for model_name in models:
    predictions[model_name] = util.load_predictions(
        dataset_name="nthakur/mirage-bench-output",
        language_code="en",
        split="dev",
        model_name=model_name,
    )

scores = evaluator.evaluate(
    predictions=predictions,
    all_model_names=models, # provide all model names
    documents=documents,
    queries=queries
)
# IMP: model_A and model_B are randomly switched
# => [{"query_id": 1, 
#      "judge": "gpt-4o-mini", 
#      "model_A": "meta-llama/Meta-Llama-3-8B-Instruct", 
#      "model_B": "meta-llama/Meta-Llama-3-70B-Instruct", 
#      "output": <judge_output>,
#      "verdict": A/B/Tie.
#    }]

Application Examples

You can use this framework for:

Citing & Authors

This work was done in a collaboration between Vectara and University of Waterloo.

If you find this repository helpful, feel free to cite our publication MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems:

@article{thakur-mirage-bench:2024,
  author       = {Nandan Thakur and
                  Suleman Kazi and
                  Ge Luo and
                  Jimmy Lin and
                  Amin Ahmad},
  title        = {MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented
                  Generation Systems},
  journal      = {CoRR},
  volume       = {abs/2410.13716},
  year         = {2024},
  url          = {https://doi.org/10.48550/arXiv.2410.13716},
  doi          = {10.48550/ARXIV.2410.13716},
  eprinttype    = {arXiv},
  eprint       = {2410.13716},
  timestamp    = {Wed, 27 Nov 2024 09:01:16 +0100},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2410-13716.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Maintainer: Nandan Thakur, PhD Student @ University of Waterloo

Don't hesitate to open an issue if something is broken (and it shouldn't be) or if you have further questions.

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.0.4

Mar 31, 2025

0.0.3

Mar 31, 2025

This version

0.0.2

Mar 31, 2025

0.0.1

Mar 28, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mirage_bench-0.0.2.tar.gz (36.5 kB view details)

Uploaded Mar 31, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mirage_bench-0.0.2-py3-none-any.whl (46.3 kB view details)

Uploaded Mar 31, 2025 Python 3

File details

Details for the file mirage_bench-0.0.2.tar.gz.

File metadata

Download URL: mirage_bench-0.0.2.tar.gz
Upload date: Mar 31, 2025
Size: 36.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.4

File hashes

Hashes for mirage_bench-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`2e4a39c5a9ba3f082ab0cfc35eb5d2a4eea7977da801cc5a76339332b56818b8`
MD5	`e1fbc1f9984fc8344f5e72f8043c294e`
BLAKE2b-256	`63ab9766dd3456df9d2fbfcc0a6711a91bc55af9c4843b1d96189e74b561a3ae`

See more details on using hashes here.

File details

Details for the file mirage_bench-0.0.2-py3-none-any.whl.

File metadata

Download URL: mirage_bench-0.0.2-py3-none-any.whl
Upload date: Mar 31, 2025
Size: 46.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.4

File hashes

Hashes for mirage_bench-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3a15a5b7005a718e53c8c5695b08add59a8c6ea12c802278dc78f49ec6647f40`
MD5	`69a5ca3d027dcef3d967562f75435de7`
BLAKE2b-256	`8ed733fac65ea3cf05e1f7f0d6fcec4ec45310bcadf068bbe2cf6cf1a80866b4`

See more details on using hashes here.

mirage-bench 0.0.2

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

MIRAGE-BENCH: Benchmarking LLM Generation Across Multiple Languages

Installation

Datasets

Getting Started

1. Multilingual RAG Answer Generation

2. Heuristic & Automatic RAG Evaluation

3. LLM-as-a-Judge Pairwise Evaluation

Application Examples

Citing & Authors

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes