Skip to main content

RAGElo: A Tool for Evaluating Retrieval-Augmented Generation Models

Project description

Elo-based RAG Agent evaluator

RAGElo[^1] is a streamlined toolkit for evaluating Retrieval Augmented Generation (RAG)-powered Large Language Models (LLMs) question answering agents using the Elo rating system.

While it has become easier to prototype and incorporate generative LLMs in production, evaluation is still the most challenging part of the solution. Comparing different outputs from multiple prompt and pipeline variations to a "gold standard" is not easy. Still, we can ask a powerful LLM to judge between pairs of answers and a set of questions.

This led us to develop a simple tool for tournament-style Elo ranking of LLM outputs. By comparing answers from different RAG pipelines and prompts over multiple questions, RAGElo computes a ranking of the different settings, providing a good overview of what works (and what doesn't).

⚙️ Installation

For using RAGElo as a Python library or as CLI, install it using pip:

pip install ragelo

🚀 Library Quickstart

To use RAGElo as a library, all you need to do is import RAGElo, initialize an Evaluator and call either evaluate() for evaluating a retrieved document or an LLM answer, or batch_evaluate() to evaluate multiple responses at once. For example, using the RDNAM retrieval evaluator from the Thomas et al. (2023) paper on using GPT-4 for annotating retrieval results:

from ragelo import get_retrieval_evaluator

evaluator = get_retrieval_evaluator("RDNAM", llm_provider="openai")
raw_answer, processed_answer = evaluator.evaluate(query="What is the capital of France?", document='Lyon is the second largest city in France.')
print(processed_answer)
# Output: 1
print(raw_answer)
# Output: '"O": 1\n}'

For a more complete example, we can evaluate with a custom prompt, and inject metadata into our evaluation prompt:

from ragelo import get_retrieval_evaluator

prompt = """You are a helpful assistant for evaluating the relevance of a retrieved document to a user query.
You should pay extra attention to how **recent** a document is. A document older than 5 years is considered outdated.

The answer should be evaluated according tot its recency, truthfulness, and relevance to the user query.

User query: {q}

Retrieved document: {d}

The document has a date of {document_date}.
Today is {today_date}.

WRITE YOUR ANSWER ON A SINGLE LINE AS A JSON OBJECT WITH THE FOLLOWING KEYS:
- "relevance": 0 if the document is irrelevant, 1 if it is relevant.
- "recency": 0 if the document is outdated, 1 if it is recent.
- "truthfulness": 0 if the document is false, 1 if it is true.
- "reasoning": A short explanation of why you think the document is relevant or irrelevant.
"""

evaluator = get_retrieval_evaluator(
    "custom_prompt", # name of the retrieval evaluator
    llm_provider="openai", # Which LLM provider to use
    prompt=prompt, # your custom prompt
    query_placeholder="q", # the placeholder for the query in the prompt
    document_placeholder="d", # the placeholder for the document in the prompt
    scoring_keys_retrieval_evaluator=["relevance", "recency", "truthfulness", "reasoning"], # Which keys to extract from the answer
    answer_format_retrieval_evaluator="multi_field_json", # The format of the answer. In this case, a JSON object with multiple fields
)

raw_answer, answer = evaluator.evaluate(
    query="What is the capital of Brazil?", # The user query
    document="Rio de Janeiro is the capital of Brazil.", # The retrieved document
    query_metadata={"today_date": "08-04-2024"}, # Some metadata for the query
    doc_metadata={"document_date": "04-03-1950"}, # Some metadata for the document
)

answer
{'relevance': 0,
 'recency': 0,
 'truthfulness': 0,
 'reasoning': 'The document is outdated and incorrect. Rio de Janeiro was the capital of Brazil until 1960 when it was changed to Brasília.'}

Note that, in this example, we passed to the evaluate method two dictionaries with metadata for the query and the document. This metadata is injected into the prompt by matching their keys into the placeholders in the prompt.

Other examples are available as notebooks in the docs/examples/notebooks folder of the repository.

🚀 CLI Quickstart

After installing RAGElo as a CLI app, you can run it with the following command:

ragelo run-all queries.csv documents.csv answers.csv --verbose --data-dir tests/data/

---------- Agent Scores by Elo ranking ----------
 agent1        : 1026.7
 agent2        : 973.3

When running as a CLI, RAGElo expects the input files as CSV files. Specifically, it needs a csv file with the user queries, one with the documents retrieved by the retrieval system and one of the answers each agent produced. Here are some examples of the expected format:

queries.csv:

qid,query
0, What is the capital of Brazil?
1, What is the capital of France?

documents.csv:

qid,did,document_text
0,0, Brasília is the capital of Brazil.
0,1, Rio de Janeiro used to be the capital of Brazil.
1,2, Paris is the capital of France.
1,3, Lyon is the second largest city in France.

answers.csv:

qid,agent,answer
0, agent1,"Brasília is the capital of Brazil, according to [0]."
0, agent2,"According to [1], Rio de Janeiro used to be the capital of Brazil, until the 60s."
1, agent1,"Paris is the capital of France, according to [2]."
1, agent2,"According to [3], Lyon is the second largest city in France. Meanwhile, Paris is its capital [2]."

🧩 Components

While RAGElo can be used as either an end-to-end tool or by calling individual CLI components.

📜 retrieval-evaluator

The retrieval-evaluator tool annotates retrieved documents based on their relevance to the user query. This is done regardless of the answers provided by any Agent. As an example, for calling the Reasoner retrieval evaluator (reasoner only outputs the reasoning why a document is relevant or not) we can use:

ragelo retrieval-evaluator reasoner queries.csv documents.csv output.csv --verbose --data-dir tests/data/

The output file changes according to the evaluator used. In general it will have one row per document evaluator, with the query_id, document_id, the raw LLM answer and the parsed answer. An example of the output for the reasoner is found here: tests/data/reasonings.csv.

💬 answers-annotator

The answers-annotator tool annotates the answers generated by the Agents, taking the quality of the documents retrieved by the retrieval pipeline. By default, it uses the Pairwise annotator, which generates k random pairs of answers for each query and chooses the best answer based on the relevant documents cited in the answer. It relies on the reasonings generated by the Reasoner retrieval-evaluator.

ragelo answer-evaluator pairwise-reasoning queries.csv documents.csv answers.csv --games-evaluations-file  pairwise_answers_evaluations.csv --verbose --data-dir tests/data/

The pairwise_answers_evaluations.csv file is a CSV file with both the raw answer and the parsed result for each pair of "games" between two agents. An output file example is provided at tests/data/pairwise_answers_evaluations.csv

🏆 agents-ranker

Finally, the agents-ranker tool ranks the agents by simulating an Elo tournament where the output of each game is given by the answers from the answers-annotator:

ragelo agents-ranker elo pairwise_answers_evaluations.csv --agents-evaluations-file agents_ranking.csv --verbose --data-dir tests/data/

The output of this step is written to the output file agents_ranking.csv with columns agent and score: tests/data/agents_ranking.csv.

🙋 Contributing

To install the development dependencies, download the repository and run the following:

git clone https://github.com/zeta-alpha/ragelo && cd ragelo
pip install -e '.[dev]'

This will install the requirement dependencies in an editable mode (i.e., any changes to the code don't need to be rebuilt.) For building a new version, use the build command:

python -m build

✅ TODO

  • Add CI/CD for publishing
  • Add full documentation of all implemented Evaluators
  • Add option to few-shot examples (Undocumented, yet)
  • Testing!
  • Publish on PyPi
  • Add more document evaluators (Microsoft)
  • Split Elo evaluator
  • Install as standalone CLI

[^1]: The RAGElo logo was created using Dall-E 3 and GPT-4 with the following prompt: "Vector logo design for a toolkit named 'RAGElo'. The logo should have bold, modern typography with emphasis on 'RAG' in a contrasting color. Include a minimalist icon symbolizing retrieval or ranking."

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragelo-0.1.8.tar.gz (56.0 kB view details)

Uploaded Source

Built Distribution

ragelo-0.1.8-py3-none-any.whl (70.3 kB view details)

Uploaded Python 3

File details

Details for the file ragelo-0.1.8.tar.gz.

File metadata

  • Download URL: ragelo-0.1.8.tar.gz
  • Upload date:
  • Size: 56.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.5

File hashes

Hashes for ragelo-0.1.8.tar.gz
Algorithm Hash digest
SHA256 ec89c31109614d01afa7fd73835d3f8e0acb3bf87c71cff80cdec057a1bc65ff
MD5 2c9604d36c9620de2352ac39bbfa071d
BLAKE2b-256 7f824b5184aa566aaf3275ed3df21e1dd8edec288eaaeb1fedde529fa88e1517

See more details on using hashes here.

File details

Details for the file ragelo-0.1.8-py3-none-any.whl.

File metadata

  • Download URL: ragelo-0.1.8-py3-none-any.whl
  • Upload date:
  • Size: 70.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.5

File hashes

Hashes for ragelo-0.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 4fd17ec79dd6f9d79a81fc84f3c3b087d83bad0ba84b6997964e5c4d6a21a26f
MD5 e0518d7b64e3380619a07b41ddcf018b
BLAKE2b-256 5e4770892470e0d2726c156b3e58a186d27d1cc1474e264dfafee929ea25b3e4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page