Skip to main content

RAGChecker: A Fine-grained Framework For Diagnosing Retrieval-Augmented Generation (RAG) systems.

Project description

RAGChecker: A Fine-grained Framework For Diagnosing RAG

RAGChecker Paper    |    Tutorial (English)    |    中文教程

RAGChecker is an advanced automatic evaluation framework designed to assess and diagnose Retrieval-Augmented Generation (RAG) systems. It provides a comprehensive suite of metrics and tools for in-depth analysis of RAG performance.

RefChecker Metrics
Figure: RAGChecker Metrics

🌟 Highlighted Features

  • Holistic Evaluation: RAGChecker offers Overall Metrics for an assessment of the entire RAG pipeline.

  • Diagnostic Metrics: Diagnostic Retriever Metrics for analyzing the retrieval component. Diagnostic Generator Metrics for evaluating the generation component. These metrics provide valuable insights for targeted improvements.

  • Fine-grained Evaluation: Utilizes claim-level entailment operations for fine-grained evaluation.

  • Benchmark Dataset: A comprehensive RAG benchmark dataset with 4k questions covering 10 domains (upcoming).

  • Meta-Evaluation: A human-annotated preference dataset for evaluating the correlations of RAGChecker's results with human judgments.

RAGChecker empowers developers and researchers to thoroughly evaluate, diagnose, and enhance their RAG systems with precision and depth.

🔥 News

❤️ Citation

RAGChecker paper: https://arxiv.org/pdf/2408.08067

If you use RAGChecker in your work, please cite us:

@misc{ru2024ragcheckerfinegrainedframeworkdiagnosing,
      title={RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation}, 
      author={Dongyu Ru and Lin Qiu and Xiangkun Hu and Tianhang Zhang and Peng Shi and Shuaichen Chang and Jiayang Cheng and Cunxiang Wang and Shichao Sun and Huanyu Li and Zizhao Zhang and Binjie Wang and Jiarong Jiang and Tong He and Zhiguo Wang and Pengfei Liu and Yue Zhang and Zheng Zhang},
      year={2024},
      eprint={2408.08067},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2408.08067}, 
}

🚀 Quick Start

Setup Environment

pip install ragchecker
python -m spacy download en_core_web_sm

Run the Checking Pipeline with CLI

Please process your own data with the same format as examples/checking_inputs.json. The only required annotation for each query is the ground truth answer (gt_answer).

{
  "results": [
    {
      "query_id": "<query id>", # string
      "query": "<input query>", # string
      "gt_answer": "<ground truth answer>", # string
      "response": "<response generated by the RAG generator>", # string
      "retrieved_context": [ # a list of retrieved chunks by the retriever
        {
          "doc_id": "<doc id>", # string, optional
          "text": "<content of the chunk>" # string
        },
        ...
      ]
    },
    ...
  ]
}

If you are using AWS Bedrock version of Llama3 70B for the claim extractor and checker, use the following command to run the checking pipeline, the checking results as well as intermediate results will be saved to --output_path:

ragchecker-cli \
    --input_path=examples/checking_inputs.json \
    --output_path=examples/checking_outputs.json \
    --extractor_name=bedrock/meta.llama3-1-70b-instruct-v1:0 \
    --checker_name=bedrock/meta.llama3-1-70b-instruct-v1:0 \
    --batch_size_extractor=64 \
    --batch_size_checker=64 \
    --metrics all_metrics \
    # --disable_joint_check  # uncomment this line for one-by-one checking, slower but slightly more accurate

Please refer to RefChecker's guidance for setting up the extractor and checker models.

It will output the values for the metrics like follows:

Results for examples/checking_outputs.json:
{
  "overall_metrics": {
    "precision": 73.3,
    "recall": 62.5,
    "f1": 67.3
  },
  "retriever_metrics": {
    "claim_recall": 61.4,
    "context_precision": 87.5
  },
  "generator_metrics": {
    "context_utilization": 87.5,
    "noise_sensitivity_in_relevant": 22.5,
    "noise_sensitivity_in_irrelevant": 0.0,
    "hallucination": 4.2,
    "self_knowledge": 25.0,
    "faithfulness": 70.8
  }
}

Run the Checking Pipeline with Python

from ragchecker import RAGResults, RAGChecker
from ragchecker.metrics import all_metrics


# initialize ragresults from json/dict
with open("examples/checking_inputs.json") as fp:
    rag_results = RAGResults.from_json(fp.read())

# set-up the evaluator
evaluator = RAGChecker(
    extractor_name="bedrock/meta.llama3-1-70b-instruct-v1:0",
    checker_name="bedrock/meta.llama3-1-70b-instruct-v1:0",
    batch_size_extractor=32,
    batch_size_checker=32
)

# evaluate results with selected metrics or certain groups, e.g., retriever_metrics, generator_metrics, all_metrics
evaluator.evaluate(rag_results, all_metrics)
print(rag_results)

"""Output
RAGResults(
  2 RAG results,
  Metrics:
  {
    "overall_metrics": {
      "precision": 76.4,
      "recall": 62.5,
      "f1": 68.3
    },
    "retriever_metrics": {
      "claim_recall": 61.4,
      "context_precision": 87.5
    },
    "generator_metrics": {
      "context_utilization": 87.5,
      "noise_sensitivity_in_relevant": 19.1,
      "noise_sensitivity_in_irrelevant": 0.0,
      "hallucination": 4.5,
      "self_knowledge": 27.3,
      "faithfulness": 68.2
    }
  }
)
"""

Meta-Evaluation

Please refer to data/meta_evaluation on meta-evaluation for the effectiveness of RAGChecker.

Work with LlamaIndex

RAGChecker now integrates with LlamaIndex, providing a powerful evaluation tool for RAG applications built with LlamaIndex. For detailed instructions on how to use RAGChecker with LlamaIndex, please refer to the LlamaIndex documentation on RAGChecker integration. This integration allows LlamaIndex users to leverage RAGChecker's comprehensive metrics to evaluate and improve their RAG systems.

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragchecker-0.1.9.tar.gz (14.8 kB view details)

Uploaded Source

Built Distribution

ragchecker-0.1.9-py3-none-any.whl (15.0 kB view details)

Uploaded Python 3

File details

Details for the file ragchecker-0.1.9.tar.gz.

File metadata

  • Download URL: ragchecker-0.1.9.tar.gz
  • Upload date:
  • Size: 14.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.6 Linux/6.8.0-1014-azure

File hashes

Hashes for ragchecker-0.1.9.tar.gz
Algorithm Hash digest
SHA256 ad78aea1531bed7797605b87ef060e032ae5b76c6a1c841ca1d0dcecb1564497
MD5 61b06ddb8919a7b04974c132b16c86d1
BLAKE2b-256 28b8d020497e6cfc327991196993ccc8e2038e17eb839252ac7ff232f0e72e0e

See more details on using hashes here.

File details

Details for the file ragchecker-0.1.9-py3-none-any.whl.

File metadata

  • Download URL: ragchecker-0.1.9-py3-none-any.whl
  • Upload date:
  • Size: 15.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.6 Linux/6.8.0-1014-azure

File hashes

Hashes for ragchecker-0.1.9-py3-none-any.whl
Algorithm Hash digest
SHA256 b892515573c83d13d8498ecc2a958902ef741cc98fc45d6b05d16951af1ce955
MD5 e33ec5ed691a96a3c4eb857cb671545f
BLAKE2b-256 ddf8d0272013a4ffd808af0a0f7e0139f6af8ed396adb266a0a46b49a7d47419

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page