ragas

No project description provided

Project description

<h1 align="center">
  <img style="vertical-align:middle" height="200"
  src="./docs/assets/logo.png">
</h1>
<p align="center">
  <i>SOTA metrics for evaluating Retrieval Augmented Generation (RAG)</i>
</p>

<p align="center">
    <a href="https://github.com/explodinggradients/ragas/releases">
        <img alt="GitHub release" src="https://img.shields.io/github/release/explodinggradients/ragas.svg">
    </a>
    <a href="https://www.python.org/">
            <img alt="Build" src="https://img.shields.io/badge/Made%20with-Python-1f425f.svg?color=purple">
    </a>
    <a href="https://github.com/explodinggradients/ragas/blob/master/LICENSE">
        <img alt="License" src="https://img.shields.io/github/license/explodinggradients/ragas.svg?color=green">
    </a>
    <a href="https://colab.research.google.com/drive/1HfutiEhHMJLXiWGT8pcipxT5L2TpYEdt?usp=sharing">
        <img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg">
    </a>
    <a href="https://github.com/explodinggradients/ragas/">
        <img alt="Downloads" src="https://badges.frapsoft.com/os/v1/open-source.svg?v=103">
    </a>
</p>

<h4 align="center">
    <p>
        <a href="#shield-installation">Installation</a> |
        <a href="#fire-quickstart">Quickstart</a> |
        <a href="#luggage-metrics">Metrics</a> |
        <a href="https://huggingface.co/explodinggradients">Hugging Face</a>
    <p>
</h4>

ragas is a framework that helps you evaluate your Retrieval Augmented Generation (RAG) pipelines. RAG denotes a class of LLM applications that use external data to augment the LLM’s context. There are existing tools and frameworks that help you build these pipelines but evaluating it and quantifying your pipeline performance can be hard.. This is were ragas (RAG Assessment) comes in

ragas provides you with the tools based on the latest research for evaluating LLM generated text  to give you insights about your RAG pipeline. ragas can be integrated with your CI/CD to provide continuous check to ensure performance.

## :shield: Installation

```bash
pip install ragas
```
if you want to install from source 
```bash
git clone https://github.com/explodinggradients/ragas && cd ragas
pip install -e .
```

## :fire: Quickstart 

This is a small example program you can run to see ragas in action!
```python
from datasets import load_dataset
from ragas.metrics import (
    Evaluation,
    rouge1,
    bert_score,
    entailment_score,
) # import the metrics you want to use

# load the dataset
ds = load_dataset("explodinggradients/eli5-test", split="test_eli5").select(range(100))

# init the evaluator, this takes in the metrics you want to use
# and performs the evaluation
e = Evaluation(
    metrics=[rouge1, bert_score, entailment_score,],
    batched=False,
    batch_size=30,
)

# run the evaluation
results = e.eval(ds["ground_truth"], ds["generated_text"])
print(results)
```
If you want a more in-depth explanation of core components, check out our quick-start notebook
## :luggage: Metrics

### :3rd_place_medal: Character based 

- **Levenshtein distance** the number of single character edits (additional, insertion, deletion) required to change your generated text to ground truth text.
- **Levenshtein** **ratio** is obtained by dividing the Levenshtein distance by sum of number of characters in generated text and ground truth. This type of metrics is suitable where one works with short and precise texts.

### :2nd_place_medal: N-Gram based

N-gram based metrics as name indicates uses n-grams for comparing generated answer with ground truth. It is suitable to extractive and abstractive tasks but has its limitations in long free form answers due to the word based comparison.

- **ROGUE** (Recall-Oriented Understudy for Gisting Evaluation):
    - **ROUGE-N** measures the number of matching ‘n-grams’ between generated text and ground truth. These matches do not consider the ordering of words.
    - **ROUGE-L** measures the longest common subsequence (LCS) between generated text and ground truth. This means is that we count the longest sequence of tokens that is shared between both

- **BLEU** (BiLingual Evaluation Understudy)

    It measures precision by comparing  clipped n-grams in generated text to ground truth text. These matches do not consider the ordering of words.

### :1st_place_medal: Model Based

Model based methods uses language models combined with NLP techniques to compare generated text with ground truth.  It is well suited for free form long or short answer types. 

- **BertScore**
    
    Bert Score measures the similarity between ground truth text answers and generated text using SBERT vector embeddings. The common choice of similarity measure is cosine similarity for which values range between 0 to 1. It shows good correlation with human judgement.
    

- **EntailmentScore**
    
    Textual entailment to measure factual consistency in generated text given ground truth. Score can range from 0 to 1 with latter indicating perfect factual entailment for all samples. Entailment score is highly correlated with human judgement.
    

- **$Q^2$**
    
    Best used to measure factual consistencies between ground truth and generated text. Scores can range from 0 to 1. Higher score indicates better factual consistency between ground truth and generated answer. Employs QA-QG paradigm followed by NLI to compare ground truth and generated answer. $Q^2$ score is highly correlated with human judgement. :warning: time and resource hungry metrics. 

📜 Checkout [citations](./references.md) for related publications.

Project details

Release history Release notifications | RSS feed

0.1.7

Apr 8, 2024

0.1.6

Apr 2, 2024

0.1.5

Mar 20, 2024

0.1.4

Mar 13, 2024

0.1.3

Feb 28, 2024

0.1.2

Feb 23, 2024

0.1.1

Feb 15, 2024

0.1.0

Feb 7, 2024

0.1.0rc1 pre-release

Jan 25, 2024

0.0.22

Dec 13, 2023

0.0.21

Nov 21, 2023

0.0.20

Nov 15, 2023

0.0.19

Oct 31, 2023

0.0.18

Oct 24, 2023

0.0.17

Oct 16, 2023

0.0.16

Sep 28, 2023

0.0.15

Sep 25, 2023

0.0.14

Sep 15, 2023

0.0.13

Sep 15, 2023

0.0.12

Sep 6, 2023

0.0.11

Aug 24, 2023

0.0.10

Aug 2, 2023

0.0.9

Jul 27, 2023

0.0.8 yanked

Jul 27, 2023

Reason this release was yanked:

bug in the code

0.0.7

Jul 20, 2023

0.0.6

Jul 15, 2023

0.0.5

Jul 10, 2023

0.0.4

Jul 10, 2023

0.0.3

Jun 9, 2023

0.0.3rc1 pre-release

Jun 9, 2023

This version

0.0.2

May 23, 2023

0.0.1

May 14, 2023

0.0.1a7 pre-release

May 14, 2023

0.0.1a6 pre-release

May 14, 2023

0.0.1a5 pre-release

May 14, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragas-0.0.2.tar.gz (58.5 kB view hashes)

Uploaded May 23, 2023 Source

Built Distribution

ragas-0.0.2-py3-none-any.whl (15.8 kB view hashes)

Uploaded May 23, 2023 Python 3

Hashes for ragas-0.0.2.tar.gz

Hashes for ragas-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`a160f3cbdb4f270dd59a5089ec1185e46c191ee140f3ae1af987f85fe5fcea48`
MD5	`02301293d48e1c1303149ab6ce3ca3f2`
BLAKE2b-256	`ca6fdb92024ef2e09c96670c80d54592acd2249baed14c562ec6ca3c1070adcd`

Hashes for ragas-0.0.2-py3-none-any.whl

Hashes for ragas-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`66d03750976f9f7e5d9c9a8c3bd3d6c1ab5cb2ba4572c7e9ad45ea7ea3ec5527`
MD5	`dbbf5f52dd547813c95bbf2a3f9fad2a`
BLAKE2b-256	`e6a82c6870ffa613929848259a20c093b2650a7abbd8b15b0d5fe33f77089856`