This library provides a comprehensive suite of metrics to evaluate the performance of Retrieval-Augmented Generation (RAG) systems. RAG systems, which combine information retrieval with text generation, present unique evaluation challenges beyond those found in standard language generation tasks

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

RAG Evaluator Overview RAG Evaluator is a Python library for evaluating Retrieval-Augmented Generation (RAG) systems. It provides various metrics to evaluate the quality of generated text against reference text.

Installation You can install the library using pip:

pip install Comprehensive_RAG_Evaluation_Metrics Usage Here's how to use the RAG Evaluator library:

from Comprehensive_RAG_Evaluation_Metrics import RAGEvaluator

Initialize the evaluator

evaluator = RAGEvaluator()

Input data

question = "What are the causes of difficulty in learning a new topic?" response = "Difficulty in learning a new topic is often caused by a lack of understanding of the subject's structure." reference = "Not knowing how to explain a topic to others can make it harder to learn, as it requires a deeper understanding of the subject's structure."

Evaluate the response

metrics = evaluator.evaluate_all(question, response, reference)

Print the results

print(metrics) Streamlit Web App To run the web app:

cd into streamlit app folder. Create a virtual env Activate the virtual env Install all dependencies Run the app: streamlit run app.py Metrics The RAG Evaluator provides the following metrics:

BLEU (0-100): Measures the overlap between the generated output and reference text based on n-grams.

0-20: Low similarity, 20-40: Medium-low, 40-60: Medium, 60-80: High, 80-100: Very high ROUGE-1 (0-1): Measures the overlap of unigrams between the generated output and reference text.

0.0-0.2: Poor overlap, 0.2-0.4: Fair, 0.4-0.6: Good, 0.6-0.8: Very good, 0.8-1.0: Excellent BERT Score (0-1): Evaluates the semantic similarity using BERT embeddings (Precision, Recall, F1).

0.0-0.5: Low similarity, 0.5-0.7: Moderate, 0.7-0.8: Good, 0.8-0.9: High, 0.9-1.0: Very high Perplexity (1 to âˆž, lower is better): Measures how well a language model predicts the text.

1-10: Excellent, 10-50: Good, 50-100: Moderate, 100+: High (potentially nonsensical) Diversity (0-1): Measures the uniqueness of bigrams in the generated output.

0.0-0.2: Very low, 0.2-0.4: Low, 0.4-0.6: Moderate, 0.6-0.8: High, 0.8-1.0: Very high Racial Bias (0-1): Detects the presence of biased language in the generated output.

0.0-0.2: Low probability, 0.2-0.4: Moderate, 0.4-0.6: High, 0.6-0.8: Very high, 0.8-1.0: Extreme METEOR (0-1): Calculates semantic similarity considering synonyms and paraphrases.

0.0-0.2: Poor, 0.2-0.4: Fair, 0.4-0.6: Good, 0.6-0.8: Very good, 0.8-1.0: Excellent CHRF (0-1): Computes Character n-gram F-score for fine-grained text similarity.

0.0-0.2: Low, 0.2-0.4: Moderate, 0.4-0.6: Good, 0.6-0.8: High, 0.8-1.0: Very high Flesch Reading Ease (0-100): Assesses text readability.

0-30: Very difficult, 30-50: Difficult, 50-60: Fairly difficult, 60-70: Standard, 70-80: Fairly easy, 80-90: Easy, 90-100: Very easy Flesch-Kincaid Grade (0-18+): Indicates the U.S. school grade level needed to understand the text.

1-6: Elementary, 7-8: Middle school, 9-12: High school, 13+: College level Testing To run the tests, use the following command:

Semantic Similarity Evaluates similarity in meaning between two texts using word embeddings and cosine similarity for accurate context.

Factual Consistency Verifies factual accuracy in responses using entity recognition and knowledge graph-based methods for trustworthiness.

Question Relevance Measures response relevance to user queries using keyword extraction and intent detection for effective answers.

Context Relevance Assesses response appropriateness in a given situation using topic modeling and semantic role labeling for contextual fit.

Answer Relevance Evaluates response clarity and directness in answering user queries using named entity recognition and dependency parsing.

Toxicity Detects hate speech, profanity, and toxic content in responses using sentiment analysis and machine learning-based classification.

python -m unittest discover -s rag_evaluator -p "test_*.py"

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.12.0

Aug 6, 2024

0.11.0

Aug 6, 2024

0.10.0

Aug 6, 2024

0.9.0

Aug 6, 2024

0.8.0

Aug 6, 2024

0.7.0

Aug 6, 2024

0.6.0

Aug 6, 2024

0.5.0

Aug 6, 2024

0.2.0

Aug 6, 2024

0.1.0

Aug 6, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

comprehensive_rag_evaluation_metrics-0.12.0.tar.gz (7.6 kB view details)

Uploaded Aug 6, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

Comprehensive_RAG_Evaluation_Metrics-0.12.0-py3-none-any.whl (8.7 kB view details)

Uploaded Aug 6, 2024 Python 3

File details

Details for the file comprehensive_rag_evaluation_metrics-0.12.0.tar.gz.

File metadata

Download URL: comprehensive_rag_evaluation_metrics-0.12.0.tar.gz
Upload date: Aug 6, 2024
Size: 7.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for comprehensive_rag_evaluation_metrics-0.12.0.tar.gz
Algorithm	Hash digest
SHA256	`1473059df060017db1941e8f64403af8ce8950a76c7187b84bc23d7245f49eb9`
MD5	`201ef2808d0050085401413c7f958b7b`
BLAKE2b-256	`98bafb97b395a004d6f3c02822b09172c00bf37f4bdb830e0cea9c9c85efde1e`

See more details on using hashes here.

File details

Details for the file Comprehensive_RAG_Evaluation_Metrics-0.12.0-py3-none-any.whl.

File metadata

Download URL: Comprehensive_RAG_Evaluation_Metrics-0.12.0-py3-none-any.whl
Upload date: Aug 6, 2024
Size: 8.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for Comprehensive_RAG_Evaluation_Metrics-0.12.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`896e49928405415b1cd223ae2b54d63315b156c213abcf6edff1205ee29f2ab2`
MD5	`c02a96bc68e1b421b029047470ec702e`
BLAKE2b-256	`a8e5f5eaebf86289354669ca1e25ee322df97e6f4a525703228aa6c05c6dd981`

See more details on using hashes here.

Comprehensive-RAG-Evaluation-Metrics 0.12.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Initialize the evaluator

Input data

Evaluate the response

Print the results

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes