Skip to main content

This library provides a comprehensive suite of metrics to evaluate the performance of Retrieval-Augmented Generation (RAG) systems. RAG systems, which combine information retrieval with text generation, present unique evaluation challenges beyond those found in standard language generation tasks

Project description

RAG Evaluator Overview RAG Evaluator is a Python library for evaluating Retrieval-Augmented Generation (RAG) systems. It provides various metrics to evaluate the quality of generated text against reference text.

Installation You can install the library using pip:

pip install Comprehensive_RAG_Evaluation_Metrics Usage Here's how to use the RAG Evaluator library:

from Comprehensive_RAG_Evaluation_Metrics import RAGEvaluator

Initialize the evaluator

evaluator = RAGEvaluator()

Input data

question = "What are the causes of difficulty in learning a new topic?" response = "Difficulty in learning a new topic is often caused by a lack of understanding of the subject's structure." reference = "Not knowing how to explain a topic to others can make it harder to learn, as it requires a deeper understanding of the subject's structure."

Evaluate the response

metrics = evaluator.evaluate_all(question, response, reference)

Print the results

print(metrics) Streamlit Web App To run the web app:

cd into streamlit app folder. Create a virtual env Activate the virtual env Install all dependencies Run the app: streamlit run app.py Metrics The RAG Evaluator provides the following metrics:

BLEU (0-100): Measures the overlap between the generated output and reference text based on n-grams.

0-20: Low similarity, 20-40: Medium-low, 40-60: Medium, 60-80: High, 80-100: Very high ROUGE-1 (0-1): Measures the overlap of unigrams between the generated output and reference text.

0.0-0.2: Poor overlap, 0.2-0.4: Fair, 0.4-0.6: Good, 0.6-0.8: Very good, 0.8-1.0: Excellent BERT Score (0-1): Evaluates the semantic similarity using BERT embeddings (Precision, Recall, F1).

0.0-0.5: Low similarity, 0.5-0.7: Moderate, 0.7-0.8: Good, 0.8-0.9: High, 0.9-1.0: Very high Perplexity (1 to ∞, lower is better): Measures how well a language model predicts the text.

1-10: Excellent, 10-50: Good, 50-100: Moderate, 100+: High (potentially nonsensical) Diversity (0-1): Measures the uniqueness of bigrams in the generated output.

0.0-0.2: Very low, 0.2-0.4: Low, 0.4-0.6: Moderate, 0.6-0.8: High, 0.8-1.0: Very high Racial Bias (0-1): Detects the presence of biased language in the generated output.

0.0-0.2: Low probability, 0.2-0.4: Moderate, 0.4-0.6: High, 0.6-0.8: Very high, 0.8-1.0: Extreme METEOR (0-1): Calculates semantic similarity considering synonyms and paraphrases.

0.0-0.2: Poor, 0.2-0.4: Fair, 0.4-0.6: Good, 0.6-0.8: Very good, 0.8-1.0: Excellent CHRF (0-1): Computes Character n-gram F-score for fine-grained text similarity.

0.0-0.2: Low, 0.2-0.4: Moderate, 0.4-0.6: Good, 0.6-0.8: High, 0.8-1.0: Very high Flesch Reading Ease (0-100): Assesses text readability.

0-30: Very difficult, 30-50: Difficult, 50-60: Fairly difficult, 60-70: Standard, 70-80: Fairly easy, 80-90: Easy, 90-100: Very easy Flesch-Kincaid Grade (0-18+): Indicates the U.S. school grade level needed to understand the text.

1-6: Elementary, 7-8: Middle school, 9-12: High school, 13+: College level Testing To run the tests, use the following command:

Semantic Similarity Evaluates similarity in meaning between two texts using word embeddings and cosine similarity for accurate context.

Factual Consistency Verifies factual accuracy in responses using entity recognition and knowledge graph-based methods for trustworthiness.

Question Relevance Measures response relevance to user queries using keyword extraction and intent detection for effective answers.

Context Relevance Assesses response appropriateness in a given situation using topic modeling and semantic role labeling for contextual fit.

Answer Relevance Evaluates response clarity and directness in answering user queries using named entity recognition and dependency parsing.

Toxicity Detects hate speech, profanity, and toxic content in responses using sentiment analysis and machine learning-based classification.

python -m unittest discover -s rag_evaluator -p "test_*.py"

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file comprehensive_rag_evaluation_metrics-0.12.0.tar.gz.

File metadata

File hashes

Hashes for comprehensive_rag_evaluation_metrics-0.12.0.tar.gz
Algorithm Hash digest
SHA256 1473059df060017db1941e8f64403af8ce8950a76c7187b84bc23d7245f49eb9
MD5 201ef2808d0050085401413c7f958b7b
BLAKE2b-256 98bafb97b395a004d6f3c02822b09172c00bf37f4bdb830e0cea9c9c85efde1e

See more details on using hashes here.

File details

Details for the file Comprehensive_RAG_Evaluation_Metrics-0.12.0-py3-none-any.whl.

File metadata

File hashes

Hashes for Comprehensive_RAG_Evaluation_Metrics-0.12.0-py3-none-any.whl
Algorithm Hash digest
SHA256 896e49928405415b1cd223ae2b54d63315b156c213abcf6edff1205ee29f2ab2
MD5 c02a96bc68e1b421b029047470ec702e
BLAKE2b-256 a8e5f5eaebf86289354669ca1e25ee322df97e6f4a525703228aa6c05c6dd981

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page