This library provides a comprehensive suite of metrics to evaluate the performance of Retrieval-Augmented Generation (RAG) systems. RAG systems, which combine information retrieval with text generation, present unique evaluation challenges beyond those found in standard language generation tasks
Project description
RAG Evaluator Overview RAG Evaluator is a Python library for evaluating Retrieval-Augmented Generation (RAG) systems. It provides various metrics to evaluate the quality of generated text against reference text.
Installation You can install the library using pip:
pip install Comprehensive_RAG_Evaluation_Metrics Usage Here's how to use the RAG Evaluator library:
from Comprehensive_RAG_Evaluation_Metrics import RAGEvaluator
Initialize the evaluator
evaluator = RAGEvaluator()
Input data
question = "What are the causes of difficulty in learning a new topic?" response = "Difficulty in learning a new topic is often caused by a lack of understanding of the subject's structure." reference = "Not knowing how to explain a topic to others can make it harder to learn, as it requires a deeper understanding of the subject's structure."
Evaluate the response
metrics = evaluator.evaluate_all(question, response, reference)
Print the results
print(metrics) Streamlit Web App To run the web app:
cd into streamlit app folder. Create a virtual env Activate the virtual env Install all dependencies Run the app: streamlit run app.py Metrics The RAG Evaluator provides the following metrics:
BLEU (0-100): Measures the overlap between the generated output and reference text based on n-grams.
0-20: Low similarity, 20-40: Medium-low, 40-60: Medium, 60-80: High, 80-100: Very high ROUGE-1 (0-1): Measures the overlap of unigrams between the generated output and reference text.
0.0-0.2: Poor overlap, 0.2-0.4: Fair, 0.4-0.6: Good, 0.6-0.8: Very good, 0.8-1.0: Excellent BERT Score (0-1): Evaluates the semantic similarity using BERT embeddings (Precision, Recall, F1).
0.0-0.5: Low similarity, 0.5-0.7: Moderate, 0.7-0.8: Good, 0.8-0.9: High, 0.9-1.0: Very high Perplexity (1 to ∞, lower is better): Measures how well a language model predicts the text.
1-10: Excellent, 10-50: Good, 50-100: Moderate, 100+: High (potentially nonsensical) Diversity (0-1): Measures the uniqueness of bigrams in the generated output.
0.0-0.2: Very low, 0.2-0.4: Low, 0.4-0.6: Moderate, 0.6-0.8: High, 0.8-1.0: Very high Racial Bias (0-1): Detects the presence of biased language in the generated output.
0.0-0.2: Low probability, 0.2-0.4: Moderate, 0.4-0.6: High, 0.6-0.8: Very high, 0.8-1.0: Extreme METEOR (0-1): Calculates semantic similarity considering synonyms and paraphrases.
0.0-0.2: Poor, 0.2-0.4: Fair, 0.4-0.6: Good, 0.6-0.8: Very good, 0.8-1.0: Excellent CHRF (0-1): Computes Character n-gram F-score for fine-grained text similarity.
0.0-0.2: Low, 0.2-0.4: Moderate, 0.4-0.6: Good, 0.6-0.8: High, 0.8-1.0: Very high Flesch Reading Ease (0-100): Assesses text readability.
0-30: Very difficult, 30-50: Difficult, 50-60: Fairly difficult, 60-70: Standard, 70-80: Fairly easy, 80-90: Easy, 90-100: Very easy Flesch-Kincaid Grade (0-18+): Indicates the U.S. school grade level needed to understand the text.
1-6: Elementary, 7-8: Middle school, 9-12: High school, 13+: College level Testing To run the tests, use the following command:
Semantic Similarity Evaluates similarity in meaning between two texts using word embeddings and cosine similarity for accurate context.
Factual Consistency Verifies factual accuracy in responses using entity recognition and knowledge graph-based methods for trustworthiness.
Question Relevance Measures response relevance to user queries using keyword extraction and intent detection for effective answers.
Context Relevance Assesses response appropriateness in a given situation using topic modeling and semantic role labeling for contextual fit.
Answer Relevance Evaluates response clarity and directness in answering user queries using named entity recognition and dependency parsing.
Toxicity Detects hate speech, profanity, and toxic content in responses using sentiment analysis and machine learning-based classification.
python -m unittest discover -s rag_evaluator -p "test_*.py"
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for comprehensive_rag_evaluation_metrics-0.12.0.tar.gz
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 | 1473059df060017db1941e8f64403af8ce8950a76c7187b84bc23d7245f49eb9 |
|
| MD5 | 201ef2808d0050085401413c7f958b7b |
|
| BLAKE2b-256 | 98bafb97b395a004d6f3c02822b09172c00bf37f4bdb830e0cea9c9c85efde1e |
Hashes for Comprehensive_RAG_Evaluation_Metrics-0.12.0-py3-none-any.whl
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 | 896e49928405415b1cd223ae2b54d63315b156c213abcf6edff1205ee29f2ab2 |
|
| MD5 | c02a96bc68e1b421b029047470ec702e |
|
| BLAKE2b-256 | a8e5f5eaebf86289354669ca1e25ee322df97e6f4a525703228aa6c05c6dd981 |