Skip to main content

A Python package for RAG performance evaluation

Project description

Krag

Krag is a Python package designed to evaluate RAG (Retrieval-Augmented Generation) systems. It provides various evaluation metrics including Hit Rate, Recall, Precision, MRR (Mean Reciprocal Rank), MAP (Mean Average Precision), and NDCG (Normalized Discounted Cumulative Gain).

Installation

pip install krag

Key Features

1. Evaluation Metrics

  • Hit Rate: Measures ratio of correctly identified target documents
  • Recall: Measures ratio of relevant documents found in top-k predictions
  • Precision: Measures accuracy of top-k predictions
  • F1 Score: Harmonic mean of Precision and Recall
  • MRR (Mean Reciprocal Rank): Average of inverse rank for first relevant document
  • MAP (Mean Average Precision): Average precision at ranks with relevant documents
  • NDCG (Normalized Discounted Cumulative Gain): Measures ranking quality with position-weighted gains

2. Document Matching Methods

  • Text Preprocessing

    • TokenizerType (KIWI(for Korean), NLTK, WHITESPACE)
    • TokenizerConfig for consistent normalization
    • Language-specific tokenization support (Korean/ English)
  • Matching Methods

    • Exact text matching
    • ROUGE-based matching (rouge1, rouge2, rougeL)
    • Embedding-based similarity
      • Supports HuggingFace, OpenAI, Ollama
      • Configurable thresholds
      • Cached embeddings for efficiency

3. Configuration Options

  • Tokenizer selection and settings
  • Rouge/similarity thresholds
  • Embedding model configuration
  • Averaging methods (micro/macro)

4. Visualization

  • Comparative bar charts for metrics

Metric Details

def calculate_hit_rate(k):
    """
    Hit rate = # queries with correct documents / total queries
    - ALL: Found all target docs
    - PARTIAL: Found at least one target doc
    """

def calculate_recall(k):
    """
    Recall = # relevant docs retrieved / total relevant docs
    - micro: Document-level average
    - macro: Query-level average
    """

def calculate_precision(k):
    """
    Precision = # relevant retrieved / # retrieved
    - micro: Document-level average
    - macro: Query-level average
    """

def calculate_f1_score(k):
    """F1 = 2 * (precision * recall)/(precision + recall)"""

def calculate_mrr(k):
    """MRR = mean(1/rank of first relevant doc)"""

def calculate_map(k):
    """MAP = mean(average precision per query)"""

def calculate_ndcg(k):
    """NDCG = DCG/IDCG with position discounting"""

Usage Examples

1. Basic Evaluator

from krag.document import KragDocument as Document
from krag.evaluators import OfflineRetrievalEvaluators, AveragingMethod, MatchingCriteria

actual_docs = [
    [Document(page_content="This is the first document."),
     Document(page_content="This is the second document.")],
]

predicted_docs = [
    [Document(page_content="This is the last document."),
     Document(page_content="This is the second document.")],
]

evaluator = OfflineRetrievalEvaluators(
    actual_docs,
    predicted_docs,
    averaging_method=AveragingMethod.MICRO,
    matching_criteria=MatchingCriteria.PARTIAL
)

# Calculate individual metrics
hit_rate = evaluator.calculate_hit_rate(k=2)
mrr = evaluator.calculate_mrr(k=2)
recall = evaluator.calculate_recall(k=2)
precision = evaluator.calculate_precision(k=2)
f1_score = evaluator.calculate_f1_score(k=2)
map_score = evaluator.calculate_map(k=2)
ndcg = evaluator.calculate_ndcg(k=2)

print(f"Hit Rate @2: {hit_rate}")
print(f"MRR @2: {mrr}")
print(f"Recall @2: {recall}")
print(f"Precision @2: {precision}")
print(f"F1 Score @2: {f1_score}")
print(f"MAP @2: {map_score}")
print(f"NDCG @2: {ndcg}")

# Visualize the evaluation results
evaluator.visualize_results(k=2)

Metric Values

Hit Rate @2: {'hit_rate': 1.0}
MRR @2: {'mrr': 0.5}
Recall @2: {'micro_recall': 0.5}
Precision @2: {'micro_precision': 0.5}
F1 Score @2: {'micro_f1': 0.5}
MAP @2: {'map': 0.25}
NDCG @2: {'ndcg': 0.6309297535714575}

Visualization Result

Evaluation Results Visualization

2. ROUGE Evaluator

from krag.evaluators import RougeOfflineRetrievalEvaluators

# Evaluation using ROUGE matching
evaluator = RougeOfflineRetrievalEvaluators(
    actual_docs,
    predicted_docs,
    averaging_method=AveragingMethod.MICRO,
    matching_criteria=MatchingCriteria.PARTIAL,
    match_method="rouge2",  # Choose from rouge1, rouge2, rougeL
    threshold=0.8  # ROUGE score threshold
)

# Calculate individual metrics
hit_rate = evaluator.calculate_hit_rate(k=2)
mrr = evaluator.calculate_mrr(k=2)
recall = evaluator.calculate_recall(k=2)
precision = evaluator.calculate_precision(k=2)
f1_score = evaluator.calculate_f1_score(k=2)
map_score = evaluator.calculate_map(k=2)
ndcg = evaluator.calculate_ndcg(k=2)

print(f"Hit Rate @2: {hit_rate}")
print(f"MRR @2: {mrr}")
print(f"Recall @2: {recall}")
print(f"Precision @2: {precision}")
print(f"F1 Score @2: {f1_score}")
print(f"MAP @2: {map_score}")
print(f"NDCG @2: {ndcg}")

# Visualize the evaluation results
evaluator.visualize_results(k=2)

Metric Values

Hit Rate @2: {'hit_rate': 1.0}
MRR @2: {'mrr': 0.5}
Recall @2: {'micro_recall': 0.5}
Precision @2: {'micro_precision': 0.5}
F1 Score @2: {'micro_f1': 0.5}
MAP @2: {'map': 0.25}
NDCG @2: {'ndcg': 0.8651447273736845}

Visualization Result

Evaluation Results Visualization

3. Embedding-based ROUGE Evaluator

Performs initial filtering using text embeddings followed by detailed comparison using ROUGE scores.

from krag.evaluators import EmbeddingRougeOfflineRetrievalEvaluators

# Using HuggingFace embeddings
evaluator = EmbeddingRougeOfflineRetrievalEvaluators(
    actual_docs,
    predicted_docs,
    averaging_method=AveragingMethod.MICRO,
    matching_criteria=MatchingCriteria.PARTIAL,
    embedding_type="huggingface",
    embedding_config={
        "model_name": "jhgan/ko-sroberta-multitask",
        "model_kwargs": {'device': 'cpu'},
        "encode_kwargs": {'normalize_embeddings': False}
    },
    similarity_threshold=0.7,  # Embedding similarity threshold
    rouge_threshold=0.8  # ROUGE score threshold
)

# Using OpenAI embeddings
evaluator = EmbeddingRougeOfflineRetrievalEvaluators(
    actual_docs,
    predicted_docs,
    averaging_method=AveragingMethod.MICRO,
    matching_criteria=MatchingCriteria.PARTIAL,
    embedding_type="openai",
    embedding_config={
        "model": "text-embedding-3-small",
        "dimensions": 1024  # Optional embedding dimensions
    },
    similarity_threshold=0.7,  # Embedding similarity threshold
    rouge_threshold=0.8  # ROUGE score threshold
)

# Using Ollama embeddings
evaluator = EmbeddingRougeOfflineRetrievalEvaluators(
    actual_docs,
    predicted_docs,
    averaging_method=AveragingMethod.MICRO,
    matching_criteria=MatchingCriteria.PARTIAL,
    embedding_type="ollama",
    embedding_config={"model": "bge-m3"},
    similarity_threshold=0.8,  # Embedding similarity threshold
    rouge_threshold=0.8  # ROUGE score threshold

)

# Calculate individual metrics
hit_rate = evaluator.calculate_hit_rate(k=2)
mrr = evaluator.calculate_mrr(k=2)
recall = evaluator.calculate_recall(k=2)
precision = evaluator.calculate_precision(k=2)
f1_score = evaluator.calculate_f1_score(k=2)
map_score = evaluator.calculate_map(k=2)
ndcg = evaluator.calculate_ndcg(k=2)

print(f"Hit Rate @2: {hit_rate}")
print(f"MRR @2: {mrr}")
print(f"Recall @2: {recall}")
print(f"Precision @2: {precision}")
print(f"F1 Score @2: {f1_score}")
print(f"MAP @2: {map_score}")
print(f"NDCG @2: {ndcg}")

# Visualize the evaluation results
evaluator.visualize_results(k=2)

Metric Values

Hit Rate @2: {'hit_rate': 1.0}
MRR @2: {'mrr': 0.5}
Recall @2: {'micro_recall': 1.0}
Precision @2: {'micro_precision': 0.5}
F1 Score @2: {'micro_f1': 0.6666666666666666}
MAP @2: {'map': 0.25}
NDCG @2: {'ndcg': 0.6309297535714575}

Visualization Result

Evaluation Results Visualization

Important Notes

  1. Required packages for embedding models:

    • HuggingFace: pip install langchain-huggingface
    • OpenAI: pip install langchain-openai
    • Ollama: pip install langchain-ollama
  2. OpenAI embeddings require an API key:

    import os
    os.environ["OPENAI_API_KEY"] = "your-api-key"
    

License

MIT License MIT 라이선스

Contact

Questions: ontofinances@gmail.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

krag-0.0.29.tar.gz (12.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

krag-0.0.29-py3-none-any.whl (13.0 kB view details)

Uploaded Python 3

File details

Details for the file krag-0.0.29.tar.gz.

File metadata

  • Download URL: krag-0.0.29.tar.gz
  • Upload date:
  • Size: 12.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for krag-0.0.29.tar.gz
Algorithm Hash digest
SHA256 7819e0630db07b2db5383a14cf68db80d02694e97168e41a4adc80f4a07af986
MD5 04d989d992c9e15957355b92da8e315b
BLAKE2b-256 2668edc9b4540263a2cb35865d2753c9d204da9fd831627c552bcd3bb925bd96

See more details on using hashes here.

File details

Details for the file krag-0.0.29-py3-none-any.whl.

File metadata

  • Download URL: krag-0.0.29-py3-none-any.whl
  • Upload date:
  • Size: 13.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for krag-0.0.29-py3-none-any.whl
Algorithm Hash digest
SHA256 187b4eb3e16ea879e96fdd78895fb6ff64af85da1dd034e5e5b2be8f52e5f08d
MD5 5f3a6ac4ae7a7c944012314f2e23d458
BLAKE2b-256 03f82f0e004973e4f8a112ff4008a0a18b7ade1950cb165142181a0b2667404e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page