Skip to main content

A Python package for RAG performance evaluation

Project description

Krag

Krag is a Python package designed to evaluate RAG (Retrieval-Augmented Generation) systems. It provides various evaluation metrics including Hit Rate, Recall, Precision, MRR (Mean Reciprocal Rank), MAP (Mean Average Precision), and NDCG (Normalized Discounted Cumulative Gain).

Installation

pip install krag

Key Features

1. Evaluation Metrics

  • Hit Rate: Measures ratio of correctly identified target documents
  • Recall: Measures ratio of relevant documents found in top-k predictions
  • Precision: Measures accuracy of top-k predictions
  • F1 Score: Harmonic mean of Precision and Recall
  • MRR (Mean Reciprocal Rank): Average of inverse rank for first relevant document
  • MAP (Mean Average Precision): Average precision at ranks with relevant documents
  • NDCG (Normalized Discounted Cumulative Gain): Measures ranking quality with position-weighted gains

2. Document Matching Methods

  • Text Preprocessing

    • TokenizerType (KIWI(for Korean), NLTK, WHITESPACE)
    • TokenizerConfig for consistent normalization
    • Language-specific tokenization support (Korean/ English)
  • Matching Methods

    • Exact text matching
    • ROUGE-based matching (rouge1, rouge2, rougeL)
    • Embedding-based similarity
      • Supports HuggingFace, OpenAI, Ollama
      • Configurable thresholds
      • Cached embeddings for efficiency

3. Configuration Options

  • Tokenizer selection and settings
  • Rouge/similarity thresholds
  • Embedding model configuration
  • Averaging methods (micro/macro)

4. Visualization

  • Comparative bar charts for metrics

Metric Details

def calculate_hit_rate(k):
    """
    Hit rate = # queries with correct documents / total queries
    - ALL: Found all target docs
    - PARTIAL: Found at least one target doc
    """

def calculate_recall(k):
    """
    Recall = # relevant docs retrieved / total relevant docs
    - micro: Document-level average
    - macro: Query-level average
    """

def calculate_precision(k):
    """
    Precision = # relevant retrieved / # retrieved
    - micro: Document-level average
    - macro: Query-level average
    """

def calculate_f1_score(k):
    """F1 = 2 * (precision * recall)/(precision + recall)"""

def calculate_mrr(k):
    """MRR = mean(1/rank of first relevant doc)"""

def calculate_map(k):
    """MAP = mean(average precision per query)"""

def calculate_ndcg(k):
    """NDCG = DCG/IDCG with position discounting"""

Usage Examples

1. Basic Evaluator

from krag.document import KragDocument as Document
from krag.evaluators import OfflineRetrievalEvaluators, AveragingMethod, MatchingCriteria

actual_docs = [
    [Document(page_content="This is the first document."),
     Document(page_content="This is the second document.")],
]

predicted_docs = [
    [Document(page_content="This is the last document."),
     Document(page_content="This is the second document.")],
]

evaluator = OfflineRetrievalEvaluators(
    actual_docs,
    predicted_docs,
    averaging_method=AveragingMethod.MICRO,
    matching_criteria=MatchingCriteria.PARTIAL
)

# Calculate individual metrics
hit_rate = evaluator.calculate_hit_rate(k=2)
mrr = evaluator.calculate_mrr(k=2)
recall = evaluator.calculate_recall(k=2)
precision = evaluator.calculate_precision(k=2)
f1_score = evaluator.calculate_f1_score(k=2)
map_score = evaluator.calculate_map(k=2)
ndcg = evaluator.calculate_ndcg(k=2)

print(f"Hit Rate @2: {hit_rate}")
print(f"MRR @2: {mrr}")
print(f"Recall @2: {recall}")
print(f"Precision @2: {precision}")
print(f"F1 Score @2: {f1_score}")
print(f"MAP @2: {map_score}")
print(f"NDCG @2: {ndcg}")

# Visualize the evaluation results
evaluator.visualize_results(k=2)

Metric Values

Hit Rate @2: {'hit_rate': 1.0}
MRR @2: {'mrr': 0.5}
Recall @2: {'micro_recall': 0.5}
Precision @2: {'micro_precision': 0.5}
F1 Score @2: {'micro_f1': 0.5}
MAP @2: {'map': 0.25}
NDCG @2: {'ndcg': 0.6309297535714575}

Visualization Result

Evaluation Results Visualization

2. ROUGE Evaluator

from krag.evaluators import RougeOfflineRetrievalEvaluators

# Evaluation using ROUGE matching
evaluator = RougeOfflineRetrievalEvaluators(
    actual_docs,
    predicted_docs,
    averaging_method=AveragingMethod.MICRO,
    matching_criteria=MatchingCriteria.PARTIAL,
    match_method="rouge2",  # Choose from rouge1, rouge2, rougeL
    threshold=0.8  # ROUGE score threshold
)

# Calculate individual metrics
hit_rate = evaluator.calculate_hit_rate(k=2)
mrr = evaluator.calculate_mrr(k=2)
recall = evaluator.calculate_recall(k=2)
precision = evaluator.calculate_precision(k=2)
f1_score = evaluator.calculate_f1_score(k=2)
map_score = evaluator.calculate_map(k=2)
ndcg = evaluator.calculate_ndcg(k=2)

print(f"Hit Rate @2: {hit_rate}")
print(f"MRR @2: {mrr}")
print(f"Recall @2: {recall}")
print(f"Precision @2: {precision}")
print(f"F1 Score @2: {f1_score}")
print(f"MAP @2: {map_score}")
print(f"NDCG @2: {ndcg}")

# Visualize the evaluation results
evaluator.visualize_results(k=2)

Metric Values

Hit Rate @2: {'hit_rate': 1.0}
MRR @2: {'mrr': 0.5}
Recall @2: {'micro_recall': 0.5}
Precision @2: {'micro_precision': 0.5}
F1 Score @2: {'micro_f1': 0.5}
MAP @2: {'map': 0.25}
NDCG @2: {'ndcg': 0.8651447273736845}

Visualization Result

Evaluation Results Visualization

3. Embedding-based ROUGE Evaluator

Performs initial filtering using text embeddings followed by detailed comparison using ROUGE scores.

from krag.evaluators import EmbeddingRougeOfflineRetrievalEvaluators

# Using HuggingFace embeddings
evaluator = EmbeddingRougeOfflineRetrievalEvaluators(
    actual_docs,
    predicted_docs,
    averaging_method=AveragingMethod.MICRO,
    matching_criteria=MatchingCriteria.PARTIAL,
    embedding_type="huggingface",
    embedding_config={
        "model_name": "jhgan/ko-sroberta-multitask",
        "model_kwargs": {'device': 'cpu'},
        "encode_kwargs": {'normalize_embeddings': False}
    },
    similarity_threshold=0.7,  # Embedding similarity threshold
    rouge_threshold=0.8  # ROUGE score threshold
)

# Using OpenAI embeddings
evaluator = EmbeddingRougeOfflineRetrievalEvaluators(
    actual_docs,
    predicted_docs,
    averaging_method=AveragingMethod.MICRO,
    matching_criteria=MatchingCriteria.PARTIAL,
    embedding_type="openai",
    embedding_config={
        "model": "text-embedding-3-small",
        "dimensions": 1024  # Optional embedding dimensions
    },
    similarity_threshold=0.7,  # Embedding similarity threshold
    rouge_threshold=0.8  # ROUGE score threshold
)

# Using Ollama embeddings
evaluator = EmbeddingRougeOfflineRetrievalEvaluators(
    actual_docs,
    predicted_docs,
    averaging_method=AveragingMethod.MICRO,
    matching_criteria=MatchingCriteria.PARTIAL,
    embedding_type="ollama",
    embedding_config={"model": "bge-m3"},
    similarity_threshold=0.8,  # Embedding similarity threshold
    rouge_threshold=0.8  # ROUGE score threshold

)

# Calculate individual metrics
hit_rate = evaluator.calculate_hit_rate(k=2)
mrr = evaluator.calculate_mrr(k=2)
recall = evaluator.calculate_recall(k=2)
precision = evaluator.calculate_precision(k=2)
f1_score = evaluator.calculate_f1_score(k=2)
map_score = evaluator.calculate_map(k=2)
ndcg = evaluator.calculate_ndcg(k=2)

print(f"Hit Rate @2: {hit_rate}")
print(f"MRR @2: {mrr}")
print(f"Recall @2: {recall}")
print(f"Precision @2: {precision}")
print(f"F1 Score @2: {f1_score}")
print(f"MAP @2: {map_score}")
print(f"NDCG @2: {ndcg}")

# Visualize the evaluation results
evaluator.visualize_results(k=2)

Metric Values

Hit Rate @2: {'hit_rate': 1.0}
MRR @2: {'mrr': 0.5}
Recall @2: {'micro_recall': 1.0}
Precision @2: {'micro_precision': 0.5}
F1 Score @2: {'micro_f1': 0.6666666666666666}
MAP @2: {'map': 0.25}
NDCG @2: {'ndcg': 0.6309297535714575}

Visualization Result

Evaluation Results Visualization

Important Notes

  1. Required packages for embedding models:

    • HuggingFace: pip install langchain-huggingface
    • OpenAI: pip install langchain-openai
    • Ollama: pip install langchain-ollama
  2. OpenAI embeddings require an API key:

    import os
    os.environ["OPENAI_API_KEY"] = "your-api-key"
    

License

MIT License MIT 라이선스

Contact

Questions: ontofinances@gmail.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

krag-0.0.32.tar.gz (77.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

krag-0.0.32-py3-none-any.whl (13.1 kB view details)

Uploaded Python 3

File details

Details for the file krag-0.0.32.tar.gz.

File metadata

  • Download URL: krag-0.0.32.tar.gz
  • Upload date:
  • Size: 77.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for krag-0.0.32.tar.gz
Algorithm Hash digest
SHA256 3c483190a757ffaa04cd2aba3f31737651fd86ea4d513fa917e74c01bcb5d707
MD5 33b877b74500cef975a7fb8c106d84e9
BLAKE2b-256 6d6f63c457122aa72c87e2e6f8a6444de1fdb9688cb81671c2902b102a7f2418

See more details on using hashes here.

File details

Details for the file krag-0.0.32-py3-none-any.whl.

File metadata

  • Download URL: krag-0.0.32-py3-none-any.whl
  • Upload date:
  • Size: 13.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for krag-0.0.32-py3-none-any.whl
Algorithm Hash digest
SHA256 42571c49ce95b43f9eb01cfbcec7a6bc09b64fdfc4cb65feb2afdd4e22ff8dd9
MD5 c8dd1a84bf6d9592d6b942cd91729ba6
BLAKE2b-256 b4f1c97eb2625aeef0ff43eeb24f2a5aaf594054012fd167d0fa6bcf5166e95f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page