Skip to main content

Korean-optimized RAG evaluation toolkit based on ranx with Kiwi tokenizer and Korean language support

Project description

ranx-k: Korean-optimized ranx IR Evaluation Toolkit 🇰🇷

PyPI version Python version License: MIT

English | 한국어

ranx-k is a Korean-optimized Information Retrieval (IR) evaluation toolkit that extends the ranx library with Kiwi tokenizer and Korean embeddings. It provides accurate evaluation for RAG (Retrieval-Augmented Generation) systems.

🚀 Key Features

  • Korean-optimized: Accurate tokenization using Kiwi morphological analyzer
  • ranx-based: Supports proven IR evaluation metrics (Hit@K, NDCG@K, MRR, MAP@K, etc.)
  • LangChain compatible: Supports LangChain retriever interface standards
  • Multiple evaluation methods: ROUGE, embedding similarity, semantic similarity-based evaluation
  • Graded relevance support: NEW in v0.0.9 - Use similarity scores as relevance grades instead of binary 1/0
  • Configurable ROUGE types: Choose between ROUGE-1, ROUGE-2, and ROUGE-L
  • Strict threshold enforcement: Documents below similarity threshold are correctly treated as retrieval failures
  • Practical design: Supports step-by-step evaluation from prototype to production
  • High performance: 30-80% improvement in Korean evaluation accuracy over existing methods
  • Bilingual output: English-Korean output support for international accessibility

📦 Installation

pip install ranx-k

Or install development version:

pip install "ranx-k[dev]"

🔗 Retriever Compatibility

ranx-k supports LangChain retriever interface:

# Retriever must implement invoke() method
class YourRetriever:
    def invoke(self, query: str) -> List[Document]:
        # Return list of Document objects (requires page_content attribute)
        pass

# LangChain Document usage example
from langchain.schema import Document
doc = Document(page_content="Text content")

Note: LangChain is distributed under the MIT License. See documentation for details.

🔧 Quick Start

Basic Usage

from ranx_k.evaluation import simple_kiwi_rouge_evaluation

# Simple Kiwi ROUGE evaluation
results = simple_kiwi_rouge_evaluation(
    retriever=your_retriever,
    questions=your_questions,
    reference_contexts=your_reference_contexts,
    k=5
)

print(f"ROUGE-1: {results['kiwi_rouge1@5']:.3f}")
print(f"ROUGE-2: {results['kiwi_rouge2@5']:.3f}")
print(f"ROUGE-L: {results['kiwi_rougeL@5']:.3f}")

Enhanced Evaluation (Rouge Score + Kiwi)

from ranx_k.evaluation import rouge_kiwi_enhanced_evaluation

# Proven rouge_score library + Kiwi tokenizer
results = rouge_kiwi_enhanced_evaluation(
    retriever=your_retriever,
    questions=your_questions,
    reference_contexts=your_reference_contexts,
    k=5,
    tokenize_method='morphs',  # 'morphs' or 'nouns'
    use_stopwords=True
)

Semantic Similarity-based ranx Evaluation

from ranx_k.evaluation import evaluate_with_ranx_similarity

# Reference-based evaluation (recommended for accurate recall)
results = evaluate_with_ranx_similarity(
    retriever=your_retriever,
    questions=your_questions,
    reference_contexts=your_reference_contexts,
    k=5,
    method='embedding',
    similarity_threshold=0.6,
    use_graded_relevance=False,        # NEW: Binary relevance (default)
    evaluation_mode='reference_based'  # Evaluates against all reference docs
)

print(f"Hit@5: {results['hit_rate@5']:.3f}")
print(f"NDCG@5: {results['ndcg@5']:.3f}")
print(f"MRR: {results['mrr']:.3f}")
print(f"MAP@5: {results['map@5']:.3f}")

# NEW: Graded relevance - uses similarity scores as relevance grades
results_graded = evaluate_with_ranx_similarity(
    retriever=your_retriever,
    questions=your_questions,
    reference_contexts=your_reference_contexts,
    k=5,
    method='embedding',
    similarity_threshold=0.6,
    use_graded_relevance=True,         # Use similarity scores as grades
    evaluation_mode='reference_based'
)

print(f"Graded NDCG@5: {results_graded['ndcg@5']:.3f}")

Using Different Embedding Models

# OpenAI embedding model (requires API key)
results = evaluate_with_ranx_similarity(
    retriever=your_retriever,
    questions=your_questions,
    reference_contexts=your_reference_contexts,
    k=5,
    method='openai',
    similarity_threshold=0.7,
    embedding_model="text-embedding-3-small"
)

# Latest BGE-M3 model (excellent for Korean)
results = evaluate_with_ranx_similarity(
    retriever=your_retriever,
    questions=your_questions,
    reference_contexts=your_reference_contexts,
    k=5,
    method='embedding',
    similarity_threshold=0.6,
    embedding_model="BAAI/bge-m3"
)

# Korean-specialized Kiwi ROUGE method with configurable ROUGE types (NEW in v0.0.9)
results = evaluate_with_ranx_similarity(
    retriever=your_retriever,
    questions=your_questions,
    reference_contexts=your_reference_contexts,
    k=5,
    method='kiwi_rouge',
    similarity_threshold=0.3,  # Lower threshold recommended for Kiwi ROUGE
    rouge_type='rougeL',      # NEW: Choose 'rouge1', 'rouge2', or 'rougeL'
    tokenize_method='morphs', # NEW: Choose 'morphs' or 'nouns'  
    use_stopwords=True        # NEW: Configure stopword filtering
)

Comprehensive Evaluation

from ranx_k.evaluation import comprehensive_evaluation_comparison

# Compare all evaluation methods
comparison = comprehensive_evaluation_comparison(
    retriever=your_retriever,
    questions=your_questions,
    reference_contexts=your_reference_contexts,
    k=5
)

📊 Evaluation Methods

1. Kiwi ROUGE Evaluation

  • Advantages: Fast speed, intuitive interpretation
  • Use case: Prototyping, quick feedback

2. Enhanced ROUGE (Rouge Score + Kiwi)

  • Advantages: Proven library, stability
  • Use case: Production environment, reliability-critical evaluation

3. Semantic Similarity-based ranx

  • Advantages: Traditional IR metrics, semantic similarity
  • Use case: Research, benchmarking, detailed analysis

🎯 Performance Improvement Examples

# Existing method (English tokenizer)
basic_rouge1 = 0.234

# ranx-k (Kiwi tokenizer)
ranxk_rouge1 = 0.421  # +79.9% improvement!

📊 Recommended Embedding Models

Model Use Case Threshold Features
paraphrase-multilingual-MiniLM-L12-v2 Default 0.6 Fast, lightweight
text-embedding-3-small (OpenAI) Accuracy 0.7 High accuracy, cost-effective
BAAI/bge-m3 Korean 0.6 Latest, excellent multilingual
text-embedding-3-large (OpenAI) Premium 0.8 Highest performance

📈 Score Interpretation Guide

Score Range Assessment Recommended Action
0.7+ 🟢 Excellent Maintain current settings
0.5~0.7 🟡 Good Consider fine-tuning
0.3~0.5 🟠 Average Improvement needed
0.3- 🔴 Poor Major revision required

🔍 Advanced Usage

Graded vs Binary Relevance Comparison

# Compare binary and graded relevance
binary_results = evaluate_with_ranx_similarity(
    retriever=your_retriever,
    questions=questions,
    reference_contexts=references,
    method='embedding',
    similarity_threshold=0.6,
    use_graded_relevance=False  # Binary: 1.0 for all relevant docs
)

graded_results = evaluate_with_ranx_similarity(
    retriever=your_retriever,
    questions=questions,
    reference_contexts=references,
    method='embedding',
    similarity_threshold=0.6,
    use_graded_relevance=True   # Graded: similarity scores as relevance grades
)

print(f"Binary NDCG@5: {binary_results['ndcg@5']:.3f}")
print(f"Graded NDCG@5: {graded_results['ndcg@5']:.3f}")

Custom Embedding Models

# Use custom embedding model
results = evaluate_with_ranx_similarity(
    retriever=your_retriever,
    questions=questions,
    reference_contexts=references,
    method='embedding',
    embedding_model="your-custom-model-name",
    similarity_threshold=0.6,
    use_graded_relevance=True
)

Configurable ROUGE Types

# Compare different ROUGE metrics with graded relevance
for rouge_type in ['rouge1', 'rouge2', 'rougeL']:
    results = evaluate_with_ranx_similarity(
        retriever=your_retriever,
        questions=questions,
        reference_contexts=references,
        method='kiwi_rouge',
        rouge_type=rouge_type,
        tokenize_method='morphs',
        similarity_threshold=0.3,
        use_graded_relevance=True  # Use ROUGE scores as relevance grades
    )
    print(f"{rouge_type.upper()}: Hit@5 = {results['hit_rate@5']:.3f}")

Threshold Sensitivity Analysis

# Analyze how different thresholds affect graded vs binary relevance
thresholds = [0.3, 0.5, 0.7]
for threshold in thresholds:
    binary = evaluate_with_ranx_similarity(
        retriever=your_retriever,
        questions=questions,
        reference_contexts=references,
        similarity_threshold=threshold,
        use_graded_relevance=False
    )
    graded = evaluate_with_ranx_similarity(
        retriever=your_retriever,
        questions=questions,
        reference_contexts=references,
        similarity_threshold=threshold,
        use_graded_relevance=True
    )
    print(f"Threshold {threshold}: Binary={binary['hit_rate@5']:.3f}, Graded={graded['hit_rate@5']:.3f}")

📚 Examples

📖 Documentation

🤝 Contributing

Contributions are welcome! Please read our Contributing Guide for details.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

📞 Support


ranx-k - Empowering Korean RAG evaluation with precision and ease!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ranx_k-0.0.11.tar.gz (70.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ranx_k-0.0.11-py3-none-any.whl (84.8 kB view details)

Uploaded Python 3

File details

Details for the file ranx_k-0.0.11.tar.gz.

File metadata

  • Download URL: ranx_k-0.0.11.tar.gz
  • Upload date:
  • Size: 70.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for ranx_k-0.0.11.tar.gz
Algorithm Hash digest
SHA256 b268a319303a3755af82340dc443dd45eeda9ce0c48f682ff770407bedf1f630
MD5 1452fdd8d6e2d8a3317dcfcaf72d7d2d
BLAKE2b-256 f334240701a40aa99f80d3dc79f1bf55fb84ef5974ea72ca74bdd812039e9d88

See more details on using hashes here.

File details

Details for the file ranx_k-0.0.11-py3-none-any.whl.

File metadata

  • Download URL: ranx_k-0.0.11-py3-none-any.whl
  • Upload date:
  • Size: 84.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for ranx_k-0.0.11-py3-none-any.whl
Algorithm Hash digest
SHA256 6846966bc6b81129321375bae62489bf8d6d73fae33024f4e9198dcbeac1c4d0
MD5 076e1450c71260ec12e5e70cede94027
BLAKE2b-256 12f217b535db3d96cb03242145e7f30748266f85049d4d1aae05d98da70f2487

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page