Korean-optimized RAG evaluation toolkit based on ranx with Kiwi tokenizer and Korean language support

These details have not been verified by PyPI

Project links

Project description

ranx-k: Korean-optimized ranx IR Evaluation Toolkit 🇰🇷

ranx-k is a Korean-optimized Information Retrieval (IR) evaluation toolkit that extends the ranx library with Kiwi tokenizer and Korean embeddings. It provides accurate evaluation for RAG (Retrieval-Augmented Generation) systems.

🚀 Key Features

Korean-optimized: Accurate tokenization using Kiwi morphological analyzer
ranx-based: Supports proven IR evaluation metrics (Hit@K, NDCG@K, MRR, MAP@K, etc.)
LangChain compatible: Supports LangChain retriever interface standards
Multiple evaluation methods: ROUGE, embedding similarity, semantic similarity-based evaluation
Graded relevance support: Use similarity scores as relevance grades for NDCG calculation
Configurable ROUGE types: Choose between ROUGE-1, ROUGE-2, and ROUGE-L
Strict threshold enforcement: Documents below similarity threshold are correctly treated as retrieval failures
Practical design: Supports step-by-step evaluation from prototype to production
High performance: 30-80% improvement in Korean evaluation accuracy over existing methods
Bilingual output: English-Korean output support for international accessibility

📦 Installation

pip install ranx-k

Or install development version:

pip install "ranx-k[dev]"

🔗 Retriever Compatibility

ranx-k supports LangChain retriever interface:

# Retriever must implement invoke() method
class YourRetriever:
    def invoke(self, query: str) -> List[Document]:
        # Return list of Document objects (requires page_content attribute)
        pass

# LangChain Document usage example
from langchain.schema import Document
doc = Document(page_content="Text content")

Note: LangChain is distributed under the MIT License. See documentation for details.

🔧 Quick Start

Basic Usage

from ranx_k.evaluation import simple_kiwi_rouge_evaluation

# Simple Kiwi ROUGE evaluation
results = simple_kiwi_rouge_evaluation(
    retriever=your_retriever,
    questions=your_questions,
    reference_contexts=your_reference_contexts,
    k=5
)

print(f"ROUGE-1: {results['kiwi_rouge1@5']:.3f}")
print(f"ROUGE-2: {results['kiwi_rouge2@5']:.3f}")
print(f"ROUGE-L: {results['kiwi_rougeL@5']:.3f}")

Enhanced Evaluation (Rouge Score + Kiwi)

from ranx_k.evaluation import rouge_kiwi_enhanced_evaluation

# Proven rouge_score library + Kiwi tokenizer
results = rouge_kiwi_enhanced_evaluation(
    retriever=your_retriever,
    questions=your_questions,
    reference_contexts=your_reference_contexts,
    k=5,
    tokenize_method='morphs',  # 'morphs' or 'nouns'
    use_stopwords=True
)

Semantic Similarity-based ranx Evaluation

from ranx_k.evaluation import evaluate_with_ranx_similarity

# Reference-based evaluation (recommended for accurate recall)
results = evaluate_with_ranx_similarity(
    retriever=your_retriever,
    questions=your_questions,
    reference_contexts=your_reference_contexts,
    k=5,
    method='embedding',
    similarity_threshold=0.6,
    use_graded_relevance=False,        # Binary relevance (default)
    evaluation_mode='reference_based'  # Evaluates against all reference docs
)

print(f"Hit@5: {results['hit_rate@5']:.3f}")
print(f"NDCG@5: {results['ndcg@5']:.3f}")
print(f"MRR: {results['mrr']:.3f}")
print(f"MAP@5: {results['map@5']:.3f}")

Using Different Embedding Models

# OpenAI embedding model (requires API key)
results = evaluate_with_ranx_similarity(
    retriever=your_retriever,
    questions=your_questions,
    reference_contexts=your_reference_contexts,
    k=5,
    method='openai',
    similarity_threshold=0.7,
    embedding_model="text-embedding-3-small"
)

# Latest BGE-M3 model (excellent for Korean)
results = evaluate_with_ranx_similarity(
    retriever=your_retriever,
    questions=your_questions,
    reference_contexts=your_reference_contexts,
    k=5,
    method='embedding',
    similarity_threshold=0.6,
    embedding_model="BAAI/bge-m3"
)

# Korean-specialized Kiwi ROUGE method with configurable ROUGE types
results = evaluate_with_ranx_similarity(
    retriever=your_retriever,
    questions=your_questions,
    reference_contexts=your_reference_contexts,
    k=5,
    method='kiwi_rouge',
    similarity_threshold=0.3,  # Lower threshold recommended for Kiwi ROUGE
    rouge_type='rougeL',      # Choose 'rouge1', 'rouge2', or 'rougeL'
    tokenize_method='morphs', # Choose 'morphs' or 'nouns'  
    use_stopwords=True        # Configure stopword filtering
)

Comprehensive Evaluation

from ranx_k.evaluation import comprehensive_evaluation_comparison

# Compare all evaluation methods
comparison = comprehensive_evaluation_comparison(
    retriever=your_retriever,
    questions=your_questions,
    reference_contexts=your_reference_contexts,
    k=5
)

📊 Evaluation Methods

1. Kiwi ROUGE Evaluation

Advantages: Fast speed, intuitive interpretation
Use case: Prototyping, quick feedback

2. Enhanced ROUGE (Rouge Score + Kiwi)

Advantages: Proven library, stability
Use case: Production environment, reliability-critical evaluation

3. Semantic Similarity-based ranx

Advantages: Traditional IR metrics, semantic similarity
Use case: Research, benchmarking, detailed analysis

🎯 Performance Improvement Examples

# Existing method (English tokenizer)
basic_rouge1 = 0.234

# ranx-k (Kiwi tokenizer)
ranxk_rouge1 = 0.421  # +79.9% improvement!

📊 Recommended Embedding Models

Model	Use Case	Threshold	Features
`paraphrase-multilingual-MiniLM-L12-v2`	Default	0.6	Fast, lightweight
`text-embedding-3-small` (OpenAI)	Accuracy	0.7	High accuracy, cost-effective
`BAAI/bge-m3`	Korean	0.6	Latest, excellent multilingual
`text-embedding-3-large` (OpenAI)	Premium	0.8	Highest performance

📈 Score Interpretation Guide

Score Range	Assessment	Recommended Action
0.7+	🟢 Excellent	Maintain current settings
0.5~0.7	🟡 Good	Consider fine-tuning
0.3~0.5	🟠 Average	Improvement needed
0.3-	🔴 Poor	Major revision required

🔍 Advanced Usage

Graded Relevance Mode

# Graded relevance mode - uses similarity scores as relevance grades
results = evaluate_with_ranx_similarity(
    retriever=your_retriever,
    questions=questions,
    reference_contexts=references,
    method='embedding',
    similarity_threshold=0.6,
    use_graded_relevance=True   # Uses similarity scores as relevance grades
)

print(f"NDCG@5: {results['ndcg@5']:.3f}")

Note on Graded Relevance: The use_graded_relevance parameter primarily affects NDCG (Normalized Discounted Cumulative Gain) calculation. Other metrics like Hit@K, MRR, and MAP treat relevance as binary in the ranx library. Use graded relevance when you need to distinguish between different levels of document relevance quality.

Custom Embedding Models

# Use custom embedding model
results = evaluate_with_ranx_similarity(
    retriever=your_retriever,
    questions=questions,
    reference_contexts=references,
    method='embedding',
    embedding_model="your-custom-model-name",
    similarity_threshold=0.6,
    use_graded_relevance=True
)

Configurable ROUGE Types

# Compare different ROUGE metrics
for rouge_type in ['rouge1', 'rouge2', 'rougeL']:
    results = evaluate_with_ranx_similarity(
        retriever=your_retriever,
        questions=questions,
        reference_contexts=references,
        method='kiwi_rouge',
        rouge_type=rouge_type,
        tokenize_method='morphs',
        similarity_threshold=0.3
    )
    print(f"{rouge_type.upper()}: Hit@5 = {results['hit_rate@5']:.3f}")

Threshold Sensitivity Analysis

# Analyze how different thresholds affect evaluation
thresholds = [0.3, 0.5, 0.7]
for threshold in thresholds:
    results = evaluate_with_ranx_similarity(
        retriever=your_retriever,
        questions=questions,
        reference_contexts=references,
        similarity_threshold=threshold
    )
    print(f"Threshold {threshold}: Hit@5={results['hit_rate@5']:.3f}, NDCG@5={results['ndcg@5']:.3f}")

📚 Examples

🤝 Contributing

Contributions are welcome! Please feel free to submit issues and pull requests.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built on top of ranx by Elias Bassani
Korean morphological analysis powered by Kiwi
Embedding support via sentence-transformers

📞 Support

🐛 Issue Tracker: Please submit issues on GitHub
📧 Email: ontofinance@gmail.com

ranx-k - Empowering Korean RAG evaluation with precision and ease!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.0.17

Aug 19, 2025

0.0.16

Aug 19, 2025

0.0.15

Aug 19, 2025

0.0.14

Aug 19, 2025

This version

0.0.13

Aug 19, 2025

0.0.12

Aug 19, 2025

0.0.11

Aug 19, 2025

0.0.10

Aug 19, 2025

0.0.9

Aug 19, 2025

0.0.8

Aug 4, 2025

0.0.7

Aug 4, 2025

0.0.6

Aug 4, 2025

0.0.5

Aug 4, 2025

0.0.4

Aug 4, 2025

0.0.3

Aug 4, 2025

0.0.2

Aug 4, 2025

0.0.1

Aug 4, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ranx_k-0.0.13.tar.gz (70.6 kB view details)

Uploaded Aug 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ranx_k-0.0.13-py3-none-any.whl (84.8 kB view details)

Uploaded Aug 19, 2025 Python 3

File details

Details for the file ranx_k-0.0.13.tar.gz.

File metadata

Download URL: ranx_k-0.0.13.tar.gz
Upload date: Aug 19, 2025
Size: 70.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for ranx_k-0.0.13.tar.gz
Algorithm	Hash digest
SHA256	`31befaf0845d6df3d92bbb5616712929781efc36c1bb61faea0e1acb99ecac7f`
MD5	`0c4a2d1d994337770460f850b4166a80`
BLAKE2b-256	`08a343378b23746856244e8b3050bf2da72a7c2961f44f4dd6c27ca431f6856c`

See more details on using hashes here.

File details

Details for the file ranx_k-0.0.13-py3-none-any.whl.

File metadata

Download URL: ranx_k-0.0.13-py3-none-any.whl
Upload date: Aug 19, 2025
Size: 84.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for ranx_k-0.0.13-py3-none-any.whl
Algorithm	Hash digest
SHA256	`88ced5b0f857006ac1d0edda3378f0bbf63d0e49d66092c31802eedde79f0081`
MD5	`7c33c96d5f0b6f16224df1007e27c36a`
BLAKE2b-256	`69a340ee9c9a01e32d46e86220504646e9042b07a5c1d2db8a07003bd14b6947`

See more details on using hashes here.

ranx-k 0.0.13

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

ranx-k: Korean-optimized ranx IR Evaluation Toolkit 🇰🇷

🚀 Key Features

📦 Installation

🔗 Retriever Compatibility

🔧 Quick Start

Basic Usage

Enhanced Evaluation (Rouge Score + Kiwi)

Semantic Similarity-based ranx Evaluation

Using Different Embedding Models

Comprehensive Evaluation

📊 Evaluation Methods

1. Kiwi ROUGE Evaluation

2. Enhanced ROUGE (Rouge Score + Kiwi)

3. Semantic Similarity-based ranx

🎯 Performance Improvement Examples

📊 Recommended Embedding Models

📈 Score Interpretation Guide

🔍 Advanced Usage

Graded Relevance Mode

Custom Embedding Models

Configurable ROUGE Types

Threshold Sensitivity Analysis

📚 Examples

🤝 Contributing

📄 License

🙏 Acknowledgments

📞 Support

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes