Korean-optimized RAG evaluation toolkit based on ranx with Kiwi tokenizer and Korean language support
Project description
ranx-k: Korean-optimized ranx IR Evaluation Toolkit 🇰🇷
ranx-k is a Korean-optimized Information Retrieval (IR) evaluation toolkit that extends the ranx library with Kiwi tokenizer and Korean embeddings. It provides accurate evaluation for RAG (Retrieval-Augmented Generation) systems.
🚀 Key Features
- Korean-optimized: Accurate tokenization using Kiwi morphological analyzer
- ranx-based: Supports proven IR evaluation metrics (Hit@K, NDCG@K, MRR, MAP@K, etc.)
- LangChain compatible: Supports LangChain retriever interface standards
- Multiple evaluation methods: ROUGE, embedding similarity, semantic similarity-based evaluation
- Graded relevance support: Use similarity scores as relevance grades for NDCG calculation
- Configurable ROUGE types: Choose between ROUGE-1, ROUGE-2, and ROUGE-L
- Strict threshold enforcement: Documents below similarity threshold are correctly treated as retrieval failures
- Practical design: Supports step-by-step evaluation from prototype to production
- High performance: 30-80% improvement in Korean evaluation accuracy over existing methods
- Bilingual output: English-Korean output support for international accessibility
📦 Installation
pip install ranx-k
Or install development version:
pip install "ranx-k[dev]"
🔗 Retriever Compatibility
ranx-k supports LangChain retriever interface:
# Retriever must implement invoke() method
class YourRetriever:
def invoke(self, query: str) -> List[Document]:
# Return list of Document objects (requires page_content attribute)
pass
# LangChain Document usage example
from langchain.schema import Document
doc = Document(page_content="Text content")
Note: LangChain is distributed under the MIT License. See documentation for details.
🔧 Quick Start
Basic Usage
from ranx_k.evaluation import simple_kiwi_rouge_evaluation
# Simple Kiwi ROUGE evaluation
results = simple_kiwi_rouge_evaluation(
retriever=your_retriever,
questions=your_questions,
reference_contexts=your_reference_contexts,
k=5
)
print(f"ROUGE-1: {results['kiwi_rouge1@5']:.3f}")
print(f"ROUGE-2: {results['kiwi_rouge2@5']:.3f}")
print(f"ROUGE-L: {results['kiwi_rougeL@5']:.3f}")
Enhanced Evaluation (Rouge Score + Kiwi)
from ranx_k.evaluation import rouge_kiwi_enhanced_evaluation
# Proven rouge_score library + Kiwi tokenizer
results = rouge_kiwi_enhanced_evaluation(
retriever=your_retriever,
questions=your_questions,
reference_contexts=your_reference_contexts,
k=5,
tokenize_method='morphs', # 'morphs' or 'nouns'
use_stopwords=True
)
Semantic Similarity-based ranx Evaluation
from ranx_k.evaluation import evaluate_with_ranx_similarity
# Reference-based evaluation (recommended for accurate recall)
results = evaluate_with_ranx_similarity(
retriever=your_retriever,
questions=your_questions,
reference_contexts=your_reference_contexts,
k=5,
method='embedding',
similarity_threshold=0.6,
use_graded_relevance=False, # Binary relevance (default)
evaluation_mode='reference_based' # Evaluates against all reference docs
)
print(f"Hit@5: {results['hit_rate@5']:.3f}")
print(f"NDCG@5: {results['ndcg@5']:.3f}")
print(f"MRR: {results['mrr']:.3f}")
print(f"MAP@5: {results['map@5']:.3f}")
Using Different Embedding Models
# OpenAI embedding model (requires API key)
results = evaluate_with_ranx_similarity(
retriever=your_retriever,
questions=your_questions,
reference_contexts=your_reference_contexts,
k=5,
method='openai',
similarity_threshold=0.7,
embedding_model="text-embedding-3-small"
)
# Latest BGE-M3 model (excellent for Korean)
results = evaluate_with_ranx_similarity(
retriever=your_retriever,
questions=your_questions,
reference_contexts=your_reference_contexts,
k=5,
method='embedding',
similarity_threshold=0.6,
embedding_model="BAAI/bge-m3"
)
# Korean-specialized Kiwi ROUGE method with configurable ROUGE types
results = evaluate_with_ranx_similarity(
retriever=your_retriever,
questions=your_questions,
reference_contexts=your_reference_contexts,
k=5,
method='kiwi_rouge',
similarity_threshold=0.3, # Lower threshold recommended for Kiwi ROUGE
rouge_type='rougeL', # Choose 'rouge1', 'rouge2', or 'rougeL'
tokenize_method='morphs', # Choose 'morphs' or 'nouns'
use_stopwords=True # Configure stopword filtering
)
Comprehensive Evaluation
from ranx_k.evaluation import comprehensive_evaluation_comparison
# Compare all evaluation methods
comparison = comprehensive_evaluation_comparison(
retriever=your_retriever,
questions=your_questions,
reference_contexts=your_reference_contexts,
k=5
)
📊 Evaluation Methods
1. Kiwi ROUGE Evaluation
- Advantages: Fast speed, intuitive interpretation
- Use case: Prototyping, quick feedback
2. Enhanced ROUGE (Rouge Score + Kiwi)
- Advantages: Proven library, stability
- Use case: Production environment, reliability-critical evaluation
3. Semantic Similarity-based ranx
- Advantages: Traditional IR metrics, semantic similarity
- Use case: Research, benchmarking, detailed analysis
🎯 Performance Improvement Examples
# Existing method (English tokenizer)
basic_rouge1 = 0.234
# ranx-k (Kiwi tokenizer)
ranxk_rouge1 = 0.421 # +79.9% improvement!
📊 Recommended Embedding Models
| Model | Use Case | Threshold | Features |
|---|---|---|---|
paraphrase-multilingual-MiniLM-L12-v2 |
Default | 0.6 | Fast, lightweight |
text-embedding-3-small (OpenAI) |
Accuracy | 0.7 | High accuracy, cost-effective |
BAAI/bge-m3 |
Korean | 0.6 | Latest, excellent multilingual |
text-embedding-3-large (OpenAI) |
Premium | 0.8 | Highest performance |
📈 Score Interpretation Guide
| Score Range | Assessment | Recommended Action |
|---|---|---|
| 0.7+ | 🟢 Excellent | Maintain current settings |
| 0.5~0.7 | 🟡 Good | Consider fine-tuning |
| 0.3~0.5 | 🟠 Average | Improvement needed |
| 0.3- | 🔴 Poor | Major revision required |
🔍 Advanced Usage
Graded Relevance Mode
# Graded relevance mode - uses similarity scores as relevance grades
results = evaluate_with_ranx_similarity(
retriever=your_retriever,
questions=questions,
reference_contexts=references,
method='embedding',
similarity_threshold=0.6,
use_graded_relevance=True # Uses similarity scores as relevance grades
)
print(f"NDCG@5: {results['ndcg@5']:.3f}")
Note on Graded Relevance: The
use_graded_relevanceparameter primarily affects NDCG (Normalized Discounted Cumulative Gain) calculation. Other metrics like Hit@K, MRR, and MAP treat relevance as binary in the ranx library. Use graded relevance when you need to distinguish between different levels of document relevance quality.
Custom Embedding Models
# Use custom embedding model
results = evaluate_with_ranx_similarity(
retriever=your_retriever,
questions=questions,
reference_contexts=references,
method='embedding',
embedding_model="your-custom-model-name",
similarity_threshold=0.6,
use_graded_relevance=True
)
Configurable ROUGE Types
# Compare different ROUGE metrics
for rouge_type in ['rouge1', 'rouge2', 'rougeL']:
results = evaluate_with_ranx_similarity(
retriever=your_retriever,
questions=questions,
reference_contexts=references,
method='kiwi_rouge',
rouge_type=rouge_type,
tokenize_method='morphs',
similarity_threshold=0.3
)
print(f"{rouge_type.upper()}: Hit@5 = {results['hit_rate@5']:.3f}")
Threshold Sensitivity Analysis
# Analyze how different thresholds affect evaluation
thresholds = [0.3, 0.5, 0.7]
for threshold in thresholds:
results = evaluate_with_ranx_similarity(
retriever=your_retriever,
questions=questions,
reference_contexts=references,
similarity_threshold=threshold
)
print(f"Threshold {threshold}: Hit@5={results['hit_rate@5']:.3f}, NDCG@5={results['ndcg@5']:.3f}")
📚 Examples
- Basic Tokenizer Example
- BGE-M3 Evaluation Example
- Embedding Models Comparison
- Comprehensive Comparison
🤝 Contributing
Contributions are welcome! Please feel free to submit issues and pull requests.
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
- Built on top of ranx by Elias Bassani
- Korean morphological analysis powered by Kiwi
- Embedding support via sentence-transformers
📞 Support
- 🐛 Issue Tracker: Please submit issues on GitHub
- 📧 Email: ontofinance@gmail.com
ranx-k - Empowering Korean RAG evaluation with precision and ease!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ranx_k-0.0.16.tar.gz.
File metadata
- Download URL: ranx_k-0.0.16.tar.gz
- Upload date:
- Size: 71.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2e788d3a8ff49cd8b2e0f9bc31b8bef72c2ff0502c2cb961f4053557231d6ab6
|
|
| MD5 |
ff0dd38398261652d9c189d8369163e7
|
|
| BLAKE2b-256 |
f3f3aa523da9a80888079472ee3d43fcbecc6a510d4c683a9470808383530cfa
|
File details
Details for the file ranx_k-0.0.16-py3-none-any.whl.
File metadata
- Download URL: ranx_k-0.0.16-py3-none-any.whl
- Upload date:
- Size: 85.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
10c4663f22415e986b24bcd47a82ca621f17ae6fdfc7a86fd70a87d457a363f4
|
|
| MD5 |
8c9811f5490de6148c0ac77dcd982764
|
|
| BLAKE2b-256 |
918290c2f06f77f802200398b312ae3747cc52b56a042f07209daf9f4fc5859f
|