BM25s VectorStore and KeywordSearch plugin for refinire-rag
Project description
refinire-rag-bm25s-j
BM25s VectorStore plugin for refinire-rag - A fast, efficient text search solution using BM25s ranking algorithm.
Overview
This plugin provides BM25s database functionality as a VectorStore subclass for refinire-rag. While BM25s is not technically a vector search, it provides equivalent usage patterns and is implemented as a VectorStore subclass for seamless integration.
Key Dependencies:
refinire-rag- RAG framework integrationbm25s-j- Fast BM25s implementation
Features
- Fast keyword-based retrieval - No embedding computation needed
- Metadata filtering - Advanced filtering with comparison operators (BM25s-j 0.2.0+)
- Index persistence - Save/load indexes for production use
- Memory efficient - Optimized for large document collections
- Deterministic results - Explainable and reproducible search
- Production ready - Comprehensive error handling and logging
Installation
# Basic installation
uv add refinire-rag-bm25s-j
# With refinire-rag integration
uv add refinire-rag-bm25s-j[rag]
# For development
uv add refinire-rag-bm25s-j[dev]
Quick Start
from refinire_rag_bm25s_j import BM25sStore
from refinire_rag_bm25s_j.models import BM25sConfig
# Create configuration
config = BM25sConfig(
k1=1.2, # Term frequency saturation
b=0.75, # Length normalization
epsilon=0.25, # IDF cutoff
index_path="./data/bm25s_index.pkl"
)
# Initialize store
store = BM25sStore(config=config)
# Add documents
documents = [
"Python is a powerful programming language for data science.",
"Machine learning algorithms can be implemented efficiently.",
"BM25 is a ranking function used in information retrieval."
]
store.add_texts(documents)
# Search
results = store.similarity_search("Python programming", k=2)
for doc in results:
print(f"Score: {doc.metadata['score']:.3f}")
print(f"Content: {doc.page_content}")
Configuration
BM25s Parameters
| Parameter | Description | Default | Recommended |
|---|---|---|---|
k1 |
Term frequency saturation | 1.2 | 1.2-1.5 |
b |
Length normalization | 0.75 | 0.7-0.8 |
epsilon |
IDF cutoff parameter | 0.25 | 0.1-0.3 |
index_path |
Index save/load path | None | Set for persistence |
Use Case Tuning
# Technical documentation
config = BM25sConfig(k1=1.5, b=0.8, epsilon=0.1)
# General knowledge base
config = BM25sConfig(k1=1.2, b=0.75, epsilon=0.25)
# Legal/medical texts
config = BM25sConfig(k1=1.0, b=0.9, epsilon=0.1)
Advanced Features
Metadata Filtering (BM25s-j 0.2.0+)
# Add documents with metadata
documents = ["Document content..."]
metadata = [{"category": "tech", "year": 2024, "rating": 4.5}]
store.add_texts(documents, metadatas=metadata)
# Basic filtering
results = store.similarity_search(
"search query",
filter={"category": "tech"}
)
# Advanced filtering with operators
results = store.similarity_search(
"search query",
filter={
"year": {"$gte": 2023},
"rating": {"$gt": 4.0},
"category": {"$in": ["tech", "science"]}
}
)
Supported Filter Operators
| Operator | Description | Example |
|---|---|---|
$gt |
Greater than | {"score": {"$gt": 0.8}} |
$gte |
Greater than or equal | {"year": {"$gte": 2023}} |
$lt |
Less than | {"priority": {"$lt": 5}} |
$lte |
Less than or equal | {"rating": {"$lte": 3.0}} |
$in |
Value in list | {"type": {"$in": ["doc", "guide"]}} |
$nin |
Value not in list | {"status": {"$nin": ["draft"]}} |
$ne |
Not equal | {"private": {"$ne": true}} |
$exists |
Field exists | {"author": {"$exists": true}} |
Index Persistence
# Save index automatically
config = BM25sConfig(index_path="./data/my_index.pkl")
store = BM25sStore(config=config)
store.add_texts(documents) # Index auto-saved
# Manual save/load
store.index_service.save_index()
store.index_service.load_index()
Examples
Comprehensive examples are available in src/examples/:
basic_usage.py- Simple setup and searchintegration_example.py- LangChain Document integrationrag_pipeline_example.py- Complete RAG systemmetadata_filtering_example.py- Advanced filteringproduction_rag_example.py- Production deploymenthybrid_search_example.py- Combine with semantic search
# Run examples
python src/examples/basic_usage.py
python src/examples/rag_pipeline_example.py
API Reference
BM25sStore
Main class providing VectorStore interface:
class BM25sStore(BaseVectorStore):
def add_texts(texts, metadatas=None, ids=None) -> List[str]
def add_documents(documents, ids=None) -> List[str]
def similarity_search(query, k=4, filter=None) -> List[BaseDocument]
def similarity_search_with_score(query, k=4, filter=None) -> List[Tuple[BaseDocument, float]]
def delete(ids) -> bool
@classmethod
def from_texts(texts, metadatas=None, config=None) -> BM25sStore
@classmethod
def from_documents(documents, config=None) -> BM25sStore
BM25sConfig
Configuration model with validation:
class BM25sConfig(BaseModel):
k1: float = 1.2 # Term frequency saturation
b: float = 0.75 # Length normalization
epsilon: float = 0.25 # IDF cutoff
index_path: Optional[str] = None # Save/load path
Performance Guidelines
When to Use BM25s
Ideal for:
- Technical documentation and API references
- FAQ systems with exact keyword matching
- Legal and medical document search
- Code snippet retrieval
- Any scenario requiring explainable results
Consider hybrid approach for:
- General knowledge questions
- Conceptual/semantic queries
- Cross-lingual search
- Synonym and paraphrase handling
Optimization Tips
- Document chunking: 500-1500 tokens per chunk with 10-20% overlap
- Index management: Save indexes after creation, implement incremental updates
- Query preprocessing: Normalize and expand queries for better results
- Metadata strategy: Use filtering to reduce search space before ranking
Testing
# Run all tests
uv run pytest
# Run with coverage
uv run pytest --cov=src --cov-report=term-missing
# Run specific test categories
uv run pytest tests/unit/
uv run pytest tests/e2e/
Development
# Install development dependencies
uv sync --group dev
# Run linting and formatting
uv run ruff check src/
uv run ruff format src/
# Type checking
uv run mypy src/
Contributing
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
License
[License information]
Support
- Documentation: See
docs/directory - Examples: See
src/examples/directory - Issues: GitHub Issues
refinire-rag-bm25s-j - Fast, efficient, and production-ready BM25s search for RAG applications.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file refinire_rag_bm25s_j-0.0.2.tar.gz.
File metadata
- Download URL: refinire_rag_bm25s_j-0.0.2.tar.gz
- Upload date:
- Size: 16.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ef4c1a02f04b2b733579a651cb64ccd7b1da288757f77427a0424e8e1b56a948
|
|
| MD5 |
0a7abce1d255f4a24c68841f5a4f8055
|
|
| BLAKE2b-256 |
5923194b161e5672fb809bdcb53bf8d9196e3be9885c846008ce21fed19066aa
|
File details
Details for the file refinire_rag_bm25s_j-0.0.2-py3-none-any.whl.
File metadata
- Download URL: refinire_rag_bm25s_j-0.0.2-py3-none-any.whl
- Upload date:
- Size: 18.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4b47d1d534938bedae38a13b0a55c2062569f696f8f419ce8ca1b7688407ddc9
|
|
| MD5 |
45fab754d61b421632ed880ab33c6e0f
|
|
| BLAKE2b-256 |
fa74d5a758ef203090d73caf1432554ae0f236ada5240478f89226eb88132ffd
|