BM25s VectorStore and KeywordSearch plugin for refinire-rag

Project description

refinire-rag-bm25s-j

BM25s VectorStore plugin for refinire-rag - A fast, efficient text search solution using BM25s ranking algorithm.

Overview

This plugin provides BM25s database functionality as a VectorStore subclass for refinire-rag. While BM25s is not technically a vector search, it provides equivalent usage patterns and is implemented as a VectorStore subclass for seamless integration.

Key Dependencies:

refinire-rag - RAG framework integration
bm25s-j - Fast BM25s implementation

Features

Fast keyword-based retrieval - No embedding computation needed
Metadata filtering - Advanced filtering with comparison operators (BM25s-j 0.2.0+)
Index persistence - Save/load indexes for production use
Memory efficient - Optimized for large document collections
Deterministic results - Explainable and reproducible search
Production ready - Comprehensive error handling and logging

Installation

# Basic installation
uv add refinire-rag-bm25s-j

# With refinire-rag integration
uv add refinire-rag-bm25s-j[rag]

# For development
uv add refinire-rag-bm25s-j[dev]

Quick Start

from refinire_rag_bm25s_j import BM25sStore
from refinire_rag_bm25s_j.models import BM25sConfig

# Create configuration
config = BM25sConfig(
    k1=1.2,          # Term frequency saturation
    b=0.75,          # Length normalization  
    epsilon=0.25,    # IDF cutoff
    index_path="./data/bm25s_index.pkl"
)

# Initialize store
store = BM25sStore(config=config)

# Add documents
documents = [
    "Python is a powerful programming language for data science.",
    "Machine learning algorithms can be implemented efficiently.",
    "BM25 is a ranking function used in information retrieval."
]

store.add_texts(documents)

# Search
results = store.similarity_search("Python programming", k=2)
for doc in results:
    print(f"Score: {doc.metadata['score']:.3f}")
    print(f"Content: {doc.page_content}")

Configuration

BM25s Parameters

Parameter	Description	Default	Recommended
`k1`	Term frequency saturation	1.2	1.2-1.5
`b`	Length normalization	0.75	0.7-0.8
`epsilon`	IDF cutoff parameter	0.25	0.1-0.3
`index_path`	Index save/load path	None	Set for persistence

Use Case Tuning

# Technical documentation
config = BM25sConfig(k1=1.5, b=0.8, epsilon=0.1)

# General knowledge base  
config = BM25sConfig(k1=1.2, b=0.75, epsilon=0.25)

# Legal/medical texts
config = BM25sConfig(k1=1.0, b=0.9, epsilon=0.1)

Advanced Features

Metadata Filtering (BM25s-j 0.2.0+)

# Add documents with metadata
documents = ["Document content..."]
metadata = [{"category": "tech", "year": 2024, "rating": 4.5}]
store.add_texts(documents, metadatas=metadata)

# Basic filtering
results = store.similarity_search(
    "search query",
    filter={"category": "tech"}
)

# Advanced filtering with operators
results = store.similarity_search(
    "search query", 
    filter={
        "year": {"$gte": 2023},
        "rating": {"$gt": 4.0},
        "category": {"$in": ["tech", "science"]}
    }
)

Supported Filter Operators

Operator	Description	Example
`$gt`	Greater than	`{"score": {"$gt": 0.8}}`
`$gte`	Greater than or equal	`{"year": {"$gte": 2023}}`
`$lt`	Less than	`{"priority": {"$lt": 5}}`
`$lte`	Less than or equal	`{"rating": {"$lte": 3.0}}`
`$in`	Value in list	`{"type": {"$in": ["doc", "guide"]}}`
`$nin`	Value not in list	`{"status": {"$nin": ["draft"]}}`
`$ne`	Not equal	`{"private": {"$ne": true}}`
`$exists`	Field exists	`{"author": {"$exists": true}}`

Index Persistence

# Save index automatically
config = BM25sConfig(index_path="./data/my_index.pkl")
store = BM25sStore(config=config)
store.add_texts(documents)  # Index auto-saved

# Manual save/load
store.index_service.save_index()
store.index_service.load_index()

Examples

Comprehensive examples are available in src/examples/:

basic_usage.py - Simple setup and search
integration_example.py - LangChain Document integration
rag_pipeline_example.py - Complete RAG system
metadata_filtering_example.py - Advanced filtering
production_rag_example.py - Production deployment
hybrid_search_example.py - Combine with semantic search

# Run examples
python src/examples/basic_usage.py
python src/examples/rag_pipeline_example.py

API Reference

BM25sStore

Main class providing VectorStore interface:

class BM25sStore(BaseVectorStore):
    def add_texts(texts, metadatas=None, ids=None) -> List[str]
    def add_documents(documents, ids=None) -> List[str]  
    def similarity_search(query, k=4, filter=None) -> List[BaseDocument]
    def similarity_search_with_score(query, k=4, filter=None) -> List[Tuple[BaseDocument, float]]
    def delete(ids) -> bool
    
    @classmethod
    def from_texts(texts, metadatas=None, config=None) -> BM25sStore
    
    @classmethod  
    def from_documents(documents, config=None) -> BM25sStore

BM25sConfig

Configuration model with validation:

class BM25sConfig(BaseModel):
    k1: float = 1.2           # Term frequency saturation
    b: float = 0.75           # Length normalization  
    epsilon: float = 0.25     # IDF cutoff
    index_path: Optional[str] = None  # Save/load path

Performance Guidelines

When to Use BM25s

Ideal for:

Technical documentation and API references
FAQ systems with exact keyword matching
Legal and medical document search
Code snippet retrieval
Any scenario requiring explainable results

Consider hybrid approach for:

General knowledge questions
Conceptual/semantic queries
Cross-lingual search
Synonym and paraphrase handling

Optimization Tips

Document chunking: 500-1500 tokens per chunk with 10-20% overlap
Index management: Save indexes after creation, implement incremental updates
Query preprocessing: Normalize and expand queries for better results
Metadata strategy: Use filtering to reduce search space before ranking

Testing

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=src --cov-report=term-missing

# Run specific test categories
uv run pytest tests/unit/
uv run pytest tests/e2e/

Development

# Install development dependencies
uv sync --group dev

# Run linting and formatting
uv run ruff check src/
uv run ruff format src/

# Type checking
uv run mypy src/

Contributing

Fork the repository
Create a feature branch
Add tests for new functionality
Ensure all tests pass
Submit a pull request

License

[License information]

Support

Documentation: See docs/ directory
Examples: See src/examples/ directory
Issues: GitHub Issues

refinire-rag-bm25s-j - Fast, efficient, and production-ready BM25s search for RAG applications.

Project details

Release history Release notifications | RSS feed

0.0.3

Jun 27, 2025

This version

0.0.2

Jun 26, 2025

0.0.1

Jun 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

refinire_rag_bm25s_j-0.0.2.tar.gz (16.3 kB view details)

Uploaded Jun 26, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

refinire_rag_bm25s_j-0.0.2-py3-none-any.whl (18.2 kB view details)

Uploaded Jun 26, 2025 Python 3

File details

Details for the file refinire_rag_bm25s_j-0.0.2.tar.gz.

File metadata

Download URL: refinire_rag_bm25s_j-0.0.2.tar.gz
Upload date: Jun 26, 2025
Size: 16.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for refinire_rag_bm25s_j-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`ef4c1a02f04b2b733579a651cb64ccd7b1da288757f77427a0424e8e1b56a948`
MD5	`0a7abce1d255f4a24c68841f5a4f8055`
BLAKE2b-256	`5923194b161e5672fb809bdcb53bf8d9196e3be9885c846008ce21fed19066aa`

See more details on using hashes here.

File details

Details for the file refinire_rag_bm25s_j-0.0.2-py3-none-any.whl.

File metadata

Download URL: refinire_rag_bm25s_j-0.0.2-py3-none-any.whl
Upload date: Jun 26, 2025
Size: 18.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for refinire_rag_bm25s_j-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4b47d1d534938bedae38a13b0a55c2062569f696f8f419ce8ca1b7688407ddc9`
MD5	`45fab754d61b421632ed880ab33c6e0f`
BLAKE2b-256	`fa74d5a758ef203090d73caf1432554ae0f236ada5240478f89226eb88132ffd`

See more details on using hashes here.

refinire-rag-bm25s-j 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

refinire-rag-bm25s-j

Overview

Features

Installation

Quick Start

Configuration

BM25s Parameters

Use Case Tuning

Advanced Features

Metadata Filtering (BM25s-j 0.2.0+)

Supported Filter Operators

Index Persistence

Examples

API Reference

BM25sStore

BM25sConfig

Performance Guidelines

When to Use BM25s

Optimization Tips

Testing

Development

Contributing

License

Support

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes