Skip to main content

High-performance BM25 ranking algorithm implementation with Go core and Python bindings

Project description

BM25 Go Library

Go Version License Go Report Card

A highly optimized Go implementation of the BM25 ranking algorithm with Python bindings and PostgreSQL integration. This library provides lightning-fast text search and ranking capabilities suitable for production applications, search engines, and data analysis systems.

๐Ÿ“‹ Table of Contents

๐Ÿš€ Features

Core BM25 Implementation

  • High-Performance: Optimized Go implementation with sub-millisecond search times
  • Memory Efficient: Smart memory management with configurable capacities
  • Thread-Safe: Concurrent search operations with read-write mutexes
  • Flexible Tokenization: Pluggable tokenizer interface with built-in implementations

Smart Tokenization

  • Intelligent Stopword Removal: Curated list of essential function words only
  • Multi-Language Support: English, French, German, Spanish, Italian, Portuguese, Russian, Japanese
  • Punctuation Handling: Automatic cleaning and normalization
  • Number Preservation: Maintains numeric tokens for technical content
  • Compound Word Support: Preserves multi-word terms as single tokens while also indexing individual components

Compound Token Support

  • Multi-Word Terms: Handle technical terms, medical conditions, product names, etc.
  • Dual Indexing: Both compound and individual terms are indexed for comprehensive search
  • Flexible Configuration: Add/remove compound words dynamically
  • Domain-Specific: Configure for medical, technical, legal, or any specialized vocabulary
  • Natural Search: Queries for individual terms find documents containing compound terms

PostgreSQL Integration

  • Persistent Storage: ACID-compliant document storage and retrieval
  • Automatic Schema Management: Tables and indexes created automatically
  • High-Performance Queries: Leverages PostgreSQL's query optimizer
  • Batch Mode: In-memory processing for high-frequency searches (10-100x faster)

Python Bindings

  • Native Performance: Go-based core with Python interface
  • Easy Integration: Simple Python API for existing Python applications
  • Cross-Platform: Works on Linux, macOS, and Windows
  • bm25s Compatible: 100% drop-in replacement for popular bm25s library

๐Ÿ“Š Performance Characteristics

Dataset Size Search Time Memory Usage Throughput
1K documents < 0.1ms ~10MB 10,000+ queries/sec
10K documents < 0.5ms ~100MB 5,000+ queries/sec
100K documents < 2ms ~1GB 1,000+ queries/sec
1M documents < 10ms ~10GB 100+ queries/sec

Batch Mode Performance: 10-500x faster than database queries for high-frequency operations.

๐Ÿ† Benchmark Comparisons

vs bm25s (Pure Python)

  • Indexing Speed: 2-5x faster document processing
  • Search Performance: 3-10x faster query execution
  • Memory Efficiency: Better memory usage for large datasets
  • Production Ready: Go-based core vs Python implementation

vs rank-bm25

  • Performance: Competitive or better performance
  • Features: Additional PostgreSQL integration and batch mode
  • API: Cleaner, more intuitive interface
  • Extensibility: Pluggable tokenization and custom backends

๐Ÿ—๏ธ Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Python API    โ”‚    โ”‚   Go Core       โ”‚    โ”‚  PostgreSQL    โ”‚
โ”‚   (Bindings)    โ”‚โ—„โ”€โ”€โ–บโ”‚   (BM25 +       โ”‚โ—„โ”€โ”€โ–บโ”‚  (Storage +     โ”‚
โ”‚                 โ”‚    โ”‚    Index)       โ”‚    โ”‚   Batch Mode)   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿš€ Quick Start

Go Usage

package main

import (
    "fmt"
    "github.com/pentney/go-bm25"
)

func main() {
    // Create index with capacity estimates
    index := bm25.NewIndex(1000, 10000)
    
    // Use SmartTokenizer for intelligent text processing
    tokenizer := bm25.NewEnglishSmartTokenizer()
    
    // Add documents
    index.AddDocument("doc1", "The quick brown fox jumps over the lazy dog", tokenizer)
    index.AddDocument("doc2", "A quick brown dog runs fast", tokenizer)
    
    // Search with BM25 ranking
    results := index.Search("quick brown fox", tokenizer, 10)
    
    for _, result := range results {
        fmt.Printf("%s: %.4f\n", result.DocID, result.Score)
    }
}

Smart Tokenization

The library includes an intelligent SmartTokenizer that provides:

// Create English tokenizer with stopword removal
tokenizer := bm25.NewEnglishSmartTokenizer()

// Multi-language support
frTokenizer := bm25.NewSmartTokenizer("fr")
deTokenizer := bm25.NewSmartTokenizer("de")

// Intelligent text processing
text := "The quick brown fox jumps over the lazy dog"
tokens := tokenizer.Tokenize(text)
// Result: ["quick", "brown", "fox", "jumps", "lazy", "dog"]
// Note: "the", "over" removed as stopwords

Features:

  • Essential Stopwords Only: Removes only the most common function words
  • Number Preservation: Keeps numeric tokens for technical content
  • Punctuation Handling: Automatic cleaning and normalization
  • Multi-Language: Support for 8+ languages

PostgreSQL Integration

// Create PostgreSQL index
config := bm25.DefaultPostgresConfig()
pgIndex, err := bm25.NewPostgresIndex(config)

// High-performance batch mode for stable indexes
batchMode, err := pgIndex.NewBatchMode()
results := batchMode.Search("query", tokenizer, 10)

Batch Mode for High-Performance Processing

For high-frequency searches on stable indexes, use Batch Mode:

// Create batch mode (loads all data into memory)
batchMode, err := pgIndex.NewBatchMode()
defer batchMode.Close()

// Sub-millisecond search performance
for i := 0; i < 1000; i++ {
    results := batchMode.Search("query", tokenizer, 10)
    // 10-500x faster than database queries!
}

// Check memory usage
estimatedBytes, docCount, termCount := batchMode.GetMemoryStats()
fmt.Printf("Memory: %s, Docs: %d, Terms: %d\n", 
    formatBytes(estimatedBytes), docCount, termCount)

// Refresh when database changes
batchMode.Refresh()

Perfect for:

  • High-frequency searches (100+ queries/second)
  • Stable indexes with infrequent updates
  • Batch processing operations
  • Real-time search applications

Python Usage

import bm25

# Create index
index = bm25.Index(1000, 10000)

# Add documents
index.add_document("doc1", "The quick brown fox jumps over the lazy dog")

# Search
results = index.search("quick brown fox", 10)
for doc_id, score in results:
    print(f"{doc_id}: {score}")

bm25s API Compatibility

This library provides true drop-in replacement compatibility with the popular bm25s library, making migration seamless:

# bm25s-style usage (exact same API!)
from bm25.bm25s_compat import BM25

# Create index (identical to bm25s.BM25())
index = BM25(documents)

# Index documents (identical to bm25s indexing)
index.index(tokenized_documents)

# Search (identical to bm25s.retrieve())
results = index.retrieve([query_tokens], k=10)

# Get scores (identical to bm25s.get_scores())
scores = index.get_scores(query_tokens)

# All other methods work identically
doc_count = index.get_document_count()
term_count = index.get_term_count()
avgdl = index.get_avgdl()

Perfect Migration Path:

# Before (bm25s)
import bm25s
index = bm25s.BM25(documents)
results = index.retrieve([query_tokens], k=10)

# After (this library) - just change the import!
from bm25.bm25s_compat import BM25
index = BM25(documents)
results = index.retrieve([query_tokens], k=10)
# Everything else stays exactly the same!

Migration Benefits:

  • 100% API Compatible: Exact same methods, signatures, and behavior
  • Better Performance: 2-10x faster than pure Python implementations
  • Enhanced Features: Smart tokenization, PostgreSQL support, batch mode
  • Production Ready: Go-based core with Python convenience
  • Zero Code Changes: Just change the import statement

Performance Comparison:

  • bm25s: Pure Python, good for prototyping
  • This Library: Go core + Python bindings, production performance

๐Ÿ”ง Installation

Go

go get github.com/pentney/go-bm25

Python

pip install bm25-go

bm25s Compatibility Layer:

# For drop-in replacement of bm25s
from bm25.bm25s_compat import BM25

# Use exactly like bm25s
index = BM25(documents)
results = index.retrieve([query_tokens], k=10)

PostgreSQL Setup

# Install PostgreSQL dependencies
go mod tidy

# Create database
createdb bm25

# Run schema setup (automatic in Go code)

๐Ÿ“š Documentation

๐ŸŽฏ Use Cases

Search Engines

  • Web Search: High-performance document ranking
  • Enterprise Search: Internal document repositories
  • E-commerce: Product search and recommendations

Data Analysis

  • Text Mining: Large document collections
  • Content Analysis: Document similarity and clustering
  • Research: Academic paper search and ranking

Production Applications

  • Content Management: Fast document retrieval
  • API Services: High-throughput search endpoints
  • Real-time Systems: Low-latency search operations

โšก Performance Optimizations

Memory Management

  • Smart Capacity Planning: Automatic size estimation with growth buffers
  • Efficient Data Structures: Optimized maps and slices for Go
  • Garbage Collection: Minimal allocation during search operations
  • Memory Pooling: Reusable score maps to reduce allocations
  • Bulk Operations: Efficient batch document processing with pre-allocated memory

Search Algorithms

  • Heap-based Top-K: Efficient selection of best results
  • Precomputed IDF: Cached inverse document frequency values
  • Lazy Statistics: On-demand computation of index statistics
  • Early Termination: Stop processing when sufficient high-scoring results are found
  • Score Thresholding: Skip documents below minimum score thresholds
  • Vectorized Scoring: Process multiple terms simultaneously for better performance
  • Term Impact Sorting: Process most impactful terms first for faster convergence

Caching & Optimization

  • Search Result Caching: LRU cache for frequently accessed queries
  • Configurable Parameters: Tunable K1, B, and Epsilon values for different use cases
  • Batch Search: Parallel processing of multiple queries
  • Concurrent Operations: Thread-safe operations with controlled concurrency
  • Performance Monitoring: Real-time memory usage and performance statistics

PostgreSQL Optimizations

  • Batch Operations: Efficient bulk document processing
  • Indexed Queries: Optimized database schema with proper indexes
  • Connection Pooling: Efficient database connection management

bm25s-Inspired Features

  • Epsilon Smoothing: Improved IDF calculation stability
  • Parameter Tuning: Easy adjustment of ranking behavior
  • Early Termination: Skip low-impact terms and documents
  • Memory Efficiency: Optimized data structures and pooling
  • Batch Processing: Efficient handling of large document collections

Performance Characteristics by Method

Search Method Use Case Performance Memory
Search() General purpose Good Standard
SearchOptimized() Top-K results Better Standard
VectorizedSearch() Complex queries Best Higher
SearchWithThreshold() Filtered results Good Standard
BatchSearch() Multiple queries Excellent Higher
SearchWithCache() Repeated queries Best Higher

Parameter Tuning Guide

Conservative Settings (K1=1.0, B=0.5, Epsilon=0.25):

  • Good for general-purpose search
  • Balanced precision and recall
  • Suitable for most applications

Default Settings (K1=1.2, B=0.75, Epsilon=0.25):

  • Standard BM25 behavior
  • Good balance of performance and accuracy
  • Recommended starting point

Aggressive Settings (K1=1.5, B=0.8, Epsilon=0.1):

  • Higher precision, lower recall
  • Better for focused searches
  • Faster early termination

Very Aggressive Settings (K1=2.0, B=0.9, Epsilon=0.05):

  • Maximum precision
  • Fastest search performance
  • Best for high-frequency queries

Memory Usage Optimization

  • Capacity Planning: Pre-allocate based on expected document count
  • Batch Processing: Use BulkAddDocuments() for large collections
  • Cache Management: Adjust cache size based on available memory
  • Memory Monitoring: Use GetPerformanceStats() to track usage
  • Garbage Collection: Automatic cleanup of temporary objects

๐Ÿงช Testing & Benchmarks

# Run all tests
go test -v

# Run optimization benchmarks
go test -bench=BenchmarkSearchMethods -v
go test -bench=BenchmarkParameterConfigurations -v
go test -bench=BenchmarkCaching -v
go test -bench=BenchmarkBatchOperations -v

# Run PostgreSQL benchmarks
go test -bench=Postgres -v

# Run batch mode benchmarks
go test -bench=BenchmarkBatchModeVsPostgres -v

๐Ÿ” API Reference

Core Types

  • Index: Main BM25 index with configurable capacities and parameters
  • BM25Params: Configurable K1, B, and Epsilon parameters
  • SearchCache: LRU cache for search results
  • SmartTokenizer: Intelligent tokenization with stopword removal
  • SearchResult: Search result with document ID and BM25 score
  • BatchSearchResult: Result from batch search operations
  • PostgresIndex: PostgreSQL-backed persistent index
  • BatchMode: High-performance in-memory processing mode

Key Methods

  • Search(): Perform BM25 search with ranking
  • SearchOptimized(): Optimized search with early termination
  • VectorizedSearch(): Vectorized scoring for complex queries
  • SearchWithThreshold(): Search with score thresholding
  • BatchSearch(): Process multiple queries concurrently
  • SearchWithCache(): Search with optional result caching
  • BulkAddDocuments(): Efficiently add multiple documents
  • GetPerformanceStats(): Get memory and performance statistics
  • SetParameters(): Update BM25 parameters dynamically

New Optimization Methods

  • NewIndexWithParams(): Create index with custom parameters
  • NewIndexWithCache(): Create index with search result caching
  • EnableCache(): Enable/disable result caching
  • SetCacheSize(): Adjust cache size dynamically
  • GetTermImpact(): Get term impact score
  • GetQueryTermsImpact(): Get terms sorted by impact

๐ŸŒŸ Why This Library?

Performance

  • Go Speed: Native Go performance with optimized algorithms
  • Memory Efficiency: Smart memory management for large datasets
  • Concurrent Access: Thread-safe operations with controlled concurrency
  • Early Termination: Stop processing when sufficient results are found
  • Vectorized Operations: Process multiple terms simultaneously

Flexibility

  • Multiple Backends: In-memory, PostgreSQL, and Python bindings
  • Custom Tokenization: Pluggable tokenizer interface
  • Configurable Parameters: Tunable K1, B, and Epsilon values
  • Caching Options: Optional search result caching
  • Batch Operations: Efficient bulk document processing

Production Ready

  • Comprehensive Testing: Extensive test coverage with benchmarks
  • Error Handling: Robust error handling and recovery
  • Documentation: Complete examples and performance guides
  • Performance Monitoring: Real-time statistics and memory tracking
  • Memory Optimization: Automatic cleanup and efficient data structures

Migration Benefits

  • bm25s Compatibility: Familiar API for existing users
  • Performance Upgrade: 2-10x faster than pure Python implementations
  • Feature Enhancement: Smart tokenization, PostgreSQL, batch mode
  • Production Scaling: Handle larger datasets with better performance
  • Advanced Optimizations: Early termination, caching, vectorized search

๐Ÿš€ Quick Start with Optimizations

Basic Usage with Custom Parameters

package main

import (
    "fmt"
    "github.com/pentney/go-bm25"
)

func main() {
    // Create index with custom parameters
    params := bm25.BM25Params{
        K1:      1.5,    // Higher term frequency saturation
        B:       0.8,    // Stronger length normalization
        Epsilon: 0.1,    // Lower threshold for better precision
    }
    
    index := bm25.NewIndexWithParams(1000, 10000, params)
    tokenizer := bm25.NewEnglishSmartTokenizer()
    
    // Add documents
    index.AddDocument("doc1", "machine learning algorithm", tokenizer)
    index.AddDocument("doc2", "database system performance", tokenizer)
    
    // Use optimized search
    results := index.SearchOptimized("machine learning", tokenizer, 5)
    
    for _, result := range results {
        fmt.Printf("%s: %.4f\n", result.DocID, result.Score)
    }
}

Advanced Usage with Caching

// Create cached index
cachedIndex := bm25.NewIndexWithCache(1000, 10000, 1000)
cachedIndex.EnableCache(true)

// Bulk add documents
documents := []bm25.Document{
    {ID: "doc1", Content: "machine learning algorithm"},
    {ID: "doc2", Content: "database system performance"},
    // ... more documents
}
cachedIndex.BulkAddDocuments(documents, tokenizer)

// Batch search with caching
queries := []string{"machine learning", "database system", "algorithm"}
batchResults := cachedIndex.BatchSearch(queries, tokenizer, 5)

// Get performance statistics
stats := cachedIndex.GetPerformanceStats()
fmt.Printf("Memory usage: %v MB\n", stats["memory_usage_mb"])

๐Ÿค Contributing

We welcome contributions! Please see our contributing guidelines:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests for new functionality
  5. Submit a pull request

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

  • BM25 Algorithm: Based on the probabilistic relevance framework
  • bm25s Library: Inspiration for parameter tuning and optimizations
  • Go Community: Built with Go's excellent standard library
  • PostgreSQL: Robust database backend for persistence
  • Python Community: Python bindings for broader adoption

Ready to build fast, scalable search? Get started with go get github.com/pentney/go-bm25 or pip install bm25-go!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

go_bm25-0.2.9.tar.gz (71.6 kB view details)

Uploaded Source

File details

Details for the file go_bm25-0.2.9.tar.gz.

File metadata

  • Download URL: go_bm25-0.2.9.tar.gz
  • Upload date:
  • Size: 71.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for go_bm25-0.2.9.tar.gz
Algorithm Hash digest
SHA256 5eb0475771d52380db17b4340f6f178793077c156b2f4643d7886f135a495b94
MD5 dd151cbfd3ec819412026a04885b1eef
BLAKE2b-256 8540503fd27a752159c348dc80b8e5506ab2da1f816aded76e98d5fc975c7159

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page