Categorical BM25 search engine using pure category theory
Project description
Vajra BM25
Vajra (Sanskrit: वज्र, "thunderbolt") is a high-performance BM25 search engine. It uses Category Theory abstractions to reframe the BM25 algorithm, providing a well-structured API with vectorized implementations. Benchmarks show Vajra is ~1.2-1.3x faster than BM25S (one of the fastest Python BM25 libraries) while achieving competitive recall across datasets.
What Makes Vajra Different
Vajra implements the standard BM25 ranking algorithm using rigorous mathematical abstractions:
- Morphisms: BM25 scoring as a mathematical arrow
(Query, Document) → ℝ - Coalgebras: Search as state unfolding
QueryState → List[SearchResult] - Functors: The List functor captures multiple-results semantics
While Vajra BM25 uses the same underlying mathematics of BM25, it uses different vocabulary to describe the search process, and the abstractions are more amenable to experimentation and improvement. The core BM25 formula is identical to other implementations—category theory provides the organizational structure.
Installation
# Basic installation (zero dependencies)
pip install vajra-bm25
# With optimizations (NumPy + SciPy for vectorized operations)
pip install vajra-bm25[optimized]
# With PDF support (index and search PDF documents)
pip install vajra-bm25[pdf]
# With index persistence (save/load indices)
pip install vajra-bm25[persistence]
# With vector search (semantic search with embeddings)
pip install vajra-bm25[vector]
# With vector search + Numba acceleration
pip install vajra-bm25[vector-numba]
# With CLI (interactive search)
pip install vajra-bm25[cli]
# Everything
pip install vajra-bm25[all]
Interactive CLI
Vajra includes a rich interactive CLI for exploring BM25, vector, and hybrid search:
# Interactive BM25 mode with BEIR SciFact dataset
vajra-search
# Single query mode
vajra-search -q "machine learning algorithms"
# Use custom JSONL corpus
vajra-search --corpus my_documents.jsonl
# Search PDF files (requires: pip install vajra-bm25[pdf])
vajra-search --corpus document.pdf -q "search query"
# Search a directory of PDFs
vajra-search --corpus ./pdf_folder/
# Vector search (requires: pip install vajra-bm25[vector])
vajra-search --mode vector
vajra-search --mode vector --model all-MiniLM-L6-v2
# Hybrid search (BM25 + Vector with RRF fusion)
vajra-search --mode hybrid
vajra-search --mode hybrid --alpha 0.5 # 50% BM25, 50% vector
vajra-search --mode hybrid --alpha 0.7 # 70% BM25, 30% vector
# Show options
vajra-search --help
In interactive mode, use :help for commands, :stats for index info, and :quit to exit.
Quick Start
The Python API for using Vajra BM25 is quite straightforward. Vajra supports JSONL and PDF document formats via the DocumentCorpus class.
from vajra_bm25 import VajraSearch, Document, DocumentCorpus
# Create documents
documents = [
Document(id="1", title="Category Theory", content="Functors preserve structure"),
Document(id="2", title="Coalgebras", content="Coalgebras model dynamics"),
Document(id="3", title="Search Algorithms", content="BFS explores level by level"),
]
corpus = DocumentCorpus(documents)
# Create search engine
engine = VajraSearch(corpus)
# Search
results = engine.search("category functors", top_k=5)
for r in results:
print(f"{r.rank}. {r.document.title} (score: {r.score:.3f})")
Optimized Usage
For larger corpora (1000+ documents), use the optimized version. This optimized version is much faster.
from vajra_bm25 import VajraSearchOptimized, DocumentCorpus
# Load corpus from JSONL
corpus = DocumentCorpus.load_jsonl("corpus.jsonl")
# Create optimized engine
# Automatically uses sparse matrices for >10K documents
engine = VajraSearchOptimized(corpus)
# Search (vectorized, cached)
results = engine.search("neural networks", top_k=10)
Parallel Batch Processing
For high-throughput scenarios, use the parallel batch processing version. This version is much faster and able to return results for multiple queries in parallel. There's obviously the overhead due to parallelism, which may work against the search algorithm, but in cases where we have memory limitations, this may work better than Vajra Search Optimized.
from vajra_bm25 import VajraSearchParallel
engine = VajraSearchParallel(corpus, max_workers=4)
# Process multiple queries in parallel
queries = ["machine learning", "deep learning", "neural networks"]
batch_results = engine.search_batch(queries, top_k=5)
Vector Search
Vajra supports semantic vector search using sentence transformers and HNSW indices (requires pip install vajra-bm25[vector]).
from vajra_bm25 import DocumentCorpus
from vajra_bm25.vector import (
VajraVectorSearch,
TextEmbeddingMorphism,
NativeHNSWIndex,
)
# Load corpus
corpus = DocumentCorpus.load_jsonl("corpus.jsonl")
# Create embedding morphism
embedder = TextEmbeddingMorphism(
model_name="all-MiniLM-L6-v2",
normalize=True
)
# Create HNSW index (use FlatVectorIndex for small corpora)
index = NativeHNSWIndex(
dimension=384, # all-MiniLM-L6-v2 dimension
metric="cosine",
M=16,
ef_construction=200,
ef_search=50
)
# Build search engine
vector_engine = VajraVectorSearch(embedder, index)
vector_engine.index_documents(list(corpus.documents.values()))
# Semantic search
results = vector_engine.search("machine learning techniques", top_k=10)
Hybrid Search (BM25 + Vector Fusion)
Combine keyword-based BM25 with semantic vector search using Reciprocal Rank Fusion:
from vajra_bm25 import VajraSearchOptimized
from vajra_bm25.vector import HybridSearchEngine
# Create BM25 engine
bm25_engine = VajraSearchOptimized(corpus)
# Create vector engine (from above)
vector_engine = VajraVectorSearch(embedder, index)
vector_engine.index_documents(list(corpus.documents.values()))
# Create hybrid engine
hybrid_engine = HybridSearchEngine(
bm25_engine=bm25_engine,
vector_engine=vector_engine,
alpha=0.5, # 0.5 = equal weight to BM25 and vector
method="rrf" # Reciprocal Rank Fusion (also: "linear", "rsf")
)
# Search with both keyword and semantic relevance
results = hybrid_engine.search("neural network architectures", top_k=10)
# Each result includes component scores
for r in results:
print(f"{r.rank}. {r.document.title}")
print(f" Hybrid: {r.score:.3f} | BM25: {r.bm25_score:.3f} | Vector: {r.vector_score:.3f}")
Why Hybrid Search?
- BM25 excels at exact keyword matches and term frequency
- Vector search captures semantic similarity and synonyms
- Hybrid fusion combines the best of both approaches
- RRF method is robust and parameter-free, doesn't require score normalization
Performance
Benchmarked against BM25 implementations across BEIR and Wikipedia datasets (January 2025):
BEIR/SciFact (5,183 docs, 300 queries)
| Engine | Latency | Recall@10 | NDCG@10 | QPS |
|---|---|---|---|---|
| vajra | 0.14ms | 78.9% | 67.0% | 7,133 |
| bm25s | 0.18ms | 77.4% | 66.2% | 5,512 |
| tantivy | 0.28ms | 72.5% | 60.0% | 3,539 |
Wikipedia/200K (200,000 docs, 500 queries)
| Engine | Latency | Recall@10 | NDCG@10 | QPS |
|---|---|---|---|---|
| vajra | 0.89ms | 44.4% | 35.1% | 1,125 |
| bm25s | 1.08ms | 44.6% | 35.2% | 925 |
| tantivy | 6.68ms | 45.6% | 36.4% | 150 |
Wikipedia/500K (500,000 docs, 500 queries)
| Engine | Latency | Recall@10 | NDCG@10 | QPS |
|---|---|---|---|---|
| vajra | 1.89ms | 49.6% | 36.7% | 529 |
| bm25s | 2.45ms | 49.8% | 37.1% | 409 |
| tantivy | 5.52ms | 51.6% | 38.3% | 181 |
Wikipedia/1M (1,000,000 docs, 500 queries)
| Engine | Build Time | Latency | Recall@10 | NDCG@10 | QPS |
|---|---|---|---|---|---|
| vajra | 17.0 min | 3.40ms | 45.6% | 36.3% | 294 |
| bm25s | 11.3 min | 5.44ms | 45.8% | 36.7% | 184 |
Key observations:
- Vajra is ~1.3-1.6x faster than BM25S on single queries across corpus sizes
- Sub-4ms latency even at 1M documents
- Competitive accuracy: within 1% NDCG of BM25S
- Optimized index building with per-document array concatenation
Caching for Production Workloads
For production systems with repeated queries, Vajra includes built-in LRU caching:
| Scenario | Typical Latency | Explanation |
|---|---|---|
| First query (cold) | 0.14-3.5ms | Full BM25 computation (scales with corpus) |
| Repeated query (cached) | ~0.001ms | LRU cache hit, near-instant |
Enable caching by setting cache_size (default: 1000):
engine = VajraSearchOptimized(corpus, cache_size=1000)
How Vajra Achieves Performance
- Eager Score Matrix: Pre-computes BM25 term-document scores at index time
- Sparse Matrices: Avoids computation on ~99% zeros in the term-document matrix
- Vectorized NumPy/SciPy: Uses SIMD instructions for batch scoring
- Numba JIT: Optional compiled scoring loops for additional speedup
- LRU Caching: Optional query result caching for repeated queries
For detailed benchmark methodology and results, see docs/benchmarks.md.
Running Benchmarks
The benchmark system includes progress display, automatic file output, and index caching to avoid expensive rebuilds.
# Install benchmark dependencies
pip install vajra-bm25[optimized] rank-bm25 bm25s beir rich tantivy
# Optional: Pyserini (requires Java 11+)
pip install pyserini
# Quick start: Run BEIR SciFact benchmark (small dataset, ~5K docs)
python benchmarks/benchmark.py --datasets beir-scifact
# Run Wikipedia benchmarks (requires downloading data first)
python benchmarks/download_wikipedia.py --max-docs 200000
python benchmarks/benchmark.py --datasets wiki-200k
# Run multiple datasets
python benchmarks/benchmark.py --datasets beir-scifact beir-nfcorpus wiki-200k wiki-500k
# All engines comparison
python benchmarks/benchmark.py --datasets beir-scifact \
--engines vajra vajra-parallel bm25s tantivy pyserini
# Force rebuild indexes (skip cache)
python benchmarks/benchmark.py --datasets wiki-200k --no-cache
Output files:
results/benchmark_results.json- Structured JSON with detailed metricsresults/benchmark.log- Human-readable log (appended each run).index_cache/- Cached indexes for faster subsequent runs
Available datasets: beir-scifact, beir-nfcorpus, wiki-100k, wiki-200k, wiki-500k, wiki-1m, custom
Available engines: vajra, vajra-parallel, bm25s, bm25s-parallel, tantivy, pyserini, rank-bm25
Note: Pyserini requires Java 11+ installed. On macOS:
brew install openjdk@21
Supported Formats
JSONL Format
Vajra uses JSONL for corpus persistence:
{"id": "doc1", "title": "First Document", "content": "Content here"}
{"id": "doc2", "title": "Second Document", "content": "More content"}
Load and save:
# Save
corpus.save_jsonl("corpus.jsonl")
# Load
corpus = DocumentCorpus.load_jsonl("corpus.jsonl")
PDF Support
Vajra can index and search PDF documents directly (requires pip install vajra-bm25[pdf]):
from vajra_bm25 import DocumentCorpus, VajraSearchOptimized
# Load a single PDF
corpus = DocumentCorpus.load_pdf("research_paper.pdf")
# Load all PDFs from a directory
corpus = DocumentCorpus.load_pdf_directory("./papers/")
# Load recursively from subdirectories
corpus = DocumentCorpus.load_pdf_directory("./papers/", recursive=True)
# Auto-detect format (JSONL, PDF, or directory)
corpus = DocumentCorpus.load("./my_data") # Works with any format
# Build index and search
engine = VajraSearchOptimized(corpus)
results = engine.search("attention mechanism transformer", top_k=10)
PDF metadata (title, author, page count) is automatically extracted and stored in the document's metadata field.
BM25 Parameters
from vajra_bm25 import VajraSearch, BM25Parameters
# Custom BM25 parameters
params = BM25Parameters(
k1=1.5, # Term frequency saturation (default: 1.5)
b=0.75 # Length normalization (default: 0.75)
)
engine = VajraSearch(corpus, params=params)
Categorical Abstractions (Advanced)
For users interested in the category theory foundations:
from vajra_bm25 import (
Morphism, FunctionMorphism, IdentityMorphism,
Coalgebra, SearchCoalgebra,
Functor, ListFunctor,
)
# Morphism composition
f = FunctionMorphism(lambda x: x + 1)
g = FunctionMorphism(lambda x: x * 2)
h = f >> g # h(x) = (x + 1) * 2
# Identity laws
identity = IdentityMorphism()
assert (f >> identity).apply(5) == f.apply(5) # f . id = f
assert (identity >> f).apply(5) == f.apply(5) # id . f = f
There's a better, more rigorous treatment of the concepts of Category Theory by Bartosz Milewski here.
API Reference
Core Classes
Document(id, title, content, metadata=None)- Immutable documentDocumentCorpus(documents)- Collection of documentsVajraSearch(corpus, params=None)- Base search engineVajraSearchOptimized(corpus, k1=1.5, b=0.75)- Vectorized searchVajraSearchParallel(corpus, max_workers=4)- Parallel batch search
Search Results
@dataclass
class SearchResult:
document: Document # The matched document
score: float # BM25 relevance score
rank: int # Position in results (1-indexed)
Why Category Theory?
Category theory did not make Vajra fast. The speed comes from NumPy, sparse matrices, and caching. But it gave the code a clean organizational structure.
How the Abstractions Help
The categorical/ module provides base classes that derived implementations extend:
Morphism[A, B]: Composable transformations withapply()and>>operator for chaining (preprocess >> tokenize >> score)Coalgebra[X, FX]: The "unfolding" pattern viastructure_map(). BM25 search extends this asQueryState → List[SearchResult]Functor[A, FA]: Lifts operations to containers.ListFunctorhandles multiple results;MaybeFunctorhandles failure
This creates clear separation: data types, transformations, and search dynamics each have their place. Adding a new scorer or search strategy means implementing a known interface.
Where Speed Actually Comes From
| Technique | Impact |
|---|---|
| Sparse matrices (CSR) | 6-8x memory reduction |
| Vectorized NumPy | 10-50x speedup |
| Eager pre-computed scores | Sub-millisecond queries |
| LRU query caching | Near-instant repeated queries |
These are engineering techniques. The categorical structure organizes the code; NumPy and SciPy deliver the performance.
Development
# Clone repository
git clone https://github.com/aiexplorations/vajra_bm25.git
cd vajra_bm25
# Install in development mode
pip install -e ".[dev]"
# Run tests
pytest tests/ -v
# Run with coverage
pytest --cov=vajra_bm25 --cov-report=html
What's New in v0.4.0
Vector Search Module
- Native HNSW implementation with coalgebraic abstractions
TextEmbeddingMorphism: Sentence transformer integrationNativeHNSWIndex: Approximate nearest neighbor searchFlatVectorIndex: Exact brute-force baselineHybridSearchEngine: BM25 + Vector fusion (RRF, Linear, RSF methods)- Optional Numba acceleration for distance functions
- 40 new tests covering embeddings, HNSW, flat index, and hybrid search
Bug Fixes
- Fixed
save_index/load_indexeager scorer persistence save_indexnow correctly serializes all scorer attributes (k1, b, use_eager, use_numba, use_maxscore)load_indexnow recreates eager_scorer, numba_scorer, and maxscore_scorer properly- Added comprehensive roundtrip tests for all index modes
New Dependencies
vajra-bm25[vector]: numpy + sentence-transformers for semantic searchvajra-bm25[vector-numba]: adds numba for distance computation acceleration
Publishing to PyPI
To build and publish a new version of Vajra BM25:
-
Install build tools:
pip install build twine
-
Clean previous builds:
rm -rf dist/ build/ *.egg-info
-
Build the package:
python -m build
This generates a
.whland a.tar.gzin thedist/directory. -
Upload to PyPI:
python -m twine upload dist/*
License
MIT License - see LICENSE for details.
Acknowledgments
- BM25 algorithm: Robertson & Zaragoza, "The Probabilistic Relevance Framework"
- Category theory foundations: Rutten, "Universal Coalgebra: A Theory of Systems"
- Built and explored in the State Dynamic Modeling project
- Inspired by the Category Theory lectures by Bartosz Milewski which are here on YouTube.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vajra_bm25-0.4.1.tar.gz.
File metadata
- Download URL: vajra_bm25-0.4.1.tar.gz
- Upload date:
- Size: 102.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
55eb0f66dc71f69d9c0d908d17410c25f7d1408af4d746e1bba444406351ef50
|
|
| MD5 |
11cb9f8df1879cb7d1dfa7682bcc663c
|
|
| BLAKE2b-256 |
242a3d226791ff0e5877c09cee5b618bf65b1fa5029ac206786e31cd9319be4d
|
File details
Details for the file vajra_bm25-0.4.1-py3-none-any.whl.
File metadata
- Download URL: vajra_bm25-0.4.1-py3-none-any.whl
- Upload date:
- Size: 83.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
910f8747ba5fb02c869ae470a65fa6fa3e4551caf052be701305e2c5779c361e
|
|
| MD5 |
22b31d742708a2d4bc01d50edac47045
|
|
| BLAKE2b-256 |
9ad392881574d9292e6e67784fc90d13e93c8164948232e244186aa18d54b0bd
|