Categorical BM25 search engine using pure category theory

These details have not been verified by PyPI

Project links

Project description

Vajra BM25

Vajra (Sanskrit: वज्र, "thunderbolt") is a high-performance BM25 search engine. It uses Category Theory abstractions to reframe the BM25 algorithm, providing a well-structured API with vectorized implementations. Benchmarks show Vajra is ~1.2-1.3x faster than BM25S (one of the fastest Python BM25 libraries) while achieving competitive recall across datasets.

What Makes Vajra Different

Vajra implements the standard BM25 ranking algorithm using rigorous mathematical abstractions:

Morphisms: BM25 scoring as a mathematical arrow (Query, Document) → ℝ
Coalgebras: Search as state unfolding QueryState → List[SearchResult]
Functors: The List functor captures multiple-results semantics

While Vajra BM25 uses the same underlying mathematics of BM25, it uses different vocabulary to describe the search process, and the abstractions are more amenable to experimentation and improvement. The core BM25 formula is identical to other implementations—category theory provides the organizational structure.

Installation

# Basic installation (zero dependencies)
pip install vajra-bm25

# With optimizations (NumPy + SciPy for vectorized operations)
pip install vajra-bm25[optimized]

# With PDF support (index and search PDF documents)
pip install vajra-bm25[pdf]

# With index persistence (save/load indices)
pip install vajra-bm25[persistence]

# With vector search (semantic search with embeddings)
pip install vajra-bm25[vector]

# With vector search + Numba acceleration
pip install vajra-bm25[vector-numba]

# With CLI (interactive search)
pip install vajra-bm25[cli]

# Everything
pip install vajra-bm25[all]

Interactive CLI

Vajra includes a rich interactive CLI for exploring BM25, vector, and hybrid search:

# Interactive BM25 mode with BEIR SciFact dataset
vajra-search

# Single query mode
vajra-search -q "machine learning algorithms"

# Use custom JSONL corpus
vajra-search --corpus my_documents.jsonl

# Search PDF files (requires: pip install vajra-bm25[pdf])
vajra-search --corpus document.pdf -q "search query"

# Search a directory of PDFs
vajra-search --corpus ./pdf_folder/

# Vector search (requires: pip install vajra-bm25[vector])
vajra-search --mode vector
vajra-search --mode vector --model all-MiniLM-L6-v2

# Hybrid search (BM25 + Vector with RRF fusion)
vajra-search --mode hybrid
vajra-search --mode hybrid --alpha 0.5  # 50% BM25, 50% vector
vajra-search --mode hybrid --alpha 0.7  # 70% BM25, 30% vector

# Show options
vajra-search --help

In interactive mode, use :help for commands, :stats for index info, and :quit to exit.

Quick Start

The Python API for using Vajra BM25 is quite straightforward. Vajra supports JSONL and PDF document formats via the DocumentCorpus class.

from vajra_bm25 import VajraSearch, Document, DocumentCorpus

# Create documents
documents = [
    Document(id="1", title="Category Theory", content="Functors preserve structure"),
    Document(id="2", title="Coalgebras", content="Coalgebras model dynamics"),
    Document(id="3", title="Search Algorithms", content="BFS explores level by level"),
]
corpus = DocumentCorpus(documents)

# Create search engine
engine = VajraSearch(corpus)

# Search
results = engine.search("category functors", top_k=5)

for r in results:
    print(f"{r.rank}. {r.document.title} (score: {r.score:.3f})")

Optimized Usage

For larger corpora (1000+ documents), use the optimized version. This optimized version is much faster.

from vajra_bm25 import VajraSearchOptimized, DocumentCorpus

# Load corpus from JSONL
corpus = DocumentCorpus.load_jsonl("corpus.jsonl")

# Create optimized engine
# Automatically uses sparse matrices for >10K documents
engine = VajraSearchOptimized(corpus)

# Search (vectorized, cached)
results = engine.search("neural networks", top_k=10)

Parallel Batch Processing

For high-throughput scenarios, use the parallel batch processing version. This version is much faster and able to return results for multiple queries in parallel. There's obviously the overhead due to parallelism, which may work against the search algorithm, but in cases where we have memory limitations, this may work better than Vajra Search Optimized.

from vajra_bm25 import VajraSearchParallel

engine = VajraSearchParallel(corpus, max_workers=4)

# Process multiple queries in parallel
queries = ["machine learning", "deep learning", "neural networks"]
batch_results = engine.search_batch(queries, top_k=5)

Vector Search

Vajra supports semantic vector search using sentence transformers and HNSW indices (requires pip install vajra-bm25[vector]).

from vajra_bm25 import DocumentCorpus
from vajra_bm25.vector import (
    VajraVectorSearch,
    TextEmbeddingMorphism,
    NativeHNSWIndex,
)

# Load corpus
corpus = DocumentCorpus.load_jsonl("corpus.jsonl")

# Create embedding morphism
embedder = TextEmbeddingMorphism(
    model_name="all-MiniLM-L6-v2",
    normalize=True
)

# Create HNSW index (use FlatVectorIndex for small corpora)
index = NativeHNSWIndex(
    dimension=384,  # all-MiniLM-L6-v2 dimension
    metric="cosine",
    M=16,
    ef_construction=200,
    ef_search=50
)

# Build search engine
vector_engine = VajraVectorSearch(embedder, index)
vector_engine.index_documents(list(corpus.documents.values()))

# Semantic search
results = vector_engine.search("machine learning techniques", top_k=10)

Hybrid Search (BM25 + Vector Fusion)

Combine keyword-based BM25 with semantic vector search using Reciprocal Rank Fusion:

from vajra_bm25 import VajraSearchOptimized
from vajra_bm25.vector import HybridSearchEngine

# Create BM25 engine
bm25_engine = VajraSearchOptimized(corpus)

# Create vector engine (from above)
vector_engine = VajraVectorSearch(embedder, index)
vector_engine.index_documents(list(corpus.documents.values()))

# Create hybrid engine
hybrid_engine = HybridSearchEngine(
    bm25_engine=bm25_engine,
    vector_engine=vector_engine,
    alpha=0.5,  # 0.5 = equal weight to BM25 and vector
    method="rrf"  # Reciprocal Rank Fusion (also: "linear", "rsf")
)

# Search with both keyword and semantic relevance
results = hybrid_engine.search("neural network architectures", top_k=10)

# Each result includes component scores
for r in results:
    print(f"{r.rank}. {r.document.title}")
    print(f"   Hybrid: {r.score:.3f} | BM25: {r.bm25_score:.3f} | Vector: {r.vector_score:.3f}")

Why Hybrid Search?

BM25 excels at exact keyword matches and term frequency
Vector search captures semantic similarity and synonyms
Hybrid fusion combines the best of both approaches
RRF method is robust and parameter-free, doesn't require score normalization

Performance

Benchmarked against BM25 implementations across BEIR and Wikipedia datasets (January 2025):

BEIR/SciFact (5,183 docs, 300 queries)

Engine	Latency	Recall@10	NDCG@10	QPS
vajra	0.14ms	78.9%	67.0%	7,133
bm25s	0.18ms	77.4%	66.2%	5,512
tantivy	0.28ms	72.5%	60.0%	3,539

Wikipedia/200K (200,000 docs, 500 queries)

Engine	Latency	Recall@10	NDCG@10	QPS
vajra	0.89ms	44.4%	35.1%	1,125
bm25s	1.08ms	44.6%	35.2%	925
tantivy	6.68ms	45.6%	36.4%	150

Wikipedia/500K (500,000 docs, 500 queries)

Engine	Latency	Recall@10	NDCG@10	QPS
vajra	1.89ms	49.6%	36.7%	529
bm25s	2.45ms	49.8%	37.1%	409
tantivy	5.52ms	51.6%	38.3%	181

Wikipedia/1M (1,000,000 docs, 500 queries)

Engine	Build Time	Latency	Recall@10	NDCG@10	QPS
vajra	17.0 min	3.40ms	45.6%	36.3%	294
bm25s	11.3 min	5.44ms	45.8%	36.7%	184

Key observations:

Vajra is ~1.3-1.6x faster than BM25S on single queries across corpus sizes
Sub-4ms latency even at 1M documents
Competitive accuracy: within 1% NDCG of BM25S
Optimized index building with per-document array concatenation

Caching for Production Workloads

For production systems with repeated queries, Vajra includes built-in LRU caching:

Scenario	Typical Latency	Explanation
First query (cold)	0.14-3.5ms	Full BM25 computation (scales with corpus)
Repeated query (cached)	~0.001ms	LRU cache hit, near-instant

Enable caching by setting cache_size (default: 1000):

engine = VajraSearchOptimized(corpus, cache_size=1000)

How Vajra Achieves Performance

Eager Score Matrix: Pre-computes BM25 term-document scores at index time
Sparse Matrices: Avoids computation on ~99% zeros in the term-document matrix
Vectorized NumPy/SciPy: Uses SIMD instructions for batch scoring
Numba JIT: Optional compiled scoring loops for additional speedup
LRU Caching: Optional query result caching for repeated queries

For detailed benchmark methodology and results, see docs/benchmarks.md.

Running Benchmarks

The benchmark system includes progress display, automatic file output, and index caching to avoid expensive rebuilds.

# Install benchmark dependencies
pip install vajra-bm25[optimized] rank-bm25 bm25s beir rich tantivy

# Optional: Pyserini (requires Java 11+)
pip install pyserini

# Quick start: Run BEIR SciFact benchmark (small dataset, ~5K docs)
python benchmarks/benchmark.py --datasets beir-scifact

# Run Wikipedia benchmarks (requires downloading data first)
python benchmarks/download_wikipedia.py --max-docs 200000
python benchmarks/benchmark.py --datasets wiki-200k

# Run multiple datasets
python benchmarks/benchmark.py --datasets beir-scifact beir-nfcorpus wiki-200k wiki-500k

# All engines comparison
python benchmarks/benchmark.py --datasets beir-scifact \
    --engines vajra vajra-parallel bm25s tantivy pyserini

# Force rebuild indexes (skip cache)
python benchmarks/benchmark.py --datasets wiki-200k --no-cache

Output files:

results/benchmark_results.json - Structured JSON with detailed metrics
results/benchmark.log - Human-readable log (appended each run)
.index_cache/ - Cached indexes for faster subsequent runs

Available datasets: beir-scifact, beir-nfcorpus, wiki-100k, wiki-200k, wiki-500k, wiki-1m, custom

Available engines: vajra, vajra-parallel, bm25s, bm25s-parallel, tantivy, pyserini, rank-bm25

Note: Pyserini requires Java 11+ installed. On macOS: brew install openjdk@21

Supported Formats

JSONL Format

Vajra uses JSONL for corpus persistence:

{"id": "doc1", "title": "First Document", "content": "Content here"}
{"id": "doc2", "title": "Second Document", "content": "More content"}

Load and save:

# Save
corpus.save_jsonl("corpus.jsonl")

# Load
corpus = DocumentCorpus.load_jsonl("corpus.jsonl")

PDF Support

Vajra can index and search PDF documents directly (requires pip install vajra-bm25[pdf]):

from vajra_bm25 import DocumentCorpus, VajraSearchOptimized

# Load a single PDF
corpus = DocumentCorpus.load_pdf("research_paper.pdf")

# Load all PDFs from a directory
corpus = DocumentCorpus.load_pdf_directory("./papers/")

# Load recursively from subdirectories
corpus = DocumentCorpus.load_pdf_directory("./papers/", recursive=True)

# Auto-detect format (JSONL, PDF, or directory)
corpus = DocumentCorpus.load("./my_data")  # Works with any format

# Build index and search
engine = VajraSearchOptimized(corpus)
results = engine.search("attention mechanism transformer", top_k=10)

PDF metadata (title, author, page count) is automatically extracted and stored in the document's metadata field.

BM25 Parameters

from vajra_bm25 import VajraSearch, BM25Parameters

# Custom BM25 parameters
params = BM25Parameters(
    k1=1.5,  # Term frequency saturation (default: 1.5)
    b=0.75   # Length normalization (default: 0.75)
)

engine = VajraSearch(corpus, params=params)

Categorical Abstractions (Advanced)

For users interested in the category theory foundations:

from vajra_bm25 import (
    Morphism, FunctionMorphism, IdentityMorphism,
    Coalgebra, SearchCoalgebra,
    Functor, ListFunctor,
)

# Morphism composition
f = FunctionMorphism(lambda x: x + 1)
g = FunctionMorphism(lambda x: x * 2)
h = f >> g  # h(x) = (x + 1) * 2

# Identity laws
identity = IdentityMorphism()
assert (f >> identity).apply(5) == f.apply(5)  # f . id = f
assert (identity >> f).apply(5) == f.apply(5)  # id . f = f

There's a better, more rigorous treatment of the concepts of Category Theory by Bartosz Milewski here.

API Reference

Core Classes

Document(id, title, content, metadata=None) - Immutable document
DocumentCorpus(documents) - Collection of documents
VajraSearch(corpus, params=None) - Base search engine
VajraSearchOptimized(corpus, k1=1.5, b=0.75) - Vectorized search
VajraSearchParallel(corpus, max_workers=4) - Parallel batch search

Search Results

@dataclass
class SearchResult:
    document: Document  # The matched document
    score: float        # BM25 relevance score
    rank: int           # Position in results (1-indexed)

Why Category Theory?

Category theory did not make Vajra fast. The speed comes from NumPy, sparse matrices, and caching. But it gave the code a clean organizational structure.

How the Abstractions Help

The categorical/ module provides base classes that derived implementations extend:

Morphism[A, B]: Composable transformations with apply() and >> operator for chaining (preprocess >> tokenize >> score)
Coalgebra[X, FX]: The "unfolding" pattern via structure_map(). BM25 search extends this as QueryState → List[SearchResult]
Functor[A, FA]: Lifts operations to containers. ListFunctor handles multiple results; MaybeFunctor handles failure

This creates clear separation: data types, transformations, and search dynamics each have their place. Adding a new scorer or search strategy means implementing a known interface.

Where Speed Actually Comes From

Technique	Impact
Sparse matrices (CSR)	6-8x memory reduction
Vectorized NumPy	10-50x speedup
Eager pre-computed scores	Sub-millisecond queries
LRU query caching	Near-instant repeated queries

These are engineering techniques. The categorical structure organizes the code; NumPy and SciPy deliver the performance.

Development

# Clone repository
git clone https://github.com/aiexplorations/vajra_bm25.git
cd vajra_bm25

# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Run with coverage
pytest --cov=vajra_bm25 --cov-report=html

What's New in v0.4.0

Vector Search Module

Native HNSW implementation with coalgebraic abstractions
TextEmbeddingMorphism: Sentence transformer integration
NativeHNSWIndex: Approximate nearest neighbor search
FlatVectorIndex: Exact brute-force baseline
HybridSearchEngine: BM25 + Vector fusion (RRF, Linear, RSF methods)
Optional Numba acceleration for distance functions
40 new tests covering embeddings, HNSW, flat index, and hybrid search

Bug Fixes

Fixed save_index/load_index eager scorer persistence
save_index now correctly serializes all scorer attributes (k1, b, use_eager, use_numba, use_maxscore)
load_index now recreates eager_scorer, numba_scorer, and maxscore_scorer properly
Added comprehensive roundtrip tests for all index modes

New Dependencies

vajra-bm25[vector]: numpy + sentence-transformers for semantic search
vajra-bm25[vector-numba]: adds numba for distance computation acceleration

Publishing to PyPI

To build and publish a new version of Vajra BM25:

Install build tools:
```
pip install build twine
```
Clean previous builds:
```
rm -rf dist/ build/ *.egg-info
```
Build the package:
```
python -m build
```
This generates a .whl and a .tar.gz in the dist/ directory.
Upload to PyPI:
```
python -m twine upload dist/*
```

License

MIT License - see LICENSE for details.

Acknowledgments

BM25 algorithm: Robertson & Zaragoza, "The Probabilistic Relevance Framework"
Category theory foundations: Rutten, "Universal Coalgebra: A Theory of Systems"
Built and explored in the State Dynamic Modeling project
Inspired by the Category Theory lectures by Bartosz Milewski which are here on YouTube.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.5.0

Feb 22, 2026

0.4.1

Jan 14, 2026

0.4.0

Jan 13, 2026

0.3.2

Jan 13, 2026

0.3.1

Jan 10, 2026

0.3.0

Dec 24, 2025

0.2.1

Dec 18, 2025

0.2.0

Dec 18, 2025

0.1.0

Dec 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vajra_bm25-0.5.0-py3-none-any.whl (84.6 kB view details)

Uploaded Feb 22, 2026 Python 3

File details

Details for the file vajra_bm25-0.5.0-py3-none-any.whl.

File metadata

Download URL: vajra_bm25-0.5.0-py3-none-any.whl
Upload date: Feb 22, 2026
Size: 84.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for vajra_bm25-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c727ff5c6b5c615fe763e88838c03792ba67f082ca498595a6c20dd6d28630d8`
MD5	`9ee4d839fb33ca1f4f7533896429e116`
BLAKE2b-256	`fe2263e4b249df595f3ed1f1f93e0ad5eaf849d97106ebaf36d74f6681613b7e`

See more details on using hashes here.

vajra-bm25 0.5.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Vajra BM25

What Makes Vajra Different

Installation

Interactive CLI

Quick Start

Optimized Usage

Parallel Batch Processing

Vector Search

Hybrid Search (BM25 + Vector Fusion)

Why Hybrid Search?

Performance

BEIR/SciFact (5,183 docs, 300 queries)

Wikipedia/200K (200,000 docs, 500 queries)

Wikipedia/500K (500,000 docs, 500 queries)

Wikipedia/1M (1,000,000 docs, 500 queries)

Caching for Production Workloads

How Vajra Achieves Performance

Running Benchmarks

Supported Formats

JSONL Format

PDF Support

BM25 Parameters

Categorical Abstractions (Advanced)

API Reference

Core Classes

Search Results

Why Category Theory?

How the Abstractions Help

Where Speed Actually Comes From

Development

What's New in v0.4.0

Vector Search Module

Bug Fixes

New Dependencies

Publishing to PyPI

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes