Skip to main content

Utility suite for sparse vectorization and document reranking

Project description

zvec-db

Version Python 3.12+ License

Sparse/dense vectorization and document reranking toolkit for zvec.


Quick Start (5 minutes)

1. Install

pip install zvec-db

2. Index documents

from zvec_db.embedders import BM25Embedder, SentenceTransformersEmbedder

documents = [
    "Machine learning is a subset of AI",
    "Deep learning uses neural networks",
    "NLP helps computers understand text",
]

# Create embedders
bm25 = BM25Embedder(max_features=4096)
bm25.fit(documents)

dense = SentenceTransformersEmbedder(model_name="all-MiniLM-L6-v2")

# Encode documents
for doc in documents:
    sparse_vec = bm25.embed(doc)   # dict: {index: score}
    dense_vec = dense.embed(doc)   # numpy array

3. Search with hybrid + reranking

from zvec.model.doc import Doc
from zvec_db.rerankers.defaults import DefaultHybridReranker

query = "neural networks"

# Ready-to-use reranker (dense 60% + BM25 40%)
reranker = DefaultHybridReranker()

results = reranker.rerank({
    "dense": [Doc(id="0", score=0.8), Doc(id="1", score=0.9)],
    "bm25":  [Doc(id="1", score=0.7), Doc(id="2", score=0.6)],
})

print(results[0].id)  # Most relevant document

Key Concepts

Understanding distance vs similarity metrics

Problem: Vector databases store distances (smaller = more similar), but fusion algorithms assume similarities (larger = more relevant).

The metrics parameter handles conversion automatically:

Metric Type Range Conversion Usage
COSINE Distance [0, 2] (2 - score) / 2 Normalized embeddings (Qdrant, zvec)
L2 Distance [0, ∞) -score Euclidean distance (zvec)
IP Similarity (-∞, ∞) None Inner product, BM25 scores, or already normalized

Default: metrics=MetricType.COSINE (main use case with zvec/Qdrant).

Choosing a sparse embedder

Embedder Use case
BM25Embedder Recommended - standard lexical search
TfidfEmbedder TF-IDF weighting with sublinear TF option
CountEmbedder Simple term counts (binary option available)
BM25LEmbedder Documents with highly variable lengths
BM25PlusEmbedder Avoid zero scores with delta smoothing
DisMaxEmbedder Multi-field search (takes maximum score)

Complete Example: Hybrid Search Pipeline

Variant 1: Advanced with Schema Auto-Detection

Complete example with schema auto-detection for metrics. The reranker automatically infers the correct metric conversion from the zvec collection schema.

import zvec
from zvec.model.doc import Doc
from zvec_db.embedders import BM25Embedder, SentenceTransformersEmbedder
from zvec_db.rerankers import WeightedReranker

# 1. Documents to index
documents = [
    "Machine learning is a subset of AI",
    "Deep learning uses neural networks",
    "NLP helps computers understand text",
    "Computer vision interprets images",
    "Reinforcement learning trains agents",
]

# 2. Create embedders
bm25 = BM25Embedder(max_features=4096)
bm25.fit(documents)  # Fit on documents to build vocabulary

dense = SentenceTransformersEmbedder(model_name="all-MiniLM-L6-v2")

# 3. Create collection with hybrid schema
schema = zvec.CollectionSchema(
    name="docs",
    fields=[
        zvec.FieldSchema("text", zvec.DataType.STRING),
    ],
    vectors=[
        zvec.VectorSchema(
            name="sparse", 
            data_type=zvec.DataType.SPARSE_VECTOR_FP32, 
            dimension=4096
        ),
        zvec.VectorSchema(
            name="dense", 
            data_type=zvec.DataType.VECTOR_FP32, 
            dimension=384,
            index_param=zvec.FlatIndexParam(metric_type=zvec.MetricType.COSINE)
        ),
    ]
)
collection = zvec.create_and_open("./my_db", schema)

# 4. Index documents with both sparse and dense vectors
for i, doc in enumerate(documents):
    collection.insert(Doc(
        id=str(i),
        fields={"text": doc},
        vectors={
            "sparse": bm25.embed(doc),
            "dense": dense.embed(doc),
        }
    ))

# 5. Create reranker with schema auto-detection
# The schema tells the reranker:
# - SPARSE_VECTOR_FP32 field -> metrics=None (BM25 scores, not distances)
# - VECTOR_FP32 field -> metrics=MetricType.COSINE (convert distance to similarity)
#
# normalize=True uses smart defaults:
# - COSINE metric → divide by 2 (since cosine distances are in [0, 2])
# - Other metrics (IP, L2, None/BM25) → "bayes" normalization
reranker = WeightedReranker(
    topn=3,
    schema=collection.schema,  # Auto-detect metrics
    weights={"sparse": 0.4, "dense": 0.6},
    normalize=True,  # Smart default: sparse→bayes, dense→/2
)

# 6. Search with hybrid query
query = "neural networks"

results = collection.query(
    vectors=[
        zvec.VectorQuery(field_name="sparse", vector=bm25.embed(query)),
        zvec.VectorQuery(field_name="dense", vector=dense.embed(query)),
    ],
    topk=10,
    reranker=reranker,
)

# 7. Display results
print("Top results:")
for i, doc in enumerate(results[:3]):
    print(f"  {i+1}. {doc.fields['text']} (score: {doc.score:.4f})")

Output:

Top results:
  1. Deep learning uses neural networks (score: 0.8593)
  2. Machine learning is a subset of AI (score: 0.6339)
  3. Reinforcement learning trains agents (score: 0.6161)

Variant 2: Standalone Reranking (No zvec Collection)

Use rerankers independently without zvec:

from zvec.model.doc import Doc
from zvec_db.rerankers.defaults import DefaultWeightedReranker

# Pre-computed results from different sources
bm25_results = [
    Doc(id="doc1", score=0.85),
    Doc(id="doc2", score=0.72),
    Doc(id="doc3", score=0.65),
]

dense_results = [
    Doc(id="doc2", score=0.91),
    Doc(id="doc4", score=0.78),
    Doc(id="doc1", score=0.68),
]

# Fuse results with Bayesian normalization
reranker = DefaultWeightedReranker(
    weights={"bm25": 0.4, "dense": 0.6}
)

final_results = reranker.rerank({
    "bm25": bm25_results,
    "dense": dense_results,
})

print("Fused results:")
for doc in final_results[:3]:
    print(f"  {doc.id}: {doc.score:.4f}")

Variant 3: Reciprocal Rank Fusion (RRF)

Rank-based fusion without score normalization:

from zvec_db.rerankers import DefaultRrfReranker

reranker = DefaultRrfReranker(topn=10, rank_constant=60)

results = reranker.rerank({
    "bm25": bm25_results,
    "dense": dense_results,
})

Table of Contents


Installation

# Basic install
pip install zvec-db

# With preprocessing (recommended for French/German/etc.)
pip install "zvec-db[preprocessing]"

# For development
pip install "zvec-db[dev,test,docs]"

Sparse Embedders

Sparse embedders transform text into sparse dictionaries {index: score} compatible with zvec (SPARSE_FP32).

BM25Embedder (recommended)

BM25 is the standard for lexical search - best choice for most use cases.

from zvec_db.embedders import BM25Embedder
from zvec_db.preprocessing import NormalizationConfig

# Preprocessing config (optional but recommended)
config = NormalizationConfig.aggressive(language="english")

bm25 = BM25Embedder(
    max_features=4096,      # Max non-zero terms
    k1=1.2,                 # Term freq saturation (default: 1.2)
    b=0.75,                 # Length normalization (default: 0.75)
    preprocessing_config=config
)

bm25.fit(documents)
vector = bm25.embed("query")  # {42: 0.523, 108: 0.312, ...}

Other sparse embedders

See Choosing a sparse embedder above for guidance.

from zvec_db.embedders import TfidfEmbedder, CountEmbedder

tfidf = TfidfEmbedder(max_features=4096, sublinear_tf=True)
count = CountEmbedder(max_features=4096, binary=True)

Dense Embedders

Dense embedders transform text into dense vectors (numpy arrays).

OpenAIEmbedder (API / vLLM)

Works with OpenAI API or compatible endpoints (vLLM, local servers).

from zvec_db.embedders import OpenAIEmbedder

# OpenAI API
embedder = OpenAIEmbedder(
    model="text-embedding-3-small",
    api_key="sk-..."
)

# Local vLLM
embedder = OpenAIEmbedder(
    base_url="http://localhost:9300/v1",
    model="embedding",
    max_batch_size=32,
)

vector = embedder.embed("query")

SentenceTransformersEmbedder (local)

Run embedding models locally using sentence-transformers.

from zvec_db.embedders import SentenceTransformersEmbedder

embedder = SentenceTransformersEmbedder(
    model_name="all-MiniLM-L6-v2",  # 384 dims, fast
    device="cpu",                   # or "cuda"
    normalize=True,                 # Normalize vectors
)

# Model is automatically loaded on first embed() call
vector = embedder.embed("query")

Reranking

Reranking refines search results by combining multiple sources or applying secondary scoring.

Normalization

The normalize parameter controls how scores are normalized:

Value Effect
True (default) Smart default: COSINE → /2, others → "bayes"
"bayes" Bayesian sigmoid calibration (robust to outliers)
"minmax" Min-max normalization: (x - min) / (max - min)
"percentile" Rank-based normalization (very robust to outliers)
"cosine" Divide by 2 (for COSINE distances)
{"sparse": "bayes", "dense": "cosine"} Per-source configuration
None or False No normalization (raw scores after conversion)

Smart default: When normalize=True, the reranker automatically detects the metric for each source:

  • COSINE → divide by 2 (since COSINE distances are in [0, 2])
  • Others (IP, L2, None/BM25) → Bayesian normalization

Fusion rerankers

WeightedReranker (weighted fusion)

from zvec_db.rerankers import WeightedReranker

# Smart default: COSINE → /2, others → bayes
reranker = WeightedReranker(
    topn=10,
    weights={"source1": 0.7, "source2": 0.3},
    normalize=True,  # Smart default
)

results = reranker.rerank({
    "source1": docs1,
    "source2": docs2,
})

# Per-source configuration
reranker = WeightedReranker(
    topn=10,
    weights={"sparse": 0.4, "dense": 0.6},
    normalize={"sparse": "bayes", "dense": "cosine"},
)

# No normalization
reranker = WeightedReranker(
    topn=10,
    normalize=None,
)

Auto-detection of metrics with schema

With zvec collections, use schema to automatically infer metrics:

import zvec
from zvec_db.rerankers import WeightedReranker

collection = zvec.open("./my_collection")

# The reranker automatically detects:
# - SPARSE_VECTOR_FP32 → metrics=None (BM25 scores)
# - VECTOR_FP32 COSINE → metrics=MetricType.COSINE
reranker = WeightedReranker(
    topn=10,
    schema=collection.schema,  # Infer metrics
    weights={"sparse": 0.4, "dense": 0.6},
    normalize=True,  # Default: sparse→bayes, dense→/2
)

Important: The metrics or schema parameter is required to avoid conversion errors. For example:

reranker = WeightedReranker(
    metrics={"sparse": None, "dense": MetricType.COSINE},
    normalize=True,
)

RrfReranker (Reciprocal Rank Fusion)

from zvec_db.rerankers import RrfReranker

# Basic RRF (equal weights for all sources)
reranker = RrfReranker(topn=10, rank_constant=60)
results = reranker.rerank({"bm25": bm25_docs, "dense": dense_docs})

# Weighted RRF: favor dense over BM25
reranker = RrfReranker(
    topn=10,
    rank_constant=60,
    weights={"dense": 0.7, "bm25": 0.3}
)
results = reranker.rerank({"bm25": bm25_docs, "dense": dense_docs})

Note: The normalize parameter is accepted by RrfReranker for API consistency, but it has no effect because RRF uses document ranks, not raw scores. A warning will be emitted if you set normalize to a non-None value.

Ready-to-use rerankers

Avoid manual configuration with defaults:

from zvec_db.rerankers.defaults import (
    DefaultWeightedReranker,
    DefaultHybridReranker,
    DefaultRrfReranker,
)

# Weighted fusion with Bayesian normalization (default)
reranker = DefaultWeightedReranker()

# Optimized hybrid: dense (60%) + BM25 (40%)
reranker = DefaultHybridReranker()

# RRF with standard parameters
reranker = DefaultRrfReranker()

# Weighted and Hybrid support the `normalize` parameter
reranker = DefaultWeightedReranker(normalize="minmax")
reranker = DefaultHybridReranker(normalize={"dense": "cosine", "bm25": "percentile"})
# Note: DefaultRrfReranker accepts `normalize` but it has no effect (warning emitted)

results = reranker.rerank({"bm25": bm25_docs, "dense": dense_docs})

Cross-Encoder rerankers

Cross-encoders recalculate scores using both query and document. Require a query parameter at initialization.

SentenceTransformerReranker (local, binary)

from zvec_db.rerankers import SentenceTransformerReranker

reranker = SentenceTransformerReranker(
    query="machine learning",
    model_name="cross-encoder/ms-marco-MiniLM-L-6-v2",
    topn=10,
)
results = reranker.rerank({"bm25": docs})

ClassificationReranker (local, multi-class)

from zvec_db.rerankers import ClassificationReranker

reranker = ClassificationReranker(
    query="machine learning",
    model_name="your-classification-model",
    num_classes=5,  # Auto-inferred if not specified
    topn=10,
)
results = reranker.rerank({"bm25": docs})

OpenAIReranker (API)

from zvec_db.rerankers import OpenAIReranker

reranker = OpenAIReranker(
    query="machine learning",
    base_url="http://localhost:9400/v1",
    model="BAAI/bge-reranker-v2-m3",
    endpoint="rerank",  # or "score"
    topn=10,
)
results = reranker.rerank({"bm25": docs})

Preprocessing

Preprocessing improves sparse embedding quality.

Automatic (recommended)

from zvec_db.embedders import BM25Embedder
from zvec_db.preprocessing import NormalizationConfig

config = NormalizationConfig.aggressive(language="english")
bm25 = BM25Embedder(max_features=4096, preprocessing_config=config)
bm25.fit(documents)
# Preprocessing is automatically applied and saved

Utility functions

from zvec_db.preprocessing import normalize_text, stem_word, remove_stopwords

# Full pipeline
normalize_text("  HELLO WORLD  ", lowercase=True, remove_accents=True, stem=True)
# -> "hello world"

# Individual functions
stem_word("running", language="english")           # -> "run"
remove_stopwords("the cat eats", language="english")  # -> "cat eats"

Install nltk:

pip install "zvec-db[preprocessing]"

Model Persistence

from zvec_db.embedders import BM25Embedder

# Save
bm25 = BM25Embedder(max_features=4096, preprocessing_config=config)
bm25.fit(documents)
bm25.save("models/bm25_model.joblib")

# Load
bm25_loaded = BM25Embedder()
bm25_loaded.load("models/bm25_model.joblib")

# Embeddings are identical (preprocessing included)
assert bm25.embed("query") == bm25_loaded.embed("query")

Evaluation

from zvec_db.evaluation import evaluate_ranking

# Evaluate ranking quality
metrics = evaluate_ranking(
    ground_truth=[["doc1", "doc2"], ["doc3"]],
    predictions=[["doc2", "doc1"], ["doc3", "doc4"]],
    metrics=["ndcg", "map", "mrr", "recall"],
)

Development

# Clone
git clone https://github.com/ccdv-ai/zvec-db.git
cd zvec-db

# Install with all dependencies
make install

# Run tests
make test

# Lint (black, isort, flake8, mypy)
make lint

# Build docs
make docs

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zvec_db-0.5.0.tar.gz (90.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zvec_db-0.5.0-py3-none-any.whl (112.6 kB view details)

Uploaded Python 3

File details

Details for the file zvec_db-0.5.0.tar.gz.

File metadata

  • Download URL: zvec_db-0.5.0.tar.gz
  • Upload date:
  • Size: 90.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for zvec_db-0.5.0.tar.gz
Algorithm Hash digest
SHA256 4776fe1be1cef269b4ec5d810006492496217dc1c1fcc9f828c153794007ddd8
MD5 e477c4806b378db0bec3f7828d5a7ba3
BLAKE2b-256 1cd359890944e058a5e89b2548cacfafb908fd54f490fb3284ce000b180c2a48

See more details on using hashes here.

File details

Details for the file zvec_db-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: zvec_db-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 112.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for zvec_db-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6b40711ad43c0de1dc7cb3264691098c46b0e1dd05b5adfebeb75a036cecf805
MD5 09ad2bb5579edec57cc24c3853a00db8
BLAKE2b-256 454cc19f85d6fa747a2116b6b65223f08c77279d0cceb6ffc44292d678d4e156

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page