Skip to main content

Utility suite for sparse vectorization and document reranking

Project description

zvec-db

Version Python 3.12+ License

Sparse/dense vectorization and document reranking toolkit for zvec.


Quick Start (5 minutes)

1. Install

pip install zvec-db

2. Index documents

from zvec_db.embedders import BM25Embedder, SentenceTransformersEmbedder

documents = [
    "Machine learning is a subset of AI",
    "Deep learning uses neural networks",
    "NLP helps computers understand text",
]

# Create embedders
bm25 = BM25Embedder(max_features=4096)
bm25.fit(documents)

dense = SentenceTransformersEmbedder(model_name="all-MiniLM-L6-v2")

# Encode documents
for doc in documents:
    sparse_vec = bm25.embed(doc)   # dict: {index: score}
    dense_vec = dense.embed(doc)   # numpy array

3. Search with hybrid + reranking

from zvec.model.doc import Doc
from zvec_db.rerankers.defaults import DefaultHybridReranker

query = "neural networks"

# Ready-to-use reranker (dense 60% + BM25 40%)
reranker = DefaultHybridReranker()

results = reranker.rerank({
    "dense": [Doc(id="0", score=0.8), Doc(id="1", score=0.9)],
    "bm25":  [Doc(id="1", score=0.7), Doc(id="2", score=0.6)],
})

print(results[0].id)  # Most relevant document

Key Concepts

Understanding distance vs similarity metrics

Problem: Vector databases store distances (smaller = more similar), but fusion algorithms assume similarities (larger = more relevant).

The metrics parameter handles conversion automatically:

Metric Type Range Conversion Usage
COSINE Distance [0, 2] 1.0 - score/2.0 Normalized embeddings (Qdrant, zvec)
L2 Distance [0, ∞) 1 - 2*atan(s)/π Euclidean distance
IP Similarity (-∞, ∞) None Inner product (already similarity)
None - - None BM25 scores or already normalized [0, 1]

Default: metrics=MetricType.COSINE (main use case with zvec/Qdrant).

Choosing a sparse embedder

Embedder Use case
BM25Embedder Recommended - standard lexical search
TfidfEmbedder TF-IDF weighting with sublinear TF option
CountEmbedder Simple term counts (binary option available)
BM25LEmbedder Documents with highly variable lengths
BM25PlusEmbedder Avoid zero scores with delta smoothing
DisMaxEmbedder Multi-field search (takes maximum score)

Complete Example: Hybrid Search Pipeline

Variant 1: Advanced with Schema Auto-Detection

Complete example with schema auto-detection for metrics. The reranker automatically infers the correct metric conversion from the zvec collection schema.

import zvec
from zvec.model.doc import Doc
from zvec_db.embedders import BM25Embedder, SentenceTransformersEmbedder
from zvec_db.rerankers import WeightedReranker

# 1. Documents to index
documents = [
    "Machine learning is a subset of AI",
    "Deep learning uses neural networks",
    "NLP helps computers understand text",
    "Computer vision interprets images",
    "Reinforcement learning trains agents",
]

# 2. Create embedders
bm25 = BM25Embedder(max_features=4096)
bm25.fit(documents)  # Fit on documents to build vocabulary

dense = SentenceTransformersEmbedder(model_name="all-MiniLM-L6-v2")

# 3. Create collection with hybrid schema
schema = zvec.CollectionSchema(
    name="docs",
    fields=[
        zvec.FieldSchema("text", zvec.DataType.STRING),
    ],
    vectors=[
        zvec.VectorSchema("sparse", zvec.DataType.SPARSE_VECTOR_FP32, dimension=4096),
        zvec.VectorSchema("dense", zvec.DataType.VECTOR_FP32, dimension=384),
    ]
)
collection = zvec.create_and_open("./my_db", schema)

# 4. Index documents with both sparse and dense vectors
for i, doc in enumerate(documents):
    collection.insert(Doc(
        id=str(i),
        fields={"text": doc},
        vectors={
            "sparse": bm25.embed(doc),
            "dense": dense.embed(doc),
        }
    ))

# 5. Create reranker with schema auto-detection
# The schema tells the reranker:
# - SPARSE_VECTOR_FP32 field -> metrics=None (BM25 scores, not distances)
# - VECTOR_FP32 field -> metrics=MetricType.COSINE (convert distance to similarity)
#
# normalize=True uses smart defaults:
# - COSINE metric → divide by 2 (since cosine distances are in [0, 2])
# - Other metrics (IP, L2, None/BM25) → "bayes" normalization
reranker = WeightedReranker(
    topn=3,
    schema=collection.schema,  # Auto-detect metrics
    weights={"sparse": 0.4, "dense": 0.6},
    normalize=True,  # Smart default: sparse→bayes, dense→/2
)

# 6. Search with hybrid query
query = "neural networks"

results = collection.query(
    vectors=[
        zvec.VectorQuery(field_name="sparse", vector=bm25.embed(query)),
        zvec.VectorQuery(field_name="dense", vector=dense.embed(query)),
    ],
    topk=10,
    reranker=reranker,
)

# 7. Display results
print("Top results:")
for i, doc in enumerate(results[:5]):
    print(f"  {i+1}. {doc.fields['text']} (score: {doc.score:.4f})")

Output:

Top results:
  1. Deep learning uses neural networks (score: 0.7131)
  2. Machine learning is a subset of AI (score: 0.6919)
  3. NLP helps computers understand text (score: 0.6839)

Variant 2: Standalone Reranking (No zvec Collection)

Use rerankers independently without zvec:

from zvec.model.doc import Doc
from zvec_db.rerankers.defaults import DefaultWeightedReranker

# Pre-computed results from different sources
bm25_results = [
    Doc(id="doc1", score=0.85),
    Doc(id="doc2", score=0.72),
    Doc(id="doc3", score=0.65),
]

dense_results = [
    Doc(id="doc2", score=0.91),
    Doc(id="doc4", score=0.78),
    Doc(id="doc1", score=0.68),
]

# Fuse results with Bayesian normalization
reranker = DefaultWeightedReranker(
    weights={"bm25": 0.4, "dense": 0.6}
)

final_results = reranker.rerank({
    "bm25": bm25_results,
    "dense": dense_results,
})

print("Fused results:")
for doc in final_results[:5]:
    print(f"  {doc.id}: {doc.score:.4f}")

Variant 3: Reciprocal Rank Fusion (RRF)

Rank-based fusion without score normalization:

from zvec_db.rerankers import DefaultRrfReranker

reranker = DefaultRrfReranker(topn=10, rank_constant=60)

results = reranker.rerank({
    "bm25": bm25_results,
    "dense": dense_results,
})

Table of Contents


Installation

# Basic install
pip install zvec-db

# With preprocessing (recommended for French/German/etc.)
pip install "zvec-db[preprocessing]"

# For development
pip install "zvec-db[dev,test,docs]"

Sparse Embedders

Sparse embedders transform text into sparse dictionaries {index: score} compatible with zvec (SPARSE_FP32).

BM25Embedder (recommended)

BM25 is the standard for lexical search - best choice for most use cases.

from zvec_db.embedders import BM25Embedder
from zvec_db.preprocessing import NormalizationConfig

# Preprocessing config (optional but recommended)
config = NormalizationConfig.aggressive(language="english")

bm25 = BM25Embedder(
    max_features=4096,      # Max non-zero terms
    k1=1.2,                 # Term freq saturation (default: 1.2)
    b=0.75,                 # Length normalization (default: 0.75)
    preprocessing_config=config
)

bm25.fit(documents)
vector = bm25.embed("query")  # {42: 0.523, 108: 0.312, ...}

Other sparse embedders

See Choosing a sparse embedder above for guidance.

from zvec_db.embedders import TfidfEmbedder, CountEmbedder

tfidf = TfidfEmbedder(max_features=4096, sublinear_tf=True)
count = CountEmbedder(max_features=4096, binary=True)

Dense Embedders

Dense embedders transform text into dense vectors (numpy arrays).

OpenAIEmbedder (API / vLLM)

Works with OpenAI API or compatible endpoints (vLLM, local servers).

from zvec_db.embedders import OpenAIEmbedder

# OpenAI API
embedder = OpenAIEmbedder(
    model="text-embedding-3-small",
    api_key="sk-..."
)

# Local vLLM
embedder = OpenAIEmbedder(
    base_url="http://localhost:9300/v1",
    model="embedding",
    max_batch_size=32,
)

vector = embedder.embed("query")

SentenceTransformersEmbedder (local)

Run embedding models locally using sentence-transformers.

from zvec_db.embedders import SentenceTransformersEmbedder

embedder = SentenceTransformersEmbedder(
    model_name="all-MiniLM-L6-v2",  # 384 dims, fast
    device="cpu",                   # or "cuda"
    normalize=True,                 # Normalize vectors
)

vector = embedder.embed("query")

Reranking

Reranking refines search results by combining multiple sources or applying secondary scoring.

Normalization

Le paramètre normalize contrôle comment les scores sont normalisés :

Valeur Effet
True (défaut) Défaut intelligent : COSINE → /2, autres → "bayes"
"bayes" Calibration sigmoid bayésienne (robuste aux outliers)
"minmax" Normalisation min-max : (x - min) / (max - min)
"percentile" Normalisation par rang (très robuste aux outliers)
"cosine" Division par 2 (pour distances COSINE)
{"sparse": "bayes", "dense": "cosine"} Configuration par source
None ou False Pas de normalisation (scores bruts après conversion)

Défaut intelligent : Lorsque normalize=True, le reranker détecte automatiquement la métrique pour chaque source :

  • COSINE → division par 2 (car les distances COSINE sont dans [0, 2])
  • Autres (IP, L2, None/BM25) → normalisation bayésienne

Fusion rerankers

WeightedReranker (fusion pondérée)

from zvec_db.rerankers import WeightedReranker

# Défaut intelligent : COSINE → /2, autres → bayes
reranker = WeightedReranker(
    topn=10,
    weights={"source1": 0.7, "source2": 0.3},
    normalize=True,  # Défaut intelligent
)

results = reranker.rerank({
    "source1": docs1,
    "source2": docs2,
})

# Configuration par source
reranker = WeightedReranker(
    topn=10,
    weights={"sparse": 0.4, "dense": 0.6},
    normalize={"sparse": "bayes", "dense": "cosine"},
)

# Pas de normalisation
reranker = WeightedReranker(
    topn=10,
    normalize=None,
)

Auto-détection des métriques avec schema

Avec les collections zvec, utilisez schema pour inférer automatiquement les métriques :

import zvec
from zvec_db.rerankers import WeightedReranker

collection = zvec.open("./my_collection")

# Le reranker détecte automatiquement :
# - SPARSE_VECTOR_FP32 → metrics=None (scores BM25)
# - VECTOR_FP32 COSINE → metrics=MetricType.COSINE
reranker = WeightedReranker(
    topn=10,
    schema=collection.schema,  # Infère les métriques
    weights={"sparse": 0.4, "dense": 0.6},
    normalize=True,  # Défaut : sparse→bayes, dense→/2
)

Important : Le paramètre metrics ou schema est obligatoire pour éviter les erreurs de conversion. Par exemple :

reranker = WeightedReranker(
    metrics={"sparse": None, "dense": MetricType.COSINE},
    normalize=True,
)

RrfReranker (Reciprocal Rank Fusion)

from zvec_db.rerankers import RrfReranker

# Basic RRF (equal weights for all sources)
reranker = RrfReranker(topn=10, rank_constant=60)
results = reranker.rerank({"bm25": bm25_docs, "dense": dense_docs})

# Weighted RRF: favor dense over BM25
reranker = RrfReranker(
    topn=10,
    rank_constant=60,
    weights={"dense": 0.7, "bm25": 0.3}
)
results = reranker.rerank({"bm25": bm25_docs, "dense": dense_docs})

Note: Le paramètre normalize est accepté par RrfReranker pour la cohérence de l'API, mais il n'a aucun effet car RRF utilise les rangs des documents, pas leurs scores bruts. Un avertissement sera émis si vous définissez normalize sur une valeur non-None.

Ready-to-use rerankers

Avoid manual configuration with defaults:

from zvec_db.rerankers.defaults import (
    DefaultWeightedReranker,
    DefaultHybridReranker,
    DefaultRrfReranker,
)

# Weighted fusion with Bayesian normalization (default)
reranker = DefaultWeightedReranker()

# Optimized hybrid: dense (60%) + BM25 (40%)
reranker = DefaultHybridReranker()

# RRF with standard parameters
reranker = DefaultRrfReranker()

# All support the new `normalize` parameter
reranker = DefaultWeightedReranker(normalize="minmax")
reranker = DefaultHybridReranker(normalize={"dense": "cosine", "bm25": "percentile"})
# Note: DefaultRrfReranker accepts `normalize` but it has no effect (warning emitted)

results = reranker.rerank({"bm25": bm25_docs, "dense": dense_docs})

Cross-Encoder rerankers

Cross-encoders recalculate scores using both query and document. Require a query parameter at initialization.

SentenceTransformerReranker (local, binary)

from zvec_db.rerankers import SentenceTransformerReranker

reranker = SentenceTransformerReranker(
    query="machine learning",
    model_name="cross-encoder/ms-marco-MiniLM-L-6-v2",
    topn=10,
)
results = reranker.rerank({"bm25": docs})

ClassificationReranker (local, multi-class)

from zvec_db.rerankers import ClassificationReranker

reranker = ClassificationReranker(
    query="machine learning",
    model_name="your-classification-model",
    num_classes=5,  # Auto-inferred if not specified
    topn=10,
)
results = reranker.rerank({"bm25": docs})

OpenAIReranker (API)

from zvec_db.rerankers import OpenAIReranker

reranker = OpenAIReranker(
    query="machine learning",
    base_url="http://localhost:9400/v1",
    model="BAAI/bge-reranker-v2-m3",
    endpoint="rerank",  # or "score"
    topn=10,
)
results = reranker.rerank({"bm25": docs})

Diversification

SubmodularReranker (MMR)

Maximize relevance while diversifying results.

from zvec_db.rerankers import SubmodularReranker

reranker = SubmodularReranker(
    topn=10,
    lambda_param=0.7,  # 70% relevance, 30% diversity
    vector_field="embedding",
)
results = reranker.rerank({"source": docs_with_vectors})

Preprocessing

Preprocessing improves sparse embedding quality.

Automatic (recommended)

from zvec_db.embedders import BM25Embedder
from zvec_db.preprocessing import NormalizationConfig

config = NormalizationConfig.aggressive(language="english")
bm25 = BM25Embedder(max_features=4096, preprocessing_config=config)
bm25.fit(documents)
# Preprocessing is automatically applied and saved

Utility functions

from zvec_db.preprocessing import normalize_text, stem_word, remove_stopwords

# Full pipeline
normalize_text("  HELLO WORLD  ", lowercase=True, remove_accents=True, stem=True)
# -> "hello world"

# Individual functions
stem_word("running", language="english")           # -> "run"
remove_stopwords("the cat eats", language="english")  # -> "cat eats"

Install nltk:

pip install "zvec-db[preprocessing]"

Model Persistence

from zvec_db.embedders import BM25Embedder

# Save
bm25 = BM25Embedder(max_features=4096, preprocessing_config=config)
bm25.fit(documents)
bm25.save("models/bm25_model.joblib")

# Load
bm25_loaded = BM25Embedder()
bm25_loaded.load("models/bm25_model.joblib")

# Embeddings are identical (preprocessing included)
assert bm25.embed("query") == bm25_loaded.embed("query")

Evaluation

from zvec_db.evaluation import evaluate_ranking

# Evaluate ranking quality
metrics = evaluate_ranking(
    ground_truth=[["doc1", "doc2"], ["doc3"]],
    predictions=[["doc2", "doc1"], ["doc3", "doc4"]],
    metrics=["ndcg", "map", "mrr", "recall"],
)

Development

# Clone
git clone https://github.com/ccdv-ai/zvec-db.git
cd zvec-db

# Install with all dependencies
make install

# Run tests
make test

# Lint (black, isort, flake8, mypy)
make lint

# Build docs
make docs

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zvec_db-0.4.0.tar.gz (87.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zvec_db-0.4.0-py3-none-any.whl (110.4 kB view details)

Uploaded Python 3

File details

Details for the file zvec_db-0.4.0.tar.gz.

File metadata

  • Download URL: zvec_db-0.4.0.tar.gz
  • Upload date:
  • Size: 87.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for zvec_db-0.4.0.tar.gz
Algorithm Hash digest
SHA256 4cffa6c1e04dd4a330ec51db4dfb90b8111e85dc63d207156ae4b198b9b6e295
MD5 10ed9e0a0221328c54ad3fbfe31939ea
BLAKE2b-256 553ee08e4700904d556b590fcb94712f55d42d7f22f533931a1ea68e699a7bd5

See more details on using hashes here.

File details

Details for the file zvec_db-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: zvec_db-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 110.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for zvec_db-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b2157dbb7ed6d0046dc1270b1dd9c5c3ebdcdf759489d4108181dc0898104101
MD5 a198e8694414be66dc299ce393634cfc
BLAKE2b-256 9435e8646ec1b4b855ad603ed93987622e3e535dc5d7d2699eca6ef5fd26ffc2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page