Utility suite for sparse vectorization and document reranking
Project description
zvec-db
Sparse/dense vectorization and document reranking toolkit for zvec.
Quick Start (5 minutes)
1. Install
pip install zvec-db
2. Index documents
from zvec_db.embedders import BM25Embedder, SentenceTransformersEmbedder
documents = [
"Machine learning is a subset of AI",
"Deep learning uses neural networks",
"NLP helps computers understand text",
]
# Create embedders
bm25 = BM25Embedder(max_features=4096)
bm25.fit(documents)
dense = SentenceTransformersEmbedder(model_name="all-MiniLM-L6-v2")
# Encode documents
for doc in documents:
sparse_vec = bm25.embed(doc) # dict: {index: score}
dense_vec = dense.embed(doc) # numpy array
3. Search with hybrid + reranking
from zvec.model.doc import Doc
from zvec_db.rerankers.defaults import DefaultHybridReranker
query = "neural networks"
# Ready-to-use reranker (dense 60% + BM25 40%)
reranker = DefaultHybridReranker()
results = reranker.rerank({
"dense": [Doc(id="0", score=0.8), Doc(id="1", score=0.9)],
"bm25": [Doc(id="1", score=0.7), Doc(id="2", score=0.6)],
})
print(results[0].id) # Most relevant document
Key Concepts
Understanding distance vs similarity metrics
Problem: Vector databases store distances (smaller = more similar), but fusion algorithms assume similarities (larger = more relevant).
The metrics parameter handles conversion automatically:
| Metric | Type | Range | Conversion | Usage |
|---|---|---|---|---|
COSINE |
Distance | [0, 2] | 1.0 - score/2.0 |
Normalized embeddings (Qdrant, zvec) |
L2 |
Distance | [0, ∞) | 1 - 2*atan(s)/π |
Euclidean distance |
IP |
Similarity | (-∞, ∞) | None | Inner product (already similarity) |
None |
- | - | None | BM25 scores or already normalized [0, 1] |
Default: metrics=MetricType.COSINE (main use case with zvec/Qdrant).
Choosing a sparse embedder
| Embedder | Use case |
|---|---|
BM25Embedder |
Recommended - standard lexical search |
TfidfEmbedder |
TF-IDF weighting with sublinear TF option |
CountEmbedder |
Simple term counts (binary option available) |
BM25LEmbedder |
Documents with highly variable lengths |
BM25PlusEmbedder |
Avoid zero scores with delta smoothing |
DisMaxEmbedder |
Multi-field search (takes maximum score) |
Complete Example: Hybrid Search Pipeline
Variant 1: Advanced with Schema Auto-Detection
Complete example with schema auto-detection for metrics. The reranker automatically infers the correct metric conversion from the zvec collection schema.
import zvec
from zvec.model.doc import Doc
from zvec_db.embedders import BM25Embedder, SentenceTransformersEmbedder
from zvec_db.rerankers import WeightedReranker
# 1. Documents to index
documents = [
"Machine learning is a subset of AI",
"Deep learning uses neural networks",
"NLP helps computers understand text",
"Computer vision interprets images",
"Reinforcement learning trains agents",
]
# 2. Create embedders
bm25 = BM25Embedder(max_features=4096)
bm25.fit(documents) # Fit on documents to build vocabulary
dense = SentenceTransformersEmbedder(model_name="all-MiniLM-L6-v2")
# 3. Create collection with hybrid schema
schema = zvec.CollectionSchema(
name="docs",
fields=[
zvec.FieldSchema("text", zvec.DataType.STRING),
],
vectors=[
zvec.VectorSchema("sparse", zvec.DataType.SPARSE_VECTOR_FP32, dimension=4096),
zvec.VectorSchema("dense", zvec.DataType.VECTOR_FP32, dimension=384),
]
)
collection = zvec.create_and_open("./my_db", schema)
# 4. Index documents with both sparse and dense vectors
for i, doc in enumerate(documents):
collection.insert(Doc(
id=str(i),
fields={"text": doc},
vectors={
"sparse": bm25.embed(doc),
"dense": dense.embed(doc),
}
))
# 5. Create reranker with schema auto-detection
# The schema tells the reranker:
# - SPARSE_VECTOR_FP32 field -> metrics=None (BM25 scores, not distances)
# - VECTOR_FP32 field -> metrics=MetricType.COSINE (convert distance to similarity)
#
# normalize=True uses smart defaults:
# - COSINE metric → divide by 2 (since cosine distances are in [0, 2])
# - Other metrics (IP, L2, None/BM25) → "bayes" normalization
reranker = WeightedReranker(
topn=3,
schema=collection.schema, # Auto-detect metrics
weights={"sparse": 0.4, "dense": 0.6},
normalize=True, # Smart default: sparse→bayes, dense→/2
)
# 6. Search with hybrid query
query = "neural networks"
results = collection.query(
vectors=[
zvec.VectorQuery(field_name="sparse", vector=bm25.embed(query)),
zvec.VectorQuery(field_name="dense", vector=dense.embed(query)),
],
topk=10,
reranker=reranker,
)
# 7. Display results
print("Top results:")
for i, doc in enumerate(results[:5]):
print(f" {i+1}. {doc.fields['text']} (score: {doc.score:.4f})")
Output:
Top results:
1. Deep learning uses neural networks (score: 0.7131)
2. Machine learning is a subset of AI (score: 0.6919)
3. NLP helps computers understand text (score: 0.6839)
Variant 2: Standalone Reranking (No zvec Collection)
Use rerankers independently without zvec:
from zvec.model.doc import Doc
from zvec_db.rerankers.defaults import DefaultWeightedReranker
# Pre-computed results from different sources
bm25_results = [
Doc(id="doc1", score=0.85),
Doc(id="doc2", score=0.72),
Doc(id="doc3", score=0.65),
]
dense_results = [
Doc(id="doc2", score=0.91),
Doc(id="doc4", score=0.78),
Doc(id="doc1", score=0.68),
]
# Fuse results with Bayesian normalization
reranker = DefaultWeightedReranker(
weights={"bm25": 0.4, "dense": 0.6}
)
final_results = reranker.rerank({
"bm25": bm25_results,
"dense": dense_results,
})
print("Fused results:")
for doc in final_results[:5]:
print(f" {doc.id}: {doc.score:.4f}")
Variant 3: Reciprocal Rank Fusion (RRF)
Rank-based fusion without score normalization:
from zvec_db.rerankers import DefaultRrfReranker
reranker = DefaultRrfReranker(topn=10, rank_constant=60)
results = reranker.rerank({
"bm25": bm25_results,
"dense": dense_results,
})
Table of Contents
Installation
# Basic install
pip install zvec-db
# With preprocessing (recommended for French/German/etc.)
pip install "zvec-db[preprocessing]"
# For development
pip install "zvec-db[dev,test,docs]"
Sparse Embedders
Sparse embedders transform text into sparse dictionaries {index: score} compatible with zvec (SPARSE_FP32).
BM25Embedder (recommended)
BM25 is the standard for lexical search - best choice for most use cases.
from zvec_db.embedders import BM25Embedder
from zvec_db.preprocessing import NormalizationConfig
# Preprocessing config (optional but recommended)
config = NormalizationConfig.aggressive(language="english")
bm25 = BM25Embedder(
max_features=4096, # Max non-zero terms
k1=1.2, # Term freq saturation (default: 1.2)
b=0.75, # Length normalization (default: 0.75)
preprocessing_config=config
)
bm25.fit(documents)
vector = bm25.embed("query") # {42: 0.523, 108: 0.312, ...}
Other sparse embedders
See Choosing a sparse embedder above for guidance.
from zvec_db.embedders import TfidfEmbedder, CountEmbedder
tfidf = TfidfEmbedder(max_features=4096, sublinear_tf=True)
count = CountEmbedder(max_features=4096, binary=True)
Dense Embedders
Dense embedders transform text into dense vectors (numpy arrays).
OpenAIEmbedder (API / vLLM)
Works with OpenAI API or compatible endpoints (vLLM, local servers).
from zvec_db.embedders import OpenAIEmbedder
# OpenAI API
embedder = OpenAIEmbedder(
model="text-embedding-3-small",
api_key="sk-..."
)
# Local vLLM
embedder = OpenAIEmbedder(
base_url="http://localhost:9300/v1",
model="embedding",
max_batch_size=32,
)
vector = embedder.embed("query")
SentenceTransformersEmbedder (local)
Run embedding models locally using sentence-transformers.
from zvec_db.embedders import SentenceTransformersEmbedder
embedder = SentenceTransformersEmbedder(
model_name="all-MiniLM-L6-v2", # 384 dims, fast
device="cpu", # or "cuda"
normalize=True, # Normalize vectors
)
vector = embedder.embed("query")
Reranking
Reranking refines search results by combining multiple sources or applying secondary scoring.
Normalization
Le paramètre normalize contrôle comment les scores sont normalisés :
| Valeur | Effet |
|---|---|
True (défaut) |
Défaut intelligent : COSINE → /2, autres → "bayes" |
"bayes" |
Calibration sigmoid bayésienne (robuste aux outliers) |
"minmax" |
Normalisation min-max : (x - min) / (max - min) |
"percentile" |
Normalisation par rang (très robuste aux outliers) |
"cosine" |
Division par 2 (pour distances COSINE) |
{"sparse": "bayes", "dense": "cosine"} |
Configuration par source |
None ou False |
Pas de normalisation (scores bruts après conversion) |
Défaut intelligent : Lorsque normalize=True, le reranker détecte automatiquement la métrique pour chaque source :
- COSINE → division par 2 (car les distances COSINE sont dans [0, 2])
- Autres (IP, L2, None/BM25) → normalisation bayésienne
Fusion rerankers
WeightedReranker (fusion pondérée)
from zvec_db.rerankers import WeightedReranker
# Défaut intelligent : COSINE → /2, autres → bayes
reranker = WeightedReranker(
topn=10,
weights={"source1": 0.7, "source2": 0.3},
normalize=True, # Défaut intelligent
)
results = reranker.rerank({
"source1": docs1,
"source2": docs2,
})
# Configuration par source
reranker = WeightedReranker(
topn=10,
weights={"sparse": 0.4, "dense": 0.6},
normalize={"sparse": "bayes", "dense": "cosine"},
)
# Pas de normalisation
reranker = WeightedReranker(
topn=10,
normalize=None,
)
Auto-détection des métriques avec schema
Avec les collections zvec, utilisez schema pour inférer automatiquement les métriques :
import zvec
from zvec_db.rerankers import WeightedReranker
collection = zvec.open("./my_collection")
# Le reranker détecte automatiquement :
# - SPARSE_VECTOR_FP32 → metrics=None (scores BM25)
# - VECTOR_FP32 COSINE → metrics=MetricType.COSINE
reranker = WeightedReranker(
topn=10,
schema=collection.schema, # Infère les métriques
weights={"sparse": 0.4, "dense": 0.6},
normalize=True, # Défaut : sparse→bayes, dense→/2
)
Important : Le paramètre metrics ou schema est obligatoire pour éviter les erreurs de conversion. Par exemple :
reranker = WeightedReranker(
metrics={"sparse": None, "dense": MetricType.COSINE},
normalize=True,
)
RrfReranker (Reciprocal Rank Fusion)
from zvec_db.rerankers import RrfReranker
# Basic RRF (equal weights for all sources)
reranker = RrfReranker(topn=10, rank_constant=60)
results = reranker.rerank({"bm25": bm25_docs, "dense": dense_docs})
# Weighted RRF: favor dense over BM25
reranker = RrfReranker(
topn=10,
rank_constant=60,
weights={"dense": 0.7, "bm25": 0.3}
)
results = reranker.rerank({"bm25": bm25_docs, "dense": dense_docs})
Note: Le paramètre normalize est accepté par RrfReranker pour la cohérence de l'API, mais il n'a aucun effet car RRF utilise les rangs des documents, pas leurs scores bruts. Un avertissement sera émis si vous définissez normalize sur une valeur non-None.
Ready-to-use rerankers
Avoid manual configuration with defaults:
from zvec_db.rerankers.defaults import (
DefaultWeightedReranker,
DefaultHybridReranker,
DefaultRrfReranker,
)
# Weighted fusion with Bayesian normalization (default)
reranker = DefaultWeightedReranker()
# Optimized hybrid: dense (60%) + BM25 (40%)
reranker = DefaultHybridReranker()
# RRF with standard parameters
reranker = DefaultRrfReranker()
# All support the new `normalize` parameter
reranker = DefaultWeightedReranker(normalize="minmax")
reranker = DefaultHybridReranker(normalize={"dense": "cosine", "bm25": "percentile"})
# Note: DefaultRrfReranker accepts `normalize` but it has no effect (warning emitted)
results = reranker.rerank({"bm25": bm25_docs, "dense": dense_docs})
Cross-Encoder rerankers
Cross-encoders recalculate scores using both query and document. Require a query parameter at initialization.
SentenceTransformerReranker (local, binary)
from zvec_db.rerankers import SentenceTransformerReranker
reranker = SentenceTransformerReranker(
query="machine learning",
model_name="cross-encoder/ms-marco-MiniLM-L-6-v2",
topn=10,
)
results = reranker.rerank({"bm25": docs})
ClassificationReranker (local, multi-class)
from zvec_db.rerankers import ClassificationReranker
reranker = ClassificationReranker(
query="machine learning",
model_name="your-classification-model",
num_classes=5, # Auto-inferred if not specified
topn=10,
)
results = reranker.rerank({"bm25": docs})
OpenAIReranker (API)
from zvec_db.rerankers import OpenAIReranker
reranker = OpenAIReranker(
query="machine learning",
base_url="http://localhost:9400/v1",
model="BAAI/bge-reranker-v2-m3",
endpoint="rerank", # or "score"
topn=10,
)
results = reranker.rerank({"bm25": docs})
Diversification
SubmodularReranker (MMR)
Maximize relevance while diversifying results.
from zvec_db.rerankers import SubmodularReranker
reranker = SubmodularReranker(
topn=10,
lambda_param=0.7, # 70% relevance, 30% diversity
vector_field="embedding",
)
results = reranker.rerank({"source": docs_with_vectors})
Preprocessing
Preprocessing improves sparse embedding quality.
Automatic (recommended)
from zvec_db.embedders import BM25Embedder
from zvec_db.preprocessing import NormalizationConfig
config = NormalizationConfig.aggressive(language="english")
bm25 = BM25Embedder(max_features=4096, preprocessing_config=config)
bm25.fit(documents)
# Preprocessing is automatically applied and saved
Utility functions
from zvec_db.preprocessing import normalize_text, stem_word, remove_stopwords
# Full pipeline
normalize_text(" HELLO WORLD ", lowercase=True, remove_accents=True, stem=True)
# -> "hello world"
# Individual functions
stem_word("running", language="english") # -> "run"
remove_stopwords("the cat eats", language="english") # -> "cat eats"
Install nltk:
pip install "zvec-db[preprocessing]"
Model Persistence
from zvec_db.embedders import BM25Embedder
# Save
bm25 = BM25Embedder(max_features=4096, preprocessing_config=config)
bm25.fit(documents)
bm25.save("models/bm25_model.joblib")
# Load
bm25_loaded = BM25Embedder()
bm25_loaded.load("models/bm25_model.joblib")
# Embeddings are identical (preprocessing included)
assert bm25.embed("query") == bm25_loaded.embed("query")
Evaluation
from zvec_db.evaluation import evaluate_ranking
# Evaluate ranking quality
metrics = evaluate_ranking(
ground_truth=[["doc1", "doc2"], ["doc3"]],
predictions=[["doc2", "doc1"], ["doc3", "doc4"]],
metrics=["ndcg", "map", "mrr", "recall"],
)
Development
# Clone
git clone https://github.com/ccdv-ai/zvec-db.git
cd zvec-db
# Install with all dependencies
make install
# Run tests
make test
# Lint (black, isort, flake8, mypy)
make lint
# Build docs
make docs
License
MIT License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file zvec_db-0.4.0.tar.gz.
File metadata
- Download URL: zvec_db-0.4.0.tar.gz
- Upload date:
- Size: 87.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4cffa6c1e04dd4a330ec51db4dfb90b8111e85dc63d207156ae4b198b9b6e295
|
|
| MD5 |
10ed9e0a0221328c54ad3fbfe31939ea
|
|
| BLAKE2b-256 |
553ee08e4700904d556b590fcb94712f55d42d7f22f533931a1ea68e699a7bd5
|
File details
Details for the file zvec_db-0.4.0-py3-none-any.whl.
File metadata
- Download URL: zvec_db-0.4.0-py3-none-any.whl
- Upload date:
- Size: 110.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b2157dbb7ed6d0046dc1270b1dd9c5c3ebdcdf759489d4108181dc0898104101
|
|
| MD5 |
a198e8694414be66dc299ce393634cfc
|
|
| BLAKE2b-256 |
9435e8646ec1b4b855ad603ed93987622e3e535dc5d7d2699eca6ef5fd26ffc2
|