Suite d'utilitaires pour la vectorisation sparse et le re-ranking de documents
Project description
zvec-db
Utility suite for sparse/dense vectorization and document re-ranking, designed to work with zvec.
Table of Contents
- Installation
- Quick Start
- Sparse Embedders
- Dense Embedders
- Re-ranking
- Preprocessing
- Model Persistence
- Evaluation
- Complete Example: Hybrid Search Pipeline
- License
Installation
pip install zvec-db
Optional dependencies:
# For preprocessing (stemming, stopwords)
pip install "zvec-db[preprocessing]"
# For development
pip install "zvec-db[dev,test,docs]"
Quick Start
Hybrid search with zvec (recommended)
import zvec
from zvec_db.embedders import BM25Embedder, OpenAIEmbedder
from zvec_db.rerankers import NormalizedWeightedReRanker
# 1. Create embedders
bm25 = BM25Embedder(max_features=4096)
bm25.fit(documents)
dense = OpenAIEmbedder(base_url="http://localhost:9300/v1", model="embedding")
# 2. Create collection
schema = zvec.CollectionSchema(
name="docs",
vectors=[
zvec.VectorSchema("sparse", zvec.DataType.SPARSE_FP32, dimension=4096),
zvec.VectorSchema("dense", zvec.DataType.VECTOR_FP32, dimension=1024),
]
)
collection = zvec.create_and_open("./my_db", schema)
# 3. Insert
for i, doc in enumerate(documents):
collection.insert(zvec.Doc(
id=str(i),
fields={"text": doc},
vectors={
"sparse": bm25.embed(doc),
"dense": dense.embed(doc),
}
))
# 4. Search with weighted fusion
# Note: metrics=None because we mix BM25 (arbitrary scores) and dense (COSINE distances)
results = collection.query(
vectors=[
zvec.VectorQuery(field_name="sparse", vector=bm25.embed(query)),
zvec.VectorQuery(field_name="dense", vector=dense.embed(query)),
],
topk=10,
reranker=NormalizedWeightedReRanker(
metrics=None, # No automatic conversion (mixed metrics)
weights={"sparse": 0.4, "dense": 0.6},
normalizer_configs={"sparse": {"method": "bayes"}},
),
)
Sparse Embedders
All sparse embedders return dictionaries {index: score, ...} compatible with zvec's SPARSE_FP32 format.
BM25Embedder (recommended)
Standard BM25 scoring - best for general use cases.
from zvec_db.embedders import BM25Embedder
from zvec_db.preprocessing import NormalizationConfig
# With automatic preprocessing
config = NormalizationConfig.aggressive(language="french")
bm25 = BM25Embedder(
max_features=4096,
k1=1.2, # Term frequency saturation (default: 1.2)
b=0.75, # Length normalization (default: 0.75)
preprocessing_config=config
)
bm25.fit(documents)
vector = bm25.embed("search query") # {index: score, ...}
Other sparse embedders
| Embedder | Use case |
|---|---|
TfidfEmbedder |
TF-IDF weighting with sublinear TF option |
CountEmbedder |
Simple term counts (binary option available) |
BM25LEmbedder |
Documents with variable lengths |
BM25PlusEmbedder |
Avoid zero scores with delta smoothing |
DisMaxEmbedder |
Multi-field search (takes maximum score) |
from zvec_db.embedders import TfidfEmbedder, CountEmbedder, DisMaxEmbedder
tfidf = TfidfEmbedder(max_features=4096, sublinear_tf=True)
count = CountEmbedder(max_features=4096, binary=True)
dismax = DisMaxEmbedder(tie_breaker=0.1)
Dense Embedders
OpenAIEmbedder (API / vLLM)
Works with OpenAI API or compatible endpoints (vLLM, local servers).
from zvec_db.embedders import OpenAIEmbedder
# OpenAI API
embedder = OpenAIEmbedder(model="text-embedding-3-small", api_key="sk-...")
# Local vLLM
embedder = OpenAIEmbedder(
base_url="http://localhost:9300/v1",
model="embedding",
max_batch_size=32,
)
vector = embedder.embed("search query")
SentenceTransformersEmbedder (local)
Run embedding models locally using sentence-transformers.
from zvec_db.embedders import SentenceTransformersEmbedder
embedder = SentenceTransformersEmbedder(
model_name="all-MiniLM-L6-v2", # 384 dims, fast
device="cpu",
normalize=True,
)
vector = embedder.embed("search query")
Re-ranking
Understanding distance/similarity metrics
Problem: Vector databases store distances (smaller = more similar), but fusion algorithms assume similarities (larger = more relevant).
The metrics parameter handles conversion:
| Metric | Type | Range | Conversion | Usage |
|---|---|---|---|---|
COSINE |
Distance | [0, 2] | 1.0 - score/2.0 |
Normalized embeddings (Qdrant, zvec) |
L2 |
Distance | [0, ∞) | 1 - 2*atan(s)/π |
Euclidean distance |
IP |
Similarity | (-∞, ∞) | None | Inner product (already similarity) |
None |
- | - | None | BM25 scores or already normalized [0, 1] |
Default: metrics=MetricType.COSINE (main use case with zvec/Qdrant).
from zvec_db.rerankers import NormalizedWeightedReRanker, MetricType
# COSINE distances from zvec/Qdrant (default)
reranker = NormalizedWeightedReRanker(topn=10)
# BM25 scores (not distances!)
reranker = NormalizedWeightedReRanker(topn=10, metrics=None)
# Hybrid: BM25 + dense with per-source normalization
reranker = NormalizedWeightedReRanker(
metrics=None, # No global conversion
weights={"sparse": 0.4, "dense": 0.6},
normalizer_configs={
"sparse": "bayes", # BM25: handles outliers well
"dense": True, # Dense: standard normalization
},
)
Fusion rerankers
Normalizer configuration
The normalizer_configs parameter controls how scores are normalized per source:
| Value | Effect |
|---|---|
True |
Standard normalization (scales scores to [0, 1]) |
"bayes", "bayesian", "bb25" |
Bayesian sigmoid calibration (robust to outliers). These are aliases for the same method. |
{"method": "bayes", "alpha": 1.0} |
Dict with custom parameters (alpha, beta) |
None |
Skip normalization (use raw scores after metric conversion) |
Example:
normalizer_configs={
"sparse": "bayes", # Bayesian: handles BM25 outliers well
"dense": None, # Optional: Cosine already scales between in [0, 1]
}
NormalizedWeightedReRanker (weighted fusion)
from zvec_db.rerankers import NormalizedWeightedReRanker
reranker = NormalizedWeightedReRanker(
topn=10,
weights={"source1": 0.7, "source2": 0.3},
normalizer_configs={"source1": "bayes", "source2": True},
)
results = collection.query(vectors=[...], topk=20, reranker=reranker)
Using schema parameter (auto-detect metrics from collection)
When working with zvec collections, you can use the schema parameter to automatically infer the correct metrics for each vector field:
import zvec
from zvec_db.rerankers import NormalizedWeightedReRanker
# Open existing collection
collection = zvec.open("./my_collection")
# Reranker auto-infers metrics from schema
# - SPARSE_FP32 fields -> metrics=None (BM25 scores)
# - VECTOR_FP32 fields with COSINE -> metrics=MetricType.COSINE
reranker = NormalizedWeightedReRanker(
topn=10,
metrics=None, # Will infer from schema
schema=collection.schema,
weights={"sparse": 0.4, "dense": 0.6},
)
# No need to manually specify metrics per source!
results = collection.query(
vectors=[
zvec.VectorQuery(field_name="sparse", vector=bm25.embed(query)),
zvec.VectorQuery(field_name="dense", vector=dense.embed(query)),
],
topk=20,
reranker=reranker,
)
Manual per-source metrics (alternative):
from zvec_db.rerankers import NormalizedWeightedReRanker, MetricType
# Explicit per-source metrics
reranker = NormalizedWeightedReRanker(
topn=10,
metrics={
"sparse": None, # BM25 scores (not distances)
"dense": MetricType.COSINE, # Convert COSINE distance [0,2] -> similarity
},
weights={"sparse": 0.4, "dense": 0.6},
)
NormalizedRrfReRanker (Reciprocal Rank Fusion)
from zvec_db.rerankers import NormalizedRrfReRanker
reranker = NormalizedRrfReRanker(topn=10, rank_constant=60)
results = reranker.rerank({"bm25": bm25_docs, "dense": dense_docs})
WeightedReRanker (scores already normalized)
Use when scores are already in [0, 1] with "higher=better" orientation.
from zvec_db.rerankers import WeightedReRanker
reranker = WeightedReRanker(
topn=10,
weights={"source1": 0.7, "source2": 0.3},
)
Default rerankers (ready-to-use)
from zvec_db.rerankers.defaults import (
DefaultWeightedReranker,
DefaultHybridReranker,
DefaultRrfReranker,
)
# Weighted fusion with Bayesian normalization
reranker = DefaultWeightedReranker()
# Optimized hybrid: dense (60%) + BM25 (40%)
reranker = DefaultHybridReranker()
# RRF with standard parameters
reranker = DefaultRrfReranker()
results = reranker.rerank({"bm25": bm25_docs, "dense": dense_docs})
Cross-Encoder rerankers
All cross-encoders require a query parameter at initialization.
SentenceTransformerReranker (local, binary)
from zvec_db.rerankers import SentenceTransformerReranker
reranker = SentenceTransformerReranker(
query="machine learning",
model_name="cross-encoder/ms-marco-MiniLM-L-6-v2",
topn=10,
)
results = reranker.rerank({"bm25": docs})
ClassificationReranker (local, multi-class)
from zvec_db.rerankers import ClassificationReranker
reranker = ClassificationReranker(
query="machine learning",
model_name="your-multi-class-model",
num_classes=5, # Auto-inferred if not specified
topn=10,
)
results = reranker.rerank({"bm25": docs})
OpenAIReranker (API)
from zvec_db.rerankers import OpenAIReranker
reranker = OpenAIReranker(
query="machine learning",
base_url="http://localhost:9400/v1",
model="BAAI/bge-reranker-v2-m3",
endpoint="rerank", # or "score"
topn=10,
)
results = reranker.rerank({"bm25": docs})
Diversification
SubmodularReranker (MMR)
Maximize relevance while diversifying results.
from zvec_db.rerankers import SubmodularReranker
reranker = SubmodularReranker(
topn=10,
lambda_param=0.7, # 70% relevance, 30% diversity
vector_field="embedding",
)
results = reranker.rerank({"source": docs_with_vectors})
Preprocessing
Preprocessing improves sparse embedding quality.
Automatic (recommended)
from zvec_db.embedders import BM25Embedder
from zvec_db.preprocessing import NormalizationConfig
config = NormalizationConfig.aggressive(language="french")
bm25 = BM25Embedder(max_features=4096, preprocessing_config=config)
bm25.fit(documents)
# Preprocessing is automatically applied and saved with the model
Utility functions
from zvec_db.preprocessing import normalize_text, stem_word, remove_stopwords
# Full pipeline
normalize_text(" CHAT MANGEAIT ", lowercase=True, remove_accents=True, stem=True) # "chat mang"
# Individual functions
stem_word("mangeaient", language="french") # "mang"
remove_stopwords("le chat mange", language="french") # "chat mange"
nltk installation:
pip install "zvec-db[preprocessing]"
Model Persistence
from zvec_db.embedders import BM25Embedder
# Save
bm25 = BM25Embedder(max_features=4096, preprocessing_config=config)
bm25.fit(documents)
bm25.save("models/bm25_model.joblib")
# Load
bm25_loaded = BM25Embedder()
bm25_loaded.load("models/bm25_model.joblib")
# Embeddings are identical (preprocessing included)
assert bm25.embed("query") == bm25_loaded.embed("query")
Evaluation
from zvec_db.evaluation import evaluate_ranking
# Evaluate ranking quality
metrics = evaluate_ranking(
ground_truth=[["doc1", "doc2"], ["doc3"]],
predictions=[["doc2", "doc1"], ["doc3", "doc4"]],
metrics=["ndcg", "map", "mrr", "recall"],
)
Development
# Clone
git clone https://github.com/ccdv-ai/zvec-db.git
cd zvec-db
# Install with all dependencies
make install
# Run tests
make test
# Lint
make lint
# Build docs
make docs
License
MIT License
Complete Example: Hybrid Search Pipeline
This section demonstrates a complete hybrid search pipeline with BM25 + dense embeddings and re-ranking.
Setup
import zvec
from zvec.model.doc import Doc
from zvec_db.embedders import BM25Embedder, SentenceTransformersEmbedder
from zvec_db.rerankers import NormalizedWeightedReRanker, DefaultHybridReranker
# Sample documents
documents = [
"Machine learning is a subset of artificial intelligence",
"Deep learning uses neural networks with many layers",
"Natural language processing enables computers to understand text",
"Computer vision allows machines to interpret images",
"Reinforcement learning trains agents through rewards",
]
# Initialize embedders
bm25 = BM25Embedder(max_features=4096, k1=1.2, b=0.75)
bm25.fit(documents)
dense = SentenceTransformersEmbedder(
model_name="all-MiniLM-L6-v2",
device="cpu",
normalize=True,
)
Create and populate collection
# Create zvec collection
schema = zvec.CollectionSchema(
name="docs",
vectors=[
zvec.VectorSchema("sparse", zvec.DataType.SPARSE_FP32, dimension=4096),
zvec.VectorSchema("dense", zvec.DataType.VECTOR_FP32, dimension=384),
]
)
collection = zvec.create_and_open("./my_db", schema)
# Index documents
for i, doc in enumerate(documents):
collection.insert(zvec.Doc(
id=str(i),
fields={"text": doc},
vectors={
"sparse": bm25.embed(doc),
"dense": dense.embed(doc),
}
))
Hybrid search with re-ranking
query = "neural networks and deep learning"
# Method 1: Using collection.query with built-in reranker
results = collection.query(
vectors=[
zvec.VectorQuery(field_name="sparse", vector=bm25.embed(query)),
zvec.VectorQuery(field_name="dense", vector=dense.embed(query)),
],
topk=20,
reranker=DefaultHybridReranker(
weights={"sparse": 0.4, "dense": 0.6},
),
)
print("Top results:")
for i, doc in enumerate(results[:5]):
print(f" {i+1}. {doc.fields['text']} (score: {doc.score:.4f})")
Manual hybrid search (more control)
from zvec.model.doc import Doc
# 1. Separate searches
sparse_results = collection.search(
vector_name="sparse",
vector=bm25.embed(query),
topk=20,
)
dense_results = collection.search(
vector_name="dense",
vector=dense.embed(query),
topk=20,
)
# 2. Re-rank with schema-based auto-detection
reranker = NormalizedWeightedReRanker(
topn=10,
metrics=None, # Infer from schema
schema=collection.schema,
weights={"sparse": 0.4, "dense": 0.6},
normalizer_configs={
"sparse": "bayes", # Robust to BM25 outliers
"dense": None, # Optional: COSINE is already in [0, 1]
},
)
# 3. Combine and re-rank
final_results = reranker.rerank({
"sparse": sparse_results,
"dense": dense_results,
})
print("\nFinal re-ranked results:")
for i, doc in enumerate(final_results[:5]):
print(f" {i+1}. {doc.fields['text']} (score: {doc.score:.4f})")
Standalone re-ranking (no zvec collection)
# If you're not using zvec, you can still use the rerankers standalone
# Mock search results from different sources
bm25_results = [
Doc(id="doc1", score=15.5, fields={"text": "Machine learning..."}),
Doc(id="doc2", score=12.3, fields={"text": "Deep neural..."}),
Doc(id="doc3", score=8.7, fields={"text": "AI systems..."}),
]
dense_results = [
Doc(id="doc2", score=0.92, fields={"text": "Deep neural..."}),
Doc(id="doc1", score=0.75, fields={"text": "Machine learning..."}),
Doc(id="doc4", score=0.68, fields={"text": "Data science..."}),
]
# Re-rank with explicit metrics
reranker = NormalizedWeightedReRanker(
topn=10,
metrics={
"bm25": None, # BM25 scores
"dense": MetricType.COSINE, # COSINE distances [0, 2]
},
weights={"bm25": 0.4, "dense": 0.6},
)
final_results = reranker.rerank({
"bm25": bm25_results,
"dense": dense_results,
})
Resources
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file zvec_db-0.3.0.tar.gz.
File metadata
- Download URL: zvec_db-0.3.0.tar.gz
- Upload date:
- Size: 83.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f75ce8b021eae466e25672e7433c3353309c1857e0d0ef205d5c6ccf288ced6d
|
|
| MD5 |
5a7787b950b0513c47f5f06ee93a9365
|
|
| BLAKE2b-256 |
c52c409d3ceba7b28c8e59284710de8e318e298b2de6cfbb5a246bfad471d89c
|
File details
Details for the file zvec_db-0.3.0-py3-none-any.whl.
File metadata
- Download URL: zvec_db-0.3.0-py3-none-any.whl
- Upload date:
- Size: 108.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
62e5e939f84129cd9f40044597c6d53369835ad0acd8cc149af8674e69c2a88a
|
|
| MD5 |
172f96fbb83816bc7ec9c0c7dde900e0
|
|
| BLAKE2b-256 |
93ced5a14ab1343bafd2a0578e9d5c9e3bb65beb79cb65de0806f50df0be9177
|