Utility suite for sparse vectorization and document reranking using zvec
Project description
zvec-db
Sparse/dense vectorization and document reranking toolkit for zvec.
Quick Start (5 minutes)
1. Install
pip install zvec-db
2. Index documents
from zvec_db.embedders import BM25Embedder, SentenceTransformersEmbedder
documents = [
"Machine learning is a subset of AI",
"Deep learning uses neural networks",
"NLP helps computers understand text",
]
# Create embedders
bm25 = BM25Embedder(max_features=4096)
bm25.fit(documents)
dense = SentenceTransformersEmbedder(model_name="all-MiniLM-L6-v2")
# Encode documents
for doc in documents:
sparse_vec = bm25.embed(doc) # dict: {index: score}
dense_vec = dense.embed(doc) # numpy array
3. Search with hybrid + reranking
from zvec.model.doc import Doc
from zvec_db.rerankers import WeightedReranker
from zvec.typing import MetricType
query = "neural networks"
# Option 1: With explicit metrics
reranker = WeightedReranker(
weights={"dense": 0.6, "bm25": 0.4},
metrics={"dense": MetricType.COSINE, "bm25": None}, # None = IP (no conversion)
normalize=True, # Smart default: COSINE->/2, others->bayes
)
# Option 2: With schema auto-detection (recommended with zvec)
# import zvec
# collection = zvec.open("./my_collection")
# reranker = WeightedReranker(
# schema=collection.schema, # Auto-detect metrics from schema
# weights={"dense": 0.6, "bm25": 0.4},
# normalize=True,
# )
results = reranker.rerank({
"dense": [Doc(id="0", score=0.8), Doc(id="1", score=0.9)],
"bm25": [Doc(id="1", score=0.7), Doc(id="2", score=0.6)],
})
print(results[0].id) # Most relevant document
Table of Contents
- Key Concepts
- Advanced Example
- Reranking
- Cross-Encoder Rerankers
- Preprocessing
- Model Persistence
- Installation
- Development
Key Concepts
Distance vs Similarity Metrics
Vector databases store distances (smaller = more similar), but fusion algorithms assume similarities (larger = more relevant). The metrics parameter handles conversion automatically:
| Metric | Type | Range | Conversion | Usage |
|---|---|---|---|---|
| COSINE | Distance | [0, 2] | (2 - score) / 2 |
Normalized embeddings (Qdrant, zvec) |
| L2 | Distance | [0, ∞) | -score |
Euclidean distance |
| IP | Similarity | (-∞, ∞) | None | Inner product, BM25 scores (already similar) |
Default: MetricType.COSINE (main use case with zvec/Qdrant).
Choosing a Sparse Embedder
| Embedder | Use case |
|---|---|
BM25Embedder |
Recommended - standard lexical search |
TfidfEmbedder |
TF-IDF weighting with sublinear TF option |
CountEmbedder |
Simple term counts (binary option available) |
BM25LEmbedder |
Documents with highly variable lengths |
BM25PlusEmbedder |
Avoid zero scores with delta smoothing |
DisMaxEmbedder |
Multi-field search (takes maximum score) |
Advanced Example: Hybrid Search with zvec
import zvec
from zvec.model.doc import Doc
from zvec_db.embedders import BM25Embedder, SentenceTransformersEmbedder
from zvec_db.rerankers import WeightedReranker
documents = [
"Machine learning is a subset of AI",
"Deep learning uses neural networks",
"NLP helps computers understand text",
]
# Create embedders
bm25 = BM25Embedder(max_features=4096)
bm25.fit(documents)
dense = SentenceTransformersEmbedder(model_name="all-MiniLM-L6-v2")
# Create collection
schema = zvec.CollectionSchema(
name="docs",
fields=[zvec.FieldSchema("text", zvec.DataType.STRING)],
vectors=[
zvec.VectorSchema(name="sparse", data_type=zvec.DataType.SPARSE_VECTOR_FP32, dimension=4096),
zvec.VectorSchema(name="dense", data_type=zvec.DataType.VECTOR_FP32, dimension=384,
index_param=zvec.FlatIndexParam(metric_type=zvec.MetricType.COSINE)),
]
)
collection = zvec.create_and_open("./my_db", schema)
# Index documents
for i, doc in enumerate(documents):
collection.insert(Doc(
id=str(i),
fields={"text": doc},
vectors={
"sparse": bm25.embed(doc),
"dense": dense.embed(doc),
}
))
# Search with auto-detected metrics from schema
reranker = WeightedReranker(
topn=3,
schema=collection.schema, # Auto-detect metrics: sparse->None, dense->COSINE
weights={"sparse": 0.4, "dense": 0.6},
normalize=True, # Smart default: sparse->bayes, dense->/2
)
query = "neural networks"
results = collection.query(
vectors=[
zvec.VectorQuery(field_name="sparse", vector=bm25.embed(query)),
zvec.VectorQuery(field_name="dense", vector=dense.embed(query)),
],
topk=10,
reranker=reranker,
)
print("Top results:")
for i, doc in enumerate(results[:3]):
print(f" {i+1}. {doc.fields['text']} (score: {doc.score:.4f})")
Reranking
Normalization
The normalize parameter controls score normalization:
| Value | Effect |
|---|---|
True |
Smart default: COSINE → no-op, others → "bayes" |
"bayes" |
Bayesian sigmoid calibration (robust to outliers) |
"minmax" |
Min-max: (x - min) / (max - min) |
"rank" / "percentile" |
Rank-based (very robust to outliers) |
"cosine" |
No-op (identity). COSINE scores already in [0, 1] |
{"sparse": "bayes", "dense": "cosine"} |
Per-source configuration |
None / False |
No normalization |
Note: normalize=True requires schema or metrics to auto-detect the metric per source.
COSINE is already normalized to [0, 1] by the conversion formula (2 - score) / 2, so normalize="cosine" is a no-op (identity). Use it for explicit API consistency when you want to document that no additional normalization is applied.
WeightedReranker
Weighted fusion of multiple sources:
from zvec_db.rerankers import WeightedReranker
from zvec.typing import MetricType
# With explicit metrics
reranker = WeightedReranker(
weights={"bm25": 0.4, "dense": 0.6},
metrics={"bm25": MetricType.IP, "dense": MetricType.COSINE},
normalize="bayes",
)
# With schema auto-detection (recommended)
import zvec
collection = zvec.open("./my_collection")
reranker = WeightedReranker(
schema=collection.schema,
weights={"sparse": 0.4, "dense": 0.6},
normalize=True,
)
results = reranker.rerank({"bm25": bm25_docs, "dense": dense_docs})
RrfReranker (Reciprocal Rank Fusion)
Rank-based fusion (robust to score scale differences):
from zvec_db.rerankers import RrfReranker
import zvec
# With schema auto-detection (recommended)
collection = zvec.open("./my_collection")
reranker = RrfReranker(
topn=10,
rank_constant=60,
schema=collection.schema, # Auto-detect metrics
)
results = reranker.rerank({"bm25": bm25_docs, "dense": dense_docs})
# With custom weights
reranker = RrfReranker(
topn=10,
rank_constant=60,
weights={"dense": 0.7, "bm25": 0.3},
schema=collection.schema,
)
Note: RRF uses ranks, not scores. The normalize parameter has no effect.
Cross-Encoder Rerankers
Cross-encoders recalculate scores using both query and document. Require a query parameter.
from zvec_db.rerankers import SentenceTransformerReranker
reranker = SentenceTransformerReranker(
query="machine learning",
model_name="cross-encoder/ms-marco-MiniLM-L-6-v2",
topn=10,
)
results = reranker.rerank({"bm25": docs})
Other cross-encoders: ClassificationReranker (multi-class), OpenAIReranker (API).
Preprocessing
Preprocessing improves sparse embedding quality:
from zvec_db.embedders import BM25Embedder
from zvec_db.preprocessing import NormalizationConfig
config = NormalizationConfig.aggressive(language="english")
bm25 = BM25Embedder(max_features=4096, preprocessing_config=config)
bm25.fit(documents)
# Utility functions
from zvec_db.preprocessing import normalize_text, stem_word, remove_stopwords
normalize_text(" HELLO WORLD ", lowercase=True, stem=True) # "hello world"
Install: pip install "zvec-db[preprocessing]"
Model Persistence
from zvec_db.embedders import BM25Embedder
# Save
bm25 = BM25Embedder(max_features=4096, preprocessing_config=config)
bm25.fit(documents)
bm25.save("models/bm25_model.joblib")
# Load
bm25_loaded = BM25Embedder()
bm25_loaded.load("models/bm25_model.joblib")
Installation
# Basic
pip install zvec-db
# With preprocessing (French/German/etc.)
pip install "zvec-db[preprocessing]"
# Development
pip install "zvec-db[dev,test,docs]"
Development
git clone https://github.com/ccdv-ai/zvec-db.git
cd zvec-db
make install # Install with all dependencies
make test # Run tests
make lint # black, isort, flake8, mypy
make docs # Build Sphinx docs
License
MIT License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file zvec_db-0.7.0.tar.gz.
File metadata
- Download URL: zvec_db-0.7.0.tar.gz
- Upload date:
- Size: 82.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4a07306aa997fb297effc8c584c2b6b159967afe7c180ba9295fcbc8aef7414c
|
|
| MD5 |
a6d609fe94908b219877fe56a68ef541
|
|
| BLAKE2b-256 |
e2602e47699ce14ec79958f399429f8e4d1fcd003147c245fcde1b9c5e7e0efa
|
File details
Details for the file zvec_db-0.7.0-py3-none-any.whl.
File metadata
- Download URL: zvec_db-0.7.0-py3-none-any.whl
- Upload date:
- Size: 103.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3029568d2ba4cd6859af6101c7f7e48cc1d470014783f01cec51589db86b84f4
|
|
| MD5 |
f16f92905a4fc5e78bc9b974dde3a663
|
|
| BLAKE2b-256 |
bc5443964cd4f46cff7409122ff344339ec03a0fb760834bbe00385a55afc1ae
|