Skip to main content

Suite d'utilitaires pour la vectorisation sparse et le re-ranking de documents

Project description

zvec-db

Version Python 3.12+ License

Utility suite for sparse/dense vectorization and document re-ranking, designed to work with zvec.

Table of Contents


Installation

pip install zvec-db

Optional dependencies:

# For preprocessing (stemming, stopwords)
pip install "zvec-db[preprocessing]"

# For development
pip install "zvec-db[dev,test,docs]"

Quick Start

Hybrid search with zvec (recommended)

import zvec
from zvec_db.embedders import BM25Embedder, OpenAIEmbedder
from zvec_db.rerankers import NormalizedWeightedReRanker

# 1. Create embedders
bm25 = BM25Embedder(max_features=4096)
bm25.fit(documents)

dense = OpenAIEmbedder(base_url="http://localhost:9300/v1", model="embedding")

# 2. Create collection
schema = zvec.CollectionSchema(
    name="docs",
    vectors=[
        zvec.VectorSchema("sparse", zvec.DataType.SPARSE_FP32, dimension=4096),
        zvec.VectorSchema("dense", zvec.DataType.VECTOR_FP32, dimension=1024),
    ]
)
collection = zvec.create_and_open("./my_db", schema)

# 3. Insert
for i, doc in enumerate(documents):
    collection.insert(zvec.Doc(
        id=str(i),
        fields={"text": doc},
        vectors={
            "sparse": bm25.embed(doc),
            "dense": dense.embed(doc),
        }
    ))

# 4. Search with weighted fusion
# Note: metrics=None because we mix BM25 (arbitrary scores) and dense (COSINE distances)
results = collection.query(
    vectors=[
        zvec.VectorQuery(field_name="sparse", vector=bm25.embed(query)),
        zvec.VectorQuery(field_name="dense", vector=dense.embed(query)),
    ],
    topk=10,
    reranker=NormalizedWeightedReRanker(
        metrics=None,  # No automatic conversion (mixed metrics)
        weights={"sparse": 0.4, "dense": 0.6},
        normalizer_configs={"sparse": {"method": "bayes"}},
    ),
)

Sparse Embedders

All sparse embedders return dictionaries {index: score, ...} compatible with zvec's SPARSE_FP32 format.

BM25Embedder (recommended)

Standard BM25 scoring - best for general use cases.

from zvec_db.embedders import BM25Embedder
from zvec_db.preprocessing import NormalizationConfig

# With automatic preprocessing
config = NormalizationConfig.aggressive(language="french")
bm25 = BM25Embedder(
    max_features=4096,
    k1=1.2,       # Term frequency saturation (default: 1.2)
    b=0.75,       # Length normalization (default: 0.75)
    preprocessing_config=config
)
bm25.fit(documents)

vector = bm25.embed("search query")  # {index: score, ...}

Other sparse embedders

Embedder Use case
TfidfEmbedder TF-IDF weighting with sublinear TF option
CountEmbedder Simple term counts (binary option available)
BM25LEmbedder Documents with variable lengths
BM25PlusEmbedder Avoid zero scores with delta smoothing
DisMaxEmbedder Multi-field search (takes maximum score)
from zvec_db.embedders import TfidfEmbedder, CountEmbedder, DisMaxEmbedder

tfidf = TfidfEmbedder(max_features=4096, sublinear_tf=True)
count = CountEmbedder(max_features=4096, binary=True)
dismax = DisMaxEmbedder(tie_breaker=0.1)

Dense Embedders

OpenAIEmbedder (API / vLLM)

Works with OpenAI API or compatible endpoints (vLLM, local servers).

from zvec_db.embedders import OpenAIEmbedder

# OpenAI API
embedder = OpenAIEmbedder(model="text-embedding-3-small", api_key="sk-...")

# Local vLLM
embedder = OpenAIEmbedder(
    base_url="http://localhost:9300/v1",
    model="embedding",
    max_batch_size=32,
)
vector = embedder.embed("search query")

SentenceTransformersEmbedder (local)

Run embedding models locally using sentence-transformers.

from zvec_db.embedders import SentenceTransformersEmbedder

embedder = SentenceTransformersEmbedder(
    model_name="all-MiniLM-L6-v2",  # 384 dims, fast
    device="cpu",
    normalize=True,
)
vector = embedder.embed("search query")

Re-ranking

Understanding distance/similarity metrics

Problem: Vector databases store distances (smaller = more similar), but fusion algorithms assume similarities (larger = more relevant).

The metrics parameter handles conversion:

Metric Type Range Conversion Usage
COSINE Distance [0, 2] 1.0 - score/2.0 Normalized embeddings (Qdrant, zvec)
L2 Distance [0, ∞) 1 - 2*atan(s)/π Euclidean distance
IP Similarity (-∞, ∞) None Inner product (already similarity)
None - - None BM25 scores or already normalized [0, 1]

Default: metrics=MetricType.COSINE (main use case with zvec/Qdrant).

from zvec_db.rerankers import NormalizedWeightedReRanker, MetricType

# COSINE distances from zvec/Qdrant (default)
reranker = NormalizedWeightedReRanker(topn=10)

# BM25 scores (not distances!)
reranker = NormalizedWeightedReRanker(topn=10, metrics=None)

# Hybrid: BM25 + dense with per-source normalization
reranker = NormalizedWeightedReRanker(
    metrics=None,  # No global conversion
    weights={"sparse": 0.4, "dense": 0.6},
    normalizer_configs={
        "sparse": "bayes",  # BM25: handles outliers well
        "dense": True,      # Dense: standard normalization
    },
)

Fusion rerankers

Normalizer configuration

The normalizer_configs parameter controls how scores are normalized per source:

Value Effect
True Standard normalization (scales scores to [0, 1])
"bayes", "bayesian", "bb25" Bayesian sigmoid calibration (robust to outliers). These are aliases for the same method.
{"method": "bayes", "alpha": 1.0} Dict with custom parameters (alpha, beta)
None Skip normalization (use raw scores after metric conversion)

Example:

normalizer_configs={
    "sparse": "bayes",  # Bayesian: handles BM25 outliers well
    "dense": None,      # Optional: Cosine already scales between in [0, 1]
}

NormalizedWeightedReRanker (weighted fusion)

from zvec_db.rerankers import NormalizedWeightedReRanker

reranker = NormalizedWeightedReRanker(
    topn=10,
    weights={"source1": 0.7, "source2": 0.3},
    normalizer_configs={"source1": "bayes", "source2": True},
)

results = collection.query(vectors=[...], topk=20, reranker=reranker)

Using schema parameter (auto-detect metrics from collection)

When working with zvec collections, you can use the schema parameter to automatically infer the correct metrics for each vector field:

import zvec
from zvec_db.rerankers import NormalizedWeightedReRanker

# Open existing collection
collection = zvec.open("./my_collection")

# Reranker auto-infers metrics from schema
# - SPARSE_FP32 fields -> metrics=None (BM25 scores)
# - VECTOR_FP32 fields with COSINE -> metrics=MetricType.COSINE
reranker = NormalizedWeightedReRanker(
    topn=10,
    metrics=None,  # Will infer from schema
    schema=collection.schema,
    weights={"sparse": 0.4, "dense": 0.6},
)

# No need to manually specify metrics per source!
results = collection.query(
    vectors=[
        zvec.VectorQuery(field_name="sparse", vector=bm25.embed(query)),
        zvec.VectorQuery(field_name="dense", vector=dense.embed(query)),
    ],
    topk=20,
    reranker=reranker,
)

Manual per-source metrics (alternative):

from zvec_db.rerankers import NormalizedWeightedReRanker, MetricType

# Explicit per-source metrics
reranker = NormalizedWeightedReRanker(
    topn=10,
    metrics={
        "sparse": None,              # BM25 scores (not distances)
        "dense": MetricType.COSINE,  # Convert COSINE distance [0,2] -> similarity
    },
    weights={"sparse": 0.4, "dense": 0.6},
)

NormalizedRrfReRanker (Reciprocal Rank Fusion)

from zvec_db.rerankers import NormalizedRrfReRanker

reranker = NormalizedRrfReRanker(topn=10, rank_constant=60)
results = reranker.rerank({"bm25": bm25_docs, "dense": dense_docs})

WeightedReRanker (scores already normalized)

Use when scores are already in [0, 1] with "higher=better" orientation.

from zvec_db.rerankers import WeightedReRanker

reranker = WeightedReRanker(
    topn=10,
    weights={"source1": 0.7, "source2": 0.3},
)

Default rerankers (ready-to-use)

from zvec_db.rerankers.defaults import (
    DefaultWeightedReranker,
    DefaultHybridReranker,
    DefaultRrfReranker,
)

# Weighted fusion with Bayesian normalization
reranker = DefaultWeightedReranker()

# Optimized hybrid: dense (60%) + BM25 (40%)
reranker = DefaultHybridReranker()

# RRF with standard parameters
reranker = DefaultRrfReranker()

results = reranker.rerank({"bm25": bm25_docs, "dense": dense_docs})

Cross-Encoder rerankers

All cross-encoders require a query parameter at initialization.

SentenceTransformerReranker (local, binary)

from zvec_db.rerankers import SentenceTransformerReranker

reranker = SentenceTransformerReranker(
    query="machine learning",
    model_name="cross-encoder/ms-marco-MiniLM-L-6-v2",
    topn=10,
)
results = reranker.rerank({"bm25": docs})

ClassificationReranker (local, multi-class)

from zvec_db.rerankers import ClassificationReranker

reranker = ClassificationReranker(
    query="machine learning",
    model_name="your-multi-class-model",
    num_classes=5,  # Auto-inferred if not specified
    topn=10,
)
results = reranker.rerank({"bm25": docs})

OpenAIReranker (API)

from zvec_db.rerankers import OpenAIReranker

reranker = OpenAIReranker(
    query="machine learning",
    base_url="http://localhost:9400/v1",
    model="BAAI/bge-reranker-v2-m3",
    endpoint="rerank",  # or "score"
    topn=10,
)
results = reranker.rerank({"bm25": docs})

Diversification

SubmodularReranker (MMR)

Maximize relevance while diversifying results.

from zvec_db.rerankers import SubmodularReranker

reranker = SubmodularReranker(
    topn=10,
    lambda_param=0.7,  # 70% relevance, 30% diversity
    vector_field="embedding",
)
results = reranker.rerank({"source": docs_with_vectors})

Preprocessing

Preprocessing improves sparse embedding quality.

Automatic (recommended)

from zvec_db.embedders import BM25Embedder
from zvec_db.preprocessing import NormalizationConfig

config = NormalizationConfig.aggressive(language="french")
bm25 = BM25Embedder(max_features=4096, preprocessing_config=config)
bm25.fit(documents)
# Preprocessing is automatically applied and saved with the model

Utility functions

from zvec_db.preprocessing import normalize_text, stem_word, remove_stopwords

# Full pipeline
normalize_text("  CHAT MANGEAIT  ", lowercase=True, remove_accents=True, stem=True)  # "chat mang"

# Individual functions
stem_word("mangeaient", language="french")  # "mang"
remove_stopwords("le chat mange", language="french")  # "chat mange"

nltk installation:

pip install "zvec-db[preprocessing]"

Model Persistence

from zvec_db.embedders import BM25Embedder

# Save
bm25 = BM25Embedder(max_features=4096, preprocessing_config=config)
bm25.fit(documents)
bm25.save("models/bm25_model.joblib")

# Load
bm25_loaded = BM25Embedder()
bm25_loaded.load("models/bm25_model.joblib")

# Embeddings are identical (preprocessing included)
assert bm25.embed("query") == bm25_loaded.embed("query")

Evaluation

from zvec_db.evaluation import evaluate_ranking

# Evaluate ranking quality
metrics = evaluate_ranking(
    ground_truth=[["doc1", "doc2"], ["doc3"]],
    predictions=[["doc2", "doc1"], ["doc3", "doc4"]],
    metrics=["ndcg", "map", "mrr", "recall"],
)

Development

# Clone
git clone https://github.com/ccdv-ai/zvec-db.git
cd zvec-db

# Install with all dependencies
make install

# Run tests
make test

# Lint
make lint

# Build docs
make docs

License

MIT License


Complete Example: Hybrid Search Pipeline

This section demonstrates a complete hybrid search pipeline with BM25 + dense embeddings and re-ranking.

Setup

import zvec
from zvec.model.doc import Doc
from zvec_db.embedders import BM25Embedder, SentenceTransformersEmbedder
from zvec_db.rerankers import NormalizedWeightedReRanker, DefaultHybridReranker

# Sample documents
documents = [
    "Machine learning is a subset of artificial intelligence",
    "Deep learning uses neural networks with many layers",
    "Natural language processing enables computers to understand text",
    "Computer vision allows machines to interpret images",
    "Reinforcement learning trains agents through rewards",
]

# Initialize embedders
bm25 = BM25Embedder(max_features=4096, k1=1.2, b=0.75)
bm25.fit(documents)

dense = SentenceTransformersEmbedder(
    model_name="all-MiniLM-L6-v2",
    device="cpu",
    normalize=True,
)

Create and populate collection

# Create zvec collection
schema = zvec.CollectionSchema(
    name="docs",
    vectors=[
        zvec.VectorSchema("sparse", zvec.DataType.SPARSE_FP32, dimension=4096),
        zvec.VectorSchema("dense", zvec.DataType.VECTOR_FP32, dimension=384),
    ]
)
collection = zvec.create_and_open("./my_db", schema)

# Index documents
for i, doc in enumerate(documents):
    collection.insert(zvec.Doc(
        id=str(i),
        fields={"text": doc},
        vectors={
            "sparse": bm25.embed(doc),
            "dense": dense.embed(doc),
        }
    ))

Hybrid search with re-ranking

query = "neural networks and deep learning"

# Method 1: Using collection.query with built-in reranker
results = collection.query(
    vectors=[
        zvec.VectorQuery(field_name="sparse", vector=bm25.embed(query)),
        zvec.VectorQuery(field_name="dense", vector=dense.embed(query)),
    ],
    topk=20,
    reranker=DefaultHybridReranker(
        weights={"sparse": 0.4, "dense": 0.6},
    ),
)

print("Top results:")
for i, doc in enumerate(results[:5]):
    print(f"  {i+1}. {doc.fields['text']} (score: {doc.score:.4f})")

Manual hybrid search (more control)

from zvec.model.doc import Doc

# 1. Separate searches
sparse_results = collection.search(
    vector_name="sparse",
    vector=bm25.embed(query),
    topk=20,
)

dense_results = collection.search(
    vector_name="dense",
    vector=dense.embed(query),
    topk=20,
)

# 2. Re-rank with schema-based auto-detection
reranker = NormalizedWeightedReRanker(
    topn=10,
    metrics=None,  # Infer from schema
    schema=collection.schema,
    weights={"sparse": 0.4, "dense": 0.6},
    normalizer_configs={
        "sparse": "bayes",  # Robust to BM25 outliers
        "dense": None,      # Optional: COSINE is already in [0, 1]
    },
)

# 3. Combine and re-rank
final_results = reranker.rerank({
    "sparse": sparse_results,
    "dense": dense_results,
})

print("\nFinal re-ranked results:")
for i, doc in enumerate(final_results[:5]):
    print(f"  {i+1}. {doc.fields['text']} (score: {doc.score:.4f})")

Standalone re-ranking (no zvec collection)

# If you're not using zvec, you can still use the rerankers standalone

# Mock search results from different sources
bm25_results = [
    Doc(id="doc1", score=15.5, fields={"text": "Machine learning..."}),
    Doc(id="doc2", score=12.3, fields={"text": "Deep neural..."}),
    Doc(id="doc3", score=8.7, fields={"text": "AI systems..."}),
]

dense_results = [
    Doc(id="doc2", score=0.92, fields={"text": "Deep neural..."}),
    Doc(id="doc1", score=0.75, fields={"text": "Machine learning..."}),
    Doc(id="doc4", score=0.68, fields={"text": "Data science..."}),
]

# Re-rank with explicit metrics
reranker = NormalizedWeightedReRanker(
    topn=10,
    metrics={
        "bm25": None,              # BM25 scores
        "dense": MetricType.COSINE,  # COSINE distances [0, 2]
    },
    weights={"bm25": 0.4, "dense": 0.6},
)

final_results = reranker.rerank({
    "bm25": bm25_results,
    "dense": dense_results,
})

Resources

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zvec_db-0.3.0.tar.gz (83.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zvec_db-0.3.0-py3-none-any.whl (108.0 kB view details)

Uploaded Python 3

File details

Details for the file zvec_db-0.3.0.tar.gz.

File metadata

  • Download URL: zvec_db-0.3.0.tar.gz
  • Upload date:
  • Size: 83.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for zvec_db-0.3.0.tar.gz
Algorithm Hash digest
SHA256 f75ce8b021eae466e25672e7433c3353309c1857e0d0ef205d5c6ccf288ced6d
MD5 5a7787b950b0513c47f5f06ee93a9365
BLAKE2b-256 c52c409d3ceba7b28c8e59284710de8e318e298b2de6cfbb5a246bfad471d89c

See more details on using hashes here.

File details

Details for the file zvec_db-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: zvec_db-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 108.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for zvec_db-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 62e5e939f84129cd9f40044597c6d53369835ad0acd8cc149af8674e69c2a88a
MD5 172f96fbb83816bc7ec9c0c7dde900e0
BLAKE2b-256 93ced5a14ab1343bafd2a0578e9d5c9e3bb65beb79cb65de0806f50df0be9177

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page