Skip to main content

Utility suite for sparse vectorization and document reranking using zvec

Project description

zvec-db

Version Python 3.12+ License Documentation

Sparse/dense vectorization and document reranking toolkit for zvec.


Quick Start (5 minutes)

1. Install

pip install zvec-db

2. Index documents

from zvec_db.embedders import BM25Embedder, SentenceTransformersEmbedder

documents = [
    "Machine learning is a subset of AI",
    "Deep learning uses neural networks",
    "NLP helps computers understand text",
]

# Create embedders
bm25 = BM25Embedder(max_features=4096)
bm25.fit(documents)

dense = SentenceTransformersEmbedder(model_name="all-MiniLM-L6-v2")

# Encode documents
for doc in documents:
    sparse_vec = bm25.embed(doc)   # dict: {index: score}
    dense_vec = dense.embed(doc)   # numpy array

3. Search with hybrid + reranking

from zvec.model.doc import Doc
from zvec_db.rerankers import WeightedReranker
from zvec.typing import MetricType

query = "neural networks"

# Option 1: With explicit metrics
reranker = WeightedReranker(
    weights={"dense": 0.6, "bm25": 0.4},
    metrics={"dense": MetricType.COSINE, "bm25": None},  # None = IP (no conversion)
    normalize=True,  # Smart default: COSINE->/2, others->bayes
)

# Option 2: With schema auto-detection (recommended with zvec)
# import zvec
# collection = zvec.open("./my_collection")
# reranker = WeightedReranker(
#     schema=collection.schema,  # Auto-detect metrics from schema
#     weights={"dense": 0.6, "bm25": 0.4},
#     normalize=True,
# )

results = reranker.rerank({
    "dense": [Doc(id="0", score=0.8), Doc(id="1", score=0.9)],
    "bm25":  [Doc(id="1", score=0.7), Doc(id="2", score=0.6)],
})

print(results[0].id)  # Most relevant document

Table of Contents


Key Concepts

Distance vs Similarity Metrics

Vector databases store distances (smaller = more similar), but fusion algorithms assume similarities (larger = more relevant). The metrics parameter handles conversion automatically:

Metric Type Range Conversion Usage
COSINE Distance [0, 2] (2 - score) / 2 Normalized embeddings (Qdrant, zvec)
L2 Distance [0, ∞) -score Euclidean distance
IP Similarity (-∞, ∞) None Inner product, BM25 scores (already similar)

Default: MetricType.COSINE (main use case with zvec/Qdrant).

Choosing a Sparse Embedder

Embedder Use case
BM25Embedder Recommended - standard lexical search
TfidfEmbedder TF-IDF weighting with sublinear TF option
CountEmbedder Simple term counts (binary option available)
BM25LEmbedder Documents with highly variable lengths
BM25PlusEmbedder Avoid zero scores with delta smoothing
DisMaxEmbedder Multi-field search (takes maximum score)

Advanced Example: Hybrid Search with zvec

import zvec
from zvec.model.doc import Doc
from zvec_db.embedders import BM25Embedder, SentenceTransformersEmbedder
from zvec_db.rerankers import WeightedReranker

documents = [
    "Machine learning is a subset of AI",
    "Deep learning uses neural networks",
    "NLP helps computers understand text",
]

# Create embedders
bm25 = BM25Embedder(max_features=4096)
bm25.fit(documents)
dense = SentenceTransformersEmbedder(model_name="all-MiniLM-L6-v2")

# Create collection
schema = zvec.CollectionSchema(
    name="docs",
    fields=[zvec.FieldSchema("text", zvec.DataType.STRING)],
    vectors=[
        zvec.VectorSchema(name="sparse", data_type=zvec.DataType.SPARSE_VECTOR_FP32, dimension=4096),
        zvec.VectorSchema(name="dense", data_type=zvec.DataType.VECTOR_FP32, dimension=384,
                         index_param=zvec.FlatIndexParam(metric_type=zvec.MetricType.COSINE)),
    ]
)
collection = zvec.create_and_open("./my_db", schema)

# Index documents
for i, doc in enumerate(documents):
    collection.insert(Doc(
        id=str(i),
        fields={"text": doc},
        vectors={
            "sparse": bm25.embed(doc),
            "dense": dense.embed(doc),
        }
    ))

# Search with auto-detected metrics from schema
reranker = WeightedReranker(
    topn=3,
    schema=collection.schema,  # Auto-detect metrics: sparse->None, dense->COSINE
    weights={"sparse": 0.4, "dense": 0.6},
    normalize=True,  # Smart default: sparse->bayes, dense->/2
)

query = "neural networks"
results = collection.query(
    vectors=[
        zvec.VectorQuery(field_name="sparse", vector=bm25.embed(query)),
        zvec.VectorQuery(field_name="dense", vector=dense.embed(query)),
    ],
    topk=10,
    reranker=reranker,
)

print("Top results:")
for i, doc in enumerate(results[:3]):
    print(f"  {i+1}. {doc.fields['text']} (score: {doc.score:.4f})")

Reranking

Normalization

The normalize parameter controls score normalization:

Value Effect
True Smart default: COSINE → no-op, others → "bayes"
"bayes" Bayesian sigmoid calibration (robust to outliers)
"minmax" Min-max: (x - min) / (max - min)
"rank" / "percentile" Rank-based (very robust to outliers)
"cosine" No-op (identity). COSINE scores already in [0, 1]
{"sparse": "bayes", "dense": "cosine"} Per-source configuration
None / False No normalization

Note: normalize=True requires schema or metrics to auto-detect the metric per source.

COSINE is already normalized to [0, 1] by the conversion formula (2 - score) / 2, so normalize="cosine" is a no-op (identity). Use it for explicit API consistency when you want to document that no additional normalization is applied.

WeightedReranker

Weighted fusion of multiple sources:

from zvec_db.rerankers import WeightedReranker
from zvec.typing import MetricType

# With explicit metrics
reranker = WeightedReranker(
    weights={"bm25": 0.4, "dense": 0.6},
    metrics={"bm25": MetricType.IP, "dense": MetricType.COSINE},
    normalize="bayes",
)

# With schema auto-detection (recommended)
import zvec
collection = zvec.open("./my_collection")
reranker = WeightedReranker(
    schema=collection.schema,
    weights={"sparse": 0.4, "dense": 0.6},
    normalize=True,
)

results = reranker.rerank({"bm25": bm25_docs, "dense": dense_docs})

RrfReranker (Reciprocal Rank Fusion)

Rank-based fusion (robust to score scale differences):

from zvec_db.rerankers import RrfReranker
import zvec

# With schema auto-detection (recommended)
collection = zvec.open("./my_collection")
reranker = RrfReranker(
    topn=10,
    rank_constant=60,
    schema=collection.schema,  # Auto-detect metrics
)
results = reranker.rerank({"bm25": bm25_docs, "dense": dense_docs})

# With custom weights
reranker = RrfReranker(
    topn=10,
    rank_constant=60,
    weights={"dense": 0.7, "bm25": 0.3},
    schema=collection.schema,
)

Note: RRF uses ranks, not scores. The normalize parameter has no effect.


Cross-Encoder Rerankers

Cross-encoders recalculate scores using both query and document. Require a query parameter.

from zvec_db.rerankers import SentenceTransformerReranker

reranker = SentenceTransformerReranker(
    query="machine learning",
    model_name="cross-encoder/ms-marco-MiniLM-L-6-v2",
    topn=10,
)
results = reranker.rerank({"bm25": docs})

Other cross-encoders: ClassificationReranker (multi-class), OpenAIReranker (API).


Preprocessing

Preprocessing improves sparse embedding quality:

from zvec_db.embedders import BM25Embedder
from zvec_db.preprocessing import NormalizationConfig

config = NormalizationConfig.aggressive(language="english")
bm25 = BM25Embedder(max_features=4096, preprocessing_config=config)
bm25.fit(documents)

# Utility functions
from zvec_db.preprocessing import normalize_text, stem_word, remove_stopwords
normalize_text("  HELLO WORLD  ", lowercase=True, stem=True)  # "hello world"

Install: pip install "zvec-db[preprocessing]"


Model Persistence

from zvec_db.embedders import BM25Embedder

# Save
bm25 = BM25Embedder(max_features=4096, preprocessing_config=config)
bm25.fit(documents)
bm25.save("models/bm25_model.joblib")

# Load
bm25_loaded = BM25Embedder()
bm25_loaded.load("models/bm25_model.joblib")

Installation

# Basic
pip install zvec-db

# With preprocessing (French/German/etc.)
pip install "zvec-db[preprocessing]"

# Development
pip install "zvec-db[dev,test,docs]"

Development

git clone https://github.com/ccdv-ai/zvec-db.git
cd zvec-db

make install   # Install with all dependencies
make test      # Run tests
make lint      # black, isort, flake8, mypy
make docs      # Build Sphinx docs

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zvec_db-0.7.0.tar.gz (82.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zvec_db-0.7.0-py3-none-any.whl (103.1 kB view details)

Uploaded Python 3

File details

Details for the file zvec_db-0.7.0.tar.gz.

File metadata

  • Download URL: zvec_db-0.7.0.tar.gz
  • Upload date:
  • Size: 82.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for zvec_db-0.7.0.tar.gz
Algorithm Hash digest
SHA256 4a07306aa997fb297effc8c584c2b6b159967afe7c180ba9295fcbc8aef7414c
MD5 a6d609fe94908b219877fe56a68ef541
BLAKE2b-256 e2602e47699ce14ec79958f399429f8e4d1fcd003147c245fcde1b9c5e7e0efa

See more details on using hashes here.

File details

Details for the file zvec_db-0.7.0-py3-none-any.whl.

File metadata

  • Download URL: zvec_db-0.7.0-py3-none-any.whl
  • Upload date:
  • Size: 103.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for zvec_db-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3029568d2ba4cd6859af6101c7f7e48cc1d470014783f01cec51589db86b84f4
MD5 f16f92905a4fc5e78bc9b974dde3a663
BLAKE2b-256 bc5443964cd4f46cff7409122ff344339ec03a0fb760834bbe00385a55afc1ae

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page