A Python library that integrates Milvus vector database with BEIR (Benchmarking IR) for efficient information retrieval and evaluation.
Project description
Milvus BEIR
A Python library that integrates Milvus vector database with BEIR (Benchmarking IR) for efficient information retrieval and evaluation. This library provides various search strategies including dense retrieval, sparse retrieval, and hybrid search approaches.
Features
- Multiple search strategies:
- Dense Vector Search
- Sparse Vector Search
- BM25 Search
- Hybrid Search (BM25 + Dense, Sparse + Dense)
- Multi-Match Search
- Seamless integration with BEIR datasets and evaluation metrics
- Easy-to-use API for retrieval and evaluation
- Built-in performance measurement (QPS)
- Compatible with Milvus 2.5.x
Installation
pip install milvus-beir
Prerequisites
- Python >= 3.10
- Running Milvus instance (2.5.0 or higher)
Quick Start
Here's a comprehensive example of how to use milvus-beir for search and evaluation:
from beir import util
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.evaluation import EvaluateRetrieval
from milvus_beir.retrieval.search.dense.dense_search import MilvusDenseSearch
# Download and load BEIR dataset
dataset = "scifact"
url = f"https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{dataset}.zip"
data_path = util.download_and_unzip(url, "datasets")
corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")
# Initialize Milvus search model
uri = "http://localhost:19530"
model = MilvusDenseSearch(
uri,
token=None, # Optional authentication token
collection_name="milvus_beir_demo",
nq=100, # Number of queries to process in parallel
nb=1000 # Number of documents to process in parallel
)
# Perform retrieval and evaluation
retriever = EvaluateRetrieval(model)
results = retriever.retrieve(corpus, queries)
# Calculate standard metrics
ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)
# Measure search performance (QPS)
qps = model.measure_search_qps(
corpus,
queries,
top_k=1000,
concurrency_levels=[1, 2],
test_duration=60
)
Search Strategies
Dense Vector Search
Uses dense embeddings for semantic search.
from milvus_beir.retrieval.search.dense.dense_search import MilvusDenseSearch
Sparse Vector Search
Implements sparse vector retrieval.
from milvus_beir.retrieval.search.sparse.sparse_search import MilvusSparseSearch
BM25 Search
Traditional lexical search using BM25 algorithm.
from milvus_beir.retrieval.search.lexical.bm25_search import MilvusBM25Search
Multi-Match Search
Implements a multi-match search strategy similar to Elasticsearch's multi-match with best_fields type.
from milvus_beir.retrieval.search.multi_match.multi_match_search import MilvusMultiMatchSearch
Hybrid Search
Combines different search strategies for better results.
from milvus_beir.retrieval.search.hybrid.bm25_hybrid_search import MilvusBM25DenseHybridSearch
from milvus_beir.retrieval.search.hybrid.sparse_hybrid_search import MilvusSparseDenseHybridSearch
Command Line Interface
The package includes a powerful command-line interface for evaluating different search methods:
# Basic usage
milvus-beir --dataset nfcorpus --search-method sparse
# Full options
milvus-beir \
--dataset nfcorpus \
--uri "http://localhost:19530" \
--token "your_token" \
--search-method bm25 \
--collection-name "my_collection" \
--nq 100 \
--nb 1000 \
--concurrency-levels "1,2" \
--measure-qps
Available options:
--dataset, -d: Dataset name to evaluate on (required)- Supported datasets: climate-fever, dbpedia-entity, fever, fiqa, hotpotqa, nfcorpus, nq, quora, scidocs, scifact, webis-touche2020, trec-covid, mmarco, cqadupstack/android, cqadupstack/english
--uri, -u: Milvus server URI (default: http://localhost:19530)--token, -t: Authentication token for Milvus (optional)--search-method, -m: Search method to use (required)- Available methods: dense, sparse, sparse_hybrid, bm25_hybrid, multi_match, bm25
--collection-name, -c: Milvus collection name (optional)--nq: Number of queries to process in parallel (default: 100)--nb: Number of documents to process in parallel (default: 1000)--concurrency-levels: Comma-separated list of concurrency levels for QPS measurement (default: "1,2")--measure-qps: Flag to enable QPS measurement (default: True)
The CLI tool will:
- Download and load the specified BEIR dataset
- Initialize the selected search method
- Perform retrieval and evaluation
- Calculate and display standard metrics (NDCG@k, MAP@k, Recall@k, Precision@k)
- Measure and report search performance in QPS (Queries Per Second)
Development
Setup Development Environment
# Clone the repository
git clone https://github.com/your-org/milvus-beir.git
cd milvus-beir
# Install dependencies using PDM
pdm install
pre-commit install
Running Tests
pdm run pytest tests/
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contributing
We welcome contributions! Please feel free to submit a Pull Request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file milvus_beir-1.0.0.tar.gz.
File metadata
- Download URL: milvus_beir-1.0.0.tar.gz
- Upload date:
- Size: 14.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
70bc09a29f75b88bc4769fd463e5f21dc263828e241c4a7067b35a815d10362e
|
|
| MD5 |
77c5e6c151e3590cfb5d3b3fa23ce787
|
|
| BLAKE2b-256 |
a159e70049796315130387d829e0f9a82fed9f94e1b67373c38fb3ac51f8e7b2
|
Provenance
The following attestation bundles were made for milvus_beir-1.0.0.tar.gz:
Publisher:
publish.yml on mmga-lab/milvus-beir
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
milvus_beir-1.0.0.tar.gz -
Subject digest:
70bc09a29f75b88bc4769fd463e5f21dc263828e241c4a7067b35a815d10362e - Sigstore transparency entry: 155034791
- Sigstore integration time:
-
Permalink:
mmga-lab/milvus-beir@66f92887dd1214d8ded4e84baca7c3369fb7398f -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/mmga-lab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@66f92887dd1214d8ded4e84baca7c3369fb7398f -
Trigger Event:
release
-
Statement type:
File details
Details for the file milvus_beir-1.0.0-py3-none-any.whl.
File metadata
- Download URL: milvus_beir-1.0.0-py3-none-any.whl
- Upload date:
- Size: 23.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4ac7c5bcc500a3a9d02e6251d77ee2086024545cccfae021043aaff9074e6e9c
|
|
| MD5 |
c96d7ae3fba77dc5ea7458df43a1b6b8
|
|
| BLAKE2b-256 |
60513b550daf35213c62a276592dfc93c4b25cb42f4dfc9b05df93bb81b24002
|
Provenance
The following attestation bundles were made for milvus_beir-1.0.0-py3-none-any.whl:
Publisher:
publish.yml on mmga-lab/milvus-beir
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
milvus_beir-1.0.0-py3-none-any.whl -
Subject digest:
4ac7c5bcc500a3a9d02e6251d77ee2086024545cccfae021043aaff9074e6e9c - Sigstore transparency entry: 155034792
- Sigstore integration time:
-
Permalink:
mmga-lab/milvus-beir@66f92887dd1214d8ded4e84baca7c3369fb7398f -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/mmga-lab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@66f92887dd1214d8ded4e84baca7c3369fb7398f -
Trigger Event:
release
-
Statement type: