Skip to main content

Fast Diversification for Retrieval

Project description

Pyversity logo
Fast Diversification for Retrieval

Pyversity is a fast, lightweight library for diversifying retrieval results. Retrieval systems often return highly similar items. Pyversity efficiently re-ranks these results to encourage diversity, surfacing items that remain relevant but less redundant.

It implements several popular diversification strategies such as MMR, MSD, DPP, and Cover with a clear, unified API. More information about the supported strategies can be found in the supported strategies section. The only dependency is NumPy, making the package very lightweight.

Quickstart

Install pyversity with:

pip install pyversity

Diversify retrieval results:

import numpy as np
from pyversity import diversify, Strategy

# Define embeddings and scores (e.g. cosine similarities of a query result)
embeddings = np.random.randn(100, 256)
scores = np.random.rand(100)

# Diversify the result
diversified_result = diversify(
    embeddings=embeddings,
    scores=scores,
    k=10, # Number of items to select
    strategy=Strategy.MMR, # Diversification strategy to use
    diversity=0.5 # Diversity parameter (higher values prioritize diversity)
)

# Get the indices of the diversified result
diversified_indices = diversified_result.indices

The returned DiversificationResult can be used to access the diversified indices, as well as the selection_scores of the selected strategy and other useful info. The strategies are extremely fast and scalable: this example runs in milliseconds.

The diversity parameter tunes the trade-off between relevance and diversity: 0.0 focuses purely on relevance (no diversification), while 1.0 maximizes diversity, potentially at the cost of relevance.

Supported Strategies

The following table describes the supported strategies, how they work, their time complexity, and when to use them. The papers linked in the references section provide more in-depth information on the strengths/weaknesses of the supported strategies.

Strategy What It Does Time Complexity When to Use
MMR (Maximal Marginal Relevance) Keeps the most relevant items while down-weighting those too similar to what’s already picked. O(k · n · d) Good default. Fast, simple, and works well when you just want to avoid near-duplicates.
MSD (Max Sum of Distances) Prefers items that are both relevant and far from all previous selections. O(k · n · d) Use when you want stronger spread, i.e. results that cover a wider range of topics or styles.
DPP (Determinantal Point Process) Samples diverse yet relevant items using probabilistic “repulsion.” O(k · n · d + n · k²) Ideal when you want to eliminate redundancy or ensure diversity is built-in to selection.
COVER (Facility-Location) Ensures selected items collectively represent the full dataset’s structure. O(k · n²) Great for topic coverage or clustering scenarios, but slower for large n.

Motivation

Traditional retrieval systems rank results purely by relevance (how closely each item matches the query). While effective, this can lead to redundancy: top results often look nearly identical, which can create a poor user experience.

Diversification techniques like MMR, MSD, COVER, and DPP help balance relevance and variety. Each new item is chosen not only because it’s relevant, but also because it adds new information that wasn’t already covered by earlier results.

This improves exploration, user satisfaction, and coverage across many domains, for example:

  • E-commerce: Show different product styles, not multiple copies of the same black pants.
  • News search: Highlight articles from different outlets or viewpoints.
  • Academic retrieval: Surface papers from different subfields or methods.
  • RAG / LLM contexts: Avoid feeding the model near-duplicate passages.

References

The implementations in this package are based on the following research papers:

  • MMR: Carbonell, J., & Goldstein, J. (1998). The use of MMR, diversity-based reranking for reordering documents and producing summaries. Link

  • MSD: Borodin, A., Lee, H. C., & Ye, Y. (2012). Max-sum diversification, monotone submodular functions and dynamic updates. Link

  • COVER: Puthiya Parambath, S. A., Usunier, N., & Grandvalet, Y. (2016). A coverage-based approach to recommendation diversity on similarity graph. Link

  • DPP: Kulesza, A., & Taskar, B. (2012). Determinantal Point Processes for Machine Learning. Link

  • DPP (efficient greedy implementation): Chen, L., Zhang, G., & Zhou, H. (2018). Fast greedy MAP inference for determinantal point process to improve recommendation diversity. Link

Author

Thomas van Dongen

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyversity-0.1.0.tar.gz (17.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyversity-0.1.0-py3-none-any.whl (15.6 kB view details)

Uploaded Python 3

File details

Details for the file pyversity-0.1.0.tar.gz.

File metadata

  • Download URL: pyversity-0.1.0.tar.gz
  • Upload date:
  • Size: 17.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for pyversity-0.1.0.tar.gz
Algorithm Hash digest
SHA256 fbbc5a11d7f90987458960c51263746310513a9db107856d20589de190d66ef7
MD5 37d7600b56f9d9cbee0ff97380cd1688
BLAKE2b-256 93db23395d8d7b33676e5c4d60c360ca4feff6680992c5ad93d128bd6a6fa5f8

See more details on using hashes here.

File details

Details for the file pyversity-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pyversity-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 15.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for pyversity-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4eaadb59dce723d7a89a22e7889512d32044aee4206aa6a8a9a125c7d1a846c5
MD5 e11128d62cc3744633670f651ef6cbd3
BLAKE2b-256 a2f107fa0086f5fc5fd3174398e50dd1b33ad82459529c6fa270f43f00afa6b8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page