Skip to main content

No project description provided

Project description

diversity

PyPI version License ArXiv

A Python toolkit for measuring diversity in text.


Table of Contents


Installation

Install via pip:

pip install diversity

Or from source:

git clone https://github.com/cshaib/diversity.git
cd diversity
pip install .

Quick Start

Lexical Diversity Measures

We provide implementations for Compression Ratio, Homogenization Score, and n-gram Diversity Score:

from diversity import (
    compression_ratio,
    homogenization_score,
    ngram_diversity_score,
)

texts = [
    "The quick brown fox jumps over the lazy dog.",
    "The quick brown fox jumps over the lazy dog again.",
    "Suddenly, the quick brown fox leaps swiftly over the sleeping dog."
]

# Compression ratio
cr = compression_ratio(texts, method='gzip')
print(f"Compression Ratio: {cr:.4f}")

# Homogenization score (Self-BLEU)
hs = homogenization_score(texts, method='self-bleu')
print(f"Homogenization (Self-BLEU): {hs:.4f}")

# N-gram diversity
ngd = ngram_diversity_score(texts, n=3)
print(f"3-gram Diversity: {ngd:.4f}")

compression_ratio(texts, method='gzip')

  • Parameters:
    • texts (list): List of text strings
    • method (str): Compression algorithm ('gzip', 'bz2', 'lzma')
  • Returns: Float (0-1), higher = more repetitive

homogenization_score(texts, method='self-bleu')

  • Parameters:
    • texts (list): List of text strings
    • method (str): Scoring method ('self-bleu', 'rouge-l')
  • Returns: Float (0-1), higher = more homogeneous

ngram_diversity_score(texts, n=3)

  • Parameters:
    • texts (list): List of text strings
    • n (int): N-gram size
  • Returns: Float (0-1), higher = more diverse

Syntactic Diversity Measures

We also provide functions for extracting and analyze Part-of-Speech (POS) patterns to identify repetitive syntactic structures in your text:

from diversity import (
    extract_patterns,
    match_patterns
)

texts = [
    "The quick brown fox jumps over the lazy dog.",
    "The quick brown fox jumps over the lazy dog again.",
    "Suddenly, the quick brown fox leaps swiftly over the sleeping dog."
]

# POS pattern extraction
patterns = extract_patterns(texts, n=4, top_n=5)
print("Top POS patterns:", patterns)
# Example output: [(('DT', 'JJ', 'JJ', 'NN'), 15), ...]

# Match patterns in a single text
matches = match_patterns(texts[2], patterns)
print("Patterns in 3rd sentence:", matches)
# Example output: [{'pattern': ('DT', 'JJ', 'JJ', 'NN'), 'text': 'the quick brown fox', 'position': (0, 4)}]

remote_clique(data, model='Qwen/Qwen3-Embedding-0.6B', verbose=True, batch_size=64)

  • Parameters:

    • data (list): List of text strings to score

    • model (str): Embedding model to use (default: "Qwen/Qwen3-Embedding-0.6B")

    • verbose (bool): Whether to display progress bar (default: True)

    • batch_size (int): Batch size for embedding (default: 64)

  • Returns: float — Remote Clique score (average mean pairwise distance between documents)


chamfer_dist(data, model='Qwen/Qwen3-Embedding-0.6B', verbose=True, batch_size=64)

  • Parameters:

    • data (list): List of text strings to score

    • model (str): Embedding model to use (default: "Qwen/Qwen3-Embedding-0.6B")

    • verbose (bool): Whether to display progress bar (default: True)

    • batch_size (int): Batch size for embedding (default: 64)

  • Returns: float — Chamfer distance (average minimum pairwise distance, lower when many near-duplicates are present)


Embedding-Based Diversity Measures

You can also measure semantic diversity using embedding-based similarity. These scores compute distances between document embeddings to quantify how spread out or clustered the texts are:

from diversity.embedding import remote_clique, chamfer_dist

texts = [
    "The quick brown fox jumps over the lazy dog.",
    "A swift auburn fox vaulted a sleeping canine.",
    "I brewed coffee and read the paper."
]

# Remote Clique Score
rc = remote_clique(texts, model="Qwen/Qwen3-Embedding-0.6B")
print(f"Remote Clique: {rc:.3f}")

# Chamfer Distance
cd = chamfer_dist(texts, model="Qwen/Qwen3-Embedding-0.6B")
print(f"Chamfer Distance: {cd:.3f}")

remote_clique(data, model='Qwen/Qwen3-Embedding-0.6B', verbose=True, batch_size=64)

  • data (list of str): Documents to score.

  • model (str): HuggingFace/Sentence-Transformers embedding model to use (default: "Qwen/Qwen3-Embedding-0.6B").

  • verbose (bool): Whether to show a progress bar during encoding (default: True).

  • batch_size (int): Batch size for embedding (default: 64).

  • Returns: float — average mean pairwise cosine distance between documents (higher = more spread out / diverse).

chamfer_dist(data, model='Qwen/Qwen3-Embedding-0.6B', verbose=True, batch_size=64)

  • data (list of str): Documents to score.

  • model (str): HuggingFace/Sentence-Transformers embedding model to use (default: "Qwen/Qwen3-Embedding-0.6B").

  • verbose (bool): Whether to show a progress bar during encoding (default: True).

  • batch_size (int): Batch size for embedding (default: 64).

  • Returns: float — average minimum pairwise cosine distance (sensitive to near-duplicates; higher = less redundancy).


QUDSim (Question Under Discussion Similarity)

QUDSim aligns document segments based on Questions Under Discussion (QUDs) --- implicit questions that segments of text address (QUDsim: Quantifying Discourse Similarities in LLM-Generated Text).

This function requires OpenAI API access.

from diversity import qudsim

# Two documents about the same topic
document1 = "In the heart of ancient Macedonia, Philip II ascended to the throne in 359 BC..."
document2 = "The sun beat down on the rough-hewn hills of ancient Macedonia..."

# Requires OpenAI API key
import os
key = os.environ.get('OPENAI_API_KEY')  # or your API key

# Generate QUD-based alignment
alignment = qudsim([document1, document2], key=key)

# Access alignment results
results = eval(alignment)[0]  # First document pair

# View aligned segments
for source_text, target_text in results['aligned_segment_text']:
    print(f"Source: {source_text[:100]}...")
    print(f"Target: {target_text[:100]}...")
    print("---")

# View alignment scores (harmonic mean scores matrix)
scores = results['harmonic_mean_scores']
print(f"Alignment scores shape: {len(scores)}x{len(scores[0])}")

# Other available fields:
# - results['source_qud_answers']: QUDs generated for source document
# - results['target_qud_answers']: QUDs generated for target document
# - results['aligned_segments']: Indices of aligned segments

qudsim(documents, key)

  • Parameters:
    • documents (list): List of texts to align
    • key (str): OpenAI API key for QUD generation
    • model (str): LLM model to use (default: gpt-4)
    • threshold (float): Minimum alignment score threshold (default: 0.5)
  • Returns: list of alignment scores

Citation(s)

If you use this package, please cite:

@misc{shaib2025standardizingmeasurementtextdiversity,
  title={Standardizing the Measurement of Text Diversity: A Tool and a Comparative Analysis of Scores},
  author={Chantal Shaib and Joe Barrow and Jiuding Sun and Alexa F. Siu and Byron C. Wallace and Ani Nenkova},
  year={2025},
  eprint={2403.00553},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2403.00553},
}

If you use QUDSim, please also cite:

@inproceedings{
namuduri2025qudsim,
title={{QUD}sim: Quantifying Discourse Similarities in {LLM}-Generated Text},
author={Ramya Namuduri and Yating Wu and Anshun Asher Zheng and Manya Wadhwa and Greg Durrett and Junyi Jessy Li},
booktitle={Second Conference on Language Modeling},
year={2025},
url={https://openreview.net/forum?id=zFz1BJu211}
}

Requirements

  • Python 3.10-3.12
  • Core dependencies:
    • numpy
    • nltk
    • scikit-learn
  • For embedding-based metrics:
    • sentence-transformers
    • torch
  • For QUDSim:
    • openai
    • tqdm

License

This package is released under the Apache License 2.0.


Contributing

Contributions are welcome!
Please open an issue or submit a pull request on GitHub.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

diversity-0.3.0.tar.gz (24.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

diversity-0.3.0-py3-none-any.whl (30.4 kB view details)

Uploaded Python 3

File details

Details for the file diversity-0.3.0.tar.gz.

File metadata

  • Download URL: diversity-0.3.0.tar.gz
  • Upload date:
  • Size: 24.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for diversity-0.3.0.tar.gz
Algorithm Hash digest
SHA256 db7506d1b973b57cce85f83d0645c215448982de83b657a1b0ecaf92c26e4a60
MD5 fb06af68b586aab2f56336148431f878
BLAKE2b-256 8c5cd0fae933443bf56f8920af981253c452613ce369e3aa422aa3909b445412

See more details on using hashes here.

Provenance

The following attestation bundles were made for diversity-0.3.0.tar.gz:

Publisher: publish.yaml on cshaib/diversity

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file diversity-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: diversity-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 30.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for diversity-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4a4b77e4ade5b11538bfb416e72c3efc3bb1e81c05552d3cb12df0a59a87cdd8
MD5 8f865e2c856d422cb39082d32365e7cb
BLAKE2b-256 88f3dfc9edc7e1eadb5e209500641504366c1bc593c1dbebde206263a7fb06e4

See more details on using hashes here.

Provenance

The following attestation bundles were made for diversity-0.3.0-py3-none-any.whl:

Publisher: publish.yaml on cshaib/diversity

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page