No project description provided
Project description
diversity
A Python toolkit for measuring diversity in text.
Table of Contents
Installation
Install via pip:
pip install diversity
Or from source:
git clone https://github.com/cshaib/diversity.git
cd diversity
pip install .
Quick Start
Lexical Diversity Measures
We provide implementations for Compression Ratio, Homogenization Score, and n-gram Diversity Score:
from diversity import (
compression_ratio,
homogenization_score,
ngram_diversity_score,
)
texts = [
"The quick brown fox jumps over the lazy dog.",
"The quick brown fox jumps over the lazy dog again.",
"Suddenly, the quick brown fox leaps swiftly over the sleeping dog."
]
# Compression ratio
cr = compression_ratio(texts, method='gzip')
print(f"Compression Ratio: {cr:.4f}")
# Homogenization score (Self-BLEU)
hs = homogenization_score(texts, method='self-bleu')
print(f"Homogenization (Self-BLEU): {hs:.4f}")
# N-gram diversity
ngd = ngram_diversity_score(texts, n=3)
print(f"3-gram Diversity: {ngd:.4f}")
compression_ratio(texts, method='gzip')
- Parameters:
texts(list): List of text stringsmethod(str): Compression algorithm ('gzip', 'bz2', 'lzma')
- Returns: Float (0-1), higher = more repetitive
homogenization_score(texts, method='self-bleu')
- Parameters:
texts(list): List of text stringsmethod(str): Scoring method ('self-bleu', 'rouge-l')
- Returns: Float (0-1), higher = more homogeneous
ngram_diversity_score(texts, n=3)
- Parameters:
texts(list): List of text stringsn(int): N-gram size
- Returns: Float (0-1), higher = more diverse
Syntactic Diversity Measures
We also provide functions for extracting and analyze Part-of-Speech (POS) patterns to identify repetitive syntactic structures in your text:
from diversity import (
extract_patterns,
match_patterns
)
texts = [
"The quick brown fox jumps over the lazy dog.",
"The quick brown fox jumps over the lazy dog again.",
"Suddenly, the quick brown fox leaps swiftly over the sleeping dog."
]
# POS pattern extraction
patterns = extract_patterns(texts, n=4, top_n=5)
print("Top POS patterns:", patterns)
# Example output: [(('DT', 'JJ', 'JJ', 'NN'), 15), ...]
# Match patterns in a single text
matches = match_patterns(texts[2], patterns)
print("Patterns in 3rd sentence:", matches)
# Example output: [{'pattern': ('DT', 'JJ', 'JJ', 'NN'), 'text': 'the quick brown fox', 'position': (0, 4)}]
remote_clique(data, model='Qwen/Qwen3-Embedding-0.6B', verbose=True, batch_size=64)
-
Parameters:
-
data(list): List of text strings to score -
model(str): Embedding model to use (default:"Qwen/Qwen3-Embedding-0.6B") -
verbose(bool): Whether to display progress bar (default:True) -
batch_size(int): Batch size for embedding (default:64)
-
-
Returns:
float— Remote Clique score (average mean pairwise distance between documents)
chamfer_dist(data, model='Qwen/Qwen3-Embedding-0.6B', verbose=True, batch_size=64)
-
Parameters:
-
data(list): List of text strings to score -
model(str): Embedding model to use (default:"Qwen/Qwen3-Embedding-0.6B") -
verbose(bool): Whether to display progress bar (default:True) -
batch_size(int): Batch size for embedding (default:64)
-
-
Returns:
float— Chamfer distance (average minimum pairwise distance, lower when many near-duplicates are present)
Embedding-Based Diversity Measures
You can also measure semantic diversity using embedding-based similarity. These scores compute distances between document embeddings to quantify how spread out or clustered the texts are:
from diversity.embedding import remote_clique, chamfer_dist
texts = [
"The quick brown fox jumps over the lazy dog.",
"A swift auburn fox vaulted a sleeping canine.",
"I brewed coffee and read the paper."
]
# Remote Clique Score
rc = remote_clique(texts, model="Qwen/Qwen3-Embedding-0.6B")
print(f"Remote Clique: {rc:.3f}")
# Chamfer Distance
cd = chamfer_dist(texts, model="Qwen/Qwen3-Embedding-0.6B")
print(f"Chamfer Distance: {cd:.3f}")
remote_clique(data, model='Qwen/Qwen3-Embedding-0.6B', verbose=True, batch_size=64)
-
data (list of str): Documents to score.
-
model (str): HuggingFace/Sentence-Transformers embedding model to use (default:
"Qwen/Qwen3-Embedding-0.6B"). -
verbose (bool): Whether to show a progress bar during encoding (default:
True). -
batch_size (int): Batch size for embedding (default:
64). -
Returns:
float— average mean pairwise cosine distance between documents (higher = more spread out / diverse).
chamfer_dist(data, model='Qwen/Qwen3-Embedding-0.6B', verbose=True, batch_size=64)
-
data (list of str): Documents to score.
-
model (str): HuggingFace/Sentence-Transformers embedding model to use (default:
"Qwen/Qwen3-Embedding-0.6B"). -
verbose (bool): Whether to show a progress bar during encoding (default:
True). -
batch_size (int): Batch size for embedding (default:
64). -
Returns:
float— average minimum pairwise cosine distance (sensitive to near-duplicates; higher = less redundancy).
QUDSim (Question Under Discussion Similarity)
QUDSim aligns document segments based on Questions Under Discussion (QUDs) --- implicit questions that segments of text address (QUDsim: Quantifying Discourse Similarities in LLM-Generated Text).
This function requires OpenAI API access.
from diversity import qudsim
# Two documents about the same topic
document1 = "In the heart of ancient Macedonia, Philip II ascended to the throne in 359 BC..."
document2 = "The sun beat down on the rough-hewn hills of ancient Macedonia..."
# Requires OpenAI API key
import os
key = os.environ.get('OPENAI_API_KEY') # or your API key
# Generate QUD-based alignment
alignment = qudsim([document1, document2], key=key)
# Access alignment results
results = eval(alignment)[0] # First document pair
# View aligned segments
for source_text, target_text in results['aligned_segment_text']:
print(f"Source: {source_text[:100]}...")
print(f"Target: {target_text[:100]}...")
print("---")
# View alignment scores (harmonic mean scores matrix)
scores = results['harmonic_mean_scores']
print(f"Alignment scores shape: {len(scores)}x{len(scores[0])}")
# Other available fields:
# - results['source_qud_answers']: QUDs generated for source document
# - results['target_qud_answers']: QUDs generated for target document
# - results['aligned_segments']: Indices of aligned segments
qudsim(documents, key)
- Parameters:
documents(list): List of texts to alignkey(str): OpenAI API key for QUD generationmodel(str): LLM model to use (default:gpt-4)threshold(float): Minimum alignment score threshold (default: 0.5)
- Returns: list of alignment scores
Citation(s)
If you use this package, please cite:
@misc{shaib2025standardizingmeasurementtextdiversity,
title={Standardizing the Measurement of Text Diversity: A Tool and a Comparative Analysis of Scores},
author={Chantal Shaib and Joe Barrow and Jiuding Sun and Alexa F. Siu and Byron C. Wallace and Ani Nenkova},
year={2025},
eprint={2403.00553},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2403.00553},
}
If you use QUDSim, please also cite:
@inproceedings{
namuduri2025qudsim,
title={{QUD}sim: Quantifying Discourse Similarities in {LLM}-Generated Text},
author={Ramya Namuduri and Yating Wu and Anshun Asher Zheng and Manya Wadhwa and Greg Durrett and Junyi Jessy Li},
booktitle={Second Conference on Language Modeling},
year={2025},
url={https://openreview.net/forum?id=zFz1BJu211}
}
Requirements
- Python 3.10-3.12
- Core dependencies:
numpynltkscikit-learn
- For embedding-based metrics:
sentence-transformerstorch
- For QUDSim:
openaitqdm
License
This package is released under the Apache License 2.0.
Contributing
Contributions are welcome!
Please open an issue or submit a pull request on GitHub.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file diversity-0.3.0.tar.gz.
File metadata
- Download URL: diversity-0.3.0.tar.gz
- Upload date:
- Size: 24.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
db7506d1b973b57cce85f83d0645c215448982de83b657a1b0ecaf92c26e4a60
|
|
| MD5 |
fb06af68b586aab2f56336148431f878
|
|
| BLAKE2b-256 |
8c5cd0fae933443bf56f8920af981253c452613ce369e3aa422aa3909b445412
|
Provenance
The following attestation bundles were made for diversity-0.3.0.tar.gz:
Publisher:
publish.yaml on cshaib/diversity
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
diversity-0.3.0.tar.gz -
Subject digest:
db7506d1b973b57cce85f83d0645c215448982de83b657a1b0ecaf92c26e4a60 - Sigstore transparency entry: 507749424
- Sigstore integration time:
-
Permalink:
cshaib/diversity@9aa21250ac6119e869195333e0bd2ceb38164466 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/cshaib
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yaml@9aa21250ac6119e869195333e0bd2ceb38164466 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file diversity-0.3.0-py3-none-any.whl.
File metadata
- Download URL: diversity-0.3.0-py3-none-any.whl
- Upload date:
- Size: 30.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4a4b77e4ade5b11538bfb416e72c3efc3bb1e81c05552d3cb12df0a59a87cdd8
|
|
| MD5 |
8f865e2c856d422cb39082d32365e7cb
|
|
| BLAKE2b-256 |
88f3dfc9edc7e1eadb5e209500641504366c1bc593c1dbebde206263a7fb06e4
|
Provenance
The following attestation bundles were made for diversity-0.3.0-py3-none-any.whl:
Publisher:
publish.yaml on cshaib/diversity
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
diversity-0.3.0-py3-none-any.whl -
Subject digest:
4a4b77e4ade5b11538bfb416e72c3efc3bb1e81c05552d3cb12df0a59a87cdd8 - Sigstore transparency entry: 507749433
- Sigstore integration time:
-
Permalink:
cshaib/diversity@9aa21250ac6119e869195333e0bd2ceb38164466 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/cshaib
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yaml@9aa21250ac6119e869195333e0bd2ceb38164466 -
Trigger Event:
workflow_dispatch
-
Statement type: