No project description provided
Project description
diversity
A Python toolkit for measuring diversity in text.
Table of Contents
- Installation
- Quick Start
- Citations
- Requirements
- License
- Contributing
Installation
Install via pip:
pip install diversity
Or from source:
git clone https://github.com/cshaib/diversity.git
cd diversity
pip install .
Quick Start
The function compute_all_metrics will return a dictionary (and optionally a LaTeX/Markdown formatted tabular output) computing the diversity metrics described individually in the following section.
from diversity import compute_all_metrics
import json
texts = [
"The quick brown fox jumps over the lazy dog.",
"The quick brown fox jumps over the lazy dog again.",
"Suddenly, the quick brown fox leaps swiftly over the sleeping dog."
]
# Compute metrics
results = compute_all_metrics(corpus=texts)
# Remove the list of per-document scores for cleaner dict output
clean_results = {k: v for k, v in results.items()
if k != "templates_per_token_scores"}
output_content = json.dumps(clean_results, indent=2)
with open('diversity_metrics.json', 'w', encoding='utf-8') as f:
f.write(output_content)
Lexical Diversity Measures
We provide implementations for Compression Ratio, Homogenization Score, and n-gram Diversity Score:
from diversity import (
compression_ratio,
homogenization_score,
ngram_diversity_score,
)
texts = [
"The quick brown fox jumps over the lazy dog.",
"The quick brown fox jumps over the lazy dog again.",
"Suddenly, the quick brown fox leaps swiftly over the sleeping dog."
]
# Compression ratio
cr = compression_ratio(texts, algorithm='gzip')
print(f"Compression Ratio: {cr:.4f}")
# Homogenization score (ROUGE-L)
hs = homogenization_score(texts, measure='rougel')
print(f"Homogenization (ROUGE-L): {hs:.4f}")
# N-gram diversity
ngd = ngram_diversity_score(texts, num_n=3)
print(f"3-gram Diversity: {ngd:.4f}")
# Self-repetition score
srs = self_repetition_score (texts)
print(f"Self-repetition score: {srs:4f}")
compression_ratio(texts, algorithm='gzip')
- Parameters:
texts(list): List of text stringsalgorithm(str): Compression algorithm ('gzip'or'xz')
- Returns: Float, higher = more repetitive
homogenization_score(texts, measure='rougel')
- Parameters:
texts(list): List of text stringsmeasure(str): Scoring method ('rougel','bleu', or'bertscore')
- Returns: Float, higher = more homogeneous
ngram_diversity_score(texts, num_n=4)
- Parameters:
texts(list): List of text stringsnum_n(int): Max n-gram size to evaluate up to
- Returns: Float, higher = more diverse
self_repetition_score(dataset, n=4)
- Parameters:
dataset(list): List of text stringsn(int): N-gram size
- Returns: Float, higher = more repetitive
Syntactic Diversity Measures
We also provide functions for extracting and analyze Part-of-Speech (POS) patterns to identify repetitive syntactic structures in your text:
from diversity import (
extract_patterns,
match_patterns,
template_rate,
templates_per_token
)
texts = [
"The quick brown fox jumps over the lazy dog.",
"The quick brown fox jumps over the lazy dog again.",
"Suddenly, the quick brown fox leaps swiftly over the sleeping dog."
]
# POS pattern extraction
patterns = extract_patterns(texts, n=4, top_n=5)
print("Top POS patterns:", patterns)
# Example output: [(('DT', 'JJ', 'JJ', 'NN'), 15), ...]
# Match patterns in a single text
matches = match_patterns(texts[2], patterns)
print("Patterns in 3rd sentence:", matches)
# Example output: [('DT JJ JJ NN', 'The quick brown fox'), ...]
# Template Rate (number of templates that appear in each text, averaged across documents)
tr = template_rate(texts, patterns)
print("Template Rate:", tr)
# Templates-per-token (normalized by text length, per output)
tpt = templates_per_token(texts, patterns)
print("Templates per Token:", tpt)
extract_patterns(text, n=5, top_n=100)
-
text (list of str): Documents to extract syntactic patterns from.
-
n (int): N-gram size for POS pattern extraction (default:
5). -
top_n (int): Number of most frequent patterns to keep (default:
100). -
Returns:
dict— dictionary mapping POS patterns (e.g.,"DT JJ NN NN") to sets of text spans that match the patterns
match_patterns(text, patterns)
-
text (str): Input text to search for patterns.
-
patterns (dict): Dictionary of patterns and their text matches as returned by
extract_patterns. -
Returns:
list[tuple]— list of(pattern, text)pairs showing which syntactic patterns appear in the input and the exact spans that match
template_rate(data, templates=None, shard_size=500)
-
data (list of str): Documents to score.
-
templates (dict, optional): Dictionary of templates extracted from the corpus. If
None, templates are computed usingextract_patterns. -
shard_size (int): Number of regex patterns to compile per shard (default:
500). -
Returns:
float— fraction of documents in the corpus that contain at least one template (higher = more templated, lower = more original).
templates_per_token(data, templates=None, shard_size=500)
-
data (list of str): Documents to score.
-
templates (dict, optional): Dictionary of templates extracted from the corpus. If
None, templates are computed usingextract_patterns. -
shard_size (int): Number of regex patterns to compile per shard (default:
500). -
Returns:
float— per-document ratio of template matches to tokens (higher = more templated per word, lower = more diverse writing).
Embedding-Based Diversity Measures
You can also measure semantic diversity using embedding-based similarity. These scores compute distances between document embeddings to quantify how spread out or clustered the texts are:
from diversity.embedding import remote_clique, chamfer_dist
texts = [
"The quick brown fox jumps over the lazy dog.",
"A swift auburn fox vaulted a sleeping canine.",
"I brewed coffee and read the paper."
]
# Remote Clique Score
rc = remote_clique(texts, model="Qwen/Qwen3-Embedding-0.6B")
print(f"Remote Clique: {rc:.3f}")
# Chamfer Distance
cd = chamfer_dist(texts, model="Qwen/Qwen3-Embedding-0.6B")
print(f"Chamfer Distance: {cd:.3f}")
remote_clique(data, model='Qwen/Qwen3-Embedding-0.6B', verbose=True, batch_size=64)
-
data (list of str): Documents to score.
-
model (str): HuggingFace/Sentence-Transformers embedding model to use (default:
"Qwen/Qwen3-Embedding-0.6B"). -
verbose (bool): Whether to show a progress bar during encoding (default:
True). -
batch_size (int): Batch size for embedding (default:
64). -
Returns:
float— average mean pairwise cosine distance between documents (higher = more spread out / diverse).
chamfer_dist(data, model='Qwen/Qwen3-Embedding-0.6B', verbose=True, batch_size=64)
-
data (list of str): Documents to score.
-
model (str): HuggingFace/Sentence-Transformers embedding model to use (default:
"Qwen/Qwen3-Embedding-0.6B"). -
verbose (bool): Whether to show a progress bar during encoding (default:
True). -
batch_size (int): Batch size for embedding (default:
64). -
Returns:
float— average minimum pairwise cosine distance (sensitive to near-duplicates; higher = less redundancy).
QUDSim (Question Under Discussion Similarity)
QUDSim aligns document segments based on Questions Under Discussion (QUDs) --- implicit questions that segments of text address (QUDsim: Quantifying Discourse Similarities in LLM-Generated Text).
This function requires OpenAI API access.
from diversity import qudsim
# Two documents about the same topic
document1 = "In the heart of ancient Macedonia, Philip II ascended to the throne in 359 BC..."
document2 = "The sun beat down on the rough-hewn hills of ancient Macedonia..."
# Requires OpenAI API key
import os
key = os.environ.get('OPENAI_API_KEY') # or your API key
# Generate QUD-based alignment
alignment = qudsim([document1, document2], key=key)
# Access alignment results
import json
results = json.loads(alignment)[0] # First document pair
# View aligned segments
for source_text, target_text in results['aligned_segment_text']:
print(f"Source: {source_text[:100]}...")
print(f"Target: {target_text[:100]}...")
print("---")
# View alignment scores (harmonic mean scores matrix)
scores = results['harmonic_mean_scores']
print(f"Alignment scores shape: {len(scores)}x{len(scores[0])}")
# Other available fields:
# - results['source_qud_answers']: QUDs generated for source document
# - results['target_qud_answers']: QUDs generated for target document
# - results['aligned_segments']: Indices of aligned segments
qudsim(documents, key, config_file=None)
- Parameters:
documents(list): List of texts to alignkey(str): OpenAI API key for QUD generationconfig_file(str, optional): Path to a.yamlconfig file. If omitted, uses the bundledconfig.yaml. Model, threshold, and other settings are set there.
- Returns: JSON string containing alignment results for all document pairs
Citation(s)
If you use this package, please cite:
@misc{shaib2025standardizingmeasurementtextdiversity,
title={Standardizing the Measurement of Text Diversity: A Tool and a Comparative Analysis of Scores},
author={Chantal Shaib and Joe Barrow and Jiuding Sun and Alexa F. Siu and Byron C. Wallace and Ani Nenkova},
year={2025},
eprint={2403.00553},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2403.00553},
}
If you use QUDSim, please also cite:
@inproceedings{
namuduri2025qudsim,
title={{QUD}sim: Quantifying Discourse Similarities in {LLM}-Generated Text},
author={Ramya Namuduri and Yating Wu and Anshun Asher Zheng and Manya Wadhwa and Greg Durrett and Junyi Jessy Li},
booktitle={Second Conference on Language Modeling},
year={2025},
url={https://openreview.net/forum?id=zFz1BJu211}
}
Requirements
- Python 3.10-3.12
- Core dependencies:
numpynltkscikit-learn
- For embedding-based metrics:
sentence-transformerstorch
- For QUDSim:
openaitqdm
License
This package is released under the Apache License 2.0.
Contributing
Contributions are welcome!
Please open an issue or submit a pull request on GitHub.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file diversity-0.3.1.tar.gz.
File metadata
- Download URL: diversity-0.3.1.tar.gz
- Upload date:
- Size: 28.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.6.1 CPython/3.10.5 Darwin/24.3.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
edcd81363da0913237a74bb49ada318705a564898df19cbec926d8d7c3686040
|
|
| MD5 |
d686d27f992fc51208e72a56f1c44e4a
|
|
| BLAKE2b-256 |
fae2fecfd4d9c0acf996cf53a6511ae8df6577fca51e5e694cfd31bed0684375
|
File details
Details for the file diversity-0.3.1-py3-none-any.whl.
File metadata
- Download URL: diversity-0.3.1-py3-none-any.whl
- Upload date:
- Size: 33.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.6.1 CPython/3.10.5 Darwin/24.3.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1c7e45e714d3114e477eb93dbe2e93f2f6d174605c0a908a5840108cc96dfe72
|
|
| MD5 |
e684fe2c7b1d14cbc9486669e04a669c
|
|
| BLAKE2b-256 |
dbe1ca55450d135a30d0593438a7e393c441dde5af92a757abe252325424cfe0
|