No project description provided

These details have not been verified by PyPI

Project description

diversity

A Python toolkit for measuring diversity in text.

Installation
Quick Start
Citations
Requirements
License
Contributing

Installation

Install via pip:

pip install diversity

Or from source:

git clone https://github.com/cshaib/diversity.git
cd diversity
pip install .

Quick Start

The function compute_all_metrics will return a dictionary (and optionally a LaTeX/Markdown formatted tabular output) computing the diversity metrics described individually in the following section.

from diversity import compute_all_metrics
import json

texts = [
    "The quick brown fox jumps over the lazy dog.",
    "The quick brown fox jumps over the lazy dog again.",
    "Suddenly, the quick brown fox leaps swiftly over the sleeping dog."
]

# Compute metrics
results = compute_all_metrics(corpus=texts)

# Remove the list of per-document scores for cleaner dict output
clean_results = {k: v for k, v in results.items() 
                if k != "templates_per_token_scores"}
output_content = json.dumps(clean_results, indent=2)

with open('diversity_metrics.json', 'w', encoding='utf-8') as f:
    f.write(output_content)

Lexical Diversity Measures

We provide implementations for Compression Ratio, Homogenization Score, and n-gram Diversity Score:

from diversity import (
    compression_ratio,
    homogenization_score,
    ngram_diversity_score,
)

texts = [
    "The quick brown fox jumps over the lazy dog.",
    "The quick brown fox jumps over the lazy dog again.",
    "Suddenly, the quick brown fox leaps swiftly over the sleeping dog."
]

# Compression ratio
cr = compression_ratio(texts, algorithm='gzip')
print(f"Compression Ratio: {cr:.4f}")

# Homogenization score (ROUGE-L)
hs = homogenization_score(texts, measure='rougel')
print(f"Homogenization (ROUGE-L): {hs:.4f}")

# N-gram diversity
ngd = ngram_diversity_score(texts, num_n=3)
print(f"3-gram Diversity: {ngd:.4f}")

# Self-repetition score
srs = self_repetition_score (texts)
print(f"Self-repetition score: {srs:4f}")

`compression_ratio(texts, algorithm='gzip')`

Parameters:
- texts (list): List of text strings
- algorithm (str): Compression algorithm ('gzip' or 'xz')
Returns: Float, higher = more repetitive

`homogenization_score(texts, measure='rougel')`

Parameters:
- texts (list): List of text strings
- measure (str): Scoring method ('rougel', 'bleu', or 'bertscore')
Returns: Float, higher = more homogeneous

`ngram_diversity_score(texts, num_n=4)`

Parameters:
- texts (list): List of text strings
- num_n (int): Max n-gram size to evaluate up to
Returns: Float, higher = more diverse

`self_repetition_score(dataset, n=4)`

Parameters:
- dataset (list): List of text strings
- n (int): N-gram size
Returns: Float, higher = more repetitive

Syntactic Diversity Measures

We also provide functions for extracting and analyze Part-of-Speech (POS) patterns to identify repetitive syntactic structures in your text:

from diversity import (
    extract_patterns,
    match_patterns,
    template_rate,
    templates_per_token
)

texts = [
    "The quick brown fox jumps over the lazy dog.",
    "The quick brown fox jumps over the lazy dog again.",
    "Suddenly, the quick brown fox leaps swiftly over the sleeping dog."
]

# POS pattern extraction
patterns = extract_patterns(texts, n=4, top_n=5)
print("Top POS patterns:", patterns)
# Example output: [(('DT', 'JJ', 'JJ', 'NN'), 15), ...]

# Match patterns in a single text
matches = match_patterns(texts[2], patterns)
print("Patterns in 3rd sentence:", matches)
# Example output: [('DT JJ JJ NN', 'The quick brown fox'), ...]

# Template Rate (number of templates that appear in each text, averaged across documents)
tr = template_rate(texts, patterns)
print("Template Rate:", tr)

# Templates-per-token (normalized by text length, per output) 
tpt = templates_per_token(texts, patterns)
print("Templates per Token:", tpt)

`extract_patterns(text, n=5, top_n=100)`

text (list of str): Documents to extract syntactic patterns from.
n (int): N-gram size for POS pattern extraction (default: 5).
top_n (int): Number of most frequent patterns to keep (default: 100).
Returns: dict — dictionary mapping POS patterns (e.g., "DT JJ NN NN") to sets of text spans that match the patterns

`match_patterns(text, patterns)`

text (str): Input text to search for patterns.
patterns (dict): Dictionary of patterns and their text matches as returned by extract_patterns.
Returns: list[tuple] — list of (pattern, text) pairs showing which syntactic patterns appear in the input and the exact spans that match

`template_rate(data, templates=None, shard_size=500)`

data (list of str): Documents to score.
templates (dict, optional): Dictionary of templates extracted from the corpus. If None, templates are computed using extract_patterns.
shard_size (int): Number of regex patterns to compile per shard (default: 500).
Returns: float — fraction of documents in the corpus that contain at least one template (higher = more templated, lower = more original).

`templates_per_token(data, templates=None, shard_size=500)`

data (list of str): Documents to score.
templates (dict, optional): Dictionary of templates extracted from the corpus. If None, templates are computed using extract_patterns.
shard_size (int): Number of regex patterns to compile per shard (default: 500).
Returns: float — per-document ratio of template matches to tokens (higher = more templated per word, lower = more diverse writing).

Embedding-Based Diversity Measures

You can also measure semantic diversity using embedding-based similarity. These scores compute distances between document embeddings to quantify how spread out or clustered the texts are:

from diversity.embedding import remote_clique, chamfer_dist

texts = [
    "The quick brown fox jumps over the lazy dog.",
    "A swift auburn fox vaulted a sleeping canine.",
    "I brewed coffee and read the paper."
]

# Remote Clique Score
rc = remote_clique(texts, model="Qwen/Qwen3-Embedding-0.6B")
print(f"Remote Clique: {rc:.3f}")

# Chamfer Distance
cd = chamfer_dist(texts, model="Qwen/Qwen3-Embedding-0.6B")
print(f"Chamfer Distance: {cd:.3f}")

`remote_clique(data, model='Qwen/Qwen3-Embedding-0.6B', verbose=True, batch_size=64)`

data (list of str): Documents to score.
model (str): HuggingFace/Sentence-Transformers embedding model to use (default: "Qwen/Qwen3-Embedding-0.6B").
verbose (bool): Whether to show a progress bar during encoding (default: True).
batch_size (int): Batch size for embedding (default: 64).
Returns: float — average mean pairwise cosine distance between documents (higher = more spread out / diverse).

`chamfer_dist(data, model='Qwen/Qwen3-Embedding-0.6B', verbose=True, batch_size=64)`

data (list of str): Documents to score.
model (str): HuggingFace/Sentence-Transformers embedding model to use (default: "Qwen/Qwen3-Embedding-0.6B").
verbose (bool): Whether to show a progress bar during encoding (default: True).
batch_size (int): Batch size for embedding (default: 64).
Returns: float — average minimum pairwise cosine distance (sensitive to near-duplicates; higher = less redundancy).

QUDSim (Question Under Discussion Similarity)

QUDSim aligns document segments based on Questions Under Discussion (QUDs) --- implicit questions that segments of text address (QUDsim: Quantifying Discourse Similarities in LLM-Generated Text).

This function requires OpenAI API access.

from diversity import qudsim

# Two documents about the same topic
document1 = "In the heart of ancient Macedonia, Philip II ascended to the throne in 359 BC..."
document2 = "The sun beat down on the rough-hewn hills of ancient Macedonia..."

# Requires OpenAI API key
import os
key = os.environ.get('OPENAI_API_KEY')  # or your API key

# Generate QUD-based alignment
alignment = qudsim([document1, document2], key=key)

# Access alignment results
import json
results = json.loads(alignment)[0]  # First document pair

# View aligned segments
for source_text, target_text in results['aligned_segment_text']:
    print(f"Source: {source_text[:100]}...")
    print(f"Target: {target_text[:100]}...")
    print("---")

# View alignment scores (harmonic mean scores matrix)
scores = results['harmonic_mean_scores']
print(f"Alignment scores shape: {len(scores)}x{len(scores[0])}")

# Other available fields:
# - results['source_qud_answers']: QUDs generated for source document
# - results['target_qud_answers']: QUDs generated for target document
# - results['aligned_segments']: Indices of aligned segments

`qudsim(documents, key, config_file=None)`

Parameters:
- documents (list): List of texts to align
- key (str): OpenAI API key for QUD generation
- config_file (str, optional): Path to a .yaml config file. If omitted, uses the bundled config.yaml. Model, threshold, and other settings are set there.
Returns: JSON string containing alignment results for all document pairs

Citation(s)

If you use this package, please cite:

@misc{shaib2025standardizingmeasurementtextdiversity,
  title={Standardizing the Measurement of Text Diversity: A Tool and a Comparative Analysis of Scores},
  author={Chantal Shaib and Joe Barrow and Jiuding Sun and Alexa F. Siu and Byron C. Wallace and Ani Nenkova},
  year={2025},
  eprint={2403.00553},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2403.00553},
}

If you use QUDSim, please also cite:

@inproceedings{
namuduri2025qudsim,
title={{QUD}sim: Quantifying Discourse Similarities in {LLM}-Generated Text},
author={Ramya Namuduri and Yating Wu and Anshun Asher Zheng and Manya Wadhwa and Greg Durrett and Junyi Jessy Li},
booktitle={Second Conference on Language Modeling},
year={2025},
url={https://openreview.net/forum?id=zFz1BJu211}
}

Requirements

Python 3.10-3.12
Core dependencies:
- numpy
- nltk
- scikit-learn
For embedding-based metrics:
- sentence-transformers
- torch
For QUDSim:
- openai
- tqdm

License

This package is released under the Apache License 2.0.

Contributing

Contributions are welcome!
Please open an issue or submit a pull request on GitHub.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.1

May 6, 2026

0.3.0

Sep 12, 2025

0.2.2

Apr 21, 2025

0.2.1

Mar 23, 2025

0.2.0

Sep 20, 2024

0.1.22

Jul 5, 2024

0.1.21

Apr 5, 2024

0.1.19

Apr 2, 2024

0.1.18

Apr 2, 2024

0.1.17

Feb 26, 2024

0.1.16

Nov 14, 2023

0.1.15

Nov 8, 2023

0.1.2

Apr 5, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

diversity-0.3.1.tar.gz (28.4 kB view details)

Uploaded May 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

diversity-0.3.1-py3-none-any.whl (33.5 kB view details)

Uploaded May 6, 2026 Python 3

File details

Details for the file diversity-0.3.1.tar.gz.

File metadata

Download URL: diversity-0.3.1.tar.gz
Upload date: May 6, 2026
Size: 28.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.6.1 CPython/3.10.5 Darwin/24.3.0

File hashes

Hashes for diversity-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`edcd81363da0913237a74bb49ada318705a564898df19cbec926d8d7c3686040`
MD5	`d686d27f992fc51208e72a56f1c44e4a`
BLAKE2b-256	`fae2fecfd4d9c0acf996cf53a6511ae8df6577fca51e5e694cfd31bed0684375`

See more details on using hashes here.

File details

Details for the file diversity-0.3.1-py3-none-any.whl.

File metadata

Download URL: diversity-0.3.1-py3-none-any.whl
Upload date: May 6, 2026
Size: 33.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.6.1 CPython/3.10.5 Darwin/24.3.0

File hashes

Hashes for diversity-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1c7e45e714d3114e477eb93dbe2e93f2f6d174605c0a908a5840108cc96dfe72`
MD5	`e684fe2c7b1d14cbc9486669e04a669c`
BLAKE2b-256	`dbe1ca55450d135a30d0593438a7e393c441dde5af92a757abe252325424cfe0`

See more details on using hashes here.

diversity 0.3.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

diversity

A Python toolkit for measuring diversity in text.

Table of Contents

Installation

Quick Start

Lexical Diversity Measures

compression_ratio(texts, algorithm='gzip')

homogenization_score(texts, measure='rougel')

ngram_diversity_score(texts, num_n=4)

self_repetition_score(dataset, n=4)

Syntactic Diversity Measures

extract_patterns(text, n=5, top_n=100)

match_patterns(text, patterns)

template_rate(data, templates=None, shard_size=500)

templates_per_token(data, templates=None, shard_size=500)

Embedding-Based Diversity Measures

remote_clique(data, model='Qwen/Qwen3-Embedding-0.6B', verbose=True, batch_size=64)

chamfer_dist(data, model='Qwen/Qwen3-Embedding-0.6B', verbose=True, batch_size=64)

QUDSim (Question Under Discussion Similarity)

qudsim(documents, key, config_file=None)

Citation(s)

Requirements

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`compression_ratio(texts, algorithm='gzip')`

`homogenization_score(texts, measure='rougel')`

`ngram_diversity_score(texts, num_n=4)`

`self_repetition_score(dataset, n=4)`

`extract_patterns(text, n=5, top_n=100)`

`match_patterns(text, patterns)`

`template_rate(data, templates=None, shard_size=500)`

`templates_per_token(data, templates=None, shard_size=500)`

`remote_clique(data, model='Qwen/Qwen3-Embedding-0.6B', verbose=True, batch_size=64)`

`chamfer_dist(data, model='Qwen/Qwen3-Embedding-0.6B', verbose=True, batch_size=64)`

`qudsim(documents, key, config_file=None)`