Skip to main content

No project description provided

Project description

diversity

PyPI version License ArXiv

A Python toolkit for measuring diversity in text.


Table of Contents


Installation

Install via pip:

pip install diversity

Or from source:

git clone https://github.com/cshaib/diversity.git
cd diversity
pip install .

Quick Start

The function compute_all_metrics will return a dictionary (and optionally a LaTeX/Markdown formatted tabular output) computing the diversity metrics described individually in the following section.

from diversity import compute_all_metrics
import json

texts = [
    "The quick brown fox jumps over the lazy dog.",
    "The quick brown fox jumps over the lazy dog again.",
    "Suddenly, the quick brown fox leaps swiftly over the sleeping dog."
]

# Compute metrics
results = compute_all_metrics(corpus=texts)

# Remove the list of per-document scores for cleaner dict output
clean_results = {k: v for k, v in results.items() 
                if k != "templates_per_token_scores"}
output_content = json.dumps(clean_results, indent=2)

with open('diversity_metrics.json', 'w', encoding='utf-8') as f:
    f.write(output_content)

Lexical Diversity Measures

We provide implementations for Compression Ratio, Homogenization Score, and n-gram Diversity Score:

from diversity import (
    compression_ratio,
    homogenization_score,
    ngram_diversity_score,
)

texts = [
    "The quick brown fox jumps over the lazy dog.",
    "The quick brown fox jumps over the lazy dog again.",
    "Suddenly, the quick brown fox leaps swiftly over the sleeping dog."
]

# Compression ratio
cr = compression_ratio(texts, algorithm='gzip')
print(f"Compression Ratio: {cr:.4f}")

# Homogenization score (ROUGE-L)
hs = homogenization_score(texts, measure='rougel')
print(f"Homogenization (ROUGE-L): {hs:.4f}")

# N-gram diversity
ngd = ngram_diversity_score(texts, num_n=3)
print(f"3-gram Diversity: {ngd:.4f}")

# Self-repetition score
srs = self_repetition_score (texts)
print(f"Self-repetition score: {srs:4f}")

compression_ratio(texts, algorithm='gzip')

  • Parameters:
    • texts (list): List of text strings
    • algorithm (str): Compression algorithm ('gzip' or 'xz')
  • Returns: Float, higher = more repetitive

homogenization_score(texts, measure='rougel')

  • Parameters:
    • texts (list): List of text strings
    • measure (str): Scoring method ('rougel', 'bleu', or 'bertscore')
  • Returns: Float, higher = more homogeneous

ngram_diversity_score(texts, num_n=4)

  • Parameters:
    • texts (list): List of text strings
    • num_n (int): Max n-gram size to evaluate up to
  • Returns: Float, higher = more diverse

self_repetition_score(dataset, n=4)

  • Parameters:
    • dataset (list): List of text strings
    • n (int): N-gram size
  • Returns: Float, higher = more repetitive

Syntactic Diversity Measures

We also provide functions for extracting and analyze Part-of-Speech (POS) patterns to identify repetitive syntactic structures in your text:

from diversity import (
    extract_patterns,
    match_patterns,
    template_rate,
    templates_per_token
)

texts = [
    "The quick brown fox jumps over the lazy dog.",
    "The quick brown fox jumps over the lazy dog again.",
    "Suddenly, the quick brown fox leaps swiftly over the sleeping dog."
]

# POS pattern extraction
patterns = extract_patterns(texts, n=4, top_n=5)
print("Top POS patterns:", patterns)
# Example output: [(('DT', 'JJ', 'JJ', 'NN'), 15), ...]

# Match patterns in a single text
matches = match_patterns(texts[2], patterns)
print("Patterns in 3rd sentence:", matches)
# Example output: [('DT JJ JJ NN', 'The quick brown fox'), ...]

# Template Rate (number of templates that appear in each text, averaged across documents)
tr = template_rate(texts, patterns)
print("Template Rate:", tr)

# Templates-per-token (normalized by text length, per output) 
tpt = templates_per_token(texts, patterns)
print("Templates per Token:", tpt)

extract_patterns(text, n=5, top_n=100)

  • text (list of str): Documents to extract syntactic patterns from.

  • n (int): N-gram size for POS pattern extraction (default: 5).

  • top_n (int): Number of most frequent patterns to keep (default: 100).

  • Returns: dict — dictionary mapping POS patterns (e.g., "DT JJ NN NN") to sets of text spans that match the patterns

match_patterns(text, patterns)

  • text (str): Input text to search for patterns.

  • patterns (dict): Dictionary of patterns and their text matches as returned by extract_patterns.

  • Returns: list[tuple] — list of (pattern, text) pairs showing which syntactic patterns appear in the input and the exact spans that match

template_rate(data, templates=None, shard_size=500)

  • data (list of str): Documents to score.

  • templates (dict, optional): Dictionary of templates extracted from the corpus. If None, templates are computed using extract_patterns.

  • shard_size (int): Number of regex patterns to compile per shard (default: 500).

  • Returns: float — fraction of documents in the corpus that contain at least one template (higher = more templated, lower = more original).

templates_per_token(data, templates=None, shard_size=500)

  • data (list of str): Documents to score.

  • templates (dict, optional): Dictionary of templates extracted from the corpus. If None, templates are computed using extract_patterns.

  • shard_size (int): Number of regex patterns to compile per shard (default: 500).

  • Returns: float — per-document ratio of template matches to tokens (higher = more templated per word, lower = more diverse writing).


Embedding-Based Diversity Measures

You can also measure semantic diversity using embedding-based similarity. These scores compute distances between document embeddings to quantify how spread out or clustered the texts are:

from diversity.embedding import remote_clique, chamfer_dist

texts = [
    "The quick brown fox jumps over the lazy dog.",
    "A swift auburn fox vaulted a sleeping canine.",
    "I brewed coffee and read the paper."
]

# Remote Clique Score
rc = remote_clique(texts, model="Qwen/Qwen3-Embedding-0.6B")
print(f"Remote Clique: {rc:.3f}")

# Chamfer Distance
cd = chamfer_dist(texts, model="Qwen/Qwen3-Embedding-0.6B")
print(f"Chamfer Distance: {cd:.3f}")

remote_clique(data, model='Qwen/Qwen3-Embedding-0.6B', verbose=True, batch_size=64)

  • data (list of str): Documents to score.

  • model (str): HuggingFace/Sentence-Transformers embedding model to use (default: "Qwen/Qwen3-Embedding-0.6B").

  • verbose (bool): Whether to show a progress bar during encoding (default: True).

  • batch_size (int): Batch size for embedding (default: 64).

  • Returns: float — average mean pairwise cosine distance between documents (higher = more spread out / diverse).

chamfer_dist(data, model='Qwen/Qwen3-Embedding-0.6B', verbose=True, batch_size=64)

  • data (list of str): Documents to score.

  • model (str): HuggingFace/Sentence-Transformers embedding model to use (default: "Qwen/Qwen3-Embedding-0.6B").

  • verbose (bool): Whether to show a progress bar during encoding (default: True).

  • batch_size (int): Batch size for embedding (default: 64).

  • Returns: float — average minimum pairwise cosine distance (sensitive to near-duplicates; higher = less redundancy).


QUDSim (Question Under Discussion Similarity)

QUDSim aligns document segments based on Questions Under Discussion (QUDs) --- implicit questions that segments of text address (QUDsim: Quantifying Discourse Similarities in LLM-Generated Text).

This function requires OpenAI API access.

from diversity import qudsim

# Two documents about the same topic
document1 = "In the heart of ancient Macedonia, Philip II ascended to the throne in 359 BC..."
document2 = "The sun beat down on the rough-hewn hills of ancient Macedonia..."

# Requires OpenAI API key
import os
key = os.environ.get('OPENAI_API_KEY')  # or your API key

# Generate QUD-based alignment
alignment = qudsim([document1, document2], key=key)

# Access alignment results
import json
results = json.loads(alignment)[0]  # First document pair

# View aligned segments
for source_text, target_text in results['aligned_segment_text']:
    print(f"Source: {source_text[:100]}...")
    print(f"Target: {target_text[:100]}...")
    print("---")

# View alignment scores (harmonic mean scores matrix)
scores = results['harmonic_mean_scores']
print(f"Alignment scores shape: {len(scores)}x{len(scores[0])}")

# Other available fields:
# - results['source_qud_answers']: QUDs generated for source document
# - results['target_qud_answers']: QUDs generated for target document
# - results['aligned_segments']: Indices of aligned segments

qudsim(documents, key, config_file=None)

  • Parameters:
    • documents (list): List of texts to align
    • key (str): OpenAI API key for QUD generation
    • config_file (str, optional): Path to a .yaml config file. If omitted, uses the bundled config.yaml. Model, threshold, and other settings are set there.
  • Returns: JSON string containing alignment results for all document pairs

Citation(s)

If you use this package, please cite:

@misc{shaib2025standardizingmeasurementtextdiversity,
  title={Standardizing the Measurement of Text Diversity: A Tool and a Comparative Analysis of Scores},
  author={Chantal Shaib and Joe Barrow and Jiuding Sun and Alexa F. Siu and Byron C. Wallace and Ani Nenkova},
  year={2025},
  eprint={2403.00553},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2403.00553},
}

If you use QUDSim, please also cite:

@inproceedings{
namuduri2025qudsim,
title={{QUD}sim: Quantifying Discourse Similarities in {LLM}-Generated Text},
author={Ramya Namuduri and Yating Wu and Anshun Asher Zheng and Manya Wadhwa and Greg Durrett and Junyi Jessy Li},
booktitle={Second Conference on Language Modeling},
year={2025},
url={https://openreview.net/forum?id=zFz1BJu211}
}

Requirements

  • Python 3.10-3.12
  • Core dependencies:
    • numpy
    • nltk
    • scikit-learn
  • For embedding-based metrics:
    • sentence-transformers
    • torch
  • For QUDSim:
    • openai
    • tqdm

License

This package is released under the Apache License 2.0.


Contributing

Contributions are welcome!
Please open an issue or submit a pull request on GitHub.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

diversity-0.3.1.tar.gz (28.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

diversity-0.3.1-py3-none-any.whl (33.5 kB view details)

Uploaded Python 3

File details

Details for the file diversity-0.3.1.tar.gz.

File metadata

  • Download URL: diversity-0.3.1.tar.gz
  • Upload date:
  • Size: 28.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.10.5 Darwin/24.3.0

File hashes

Hashes for diversity-0.3.1.tar.gz
Algorithm Hash digest
SHA256 edcd81363da0913237a74bb49ada318705a564898df19cbec926d8d7c3686040
MD5 d686d27f992fc51208e72a56f1c44e4a
BLAKE2b-256 fae2fecfd4d9c0acf996cf53a6511ae8df6577fca51e5e694cfd31bed0684375

See more details on using hashes here.

File details

Details for the file diversity-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: diversity-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 33.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.10.5 Darwin/24.3.0

File hashes

Hashes for diversity-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1c7e45e714d3114e477eb93dbe2e93f2f6d174605c0a908a5840108cc96dfe72
MD5 e684fe2c7b1d14cbc9486669e04a669c
BLAKE2b-256 dbe1ca55450d135a30d0593438a7e393c441dde5af92a757abe252325424cfe0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page