Linguistic diversity metrics using similarity-sensitive Hill numbers

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

fabriceyhc

These details have not been verified by PyPI

Project links

Documentation

Project description

Linguistic Diversity

Modernized, efficient implementation of linguistic diversity metrics using similarity-sensitive Hill numbers.

This library measures various kinds of linguistic diversity using similarity-sensitive Hill numbers (SSHN). Originally adapted from the study of species diversity in ecology, SSHNs characterize the effective number of species in a population. In NLP, "species" are linguistic units (words, parse trees, etc.) and the "population" is a corpus of documents.

For example, if the token semantic diversity of a corpus is 9, this means the corpus contains approximately 9 distinct semantic concepts.

Features

Modern Python (3.9+): Type hints, dataclasses, and modern best practices
Performance optimized: FAISS-accelerated similarity computations, model caching, vectorized operations
Updated dependencies: NumPy 2.x, Pandas 2.x, latest transformers
Multiple diversity dimensions:
- Semantic: Token and document-level semantic diversity using transformers
- Syntactic: Dependency and constituency parse tree diversity
- Morphological: Part-of-speech sequence diversity
- Phonological: Rhythmic and phonemic pattern diversity
- Universal: Unified metric combining all dimensions into a single score

Quick Start

Installation

pip install linguistic-diversity

For development:

git clone https://github.com/fabriceyhc/linguistic-diversity
cd linguistic-diversity
pip install -e ".[dev]"

Basic Usage

from linguistic_diversity import TokenSemantics, DocumentSemantics

# Example corpora
corpus1 = [
    'one massive earth',
    'an enormous globe',
    'the colossal world'
]  # High paraphrasing, similar semantics

corpus2 = [
    'basic human right',
    'you were right',
    'make a right'
]  # Lower semantic diversity due to word "right"

# Token-level semantic diversity
token_metric = TokenSemantics()
print(f"Corpus 1 token diversity: {token_metric(corpus1):.2f}")
print(f"Corpus 2 token diversity: {token_metric(corpus2):.2f}")

# Document-level semantic diversity
doc_metric = DocumentSemantics()
print(f"Corpus 1 document diversity: {doc_metric(corpus1):.2f}")
print(f"Corpus 2 document diversity: {doc_metric(corpus2):.2f}")

Configuration

All metrics accept configuration dictionaries:

from linguistic_diversity import TokenSemantics

# Custom configuration
config = {
    'model_name': 'roberta-base',  # Use RoBERTa instead of BERT
    'q': 2.0,                       # Diversity order (higher = less sensitive to rare species)
    'normalize': True,              # Normalize by number of species
    'batch_size': 32,               # Larger batches for faster processing
    'use_cuda': True,               # Use GPU if available
    'remove_stopwords': True,       # Filter out stopwords
    'verbose': True                 # Show progress bars
}

metric = TokenSemantics(config)
diversity = metric(corpus1)

What's New in 1.0

This is a complete modernization of the original TextDiversity library with significant improvements:

Performance Improvements

3-5x faster similarity computation using optimized FAISS operations
Model caching: Models loaded once and reused across metric instances
Vectorized operations: Replaced nested Python loops with NumPy operations
Batch processing: Optimized batch sizes for GPU utilization
Lazy loading: Models and dependencies loaded only when needed

Code Quality

Type hints throughout for better IDE support and type checking
Modern Python: Dataclasses, f-strings, pathlib, type annotations
Better error handling with informative messages
Comprehensive docstrings in Google style
PEP 561 compliance with py.typed marker

Updated Dependencies

NumPy 1.24+ (compatible with FAISS)
Pandas 2.0+ (performance improvements)
Latest transformers (4.35+)
Python 3.9+ (modern language features)

Developer Experience

Pre-commit hooks with Black, Ruff, MyPy
Comprehensive test suite with pytest
GitHub Actions CI/CD
Type checking with MyPy
Code coverage reporting

Available Metrics

Semantic Diversity

TokenSemantics: Diversity of contextualized token embeddings

from linguistic_diversity import TokenSemantics

metric = TokenSemantics({'model_name': 'bert-base-uncased'})
diversity = metric(corpus)

DocumentSemantics: Diversity of document-level embeddings

from linguistic_diversity import DocumentSemantics

metric = DocumentSemantics({'model_name': 'all-mpnet-base-v2'})
diversity = metric(corpus)

Syntactic Diversity

DependencyParse: Diversity of dependency parse tree structures

from linguistic_diversity import DependencyParse

# Fast: using graph embeddings
metric = DependencyParse({'similarity_type': 'ldp'})
diversity = metric(corpus)

# Exact: using tree edit distance (slow)
metric = DependencyParse({'similarity_type': 'tree_edit_distance'})
diversity = metric(corpus)

ConstituencyParse: Diversity of constituency (phrase structure) parse trees

from linguistic_diversity import ConstituencyParse

metric = ConstituencyParse({'similarity_type': 'ldp'})
diversity = metric(corpus)

Note: Constituency parsing requires benepar. Install with: pip install linguistic-diversity[syntactic]

Morphological Diversity

PartOfSpeechSequence: Diversity of POS tag sequences using biological sequence alignment

from linguistic_diversity import PartOfSpeechSequence

metric = PartOfSpeechSequence()
diversity = metric(corpus)

Phonological Diversity

Rhythmic: Diversity of rhythmic patterns (stress and syllable weight)

from linguistic_diversity import Rhythmic

metric = Rhythmic()
diversity = metric(corpus)

Phonemic: Diversity of phoneme sequences (IPA representation)

from linguistic_diversity import Phonemic

# Default: uses g2p_en (pure Python, no system dependencies)
metric = Phonemic()
diversity = metric(corpus)

# Optional: use phonemizer backend (requires espeak-ng)
metric = Phonemic({'backend': 'phonemizer'})
diversity = metric(corpus)

Note: Phonological metrics require additional dependencies. Install with: pip install linguistic-diversity[phonological]

Universal Diversity

UniversalLinguisticDiversity: Unified metric combining all dimensions

from linguistic_diversity import UniversalLinguisticDiversity

# Default balanced configuration
metric = UniversalLinguisticDiversity()
diversity = metric(corpus)

# Get detailed breakdown
detailed = metric.get_detailed_scores(corpus)
print(f"Universal: {detailed['universal']:.2f}")
print(f"By branch: {detailed['branches']}")

The universal metric intelligently combines all 7 metrics across 4 linguistic branches (semantic, syntactic, morphological, phonological) into a single comprehensive diversity score. It uses hierarchical aggregation: geometric mean within branches, weighted combination across branches.

Preset Configurations:

from linguistic_diversity import get_preset_config

# Semantic-focused (for content analysis)
config = get_preset_config("semantic_focus")
metric = UniversalLinguisticDiversity(config)

# Available presets: balanced, semantic_focus, structural_focus, minimal, conservative

See UNIVERSAL_METRIC_GUIDE.md for detailed documentation.

System Requirements

Required

Python 3.9 or higher
For GPU acceleration: CUDA-compatible GPU with appropriate drivers

Optional System Dependencies

All metrics work with pure Python packages - no system dependencies required!

However, if you want to use the phonemizer backend for Phonemic diversity (instead of the default g2p_en), you'll need espeak-ng:

Linux:

sudo apt-get install espeak-ng
pip install phonemizer  # then use Phonemic({'backend': 'phonemizer'})

macOS:

brew install espeak-ng
pip install phonemizer  # then use Phonemic({'backend': 'phonemizer'})

Windows:

Download from espeak-ng releases
Install espeak-ng-X64.msi or espeak-ng-X86.msi
Set environment variable: PHONEMIZER_ESPEAK_LIBRARY=C:\Program Files\eSpeak NG\libespeak-ng.dll
pip install phonemizer then use Phonemic({'backend': 'phonemizer'})

Note: The default g2p_en backend for Phonemic is pure Python and works everywhere without system dependencies.

Theory: Similarity-Sensitive Hill Numbers

Hill numbers provide a unified framework for measuring diversity that accounts for both:

Species richness (how many different types exist)
Species similarity (how similar the types are to each other)

The diversity formula is:

D = (Σ p_i (Σ Z_ij p_j)^(q-1))^(1/(1-q))

Where:

p: Abundance distribution over species
Z: Similarity matrix between species
q: Diversity order parameter (0 = richness, 1 = Shannon, 2 = Simpson, ∞ = Berger-Parker)

When q=1 (default), this reduces to the effective number of species weighted by their semantic similarity.

Citation

If you use this library in your research, please cite:

@software{linguistic_diversity_2026,
  title={Linguistic Diversity: Modernized Implementation of Similarity-Sensitive Hill Numbers for NLP},
  author={Harel-Canada, Fabrice},
  year={2026},
  url={https://github.com/fabriceyhc/linguistic-diversity}
}

Original TextDiversity library:

@software{textdiversity_2022,
  title={TextDiversity: Measuring Linguistic Diversity with Similarity-Sensitive Hill Numbers},
  author={Harel-Canada, Fabrice},
  year={2022},
  url={https://github.com/fabriceyhc/TextDiversity}
}

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

License

MIT License - see LICENSE for details.

Acknowledgments

Original TextDiversity implementation and research
Ecological diversity theory from Chao et al. (2014)
The Hugging Face ecosystem for transformer models

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

fabriceyhc

These details have not been verified by PyPI

Project links

Documentation

Release history Release notifications | RSS feed

This version

1.0.0

Mar 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

linguistic_diversity-1.0.0.tar.gz (52.9 kB view details)

Uploaded Mar 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

linguistic_diversity-1.0.0-py3-none-any.whl (52.6 kB view details)

Uploaded Mar 17, 2026 Python 3

File details

Details for the file linguistic_diversity-1.0.0.tar.gz.

File metadata

Download URL: linguistic_diversity-1.0.0.tar.gz
Upload date: Mar 17, 2026
Size: 52.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for linguistic_diversity-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`c4b96531ff12b90b06b0a0e5fa8c48ff67a39427c88fdbdb1d9c4398bd950989`
MD5	`42f34ae9dcab6b813b9eba321bbf3d9f`
BLAKE2b-256	`5046eb09bc03138ee1416e32b98d1856f4d76113901a8adf46aefa7af49397f4`

See more details on using hashes here.

Provenance

The following attestation bundles were made for linguistic_diversity-1.0.0.tar.gz:

Publisher: publish.yml on fabriceyhc/linguistic-diversity

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: linguistic_diversity-1.0.0.tar.gz
- Subject digest: c4b96531ff12b90b06b0a0e5fa8c48ff67a39427c88fdbdb1d9c4398bd950989
- Sigstore transparency entry: 1115439442
- Sigstore integration time: Mar 17, 2026
Source repository:
- Permalink: fabriceyhc/linguistic-diversity@5ead6e4f6f923997dff83919f5c3fab051285b0d
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/fabriceyhc
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@5ead6e4f6f923997dff83919f5c3fab051285b0d
- Trigger Event: release

File details

Details for the file linguistic_diversity-1.0.0-py3-none-any.whl.

File metadata

Download URL: linguistic_diversity-1.0.0-py3-none-any.whl
Upload date: Mar 17, 2026
Size: 52.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for linguistic_diversity-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fe75d25ece8d0b18f9f53703ab87f46134e89812da859b04e02eac0fec5b58b4`
MD5	`fe555ea48442673b696ff8259a291ebe`
BLAKE2b-256	`b865cbf4b23b308cd77281dfa850c0f8c3f585b528b085970ff1e598070b6366`

See more details on using hashes here.

Provenance

The following attestation bundles were made for linguistic_diversity-1.0.0-py3-none-any.whl:

Publisher: publish.yml on fabriceyhc/linguistic-diversity

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: linguistic_diversity-1.0.0-py3-none-any.whl
- Subject digest: fe75d25ece8d0b18f9f53703ab87f46134e89812da859b04e02eac0fec5b58b4
- Sigstore transparency entry: 1115439448
- Sigstore integration time: Mar 17, 2026
Source repository:
- Permalink: fabriceyhc/linguistic-diversity@5ead6e4f6f923997dff83919f5c3fab051285b0d
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/fabriceyhc
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@5ead6e4f6f923997dff83919f5c3fab051285b0d
- Trigger Event: release

linguistic-diversity 1.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Linguistic Diversity

Features

Quick Start

Installation

Basic Usage

Configuration

What's New in 1.0

Performance Improvements

Code Quality

Updated Dependencies

Developer Experience

Available Metrics

Semantic Diversity

Syntactic Diversity

Morphological Diversity

Phonological Diversity

Universal Diversity

System Requirements

Required

Optional System Dependencies

Theory: Similarity-Sensitive Hill Numbers

Citation

Contributing

License

Acknowledgments

Links

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance