Linguistic diversity metrics using similarity-sensitive Hill numbers
Project description
Linguistic Diversity
Modernized, efficient implementation of linguistic diversity metrics using similarity-sensitive Hill numbers.
This library measures various kinds of linguistic diversity using similarity-sensitive Hill numbers (SSHN). Originally adapted from the study of species diversity in ecology, SSHNs characterize the effective number of species in a population. In NLP, "species" are linguistic units (words, parse trees, etc.) and the "population" is a corpus of documents.
For example, if the token semantic diversity of a corpus is 9, this means the corpus contains approximately 9 distinct semantic concepts.
Features
- Modern Python (3.9+): Type hints, dataclasses, and modern best practices
- Performance optimized: FAISS-accelerated similarity computations, model caching, vectorized operations
- Updated dependencies: NumPy 2.x, Pandas 2.x, latest transformers
- Multiple diversity dimensions:
- Semantic: Token and document-level semantic diversity using transformers
- Syntactic: Dependency and constituency parse tree diversity
- Morphological: Part-of-speech sequence diversity
- Phonological: Rhythmic and phonemic pattern diversity
- Universal: Unified metric combining all dimensions into a single score
Quick Start
Installation
pip install linguistic-diversity
For development:
git clone https://github.com/fabriceyhc/linguistic-diversity
cd linguistic-diversity
pip install -e ".[dev]"
Basic Usage
from linguistic_diversity import TokenSemantics, DocumentSemantics
# Example corpora
corpus1 = [
'one massive earth',
'an enormous globe',
'the colossal world'
] # High paraphrasing, similar semantics
corpus2 = [
'basic human right',
'you were right',
'make a right'
] # Lower semantic diversity due to word "right"
# Token-level semantic diversity
token_metric = TokenSemantics()
print(f"Corpus 1 token diversity: {token_metric(corpus1):.2f}")
print(f"Corpus 2 token diversity: {token_metric(corpus2):.2f}")
# Document-level semantic diversity
doc_metric = DocumentSemantics()
print(f"Corpus 1 document diversity: {doc_metric(corpus1):.2f}")
print(f"Corpus 2 document diversity: {doc_metric(corpus2):.2f}")
Configuration
All metrics accept configuration dictionaries:
from linguistic_diversity import TokenSemantics
# Custom configuration
config = {
'model_name': 'roberta-base', # Use RoBERTa instead of BERT
'q': 2.0, # Diversity order (higher = less sensitive to rare species)
'normalize': True, # Normalize by number of species
'batch_size': 32, # Larger batches for faster processing
'use_cuda': True, # Use GPU if available
'remove_stopwords': True, # Filter out stopwords
'verbose': True # Show progress bars
}
metric = TokenSemantics(config)
diversity = metric(corpus1)
What's New in 1.0
This is a complete modernization of the original TextDiversity library with significant improvements:
Performance Improvements
- 3-5x faster similarity computation using optimized FAISS operations
- Model caching: Models loaded once and reused across metric instances
- Vectorized operations: Replaced nested Python loops with NumPy operations
- Batch processing: Optimized batch sizes for GPU utilization
- Lazy loading: Models and dependencies loaded only when needed
Code Quality
- Type hints throughout for better IDE support and type checking
- Modern Python: Dataclasses, f-strings, pathlib, type annotations
- Better error handling with informative messages
- Comprehensive docstrings in Google style
- PEP 561 compliance with
py.typedmarker
Updated Dependencies
- NumPy 1.24+ (compatible with FAISS)
- Pandas 2.0+ (performance improvements)
- Latest transformers (4.35+)
- Python 3.9+ (modern language features)
Developer Experience
- Pre-commit hooks with Black, Ruff, MyPy
- Comprehensive test suite with pytest
- GitHub Actions CI/CD
- Type checking with MyPy
- Code coverage reporting
Available Metrics
Semantic Diversity
TokenSemantics: Diversity of contextualized token embeddings
from linguistic_diversity import TokenSemantics
metric = TokenSemantics({'model_name': 'bert-base-uncased'})
diversity = metric(corpus)
DocumentSemantics: Diversity of document-level embeddings
from linguistic_diversity import DocumentSemantics
metric = DocumentSemantics({'model_name': 'all-mpnet-base-v2'})
diversity = metric(corpus)
Syntactic Diversity
DependencyParse: Diversity of dependency parse tree structures
from linguistic_diversity import DependencyParse
# Fast: using graph embeddings
metric = DependencyParse({'similarity_type': 'ldp'})
diversity = metric(corpus)
# Exact: using tree edit distance (slow)
metric = DependencyParse({'similarity_type': 'tree_edit_distance'})
diversity = metric(corpus)
ConstituencyParse: Diversity of constituency (phrase structure) parse trees
from linguistic_diversity import ConstituencyParse
metric = ConstituencyParse({'similarity_type': 'ldp'})
diversity = metric(corpus)
Note: Constituency parsing requires benepar. Install with: pip install linguistic-diversity[syntactic]
Morphological Diversity
PartOfSpeechSequence: Diversity of POS tag sequences using biological sequence alignment
from linguistic_diversity import PartOfSpeechSequence
metric = PartOfSpeechSequence()
diversity = metric(corpus)
Phonological Diversity
Rhythmic: Diversity of rhythmic patterns (stress and syllable weight)
from linguistic_diversity import Rhythmic
metric = Rhythmic()
diversity = metric(corpus)
Phonemic: Diversity of phoneme sequences (IPA representation)
from linguistic_diversity import Phonemic
# Default: uses g2p_en (pure Python, no system dependencies)
metric = Phonemic()
diversity = metric(corpus)
# Optional: use phonemizer backend (requires espeak-ng)
metric = Phonemic({'backend': 'phonemizer'})
diversity = metric(corpus)
Note: Phonological metrics require additional dependencies. Install with: pip install linguistic-diversity[phonological]
Universal Diversity
UniversalLinguisticDiversity: Unified metric combining all dimensions
from linguistic_diversity import UniversalLinguisticDiversity
# Default balanced configuration
metric = UniversalLinguisticDiversity()
diversity = metric(corpus)
# Get detailed breakdown
detailed = metric.get_detailed_scores(corpus)
print(f"Universal: {detailed['universal']:.2f}")
print(f"By branch: {detailed['branches']}")
The universal metric intelligently combines all 7 metrics across 4 linguistic branches (semantic, syntactic, morphological, phonological) into a single comprehensive diversity score. It uses hierarchical aggregation: geometric mean within branches, weighted combination across branches.
Preset Configurations:
from linguistic_diversity import get_preset_config
# Semantic-focused (for content analysis)
config = get_preset_config("semantic_focus")
metric = UniversalLinguisticDiversity(config)
# Available presets: balanced, semantic_focus, structural_focus, minimal, conservative
See UNIVERSAL_METRIC_GUIDE.md for detailed documentation.
System Requirements
Required
- Python 3.9 or higher
- For GPU acceleration: CUDA-compatible GPU with appropriate drivers
Optional System Dependencies
All metrics work with pure Python packages - no system dependencies required!
However, if you want to use the phonemizer backend for Phonemic diversity (instead of the default g2p_en), you'll need espeak-ng:
Linux:
sudo apt-get install espeak-ng
pip install phonemizer # then use Phonemic({'backend': 'phonemizer'})
macOS:
brew install espeak-ng
pip install phonemizer # then use Phonemic({'backend': 'phonemizer'})
Windows:
- Download from espeak-ng releases
- Install
espeak-ng-X64.msiorespeak-ng-X86.msi - Set environment variable:
PHONEMIZER_ESPEAK_LIBRARY=C:\Program Files\eSpeak NG\libespeak-ng.dll pip install phonemizerthen usePhonemic({'backend': 'phonemizer'})
Note: The default g2p_en backend for Phonemic is pure Python and works everywhere without system dependencies.
Theory: Similarity-Sensitive Hill Numbers
Hill numbers provide a unified framework for measuring diversity that accounts for both:
- Species richness (how many different types exist)
- Species similarity (how similar the types are to each other)
The diversity formula is:
D = (Σ p_i (Σ Z_ij p_j)^(q-1))^(1/(1-q))
Where:
p: Abundance distribution over speciesZ: Similarity matrix between speciesq: Diversity order parameter (0 = richness, 1 = Shannon, 2 = Simpson, ∞ = Berger-Parker)
When q=1 (default), this reduces to the effective number of species weighted by their semantic similarity.
Citation
If you use this library in your research, please cite:
@software{linguistic_diversity_2026,
title={Linguistic Diversity: Modernized Implementation of Similarity-Sensitive Hill Numbers for NLP},
author={Harel-Canada, Fabrice},
year={2026},
url={https://github.com/fabriceyhc/linguistic-diversity}
}
Original TextDiversity library:
@software{textdiversity_2022,
title={TextDiversity: Measuring Linguistic Diversity with Similarity-Sensitive Hill Numbers},
author={Harel-Canada, Fabrice},
year={2022},
url={https://github.com/fabriceyhc/TextDiversity}
}
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
License
MIT License - see LICENSE for details.
Acknowledgments
- Original TextDiversity implementation and research
- Ecological diversity theory from Chao et al. (2014)
- The Hugging Face ecosystem for transformer models
Links
- Documentation: linguistic-diversity.readthedocs.io (coming soon)
- PyPI: pypi.org/project/linguistic-diversity (coming soon)
- Issues: GitHub Issues
- Original Library: TextDiversity
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file linguistic_diversity-1.0.0.tar.gz.
File metadata
- Download URL: linguistic_diversity-1.0.0.tar.gz
- Upload date:
- Size: 52.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c4b96531ff12b90b06b0a0e5fa8c48ff67a39427c88fdbdb1d9c4398bd950989
|
|
| MD5 |
42f34ae9dcab6b813b9eba321bbf3d9f
|
|
| BLAKE2b-256 |
5046eb09bc03138ee1416e32b98d1856f4d76113901a8adf46aefa7af49397f4
|
Provenance
The following attestation bundles were made for linguistic_diversity-1.0.0.tar.gz:
Publisher:
publish.yml on fabriceyhc/linguistic-diversity
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
linguistic_diversity-1.0.0.tar.gz -
Subject digest:
c4b96531ff12b90b06b0a0e5fa8c48ff67a39427c88fdbdb1d9c4398bd950989 - Sigstore transparency entry: 1115439442
- Sigstore integration time:
-
Permalink:
fabriceyhc/linguistic-diversity@5ead6e4f6f923997dff83919f5c3fab051285b0d -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/fabriceyhc
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5ead6e4f6f923997dff83919f5c3fab051285b0d -
Trigger Event:
release
-
Statement type:
File details
Details for the file linguistic_diversity-1.0.0-py3-none-any.whl.
File metadata
- Download URL: linguistic_diversity-1.0.0-py3-none-any.whl
- Upload date:
- Size: 52.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fe75d25ece8d0b18f9f53703ab87f46134e89812da859b04e02eac0fec5b58b4
|
|
| MD5 |
fe555ea48442673b696ff8259a291ebe
|
|
| BLAKE2b-256 |
b865cbf4b23b308cd77281dfa850c0f8c3f585b528b085970ff1e598070b6366
|
Provenance
The following attestation bundles were made for linguistic_diversity-1.0.0-py3-none-any.whl:
Publisher:
publish.yml on fabriceyhc/linguistic-diversity
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
linguistic_diversity-1.0.0-py3-none-any.whl -
Subject digest:
fe75d25ece8d0b18f9f53703ab87f46134e89812da859b04e02eac0fec5b58b4 - Sigstore transparency entry: 1115439448
- Sigstore integration time:
-
Permalink:
fabriceyhc/linguistic-diversity@5ead6e4f6f923997dff83919f5c3fab051285b0d -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/fabriceyhc
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5ead6e4f6f923997dff83919f5c3fab051285b0d -
Trigger Event:
release
-
Statement type: