Skip to main content

Calculate entropy-based linguistic metrics on text using reference corpora

Project description

entroprisal

Calculate information theoretic linguistic metrics on text using reference corpora.

Overview

entroprisal is a Python package that computes various entropy and surprisal metrics for text analysis. It provides three main calculators:

  • TokenEntropisalCalculator: Token-level n-gram entropy and surprisal
  • CharacterEntropisalCalculator: Character-level entropy and surprisal
  • RestOfWordEntropisalCalculator: Character-level rest-of-word entropy and surprisal (bidirectional: left-to-right and right-to-left word completion)

These metrics are useful for analyzing text complexity, readability, and information content.

Installation

Basic Installation

pip install entroprisal[all]

The package will automatically download reference data files from Hugging Face Hub when first used (~6GiB total).

SpaCy and Hugging Face Hub are optional dependencies for additional functionality. A minimal installation without these dependencies is also possible:

pip install entroprisal

Optional Dependencies included in all

huggingface-hub is used for faster downloads with caching (recommended)

spacy is used for classifying content words vs. function words in your target text

If using SpaCy, you will need to download a SpaCy language model as well:

python -m spacy download en_core_web_lg

Development Installation

# Clone the repository
git clone https://github.com/learlab/entroprisal.git
cd entroprisal

# Install in editable mode with dev dependencies
uv pip install -e .[dev]

Data Files

Reference corpus files are automatically downloaded from Hugging Face Hub on first use:

  • google-books-dictionary-words.txt - Word frequencies (included in package)
  • 4grams_aw.parquet - All-word 4-gram frequencies (~800MB)
  • 4grams_cw.parquet - Content-word 4-gram frequencies (~400MB)

Files are cached locally to avoid re-downloading. To use the faster Hugging Face Hub downloader with resume capability, install with pip install entroprisal[hf].

Quick Start

Token-Level Entropy and Surprisal

from entroprisal import TokenEntropisalCalculator
from entroprisal.utils import load_4grams

# Load reference n-gram data
ngrams = load_4grams("aw")  # "aw" = all words, "cw" = content words

# Initialize calculator
calc = TokenEntropisalCalculator(ngrams, min_frequency=100)

# Calculate metrics for a list of tokens
tokens = ["the", "quick", "brown", "fox"]
metrics = calc.calculate_metrics(tokens)

print(metrics)
# Output includes:
# - ngram_surprisal_1, ngram_surprisal_2, ngram_surprisal_3
# - ngram_entropy_1, ngram_entropy_2, ngram_entropy_3
# - Support counts for each metric

Character-Level Entropy and Surprisal

from entroprisal import CharacterEntropisalCalculator
from entroprisal.utils import load_google_books_words

# Load word frequency data
words_df = load_google_books_words()

# Initialize calculator
calc = CharacterEntropisalCalculator(words_df)

# Calculate metrics for text
text = "The quick brown fox jumps over the lazy dog"
metrics = calc.calculate_metrics(text)

print(metrics)
# Output includes:
# - char_entropy, char_surprisal: Single character transition metrics
# - bigraph_entropy, bigraph_surprisal: Two-character context metrics
# - trigraph_entropy, trigraph_surprisal: Three-character context metrics

Rest-of-Word Entropy and Surprisal (Character-Level, Bidirectional)

from entroprisal import RestOfWordEntropisalCalculator
from entroprisal.utils import load_google_books_words

# Load word frequency data
words_df = load_google_books_words()

# Initialize calculator
calc = RestOfWordEntropisalCalculator(words_df)

# Calculate metrics for text
text = "The quick brown fox"
metrics = calc.calculate_metrics(text)

print(metrics)
# Output includes:
# - lr_c1_entropy, lr_c1_surprisal: Left-to-right, 1-char context
# - lr_c2_entropy, lr_c2_surprisal: Left-to-right, 2-char context
# - lr_c3_entropy, lr_c3_surprisal: Left-to-right, 3-char context
# - rl_c1_entropy, rl_c1_surprisal: Right-to-left, 1-char context
# - rl_c2_entropy, rl_c2_surprisal: Right-to-left, 2-char context
# - rl_c3_entropy, rl_c3_surprisal: Right-to-left, 3-char context
# - mean_word_length

Batch Processing

All calculators support batch processing:

# Process multiple texts at once
texts = [
    "First text sample",
    "Second text sample",
    "Third text sample"
]

# Returns a pandas DataFrame with one row per text
results_df = calc.calculate_batch(texts)
print(results_df)

API Reference

TokenEntropisalCalculator

Calculate token-level entropy and surprisal metrics using n-gram frequencies.

Methods:

  • calculate_metrics(tokens: List[str]) -> Dict[str, float]: Calculate metrics for a token list
  • calculate_batch(token_lists: List[List[str]]) -> pd.DataFrame: Batch processing
  • get_detailed_ngram_analysis(tokens: List[str]) -> Dict[int, pd.DataFrame]: Detailed per-token analysis

CharacterEntropisalCalculator

Calculate character-level transition entropy and surprisal.

Methods:

  • calculate_metrics(text: str, preprocess: bool = True) -> Dict[str, float]: Calculate metrics for text
  • calculate_batch(texts: List[str], preprocess: bool = True) -> pd.DataFrame: Batch processing
  • get_character_entropy(char: str) -> Optional[float]: Lookup entropy for specific character
  • get_character_surprisal(char: str) -> Optional[float]: Lookup surprisal for specific character
  • get_bigraph_entropy(bigraph: str) -> Optional[float]: Lookup entropy for bigraph
  • get_bigraph_surprisal(bigraph: str) -> Optional[float]: Lookup surprisal for bigraph
  • get_trigraph_entropy(trigraph: str) -> Optional[float]: Lookup entropy for trigraph
  • get_trigraph_surprisal(trigraph: str) -> Optional[float]: Lookup surprisal for trigraph

RestOfWordEntropisalCalculator

Calculate character-level rest-of-word entropy and surprisal in both directions (predicting remaining characters from left-to-right and right-to-left contexts).

Methods:

  • calculate_metrics(text: str, preprocess: bool = True) -> Dict[str, float]: Calculate metrics for text
  • calculate_batch(texts: List[str], preprocess: bool = True) -> pd.DataFrame: Batch processing
  • get_word_frequency(word: str) -> int: Get frequency of a word in reference corpus

Utilities

from entroprisal.utils import (
    load_google_books_words,
    load_4grams,
    get_data_dir,
    preprocess_text
)

# Load reference data
words_df = load_google_books_words()
ngrams_aw = load_4grams("aw")
ngrams_cw = load_4grams("cw")

# Get data directory path
data_dir = get_data_dir()

# Preprocess text
cleaned = preprocess_text("Text with punctuation!", aggressive=True)

Examples

See examples/usage_examples.ipynb for comprehensive examples including:

  • Loading and initializing calculators
  • Processing single texts and batches
  • Combining multiple metrics
  • Visualizing results

Development

Running Tests

pytest tests/

Code Style

# Format code
black src/

# Lint code
ruff check src/

License

See LICENSE file for details.

Citation

If you use this package in your research, please cite:

@software{entroprisal,
  title = {entroprisal: Entropy-based linguistic metrics},
  author = {Langdon Holmes},
  year = {2025},
  url = {https://github.com/learlab/entroprisal}
}

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

Acknowledgments

Reference data sources:

  • Google Books word frequencies: gwordlist
  • N-gram token frequencies: Derived from the slimpajama test set slimpajama

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

entroprisal-0.1.0.tar.gz (20.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

entroprisal-0.1.0-py3-none-any.whl (20.0 kB view details)

Uploaded Python 3

File details

Details for the file entroprisal-0.1.0.tar.gz.

File metadata

  • Download URL: entroprisal-0.1.0.tar.gz
  • Upload date:
  • Size: 20.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for entroprisal-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7d79ecbc557680f875d25008e72a2a34c3af02d04222f79075264f9a8fb2e3f6
MD5 c2d4aea21d13d2ef486e404c5aa99d8d
BLAKE2b-256 15280120846ae99ee55bd09baf56c28023f35a3cc0d7f70785fdf6206c185c06

See more details on using hashes here.

File details

Details for the file entroprisal-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: entroprisal-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 20.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for entroprisal-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3409c74a1375cf8ef7f77672d1c8178a9e1eb725f3251fe7cf959eea4234ad24
MD5 844a35a8e3a6e0b507caef4d6e0c24b5
BLAKE2b-256 d34a3ef6a382c9eb047ee2937740789c873f1084fd9cc59d7e192c628ddd7bdd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page