Calculate entropy-based linguistic metrics on text using reference corpora

These details have not been verified by PyPI

Project links

Project description

entroprisal

Calculate information theoretic linguistic metrics on text using reference corpora.

Overview

entroprisal is a Python package that computes various entropy and surprisal metrics for text analysis. It provides three main calculators:

TokenEntropisalCalculator: Token-level n-gram entropy and surprisal
CharacterEntropisalCalculator: Character-level entropy and surprisal
RestOfWordEntropisalCalculator: Character-level rest-of-word entropy and surprisal (bidirectional: left-to-right and right-to-left word completion)

These metrics are useful for analyzing text complexity, readability, and information content.

Installation

Basic Installation

pip install entroprisal[all]

The package will automatically download reference data files from Hugging Face Hub when first used (~4GiB total).

SpaCy and Hugging Face Hub are optional dependencies for additional functionality. A minimal installation without these dependencies is also possible:

pip install entroprisal

Optional Dependencies included in `all`

huggingface-hub is used for faster downloads with caching (recommended)

spacy is used for classifying content words vs. function words in your target text and for tokenization.

If using SpaCy, you will need to download a SpaCy language model as well:

python -m spacy download en_core_web_lg

Development Installation

# Clone the repository
git clone https://github.com/learlab/entroprisal.git
cd entroprisal

# Install in editable mode with dev dependencies
uv pip install -e .[dev]

Data Files

Reference corpus files are automatically downloaded from Hugging Face Hub on first use:

google-books-dictionary-words.txt - Word frequencies (included in package)
4grams_aw.parquet - All-word 4-gram frequencies (~2GiB)
4grams_cw.parquet - Content-word 4-gram frequencies (~1.8GiB)

Files are cached locally to avoid re-downloading. To use the faster Hugging Face Hub downloader with resume capability, install with pip install entroprisal[hf].

Quick Start

Text Preprocessing

For best results, preprocess your text using the preprocess_text() function, which uses spaCy for tokenization. This ensures consistency with how the reference corpora were prepared.

from entroprisal import preprocess_text

# Preprocess text (requires spaCy: pip install entroprisal[spacy])
text = "The quick brown fox jumps over the lazy dog."
tokens = preprocess_text(text)
# [['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']]

# For content-word-only analysis (nouns, verbs, adjectives, adverbs)
content_tokens = preprocess_text(text, content_words_only=True)
# [['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']]

Token-Level Entropy and Surprisal

from entroprisal import TokenEntropisalCalculator
from entroprisal.utils import load_4grams

# Load reference n-gram data
ngrams = load_4grams("aw")  # "aw" = all words, "cw" = content words

# Initialize calculator
calc = TokenEntropisalCalculator(ngrams, min_frequency=100)

# Calculate metrics for a list of tokens
tokens = ["the", "quick", "brown", "fox"]
metrics = calc.calculate_metrics(tokens)

print(metrics)
# Output includes (per-document means over attested positions):
# - ngram_surprisal_1, ngram_surprisal_2, ngram_surprisal_3
# - ngram_entropy_1, ngram_entropy_2, ngram_entropy_3
# - entropy_reduction_2, entropy_reduction_3
# - entropy_difference_1, entropy_difference_2, entropy_difference_3
# - Support counts for each metric

Per-Position Token Metrics

In addition to per-document means, TokenEntropisalCalculator exposes per-position metrics that return a pandas.DataFrame with one row per token. Throughout, the suffix n is the conditioning context length, matching ngram_surprisal_n (so n=3 is the 4-gram). Contexts that are unattested in the reference corpus — or positions too early to have a full context — yield NaN and a False availability flag (no backoff is applied).

tokens = ["the", "quick", "brown", "fox"]

# Surprisal: -log2 P(w_t | w_{t-3}, w_{t-2}, w_{t-1}), the information value of each token.
calc.surprisal(tokens)
# columns: position, token, surprisal, surprisal_available

# Entropy reduction (Hale-style, conditional mutual information):
#   H(W_t | w_{t-n}..w_{t-2}) - H(W_t | w_{t-n}..w_{t-1})
# How much observing the most recent context word reduced uncertainty about a fixed
# target. n=3 (default) is the 4-gram; n=2 is the trigram. Clipped at 0 by default.
calc.entropy_reduction(tokens, n=3)
calc.entropy_reduction(tokens, n=2, signed=True)  # keep negative values
# columns: position, token, entropy_reduction, available

# Entropy difference (Lowder-style): E_n[t-1] - E_n[t], the change in next-word entropy
# from one position to the next. NOTE: this differences entropies over *different* random
# variables (adjacent positions), unlike entropy_reduction's H(X) - H(X|y) over a fixed
# target. n in {1, 2, 3}; n=3 (default) reproduces the original Lowder et al. (2018)
# definition, n=1 is the simplest token-token (bigram) form. Clipped at 0 by default.
calc.entropy_difference(tokens, n=3)
# columns: position, token, entropy_difference, available

# Everything at once, at every context length (best for comparative analysis):
calc.compute_all(tokens)
# columns: position, token, surprisal,
#          entropy_reduction_2, entropy_reduction_3,
#          entropy_difference_1, entropy_difference_2, entropy_difference_3,
#          and a matching *_available flag for each metric

All four methods accept a base argument (default 2.0 for bits); entropy_reduction and entropy_difference additionally accept signed (default False, clipping negatives to 0 per Hale's convention).

Character-Level Entropy and Surprisal

from entroprisal import CharacterEntropisalCalculator, preprocess_text
from entroprisal.utils import load_google_books_words

# Load word frequency data
words_df = load_google_books_words()

# Initialize calculator
calc = CharacterEntropisalCalculator(words_df)

# Preprocess text to get tokens
text = "The quick brown fox jumps over the lazy dog"
tokens = preprocess_text(text)[0]  # Get first document's tokens

# Calculate metrics for tokens
metrics = calc.calculate_metrics(tokens)

print(metrics)
# Output includes:
# - char_entropy, char_surprisal: Single character transition metrics
# - bigraph_entropy, bigraph_surprisal: Two-character context metrics
# - trigraph_entropy, trigraph_surprisal: Three-character context metrics

Rest-of-Word Entropy and Surprisal (Character-Level, Bidirectional)

from entroprisal import RestOfWordEntropisalCalculator, preprocess_text
from entroprisal.utils import load_google_books_words

# Load word frequency data
words_df = load_google_books_words()

# Initialize calculator
calc = RestOfWordEntropisalCalculator(words_df)

# Preprocess text to get tokens
text = "The quick brown fox"
tokens = preprocess_text(text)[0]  # Get first document's tokens

# Calculate metrics for tokens
metrics = calc.calculate_metrics(tokens)

print(metrics)
# Output includes:
# - lr_c1_entropy, lr_c1_surprisal: Left-to-right, 1-char context
# - lr_c2_entropy, lr_c2_surprisal: Left-to-right, 2-char context
# - lr_c3_entropy, lr_c3_surprisal: Left-to-right, 3-char context
# - rl_c1_entropy, rl_c1_surprisal: Right-to-left, 1-char context
# - rl_c2_entropy, rl_c2_surprisal: Right-to-left, 2-char context
# - rl_c3_entropy, rl_c3_surprisal: Right-to-left, 3-char context
# - mean_word_length

Batch Processing

All calculators support batch processing with token lists:

from entroprisal import preprocess_text

# Preprocess multiple texts at once
texts = [
    "First text sample",
    "Second text sample",
    "Third text sample"
]
token_lists = preprocess_text(texts)  # Returns list of token lists

# Returns a pandas DataFrame with one row per document
results_df = calc.calculate_batch(token_lists)
print(results_df)

API Reference

TokenEntropisalCalculator

Calculate token-level entropy and surprisal metrics using n-gram frequencies.

Methods:

calculate_metrics(tokens: List[str]) -> Dict[str, float]: Per-document mean metrics for a token list
calculate_batch(token_lists: List[List[str]]) -> pd.DataFrame: Batch processing
surprisal(tokens, *, base=2.0) -> pd.DataFrame: Per-position surprisal
entropy_reduction(tokens, *, n=3, signed=False, base=2.0) -> pd.DataFrame: Per-position entropy reduction (conditional mutual information; n in {2, 3})
entropy_difference(tokens, *, n=3, signed=False, base=2.0) -> pd.DataFrame: Per-position entropy difference (Lowder-style; n in {1, 2, 3})
compute_all(tokens, *, signed=False, base=2.0) -> pd.DataFrame: All per-position metrics at every context length
get_detailed_ngram_analysis(tokens: List[str]) -> Dict[int, pd.DataFrame]: Detailed per-token analysis

CharacterEntropisalCalculator

Calculate character-level transition entropy and surprisal.

Methods:

calculate_metrics(tokens: List[str]) -> Dict[str, float]: Calculate metrics for a token list
calculate_batch(token_lists: List[List[str]]) -> pd.DataFrame: Batch processing
get_character_entropy(char: str) -> Optional[float]: Lookup entropy for specific character
get_character_surprisal(context: str, target: str) -> Optional[float]: Lookup surprisal for character transition
get_bigraph_entropy(bigraph: str) -> Optional[float]: Lookup entropy for bigraph
get_bigraph_surprisal(bigraph: str) -> Optional[float]: Lookup surprisal for bigraph
get_trigraph_entropy(trigraph: str) -> Optional[float]: Lookup entropy for trigraph
get_trigraph_surprisal(trigraph: str) -> Optional[float]: Lookup surprisal for trigraph

RestOfWordEntropisalCalculator

Calculate character-level rest-of-word entropy and surprisal in both directions (predicting remaining characters from left-to-right and right-to-left contexts).

Methods:

calculate_metrics(tokens: List[str]) -> Dict[str, float]: Calculate metrics for a token list
calculate_batch(token_lists: List[List[str]]) -> pd.DataFrame: Batch processing
get_word_frequency(word: str) -> int: Get frequency of a word in reference corpus

Utilities

from entroprisal.utils import (
    load_google_books_words,
    load_4grams,
    get_data_dir,
    preprocess_text,
    is_content_token
)

# Load reference data
words_df = load_google_books_words()
ngrams_aw = load_4grams("aw")
ngrams_cw = load_4grams("cw")

# Get data directory path
data_dir = get_data_dir()

# Preprocess text with spaCy tokenization
# Returns list of token lists (one per document)
tokens = preprocess_text("The quick brown fox jumps over the lazy dog.")
# [['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']]

# Process multiple texts
texts = ["First sentence.", "Second sentence."]
token_lists = preprocess_text(texts)

# Extract only content words (nouns, verbs, adjectives, adverbs)
content_tokens = preprocess_text("The quick brown fox jumps.", content_words_only=True)
# [['quick', 'brown', 'fox', 'jumps']]  # 'the' filtered out

# Use a different spaCy model
tokens = preprocess_text("Some text", spacy_model_tag="en_core_web_sm")

Examples

See examples/usage_examples.ipynb for comprehensive examples including:

Loading and initializing calculators
Processing single texts and batches
Combining multiple metrics
Visualizing results

Development

Running Tests

pytest tests/

Code Style

# Format code
black src/

# Lint code
ruff check src/

License

It's MIT licensed. Do what you want with it.

Citation

On the other hand, if you are an academic, please cite the package as follows:

@software{entroprisal,
  title = {entroprisal: Entropy-based linguistic metrics},
  author = {Langdon Holmes and Scott Crossley},
  year = {2025},
  url = {https://github.com/learlab/entroprisal}
}

Holmes, L., & Crossley, S. (2025). entroprisal: Entropy-based linguistic metrics [Computer software].

Acknowledgments

Reference data sources:

Google Books word frequencies: gwordlist
N-gram token frequencies: Derived from the slimpajama test set slimpajama

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.7.0

Jun 19, 2026

0.6.0

Jun 5, 2026

0.5.1

Jun 1, 2026

0.5.0

May 29, 2026

0.4.0

May 26, 2026

This version

0.3.1

May 20, 2026

0.3.0

May 20, 2026

0.2.3

Dec 10, 2025

0.2.2

Dec 10, 2025

0.2.1

Dec 10, 2025

0.2.0

Dec 10, 2025

0.1.0

Dec 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

entroprisal-0.3.1.tar.gz (3.4 MB view details)

Uploaded May 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

entroprisal-0.3.1-py3-none-any.whl (3.4 MB view details)

Uploaded May 20, 2026 Python 3

File details

Details for the file entroprisal-0.3.1.tar.gz.

File metadata

Download URL: entroprisal-0.3.1.tar.gz
Upload date: May 20, 2026
Size: 3.4 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for entroprisal-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`c964b569eecb64f7087e518742854747e4229dc9d1fe953625af1b62f25bdb0c`
MD5	`dd7fd755df601bc6a3d8c1ca61ceda39`
BLAKE2b-256	`a413b9a19715609917ea8d76de27731552ddc73e9e5fbf9e9c49a1e89c7a59f7`

See more details on using hashes here.

File details

Details for the file entroprisal-0.3.1-py3-none-any.whl.

File metadata

Download URL: entroprisal-0.3.1-py3-none-any.whl
Upload date: May 20, 2026
Size: 3.4 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for entroprisal-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`74b115edc3106b1336e3a5b6d02d9135d02d95056899e0bc44b52a8e52ef4558`
MD5	`6c87a31cb2bf6ab4ef6bffd69b2317bb`
BLAKE2b-256	`23c60667df79766b239e46da27f06b0f017ea1674014fb085564fa24f35bdae7`

See more details on using hashes here.

entroprisal 0.3.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

entroprisal

Overview

Installation

Basic Installation

Optional Dependencies included in all

Development Installation

Data Files

Quick Start

Text Preprocessing

Token-Level Entropy and Surprisal

Per-Position Token Metrics

Character-Level Entropy and Surprisal

Rest-of-Word Entropy and Surprisal (Character-Level, Bidirectional)

Batch Processing

API Reference

TokenEntropisalCalculator

CharacterEntropisalCalculator

RestOfWordEntropisalCalculator

Utilities

Examples

Development

Running Tests

Code Style

License

Citation

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Optional Dependencies included in `all`