Skip to main content

A Python library for scoring observation probabilities from frequency counts, with multiple smoothing methods.

Project description

FreqProb

CI PyPI version Python versions License: MIT Code style: ruff

A modern, high-performance Python library for probability smoothing and frequency-based language modeling.

FreqProb provides state-of-the-art smoothing techniques for converting frequency counts into probability estimates, with applications in natural language processing, information retrieval, and statistical modeling.

Comprehensive & Accurate

  • 10+ smoothing methods: From basic Laplace to advanced Kneser-Ney and Simple Good-Turing
  • Mathematically rigorous: Implementations validated against reference sources (NLTK, SciPy)
  • Production-ready: Extensive testing with 400+ test cases and property-based validation

High Performance

  • Vectorized operations: Batch processing with NumPy acceleration
  • Memory efficient: Compressed representations and streaming algorithms
  • Lazy evaluation: Compute probabilities only when needed
  • Caching system: Intelligent memoization for expensive operations

Developer Experience

  • Type safety: Full type hints with mypy validation
  • Modern Python: Requires Python 3.10+, uses latest language features
  • Rich documentation: Mathematical background, tutorials, and API reference
  • Easy integration: Clean, intuitive API design

Quick Start

Installation

pip install freqprob

For additional features:

pip install freqprob[all]  # All optional dependencies

Basic Usage

import freqprob

# Create a frequency distribution
word_counts = {'the': 100, 'cat': 50, 'dog': 30, 'bird': 10}

# Basic smoothing - handles zero probabilities
laplace = freqprob.Laplace(word_counts, bins=10000)
print(f"P(cat) = {laplace('cat'):.4f}")      # 0.0053
print(f"P(elephant) = {laplace('elephant'):.6f}")  # 0.000105 (unseen word)

# Advanced smoothing for n-gram models
bigrams = {('the', 'cat'): 25, ('the', 'dog'): 20, ('a', 'cat'): 15}
kneser_ney = freqprob.KneserNey(bigrams, discount=0.75)

# Model evaluation
test_data = ['cat', 'dog', 'bird'] * 10
perplexity = freqprob.perplexity(laplace, test_data)
print(f"Perplexity: {perplexity:.2f}")

Smoothing Methods

Basic Methods

  • MLE (Maximum Likelihood): Unsmoothed relative frequencies
  • Laplace (Add-One): Classic add-one smoothing
  • Lidstone (Add-k): Generalized additive smoothing
  • ELE (Expected Likelihood): Lidstone with γ=0.5

Advanced Methods

  • Simple Good-Turing: Frequency-of-frequency based smoothing
  • Kneser-Ney: State-of-the-art for n-gram language models
  • Modified Kneser-Ney: Improved version with automatic parameter estimation
  • Bayesian: Dirichlet prior-based smoothing
  • Interpolated: Linear combination of multiple models

Specialized Features

  • Streaming algorithms: Real-time updates for large datasets
  • Memory optimization: Compressed and sparse representations
  • Performance profiling: Built-in benchmarking and validation tools

Use Cases

Natural Language Processing

# Language modeling
bigrams = freqprob.ngram_frequency(tokens, n=2)
lm = freqprob.KneserNey(bigrams, discount=0.75)

# Text classification with smoothed features
doc_features = freqprob.word_frequency(document_tokens)
classifier_probs = freqprob.Laplace(doc_features, bins=vocab_size)

Information Retrieval

# Document scoring with term frequency smoothing
term_counts = compute_term_frequencies(document)
smoothed_tf = freqprob.BayesianSmoothing(term_counts, alpha=0.5)

# Query likelihood with unseen term handling
query_prob = sum(smoothed_tf(term) for term in query_terms)

Data Science & Analytics

# Probability estimation for sparse categorical data
category_counts = {cat: count for cat, count in data.value_counts().items()}
estimator = freqprob.SimpleGoodTuring(category_counts)

# Handle zero frequencies in statistical analysis
smoothed_dist = freqprob.ELE(observed_frequencies, bins=total_categories)

Quality & Reliability

Rigorous Testing

  • 400+ test cases covering edge cases and normal operations
  • Property-based testing with Hypothesis for mathematical correctness
  • Regression testing against reference implementations (NLTK, SciPy)
  • Numerical stability validation for extreme inputs

Performance Validated

  • Benchmarking framework for performance regression detection
  • Memory profiling to ensure efficient resource usage
  • Scaling analysis from small to large vocabulary sizes
  • Cross-platform testing on Linux, Windows, and macOS

Mathematical Accuracy

  • Formula verification against academic literature
  • Statistical correctness validation with known distributions
  • Precision testing for floating-point edge cases
  • Reference compatibility with established libraries

Documentation & Learning

Learn FreqProb through comprehensive, executable tutorials with visualizations. Tutorials are written using Nhandu literate programming format.

  1. Basic Smoothing Methods (View HTML)

    • Introduction to probability smoothing
    • MLE, Laplace, Lidstone, and ELE methods
    • Model evaluation with perplexity
  2. Advanced Methods (View HTML)

    • Simple Good-Turing smoothing
    • Kneser-Ney and Modified Kneser-Ney
    • Bayesian and interpolated methods
  3. Efficiency & Memory (View HTML)

    • Vectorized batch processing
    • Streaming algorithms
    • Memory-efficient representations
  4. Real-World Applications (View HTML)

    • Language modeling
    • Text classification
    • Information retrieval

Citation

If you use FreqProb in academic research, please cite:

@software{tresoldi_freqprob_2025,
  author = {Tresoldi, Tiago},
  title = {FreqProb: A Python library for probability smoothing and frequency-based language modeling},
  url = {https://github.com/tresoldi/freqprob},
  version = {0.4.0},
  publisher = {Department of Linguistics and Philology, Uppsala University},
  address = {Uppsala},
  year = {2025}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

freqprob-0.4.0.tar.gz (105.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

freqprob-0.4.0-py3-none-any.whl (69.4 kB view details)

Uploaded Python 3

File details

Details for the file freqprob-0.4.0.tar.gz.

File metadata

  • Download URL: freqprob-0.4.0.tar.gz
  • Upload date:
  • Size: 105.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for freqprob-0.4.0.tar.gz
Algorithm Hash digest
SHA256 763ea9aa78ad1d4db5d5ab57dcce979c89efc8f9f34ed087acfa83176b364cac
MD5 56b7072bad1522eee56c71574daf7943
BLAKE2b-256 ba88fcba61defa76d6c092aff0d995a88d3e42492a9a3a6f58c301e0489768ee

See more details on using hashes here.

File details

Details for the file freqprob-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: freqprob-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 69.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for freqprob-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a357f07b5fd85b04fadb94e0afcf6b62d5ad429cab46d7fe41344beb35dec5c0
MD5 782e5f89c514cfee9dbc67abc3ef165b
BLAKE2b-256 7d0e90125b55b2a43ea70a4162b3d817c64626d5f36b7ff96927722744424235

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page