A Python library for scoring observation probabilities from frequency counts, with multiple smoothing methods.
Project description
FreqProb
A modern, high-performance Python library for probability smoothing and frequency-based language modeling.
FreqProb provides state-of-the-art smoothing techniques for converting frequency counts into probability estimates, with applications in natural language processing, information retrieval, and statistical modeling.
Comprehensive & Accurate
- 10+ smoothing methods: From basic Laplace to advanced Kneser-Ney and Simple Good-Turing
- Mathematically rigorous: Implementations validated against reference sources (NLTK, SciPy)
- Production-ready: Extensive testing with 400+ test cases and property-based validation
High Performance
- Vectorized operations: Batch processing with NumPy acceleration
- Memory efficient: Compressed representations and streaming algorithms
- Lazy evaluation: Compute probabilities only when needed
- Caching system: Intelligent memoization for expensive operations
Developer Experience
- Type safety: Full type hints with mypy validation
- Modern Python: Requires Python 3.10+, uses latest language features
- Rich documentation: Mathematical background, tutorials, and API reference
- Easy integration: Clean, intuitive API design
Quick Start
Installation
pip install freqprob
For additional features:
pip install freqprob[all] # All optional dependencies
Basic Usage
import freqprob
# Create a frequency distribution
word_counts = {'the': 100, 'cat': 50, 'dog': 30, 'bird': 10}
# Basic smoothing - handles zero probabilities
laplace = freqprob.Laplace(word_counts, bins=10000)
print(f"P(cat) = {laplace('cat'):.4f}") # 0.0053
print(f"P(elephant) = {laplace('elephant'):.6f}") # 0.000105 (unseen word)
# Advanced smoothing for n-gram models
bigrams = {('the', 'cat'): 25, ('the', 'dog'): 20, ('a', 'cat'): 15}
kneser_ney = freqprob.KneserNey(bigrams, discount=0.75)
# Model evaluation
test_data = ['cat', 'dog', 'bird'] * 10
perplexity = freqprob.perplexity(laplace, test_data)
print(f"Perplexity: {perplexity:.2f}")
Smoothing Methods
Basic Methods
- MLE (Maximum Likelihood): Unsmoothed relative frequencies
- Laplace (Add-One): Classic add-one smoothing
- Lidstone (Add-k): Generalized additive smoothing
- ELE (Expected Likelihood): Lidstone with γ=0.5
Advanced Methods
- Simple Good-Turing: Frequency-of-frequency based smoothing
- Kneser-Ney: State-of-the-art for n-gram language models
- Modified Kneser-Ney: Improved version with automatic parameter estimation
- Bayesian: Dirichlet prior-based smoothing
- Interpolated: Linear combination of multiple models
Specialized Features
- Streaming algorithms: Real-time updates for large datasets
- Memory optimization: Compressed and sparse representations
- Performance profiling: Built-in benchmarking and validation tools
Use Cases
Natural Language Processing
# Language modeling
bigrams = freqprob.ngram_frequency(tokens, n=2)
lm = freqprob.KneserNey(bigrams, discount=0.75)
# Text classification with smoothed features
doc_features = freqprob.word_frequency(document_tokens)
classifier_probs = freqprob.Laplace(doc_features, bins=vocab_size)
Information Retrieval
# Document scoring with term frequency smoothing
term_counts = compute_term_frequencies(document)
smoothed_tf = freqprob.BayesianSmoothing(term_counts, alpha=0.5)
# Query likelihood with unseen term handling
query_prob = sum(smoothed_tf(term) for term in query_terms)
Data Science & Analytics
# Probability estimation for sparse categorical data
category_counts = {cat: count for cat, count in data.value_counts().items()}
estimator = freqprob.SimpleGoodTuring(category_counts)
# Handle zero frequencies in statistical analysis
smoothed_dist = freqprob.ELE(observed_frequencies, bins=total_categories)
Quality & Reliability
Rigorous Testing
- 400+ test cases covering edge cases and normal operations
- Property-based testing with Hypothesis for mathematical correctness
- Regression testing against reference implementations (NLTK, SciPy)
- Numerical stability validation for extreme inputs
Performance Validated
- Benchmarking framework for performance regression detection
- Memory profiling to ensure efficient resource usage
- Scaling analysis from small to large vocabulary sizes
- Cross-platform testing on Linux, Windows, and macOS
Mathematical Accuracy
- Formula verification against academic literature
- Statistical correctness validation with known distributions
- Precision testing for floating-point edge cases
- Reference compatibility with established libraries
Documentation & Learning
Learn FreqProb through comprehensive, executable tutorials with visualizations. Tutorials are written using Nhandu literate programming format.
-
Basic Smoothing Methods (View HTML)
- Introduction to probability smoothing
- MLE, Laplace, Lidstone, and ELE methods
- Model evaluation with perplexity
-
- Simple Good-Turing smoothing
- Kneser-Ney and Modified Kneser-Ney
- Bayesian and interpolated methods
-
Efficiency & Memory (View HTML)
- Vectorized batch processing
- Streaming algorithms
- Memory-efficient representations
-
Real-World Applications (View HTML)
- Language modeling
- Text classification
- Information retrieval
Citation
If you use FreqProb in academic research, please cite:
@software{tresoldi_freqprob_2025,
author = {Tresoldi, Tiago},
title = {FreqProb: A Python library for probability smoothing and frequency-based language modeling},
url = {https://github.com/tresoldi/freqprob},
version = {0.4.0},
publisher = {Department of Linguistics and Philology, Uppsala University},
address = {Uppsala},
year = {2025}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file freqprob-0.4.0.tar.gz.
File metadata
- Download URL: freqprob-0.4.0.tar.gz
- Upload date:
- Size: 105.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
763ea9aa78ad1d4db5d5ab57dcce979c89efc8f9f34ed087acfa83176b364cac
|
|
| MD5 |
56b7072bad1522eee56c71574daf7943
|
|
| BLAKE2b-256 |
ba88fcba61defa76d6c092aff0d995a88d3e42492a9a3a6f58c301e0489768ee
|
File details
Details for the file freqprob-0.4.0-py3-none-any.whl.
File metadata
- Download URL: freqprob-0.4.0-py3-none-any.whl
- Upload date:
- Size: 69.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a357f07b5fd85b04fadb94e0afcf6b62d5ad429cab46d7fe41344beb35dec5c0
|
|
| MD5 |
782e5f89c514cfee9dbc67abc3ef165b
|
|
| BLAKE2b-256 |
7d0e90125b55b2a43ea70a4162b3d817c64626d5f36b7ff96927722744424235
|