Find the best probability distribution for your data
Project description
bestdist ๐
Find the best probability distribution for your data
bestdist is a Python package that helps you identify which probability distribution best fits your data using statistical tests and information criteria.
Features
- ๐ฏ Automatic Distribution Fitting: Test multiple distributions at once
- ๐ Statistical Tests: Kolmogorov-Smirnov, Anderson-Darling, Chi-square
- ๐ Information Criteria: AIC and BIC for model selection
- ๐จ Visualization: Built-in plotting for fit assessment
- ๐ง Extensible: Easy to add custom distributions
- ๐ผ Pandas Integration: Works seamlessly with pandas DataFrames
- โ Type Hints: Full type annotation support
- ๐งช Well Tested: Comprehensive test suite
Installation
From PyPI (when published)
pip install bestdist
From source
git clone https://github.com/Wilmar3752/pdist.git
cd pdist
pip install -e .
Development installation
pip install -e ".[dev]"
Quick Start
Continuous Data
from bestdist import DistributionFitter
import numpy as np
# Your continuous data (e.g., measurements, prices)
data = np.random.gamma(2, 2, 1000)
# Create fitter (continuous is default)
fitter = DistributionFitter(data)
results = fitter.fit()
# Get best distribution
best = fitter.get_best_distribution()
print(f"Best fit: {best['distribution']}")
print(f"Parameters: {best['parameters']}")
print(f"P-value: {best['p_value']:.4f}")
# View summary of all fits
print(fitter.summary())
# Visualize the best fit
fitter.plot_best_fit()
# Compare all distributions
fitter.compare_distributions()
Discrete Data
from bestdist import DistributionFitter
import numpy as np
# Your discrete data (e.g., count data)
data = np.random.poisson(lam=3.5, size=1000)
# Create fitter for discrete distributions
fitter = DistributionFitter(data, dist_type='discrete')
results = fitter.fit()
# Get best distribution
best = fitter.get_best_distribution()
print(f"Best fit: {best['distribution']}")
print(f"Lambda: {best['parameters']}")
Supported Distributions
Continuous Distributions (9 distributions)
- Normal (Gaussian): Symmetric, bell-shaped distribution
- Gamma: Skewed distribution for positive values
- Beta: Bounded [0, 1], flexible shapes
- Weibull: Common in reliability engineering and lifetime analysis
- Lognormal: Right-skewed, for positive data (income, prices)
- Exponential: Memoryless distribution for waiting times
- Uniform: Equal probability across a range
- Cauchy: Heavy-tailed distribution (undefined mean/variance)
- Student-t: Robust to outliers, heavier tails than Normal
Discrete Distributions (4 distributions)
- Poisson: Count data, number of events in fixed interval
- Binomial: Number of successes in fixed trials
- Negative Binomial: Overdispersed count data, failures before successes
- Geometric: Number of trials until first success
Coming Soon
- Chi-square
- F-distribution
- Pareto
Advanced Usage
Custom Distribution List
from bestdist import DistributionFitter
from bestdist.distributions.continuous import (
Normal, Gamma, Lognormal, Exponential
)
from bestdist.distributions.discrete import (
Poisson, Binomial, NegativeBinomial
)
# Continuous: only fit specific distributions
fitter = DistributionFitter(
continuous_data,
distributions=[Normal, Gamma, Lognormal, Exponential]
)
results = fitter.fit()
# Discrete: only fit specific distributions
fitter = DistributionFitter(
count_data,
dist_type='discrete',
distributions=[Poisson, NegativeBinomial]
)
results = fitter.fit()
Selection Criteria
# Select best by different criteria
best_pvalue = fitter.get_best_distribution(criterion='p_value')
best_aic = fitter.get_best_distribution(criterion='aic')
best_bic = fitter.get_best_distribution(criterion='bic')
Individual Distribution Usage
# CONTINUOUS DISTRIBUTIONS
from bestdist.distributions.continuous import Normal, Lognormal, Exponential
import numpy as np
# Example 1: Normal distribution
data = np.random.normal(5, 2, 1000)
dist = Normal(data)
params = dist.fit()
print(f"Mean: {dist.mean:.2f}, Std: {dist.std:.2f}")
# Example 2: Lognormal (income data)
income_data = np.random.lognormal(mean=10.5, sigma=0.8, size=1000)
lognormal = Lognormal(income_data)
lognormal.fit()
print(f"Mean income: ${lognormal.mean:,.2f}")
print(f"Median income: ${lognormal.median:,.2f}")
# DISCRETE DISTRIBUTIONS
from bestdist.distributions.discrete import Poisson, Binomial
# Example 3: Poisson (count data)
count_data = np.random.poisson(lam=3.5, size=1000)
poisson = Poisson(count_data)
poisson.fit()
print(f"Lambda (rate): {poisson.mu:.4f}")
print(f"P(X=5) = {poisson.pmf(5):.4f}")
# Example 4: Binomial (success/failure)
trials_data = np.random.binomial(n=10, p=0.3, size=1000)
binomial = Binomial(trials_data)
binomial.fit()
print(f"n (trials): {binomial.n}, p (success): {binomial.p:.4f}")
# Generate samples, evaluate PDF/CDF/PMF
samples = dist.rvs(size=100, random_state=42)
x = np.linspace(0, 10, 100)
pdf_values = dist.pdf(x) # For continuous
cdf_values = dist.cdf(x)
Working with Pandas
import pandas as pd
from bestdist import DistributionFitter
# Load data
df = pd.read_csv('data.csv')
# Fit distribution to a column
fitter = DistributionFitter(df['column_name'])
best = fitter.get_best_distribution()
# Get summary as DataFrame
summary_df = fitter.summary()
print(summary_df)
Custom Distributions
from bestdist.core.base import BaseDistribution
from scipy.stats import expon, rv_continuous
from typing import Tuple
class Exponential(BaseDistribution):
"""Custom exponential distribution."""
def _get_scipy_dist(self) -> rv_continuous:
return expon
def _extract_params(self, fit_result: Tuple) -> dict:
return {
'loc': float(fit_result[0]),
'scale': float(fit_result[1])
}
# Use your custom distribution
fitter = DistributionFitter(data, distributions=[Exponential])
results = fitter.fit()
API Reference
DistributionFitter
Main class for fitting multiple distributions.
Parameters:
data: Array-like data to fitdistributions: List of distribution classes (default: all available)method: Goodness-of-fit test method ('ks', 'ad', 'chi2')
Methods:
fit(verbose=True): Fit all distributionsget_best_distribution(criterion='p_value'): Get best fitsummary(top_n=None): Get summary DataFrameplot_best_fit(bins=30): Plot best fit distributioncompare_distributions(): Compare all fits
BaseDistribution
Abstract base class for distributions.
Methods:
fit(): Fit distribution to datatest_goodness_of_fit(method='ks'): Perform GOF testpdf(x): Probability density functioncdf(x): Cumulative distribution functionppf(q): Percent point function (inverse CDF)rvs(size, random_state): Generate random samplesget_info(): Get distribution information
Testing
Run the test suite:
# Run all tests
pytest
# Run with coverage
pytest --cov=pdist --cov-report=html
# Run specific test file
pytest tests/test_distributions/test_normal.py
Development
Setup Development Environment
# Clone repository
git clone https://github.com/yourusername/pdist.git
cd pdist
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install in development mode
pip install -e ".[dev]"
# Install pre-commit hooks
pre-commit install
Code Quality
# Format code
black src tests
# Sort imports
isort src tests
# Lint
flake8 src tests
# Type checking
mypy src
Project Structure
pdist/
โโโ src/pdist/
โ โโโ __init__.py
โ โโโ core/
โ โ โโโ base.py # Abstract base class
โ โ โโโ fitter.py # Main fitter
โ โโโ distributions/
โ โ โโโ continuous/
โ โ โโโ normal.py
โ โ โโโ gamma.py
โ โ โโโ beta.py
โ โ โโโ weibull.py
โ โโโ utils/
โ โโโ exceptions.py
โ โโโ types.py
โโโ tests/
โ โโโ test_distributions/
โ โโโ test_core/
โ โโโ conftest.py
โโโ pyproject.toml
โโโ README.md
Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Run tests (
pytest) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Citation
If you use this package in your research, please cite:
@software{bestdist2024,
author = {Sepulveda, Wilmar},
title = {bestdist: Find the best probability distribution for your data},
year = {2024},
url = {https://github.com/Wilmar3752/pdist}
}
Roadmap
- Add more distributions (lognormal, exponential, etc.)
- Support for discrete distributions
- Parallel fitting for large datasets
- GUI/Web interface
- Integration with scikit-learn
- Bayesian model selection
- Mixture distributions
Acknowledgments
- Built with scipy and numpy
- Inspired by the need for easy distribution fitting in data science workflows
Contact
- GitHub: @Wilmar3752
- Email: wilmar.sepulveda2@gmail.com
Made with โค๏ธ
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bestdist-0.1.1.tar.gz.
File metadata
- Download URL: bestdist-0.1.1.tar.gz
- Upload date:
- Size: 24.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
df0a6c745516a6bc13417902192c788dba3b6fe7516b771c5d9c0aa018ed1990
|
|
| MD5 |
ea6dc2f4507f78047c3ee25b404dd329
|
|
| BLAKE2b-256 |
da37addcbc345eaee422a55c4e1bd00a04e59dc48bda511f1fb05f196797ab46
|
Provenance
The following attestation bundles were made for bestdist-0.1.1.tar.gz:
Publisher:
python-publish.yml on Wilmar3752/pdist
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
bestdist-0.1.1.tar.gz -
Subject digest:
df0a6c745516a6bc13417902192c788dba3b6fe7516b771c5d9c0aa018ed1990 - Sigstore transparency entry: 823124327
- Sigstore integration time:
-
Permalink:
Wilmar3752/pdist@f68c31e949b52f24b52039a9c9c69cda35b459a0 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/Wilmar3752
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@f68c31e949b52f24b52039a9c9c69cda35b459a0 -
Trigger Event:
release
-
Statement type:
File details
Details for the file bestdist-0.1.1-py3-none-any.whl.
File metadata
- Download URL: bestdist-0.1.1-py3-none-any.whl
- Upload date:
- Size: 35.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
28e12392e51dd9dc3297c16493d1f8bc623ea594dd3e7ed2f7deb764f1c24053
|
|
| MD5 |
85e266f1384ba359f4a23d8e5f8062ec
|
|
| BLAKE2b-256 |
3ea0b1be4fa314561346e08bb52628cc5f5f86b7d5224dc590026e64a10f50e5
|
Provenance
The following attestation bundles were made for bestdist-0.1.1-py3-none-any.whl:
Publisher:
python-publish.yml on Wilmar3752/pdist
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
bestdist-0.1.1-py3-none-any.whl -
Subject digest:
28e12392e51dd9dc3297c16493d1f8bc623ea594dd3e7ed2f7deb764f1c24053 - Sigstore transparency entry: 823124405
- Sigstore integration time:
-
Permalink:
Wilmar3752/pdist@f68c31e949b52f24b52039a9c9c69cda35b459a0 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/Wilmar3752
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@f68c31e949b52f24b52039a9c9c69cda35b459a0 -
Trigger Event:
release
-
Statement type: