Skip to main content

distfit is a Python library for probability density fitting.

Project description

Python Pypi Docs LOC Downloads Downloads License Forks Issues Project Status DOI Medium Colab Donate

distfit is a Python package for probability density fitting of univariate distributions for random variables. The distfit library can determine the best fit for over 90 theoretical distributions. The goodness-of-fit test is used to score for the best fit and after finding the best-fitted theoretical distribution, the loc, scale, and arg parameters are returned. It can be used for parametric, non-parametric, and discrete distributions. ⭐️Star it if you like it⭐️

Key Features

Feature Description Medium Gumroad+Podcast
Parametric Fitting Fit distributions on empirical data X. Link Link
Non-Parametric Fitting Fit distributions on empirical data X using non-parametric approaches (quantile, percentiles). - -
Multivariate Fitting Fit multivariate distributions on empirical data X that contains multiple columns. - -
Discrete Fitting Fit distributions on empirical data X using binomial distribution. - -
Predict Compute probabilities for response variables y. - -
Outlier Detection Detect anomalies using fitted distributions. Link Link
Synthetic Data Generate synthetic data. Link Link
Plots Various plotting functionalities. - -

Resources and Links


Background

  • For the parametric approach, The distfit library can determine the best fit across 89 theoretical distributions. To score the fit, one of the scoring statistics for the good-of-fitness test can be used used, such as RSS/SSE, Wasserstein, Kolmogorov-Smirnov (KS), or Energy. After finding the best-fitted theoretical distribution, the loc, scale, and arg parameters are returned, such as mean and standard deviation for normal distribution.

  • For the non-parametric approach, the distfit library contains two methods, the quantile and percentile method. Both methods assume that the data does not follow a specific probability distribution. In the case of the quantile method, the quantiles of the data are modeled whereas for the percentile method, the percentiles are modeled.

  • In case the dataset contains discrete values, the distift library contains the option for discrete fitting. The best fit is then derived using the binomial distribution.


Installation

Install distfit from PyPI
pip install distfit
Install from Github source
pip install git+https://github.com/erdogant/distfit
Imort Library
import distfit
print(distfit.__version__)

# Import library
from distfit import distfit

Examples

Example: Quick start to find best fit for your input data
# [distfit] >INFO> fit
# [distfit] >INFO> transform
# [distfit] >INFO> [norm      ] [0.00 sec] [RSS: 0.00108326] [loc=-0.048 scale=1.997]
# [distfit] >INFO> [expon     ] [0.00 sec] [RSS: 0.404237] [loc=-6.897 scale=6.849]
# [distfit] >INFO> [pareto    ] [0.00 sec] [RSS: 0.404237] [loc=-536870918.897 scale=536870912.000]
# [distfit] >INFO> [dweibull  ] [0.06 sec] [RSS: 0.0115552] [loc=-0.031 scale=1.722]
# [distfit] >INFO> [t         ] [0.59 sec] [RSS: 0.00108349] [loc=-0.048 scale=1.997]
# [distfit] >INFO> [genextreme] [0.17 sec] [RSS: 0.00300806] [loc=-0.806 scale=1.979]
# [distfit] >INFO> [gamma     ] [0.05 sec] [RSS: 0.00108459] [loc=-1862.903 scale=0.002]
# [distfit] >INFO> [lognorm   ] [0.32 sec] [RSS: 0.00121597] [loc=-110.597 scale=110.530]
# [distfit] >INFO> [beta      ] [0.10 sec] [RSS: 0.00105629] [loc=-16.364 scale=32.869]
# [distfit] >INFO> [uniform   ] [0.00 sec] [RSS: 0.287339] [loc=-6.897 scale=14.437]
# [distfit] >INFO> [loggamma  ] [0.12 sec] [RSS: 0.00109042] [loc=-370.746 scale=55.722]
# [distfit] >INFO> Compute confidence intervals [parametric]
# [distfit] >INFO> Compute significance for 9 samples.
# [distfit] >INFO> Multiple test correction method applied: [fdr_bh].
# [distfit] >INFO> Create PDF plot for the parametric method.
# [distfit] >INFO> Mark 5 significant regions
# [distfit] >INFO> Estimated distribution: beta [loc:-16.364265, scale:32.868811]

Example: Plot summary of the tested distributions

The distfit library provides multivariate distribution fitting that enables modeling complex dependencies between multiple variables using copula-based methods.

  from distfit import distfit
  
  # Initialize with multivariate mode
  dfit = distfit(multivariate=True)
  
  # Load example data
  X = dfit.import_example(data='multi_normal')
  # X = dfit.import_example(data='multi_t')
  
  # Fit model
  dfit.fit_transform(X)
  
  # Access estimated correlation matrix (Gaussian copula)
  print(dfit.model.corr)
  
  # Evaluate joint density
  results = dfit.evaluate_pdf(X)
  print(results['score'])
  print(results['copula_density'])
  
  # Generate synthetic samples
  Xnew = dfit.generate(n=10)
  
  # Detect multivariate outliers
  bool_outliers = dfit.predict_outliers(X)

Example: Plot summary of the tested distributions

After we have a fitted model, we can make some predictions using the theoretical distributions. After making some predictions, we can plot again but now the predictions are automatically included.

Example: Make predictions using the fitted distribution

Example: Test for one specific distributions

The full list of distributions is listed here: https://erdogant.github.io/distfit/pages/html/Parametric.html

Example: Test for multiple distributions

The full list of distributions is listed here: https://erdogant.github.io/distfit/pages/html/Parametric.html

Example: Fit discrete distribution
from scipy.stats import binom
# Generate random numbers

# Set parameters for the test-case
n = 8
p = 0.5

# Generate 10000 samples of the distribution of (n, p)
X = binom(n, p).rvs(10000)
print(X)

# [5 1 4 5 5 6 2 4 6 5 4 4 4 7 3 4 4 2 3 3 4 4 5 1 3 2 7 4 5 2 3 4 3 3 2 3 5
#  4 6 7 6 2 4 3 3 5 3 5 3 4 4 4 7 5 4 5 3 4 3 3 4 3 3 6 3 3 5 4 4 2 3 2 5 7
#  5 4 8 3 4 3 5 4 3 5 5 2 5 6 7 4 5 5 5 4 4 3 4 5 6 2...]

# Import distfit
from distfit import distfit

# Initialize for discrete distribution fitting
dfit = distfit(method='discrete')

# Run distfit to and determine whether we can find the parameters from the data.
dfit.fit_transform(X)

# [distfit] >fit..
# [distfit] >transform..
# [distfit] >Fit using binomial distribution..
# [distfit] >[binomial] [SSE: 7.79] [n: 8] [p: 0.499959] [chi^2: 1.11]
# [distfit] >Compute confidence interval [discrete]

Example: Make predictions on unseen data for discrete distribution

Example: Generate samples based on the fitted distribution

Star history

Star History Chart

Contributors

Thank the contributors!

Maintainer

  • Erdogan Taskesen, github: erdogant
  • Contributions are welcome.
  • Yes! This library is entirely free but it runs on coffee! :) Feel free to support with a Coffee.

Buy me a coffee

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

distfit-2.0.1.tar.gz (53.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

distfit-2.0.1-py3-none-any.whl (51.6 kB view details)

Uploaded Python 3

File details

Details for the file distfit-2.0.1.tar.gz.

File metadata

  • Download URL: distfit-2.0.1.tar.gz
  • Upload date:
  • Size: 53.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for distfit-2.0.1.tar.gz
Algorithm Hash digest
SHA256 e2bc40d7dbe16bdbd2a8f684bca62bc4f5a3dc9d7653ae4eabeb9194f12a93e9
MD5 cbd84bd19ea88ceccf26315ac25bbfad
BLAKE2b-256 90ba317dd45bc6b1eaaa230dc765eb5454e3b42493051961fb845b569664befb

See more details on using hashes here.

File details

Details for the file distfit-2.0.1-py3-none-any.whl.

File metadata

  • Download URL: distfit-2.0.1-py3-none-any.whl
  • Upload date:
  • Size: 51.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for distfit-2.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 72e20482b54f4ae06a6610dc191aed02da0ab2a6068b04f2ecb640859177a5d8
MD5 d1b6382a30165cf7d6bf46cf38c471e1
BLAKE2b-256 f4fba3bceb6eaa73e488c1c40dcf0897ddac6a90d757758ab7ebbaf86b95dcb8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page