Skip to main content

Modern Spark distribution fitting library with efficient parallel processing

Project description

spark-bestfit

CI PyPI version Documentation Status License: MIT Code style: black Ruff

Modern Spark distribution fitting library with efficient parallel processing

Efficiently fit ~100 scipy.stats distributions to your data using Spark's parallel processing with optimized Pandas UDFs and broadcast variables.

Features

  • Parallel Processing: Fits distributions in parallel using Spark
  • ~100 Continuous Distributions: Access to nearly all scipy.stats continuous distributions
  • 16 Discrete Distributions: Fit count data with Poisson, negative binomial, geometric, and more
  • Histogram-Based Fitting: Efficient fitting using histogram representation
  • Multiple Metrics: Compare fits using K-S statistic, SSE, AIC, and BIC
  • Statistical Validation: Kolmogorov-Smirnov test with p-values for goodness-of-fit
  • Results API: Filter, sort, and export results easily
  • Visualization: Built-in plotting for distribution comparison and Q-Q plots
  • Flexible Configuration: Customize bins, sampling, and distribution selection

Installation

pip install spark-bestfit

This installs spark-bestfit without PySpark. You are responsible for providing a compatible Spark environment (see Compatibility Matrix below).

With PySpark included (for users without a managed Spark environment):

pip install spark-bestfit[spark]

Quick Start

from spark_bestfit import DistributionFitter
import numpy as np
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# Generate sample data
data = np.random.normal(loc=50, scale=10, size=10_000)

# Create fitter
fitter = DistributionFitter(spark)
df = spark.createDataFrame([(float(x),) for x in data], ["value"])

# Fit distributions
results = fitter.fit(df, column="value")

# Get best fit (by K-S statistic, the default)
best = results.best(n=1)[0]
print(f"Best: {best.distribution} (KS={best.ks_statistic:.4f}, p={best.pvalue:.4f})")

# Plot
fitter.plot(best, df, "value", title="Best Fit Distribution")

Compatibility Matrix

Spark Version Python Versions NumPy Pandas PyArrow
3.5.x 3.11, 3.12 1.24+ (< 2.0) 1.5+ 12.0 - 16.x
4.x 3.12, 3.13 2.0+ 2.2+ 17.0+

Note: Spark 3.5.x does not support NumPy 2.0. If using Spark 3.5 with Python 3.12, ensure setuptools is installed (provides distutils).

API Overview

Fitting Distributions

from spark_bestfit import DistributionFitter

fitter = DistributionFitter(spark, random_seed=123)
results = fitter.fit(
    df,
    column="value",
    bins=100,                    # Number of histogram bins
    support_at_zero=True,        # Only fit non-negative distributions
    enable_sampling=True,        # Enable adaptive sampling
    sample_fraction=0.3,         # Sample 30% of data
    max_distributions=50,        # Limit distributions to fit
)

Working with Results

# Get top 5 distributions (by K-S statistic, the default)
top_5 = results.best(n=5)

# Get best by other metrics
best_sse = results.best(n=1, metric="sse")[0]
best_aic = results.best(n=1, metric="aic")[0]

# Filter by goodness-of-fit
good_fits = results.filter(ks_threshold=0.05)        # K-S statistic < 0.05
significant = results.filter(pvalue_threshold=0.05)  # p-value > 0.05

# Convert to pandas for analysis
df_pandas = results.df.toPandas()

# Use fitted distribution
samples = best.sample(size=10000)  # Generate samples
pdf_values = best.pdf(x_array)     # Evaluate PDF
cdf_values = best.cdf(x_array)     # Evaluate CDF

Custom Plotting

fitter.plot(
    best,
    df,
    "value",
    figsize=(16, 10),
    dpi=300,
    histogram_alpha=0.6,
    pdf_linewidth=3,
    title="Distribution Fit",
    xlabel="Value",
    ylabel="Density",
    save_path="output/distribution.png",
)

Q-Q Plots

# Create Q-Q plot for goodness-of-fit assessment
fitter.plot_qq(
    best,
    df,
    "value",
    max_points=1000,           # Sample size for plotting
    title="Q-Q Plot",
    save_path="output/qq_plot.png",
)

Discrete Distributions

For count data (integers), use DiscreteDistributionFitter:

from spark_bestfit import DiscreteDistributionFitter
import numpy as np

# Generate count data
data = np.random.poisson(lam=7, size=10_000)
df = spark.createDataFrame([(int(x),) for x in data], ["counts"])

# Fit discrete distributions
fitter = DiscreteDistributionFitter(spark)
results = fitter.fit(df, column="counts")

# Get best fit - use AIC for model selection (recommended for discrete)
best = results.best(n=1, metric="aic")[0]
print(f"Best: {best.distribution} (AIC={best.aic:.2f})")

# Plot fitted PMF
fitter.plot(best, df, "counts", title="Best Discrete Fit")

Metric Selection for Discrete Distributions:

Metric Use Case
aic Recommended - Proper model selection criterion with complexity penalty
bic Similar to AIC but stronger penalty for complex models
ks_statistic Valid for ranking fits, but p-values are not reliable for discrete data
sse Simple comparison metric

Note: The K-S test assumes continuous distributions. For discrete data, the K-S statistic can still rank fits, but p-values are conservative and should not be used for hypothesis testing. Use AIC/BIC for proper model selection.

Excluding Distributions

from spark_bestfit import DistributionFitter, DEFAULT_EXCLUDED_DISTRIBUTIONS

# View default exclusions
print(DEFAULT_EXCLUDED_DISTRIBUTIONS)

# Include a specific distribution by removing it from exclusions
exclusions = tuple(d for d in DEFAULT_EXCLUDED_DISTRIBUTIONS if d != "wald")
fitter = DistributionFitter(spark, excluded_distributions=exclusions)

# Or exclude nothing (fit all distributions - may be slow)
fitter = DistributionFitter(spark, excluded_distributions=())

Documentation

Full documentation is available at spark-bestfit.readthedocs.io.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feat/amazing-feature)
  3. Commit your changes (git commit -m 'feat: add amazing feature')
  4. Push to the branch (git push origin feat/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spark_bestfit-0.4.0.tar.gz (56.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spark_bestfit-0.4.0-py3-none-any.whl (33.2 kB view details)

Uploaded Python 3

File details

Details for the file spark_bestfit-0.4.0.tar.gz.

File metadata

  • Download URL: spark_bestfit-0.4.0.tar.gz
  • Upload date:
  • Size: 56.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for spark_bestfit-0.4.0.tar.gz
Algorithm Hash digest
SHA256 927132669b2fb11fc3a519f0d12182dd9366952879b18a4bcfa90d4a7df9886e
MD5 94a4c9fe77efaae6afd040eb46007f57
BLAKE2b-256 dcc86b91c97873c5e4f5fa016592a7e1e230b6b77d29fe24e4f487c9e544edc6

See more details on using hashes here.

Provenance

The following attestation bundles were made for spark_bestfit-0.4.0.tar.gz:

Publisher: release.yml on dwsmith1983/spark-bestfit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file spark_bestfit-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: spark_bestfit-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 33.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for spark_bestfit-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5c44dff755a6aed85fd3bcd68d04453761f674369aa10d60c80a89a2b15c9a28
MD5 960d1b33aa66b41bfd3a352c9dc29620
BLAKE2b-256 475987eca30c84ccd415c919e2183babf0c9de9fc4f8f9eda9a09da28cf53447

See more details on using hashes here.

Provenance

The following attestation bundles were made for spark_bestfit-0.4.0-py3-none-any.whl:

Publisher: release.yml on dwsmith1983/spark-bestfit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page