Modern Spark distribution fitting library with efficient parallel processing
Project description
spark-bestfit
Modern Spark distribution fitting library with efficient parallel processing
Efficiently fit ~100 scipy.stats distributions to your data using Spark's parallel processing with optimized Pandas UDFs and broadcast variables.
Features
- Parallel Processing: Fits distributions in parallel using Spark
- ~100 Continuous Distributions: Access to nearly all scipy.stats continuous distributions
- 16 Discrete Distributions: Fit count data with Poisson, negative binomial, geometric, and more
- Histogram-Based Fitting: Efficient fitting using histogram representation
- Multiple Metrics: Compare fits using K-S statistic, SSE, AIC, and BIC
- Statistical Validation: Kolmogorov-Smirnov test with p-values for goodness-of-fit
- Results API: Filter, sort, and export results easily
- Visualization: Built-in plotting for distribution comparison and Q-Q plots
- Flexible Configuration: Customize bins, sampling, and distribution selection
Installation
pip install spark-bestfit
This installs spark-bestfit without PySpark. You are responsible for providing a compatible Spark environment (see Compatibility Matrix below).
With PySpark included (for users without a managed Spark environment):
pip install spark-bestfit[spark]
Quick Start
from spark_bestfit import DistributionFitter
import numpy as np
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
# Generate sample data
data = np.random.normal(loc=50, scale=10, size=10_000)
# Create fitter
fitter = DistributionFitter(spark)
df = spark.createDataFrame([(float(x),) for x in data], ["value"])
# Fit distributions
results = fitter.fit(df, column="value")
# Get best fit (by K-S statistic, the default)
best = results.best(n=1)[0]
print(f"Best: {best.distribution} (KS={best.ks_statistic:.4f}, p={best.pvalue:.4f})")
# Plot
fitter.plot(best, df, "value", title="Best Fit Distribution")
Compatibility Matrix
| Spark Version | Python Versions | NumPy | Pandas | PyArrow |
|---|---|---|---|---|
| 3.5.x | 3.11, 3.12 | 1.24+ (< 2.0) | 1.5+ | 12.0 - 16.x |
| 4.x | 3.12, 3.13 | 2.0+ | 2.2+ | 17.0+ |
Note: Spark 3.5.x does not support NumPy 2.0. If using Spark 3.5 with Python 3.12, ensure
setuptoolsis installed (providesdistutils).
API Overview
Fitting Distributions
from spark_bestfit import DistributionFitter
fitter = DistributionFitter(spark, random_seed=123)
results = fitter.fit(
df,
column="value",
bins=100, # Number of histogram bins
support_at_zero=True, # Only fit non-negative distributions
enable_sampling=True, # Enable adaptive sampling
sample_fraction=0.3, # Sample 30% of data
max_distributions=50, # Limit distributions to fit
)
Working with Results
# Get top 5 distributions (by K-S statistic, the default)
top_5 = results.best(n=5)
# Get best by other metrics
best_sse = results.best(n=1, metric="sse")[0]
best_aic = results.best(n=1, metric="aic")[0]
# Filter by goodness-of-fit
good_fits = results.filter(ks_threshold=0.05) # K-S statistic < 0.05
significant = results.filter(pvalue_threshold=0.05) # p-value > 0.05
# Convert to pandas for analysis
df_pandas = results.df.toPandas()
# Use fitted distribution
samples = best.sample(size=10000) # Generate samples
pdf_values = best.pdf(x_array) # Evaluate PDF
cdf_values = best.cdf(x_array) # Evaluate CDF
Custom Plotting
fitter.plot(
best,
df,
"value",
figsize=(16, 10),
dpi=300,
histogram_alpha=0.6,
pdf_linewidth=3,
title="Distribution Fit",
xlabel="Value",
ylabel="Density",
save_path="output/distribution.png",
)
Q-Q Plots
# Create Q-Q plot for goodness-of-fit assessment
fitter.plot_qq(
best,
df,
"value",
max_points=1000, # Sample size for plotting
title="Q-Q Plot",
save_path="output/qq_plot.png",
)
Discrete Distributions
For count data (integers), use DiscreteDistributionFitter:
from spark_bestfit import DiscreteDistributionFitter
import numpy as np
# Generate count data
data = np.random.poisson(lam=7, size=10_000)
df = spark.createDataFrame([(int(x),) for x in data], ["counts"])
# Fit discrete distributions
fitter = DiscreteDistributionFitter(spark)
results = fitter.fit(df, column="counts")
# Get best fit - use AIC for model selection (recommended for discrete)
best = results.best(n=1, metric="aic")[0]
print(f"Best: {best.distribution} (AIC={best.aic:.2f})")
# Plot fitted PMF
fitter.plot(best, df, "counts", title="Best Discrete Fit")
Metric Selection for Discrete Distributions:
| Metric | Use Case |
|---|---|
aic |
Recommended - Proper model selection criterion with complexity penalty |
bic |
Similar to AIC but stronger penalty for complex models |
ks_statistic |
Valid for ranking fits, but p-values are not reliable for discrete data |
sse |
Simple comparison metric |
Note: The K-S test assumes continuous distributions. For discrete data, the K-S statistic can still rank fits, but p-values are conservative and should not be used for hypothesis testing. Use AIC/BIC for proper model selection.
Excluding Distributions
from spark_bestfit import DistributionFitter, DEFAULT_EXCLUDED_DISTRIBUTIONS
# View default exclusions
print(DEFAULT_EXCLUDED_DISTRIBUTIONS)
# Include a specific distribution by removing it from exclusions
exclusions = tuple(d for d in DEFAULT_EXCLUDED_DISTRIBUTIONS if d != "wald")
fitter = DistributionFitter(spark, excluded_distributions=exclusions)
# Or exclude nothing (fit all distributions - may be slow)
fitter = DistributionFitter(spark, excluded_distributions=())
Documentation
Full documentation is available at spark-bestfit.readthedocs.io.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feat/amazing-feature) - Commit your changes (
git commit -m 'feat: add amazing feature') - Push to the branch (
git push origin feat/amazing-feature) - Open a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file spark_bestfit-0.4.0.tar.gz.
File metadata
- Download URL: spark_bestfit-0.4.0.tar.gz
- Upload date:
- Size: 56.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
927132669b2fb11fc3a519f0d12182dd9366952879b18a4bcfa90d4a7df9886e
|
|
| MD5 |
94a4c9fe77efaae6afd040eb46007f57
|
|
| BLAKE2b-256 |
dcc86b91c97873c5e4f5fa016592a7e1e230b6b77d29fe24e4f487c9e544edc6
|
Provenance
The following attestation bundles were made for spark_bestfit-0.4.0.tar.gz:
Publisher:
release.yml on dwsmith1983/spark-bestfit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
spark_bestfit-0.4.0.tar.gz -
Subject digest:
927132669b2fb11fc3a519f0d12182dd9366952879b18a4bcfa90d4a7df9886e - Sigstore transparency entry: 778816034
- Sigstore integration time:
-
Permalink:
dwsmith1983/spark-bestfit@56693f83c1e202c98c9dd3cdcea72165eba7c367 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/dwsmith1983
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@56693f83c1e202c98c9dd3cdcea72165eba7c367 -
Trigger Event:
push
-
Statement type:
File details
Details for the file spark_bestfit-0.4.0-py3-none-any.whl.
File metadata
- Download URL: spark_bestfit-0.4.0-py3-none-any.whl
- Upload date:
- Size: 33.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5c44dff755a6aed85fd3bcd68d04453761f674369aa10d60c80a89a2b15c9a28
|
|
| MD5 |
960d1b33aa66b41bfd3a352c9dc29620
|
|
| BLAKE2b-256 |
475987eca30c84ccd415c919e2183babf0c9de9fc4f8f9eda9a09da28cf53447
|
Provenance
The following attestation bundles were made for spark_bestfit-0.4.0-py3-none-any.whl:
Publisher:
release.yml on dwsmith1983/spark-bestfit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
spark_bestfit-0.4.0-py3-none-any.whl -
Subject digest:
5c44dff755a6aed85fd3bcd68d04453761f674369aa10d60c80a89a2b15c9a28 - Sigstore transparency entry: 778816041
- Sigstore integration time:
-
Permalink:
dwsmith1983/spark-bestfit@56693f83c1e202c98c9dd3cdcea72165eba7c367 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/dwsmith1983
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@56693f83c1e202c98c9dd3cdcea72165eba7c367 -
Trigger Event:
push
-
Statement type: