Modern distribution fitting library with Spark, Ray, and local parallel backends
Project description
spark-bestfit
Modern distribution fitting library with pluggable backends (Spark, Ray, Local)
Efficiently fit ~90 scipy.stats distributions to your data using parallel processing. Supports Apache Spark for production clusters, Ray for ML workflows, or local execution for development.
Features
- Parallel Processing: Spark, Ray, or local thread backends
- ~90 Continuous + 16 Discrete Distributions
- Multiple Metrics: K-S, A-D, SSE, AIC, BIC
- Bounded Fitting: Truncated distributions with natural bounds
- Heavy-Tail Detection: Warns when data may need special handling
- Gaussian Copula: Correlated multi-column sampling
- Model Serialization: Save/load to JSON or pickle
- FitterConfig Builder: Fluent API for complex configurations
Full feature list at spark-bestfit.readthedocs.io
Installation
pip install spark-bestfit # Core (BYO Spark)
pip install spark-bestfit[spark] # With PySpark
pip install spark-bestfit[ray] # With Ray
pip install spark-bestfit[plotting] # With visualization
Quick Start
from spark_bestfit import DistributionFitter
import numpy as np
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = np.random.normal(loc=50, scale=10, size=10_000)
df = spark.createDataFrame([(float(x),) for x in data], ["value"])
fitter = DistributionFitter(spark)
results = fitter.fit(df, column="value")
best = results.best(n=1)[0]
print(f"Best: {best.distribution} (KS={best.ks_statistic:.4f})")
Without Spark:
from spark_bestfit import DistributionFitter, LocalBackend
import pandas as pd
df = pd.DataFrame({"value": np.random.normal(50, 10, 1000)})
fitter = DistributionFitter(backend=LocalBackend())
results = fitter.fit(df, column="value")
Backends
| Backend | Use Case | Install |
|---|---|---|
| SparkBackend | Production clusters, 100M+ rows | [spark] or BYO |
| LocalBackend | Development, testing | Included |
| RayBackend | Ray clusters, ML pipelines | [ray] |
See Backend Guide for configuration details.
Compatibility
| Spark | Python | NumPy |
|---|---|---|
| 3.5.x | 3.11-3.12 | < 2.0 |
| 4.x | 3.12-3.13 | 2.0+ |
Documentation
Full documentation at spark-bestfit.readthedocs.io:
Contributing
Contributions welcome! See Contributing Guide.
License
MIT License - see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file spark_bestfit-3.0.3.tar.gz.
File metadata
- Download URL: spark_bestfit-3.0.3.tar.gz
- Upload date:
- Size: 2.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8e07653666a3ffca69d00cbe4b337295b7cad332c94bd74fb656626db504b8f0
|
|
| MD5 |
ca24b0451cad2e49947c18e98ac6b880
|
|
| BLAKE2b-256 |
f7eb7ef1ff6fe0d1d2513dd8927a2e30e79c7e7576ae2998d481b66a742b0572
|
Provenance
The following attestation bundles were made for spark_bestfit-3.0.3.tar.gz:
Publisher:
release.yml on dwsmith1983/spark-bestfit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
spark_bestfit-3.0.3.tar.gz -
Subject digest:
8e07653666a3ffca69d00cbe4b337295b7cad332c94bd74fb656626db504b8f0 - Sigstore transparency entry: 1203516122
- Sigstore integration time:
-
Permalink:
dwsmith1983/spark-bestfit@ecf968bd22d8e7a096a72446c9edbeee6b9042b8 -
Branch / Tag:
refs/tags/v3.0.3 - Owner: https://github.com/dwsmith1983
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@ecf968bd22d8e7a096a72446c9edbeee6b9042b8 -
Trigger Event:
push
-
Statement type:
File details
Details for the file spark_bestfit-3.0.3-py3-none-any.whl.
File metadata
- Download URL: spark_bestfit-3.0.3-py3-none-any.whl
- Upload date:
- Size: 145.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e712e64ab4af19d31fb773a6845375be5c5f9fbda82f2d6fb5dc5a85365b6042
|
|
| MD5 |
bd7749d489f6647877d41a662cd7bc7b
|
|
| BLAKE2b-256 |
ce02348929032c5820bbbb034d5ba6b41a13811da13f5c4006b8b5dd64010d99
|
Provenance
The following attestation bundles were made for spark_bestfit-3.0.3-py3-none-any.whl:
Publisher:
release.yml on dwsmith1983/spark-bestfit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
spark_bestfit-3.0.3-py3-none-any.whl -
Subject digest:
e712e64ab4af19d31fb773a6845375be5c5f9fbda82f2d6fb5dc5a85365b6042 - Sigstore transparency entry: 1203516124
- Sigstore integration time:
-
Permalink:
dwsmith1983/spark-bestfit@ecf968bd22d8e7a096a72446c9edbeee6b9042b8 -
Branch / Tag:
refs/tags/v3.0.3 - Owner: https://github.com/dwsmith1983
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@ecf968bd22d8e7a096a72446c9edbeee6b9042b8 -
Trigger Event:
push
-
Statement type: