Skip to main content

Modern distribution fitting library with Spark, Ray, and local parallel backends

Project description

spark-bestfit

CI Documentation Status PyPI version Production Ready License: MIT Code style: black Ruff

Modern distribution fitting library with pluggable backends (Spark, Ray, Local)

Efficiently fit ~90 scipy.stats distributions to your data using parallel processing. Supports Apache Spark for production clusters, Ray for ML workflows, or local execution for development.

Features

  • Parallel Processing: Spark, Ray, or local thread backends
  • ~90 Continuous + 16 Discrete Distributions
  • Multiple Metrics: K-S, A-D, SSE, AIC, BIC
  • Bounded Fitting: Truncated distributions with natural bounds
  • Heavy-Tail Detection: Warns when data may need special handling
  • Gaussian Copula: Correlated multi-column sampling
  • Model Serialization: Save/load to JSON or pickle
  • FitterConfig Builder: Fluent API for complex configurations

Full feature list at spark-bestfit.readthedocs.io

Installation

pip install spark-bestfit              # Core (BYO Spark)
pip install spark-bestfit[spark]       # With PySpark
pip install spark-bestfit[ray]         # With Ray
pip install spark-bestfit[plotting]    # With visualization

Quick Start

from spark_bestfit import DistributionFitter
import numpy as np
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
data = np.random.normal(loc=50, scale=10, size=10_000)
df = spark.createDataFrame([(float(x),) for x in data], ["value"])

fitter = DistributionFitter(spark)
results = fitter.fit(df, column="value")

best = results.best(n=1)[0]
print(f"Best: {best.distribution} (KS={best.ks_statistic:.4f})")

Without Spark:

from spark_bestfit import DistributionFitter, LocalBackend
import pandas as pd

df = pd.DataFrame({"value": np.random.normal(50, 10, 1000)})
fitter = DistributionFitter(backend=LocalBackend())
results = fitter.fit(df, column="value")

Backends

Backend Use Case Install
SparkBackend Production clusters, 100M+ rows [spark] or BYO
LocalBackend Development, testing Included
RayBackend Ray clusters, ML pipelines [ray]

See Backend Guide for configuration details.

Compatibility

Spark Python NumPy
3.5.x 3.11-3.12 < 2.0
4.x 3.12-3.13 2.0+

Documentation

Full documentation at spark-bestfit.readthedocs.io:

Contributing

Contributions welcome! See Contributing Guide.

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spark_bestfit-3.0.3.tar.gz (2.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spark_bestfit-3.0.3-py3-none-any.whl (145.1 kB view details)

Uploaded Python 3

File details

Details for the file spark_bestfit-3.0.3.tar.gz.

File metadata

  • Download URL: spark_bestfit-3.0.3.tar.gz
  • Upload date:
  • Size: 2.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for spark_bestfit-3.0.3.tar.gz
Algorithm Hash digest
SHA256 8e07653666a3ffca69d00cbe4b337295b7cad332c94bd74fb656626db504b8f0
MD5 ca24b0451cad2e49947c18e98ac6b880
BLAKE2b-256 f7eb7ef1ff6fe0d1d2513dd8927a2e30e79c7e7576ae2998d481b66a742b0572

See more details on using hashes here.

Provenance

The following attestation bundles were made for spark_bestfit-3.0.3.tar.gz:

Publisher: release.yml on dwsmith1983/spark-bestfit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file spark_bestfit-3.0.3-py3-none-any.whl.

File metadata

  • Download URL: spark_bestfit-3.0.3-py3-none-any.whl
  • Upload date:
  • Size: 145.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for spark_bestfit-3.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 e712e64ab4af19d31fb773a6845375be5c5f9fbda82f2d6fb5dc5a85365b6042
MD5 bd7749d489f6647877d41a662cd7bc7b
BLAKE2b-256 ce02348929032c5820bbbb034d5ba6b41a13811da13f5c4006b8b5dd64010d99

See more details on using hashes here.

Provenance

The following attestation bundles were made for spark_bestfit-3.0.3-py3-none-any.whl:

Publisher: release.yml on dwsmith1983/spark-bestfit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page