Skip to main content

A Python wrapper that analyses DataFrames and applies optimisation techniques to maximise PySpark session performance.

Project description

sparktimise logo

sparktimise

A PySpark optimisation library that inspects DataFrames and applies targeted performance improvements with minimal user code changes.

Python 3.10–3.12 Version License: MIT Code style: ruff PySpark

What The Package Does

sparktimise provides two ways to optimise PySpark jobs:

  1. A pipeline context-manager workflow through optimise / SparkPipelineAutoTuner.
  2. A functional workflow through analyse_* and optimise_* functions for explicit control.
Capability Description Primary API
Partition optimisation Estimates optimal shuffle partitions and low-cardinality partition candidates optimise_partitions
Skew mitigation Detects skewed keys and applies salting columns optimise_skew
Cache strategy Recommends and applies StorageLevel persistence optimise_cache
Spark session tuning Recommends and applies Spark SQL/session settings SparkPipelineAutoTuner / optimise_context
Broadcast analysis Profiles table sizes for join strategy advice and optional hints analyse_broadcast / apply_broadcast_hints
Reporting Summarises pipeline steps and metadata in text or dict form OptimisationReport

Installation

pip install sparktimise

For local development:

pip install -e .[dev]
pip install pyspark

Quick Start

Context-manager usage

from pyspark.sql import SparkSession
from sparktimise import SparkPipelineAutoTuner

spark = SparkSession.builder.appName("orders-job").getOrCreate()


def run_pipeline():
    orders = spark.read.parquet("s3a://my-bucket/orders/")
    return orders.groupBy("customer_id").count()


with SparkPipelineAutoTuner(
    spark=spark,
    pipeline_name="orders_pipeline",
    watched_modules=["my_project.orders"],
) as tuner:
    tuner.execute("run_orders", run_pipeline)

Entry-point usage through optimise

from pyspark.sql import SparkSession
from sparktimise import optimise

spark = SparkSession.builder.appName("orders-job").getOrCreate()

with optimise(
    spark,
    "orders_pipeline",
    run_type="optimise",
    watched_modules=["my_project.orders"],
) as tuner:
    tuner.execute("run_orders", run_pipeline)

Functional usage

from sparktimise.optimisation import optimise_partitions, optimise_skew

step1 = optimise_partitions(df, target_partition_bytes=134_217_728)
step2 = optimise_skew(step1.df, columns=["customer_id"])

optimised_df = step2.df
print(step1.transformations_applied)
print(step2.transformations_applied)

Configuration And Runtime Controls

The primary runtime configuration surface is the optimise context manager.

Parameter Type Default Effect
run_type str "optimise" "optimise", "baseline", or "report"
watched_functions list[str] | None None Exact or wildcard qualified function names to auto-assess
watched_modules list[str] | None None Module prefixes to auto-assess
auto_capture bool True Enables context-manager function-return capture
include_plan bool False Stores full plan text in assessments
run_id str | None None Groups results under a named folder; defaults to pipeline name
results_root str | None None Root folder to write sparktimise_results under
spark SparkSession Required Session used for safe SQL/session tuning

For full configuration details, including file-backed config loading and Spark recommendation settings, see docs/configuration.md.

Architecture And Process Flow

sparktimise follows a hybrid pattern:

  1. Functional core: analyser and optimiser functions.
  2. Imperative shell: context-manager orchestration.
  3. OOP boundaries: adapters, config, and reporting.

Detailed architecture and sequence diagrams are documented in docs/architecture.md.

Documentation Map

Document Purpose
docs/README.md Documentation index
docs/usage.md End-to-end usage patterns and examples
docs/configuration.md Configuration variables and file formats
docs/architecture.md Internal design and process flow
docs/troubleshooting.md Common setup/runtime/CI issues and fixes
CHANGELOG.md Versioned release and change history

Development Setup

Requirements

Dependency Purpose
Python 3.10+ Runtime and tooling
Java (JRE/JDK) Required for Spark tests
PySpark Runtime dependency for Spark operations

Local setup

python -m pip install -U pip
python -m pip install -e .[dev]
python -m pip install pyspark

Quality checks

ruff check src/sparktimise/ tests/
ruff format --check src/sparktimise/ tests/
mypy src/sparktimise/

Tests

# Unit tests (default)
python -m pytest tests/unit/ --tb=short -q

# Integration tests (requires Java + Spark)
python -m pytest tests/integration/ --run-spark --spark-smoke-timeout 60 --tb=short -q

Build package artifacts

python -m pip install build
python -m build --sdist --wheel

Artifacts are created under dist/.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparktimise-2.0.0.tar.gz (37.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sparktimise-2.0.0-py3-none-any.whl (42.9 kB view details)

Uploaded Python 3

File details

Details for the file sparktimise-2.0.0.tar.gz.

File metadata

  • Download URL: sparktimise-2.0.0.tar.gz
  • Upload date:
  • Size: 37.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for sparktimise-2.0.0.tar.gz
Algorithm Hash digest
SHA256 7293c1ed3d85d7550130ff51fa481390f86c248d02a04e614b1b75b1308b72cb
MD5 34f3b543d6763408d460cfd1f93937a1
BLAKE2b-256 7aafa80c1967bc6ecfb3fedf278a7f39ef4214cead9da4bb747fa97fef92f709

See more details on using hashes here.

Provenance

The following attestation bundles were made for sparktimise-2.0.0.tar.gz:

Publisher: ci.yml on KeilanEvans/sparktimise

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sparktimise-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: sparktimise-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 42.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for sparktimise-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c40c715796570c54148340296280d5d4e8c8904126ce66e72b4655f750f9e703
MD5 b8eb26c6a957322794b8dffd5df9b2f5
BLAKE2b-256 ba1310db9323ae70d0aae558c32519e359a9c0354a71aea0319b381b166e1073

See more details on using hashes here.

Provenance

The following attestation bundles were made for sparktimise-2.0.0-py3-none-any.whl:

Publisher: ci.yml on KeilanEvans/sparktimise

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page