Skip to main content

A Python wrapper that analyses DataFrames and applies optimisation techniques to maximise PySpark session performance.

Project description

sparktimise logo

sparktimise

A PySpark optimisation library that inspects DataFrames and applies targeted performance improvements with minimal user code changes.

Python 3.10–3.12 Version License: MIT Code style: ruff PySpark

What The Package Does

sparktimise provides two ways to optimise PySpark jobs:

  1. A pipeline context-manager workflow through optimise / SparkPipelineAutoTuner.
  2. A functional workflow through analyse_* and optimise_* functions for explicit control.
Capability Description Primary API
Partition optimisation Estimates optimal shuffle partitions and low-cardinality partition candidates optimise_partitions
Skew mitigation Detects skewed keys and applies salting columns optimise_skew
Cache strategy Recommends and applies StorageLevel persistence optimise_cache
Spark session tuning Recommends and applies Spark SQL/session settings SparkPipelineAutoTuner / optimise_context
Broadcast analysis Profiles table sizes for join strategy advice and optional hints analyse_broadcast / apply_broadcast_hints
Reporting Summarises pipeline steps and metadata in text or dict form OptimisationReport

Installation

pip install sparktimise

For local development:

pip install -e .[dev]
pip install pyspark

Quick Start

Context-manager usage

from pyspark.sql import SparkSession
from sparktimise import SparkPipelineAutoTuner

spark = SparkSession.builder.appName("orders-job").getOrCreate()


def run_pipeline():
    orders = spark.read.parquet("s3a://my-bucket/orders/")
    return orders.groupBy("customer_id").count()


with SparkPipelineAutoTuner(
    spark=spark,
    pipeline_name="orders_pipeline",
    watched_modules=["my_project.orders"],
) as tuner:
    tuner.execute("run_orders", run_pipeline)

Entry-point usage through optimise

from pyspark.sql import SparkSession
from sparktimise import optimise

spark = SparkSession.builder.appName("orders-job").getOrCreate()

with optimise(
    spark,
    "orders_pipeline",
    run_type="optimise",
    watched_modules=["my_project.orders"],
) as tuner:
    tuner.execute("run_orders", run_pipeline)

Functional usage

from sparktimise.optimisation import optimise_partitions, optimise_skew

step1 = optimise_partitions(df, target_partition_bytes=134_217_728)
step2 = optimise_skew(step1.df, columns=["customer_id"])

optimised_df = step2.df
print(step1.transformations_applied)
print(step2.transformations_applied)

Configuration And Runtime Controls

The primary runtime configuration surface is the optimise context manager.

Parameter Type Default Effect
run_type str "optimise" "optimise", "baseline", or "report"
watched_functions list[str] | None None Exact or wildcard qualified function names to auto-assess
watched_modules list[str] | None None Module prefixes to auto-assess
auto_capture bool True Enables context-manager function-return capture
include_plan bool False Stores full plan text in assessments
run_id str | None None Groups results under a named folder; defaults to pipeline name
results_root str | None None Root folder to write sparktimise_results under
spark SparkSession Required Session used for safe SQL/session tuning

For full configuration details, including file-backed config loading and Spark recommendation settings, see docs/configuration.md.

Architecture And Process Flow

sparktimise follows a hybrid pattern:

  1. Functional core: analyser and optimiser functions.
  2. Imperative shell: context-manager orchestration.
  3. OOP boundaries: adapters, config, and reporting.

Detailed architecture and sequence diagrams are documented in docs/architecture.md.

Documentation Map

Document Purpose
docs/README.md Documentation index
docs/usage.md End-to-end usage patterns and examples
docs/configuration.md Configuration variables and file formats
docs/architecture.md Internal design and process flow
docs/troubleshooting.md Common setup/runtime/CI issues and fixes
CHANGELOG.md Versioned release and change history

Development Setup

Requirements

Dependency Purpose
Python 3.10+ Runtime and tooling
Java (JRE/JDK) Required for Spark tests
PySpark Runtime dependency for Spark operations

Local setup

python -m pip install -U pip
python -m pip install -e .[dev]
python -m pip install pyspark

Quality checks

ruff check src/sparktimise/ tests/
ruff format --check src/sparktimise/ tests/
mypy src/sparktimise/

Tests

# Unit tests (default)
python -m pytest tests/unit/ --tb=short -q

# Integration tests (requires Java + Spark)
python -m pytest tests/integration/ --run-spark --spark-smoke-timeout 60 --tb=short -q

Build package artifacts

python -m pip install build
python -m build --sdist --wheel

Artifacts are created under dist/.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparktimise-2.0.1.tar.gz (38.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sparktimise-2.0.1-py3-none-any.whl (44.0 kB view details)

Uploaded Python 3

File details

Details for the file sparktimise-2.0.1.tar.gz.

File metadata

  • Download URL: sparktimise-2.0.1.tar.gz
  • Upload date:
  • Size: 38.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for sparktimise-2.0.1.tar.gz
Algorithm Hash digest
SHA256 180e4ac1a08e56ba1d5aab899227db2844387832e39c29b97e67b2bed4508f32
MD5 ea522bc632b80746b774d847abfc9fb4
BLAKE2b-256 283cf0ee62151778556a4be130bd9122b281c16b99928a2e8aeb745402304cd5

See more details on using hashes here.

Provenance

The following attestation bundles were made for sparktimise-2.0.1.tar.gz:

Publisher: ci.yml on KeilanEvans/sparktimise

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sparktimise-2.0.1-py3-none-any.whl.

File metadata

  • Download URL: sparktimise-2.0.1-py3-none-any.whl
  • Upload date:
  • Size: 44.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for sparktimise-2.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3834378dd61872df84a73bcdf72bdfd996089bd04052e4c734004811b0b79c74
MD5 0db36423ca99dc68d108f022b7ca429a
BLAKE2b-256 8e9ccb6b1bdb25185d08fabbfc366a24a07ba5e61026cc2d52d4d1667408252c

See more details on using hashes here.

Provenance

The following attestation bundles were made for sparktimise-2.0.1-py3-none-any.whl:

Publisher: ci.yml on KeilanEvans/sparktimise

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page