A Python wrapper that analyses DataFrames and applies optimisation techniques to maximise PySpark session performance.

These details have not been verified by PyPI

Project description

sparktimise logo

sparktimise

A PySpark optimisation library that inspects DataFrames and applies targeted performance improvements with minimal user code changes.

What The Package Does

sparktimise provides two ways to optimise PySpark jobs:

A pipeline context-manager workflow through optimise / SparkPipelineAutoTuner.
A functional workflow through analyse_* and optimise_* functions for explicit control.

Capability	Description	Primary API
Partition optimisation	Estimates optimal shuffle partitions and low-cardinality partition candidates	optimise_partitions
Skew mitigation	Detects skewed keys and applies salting columns	optimise_skew
Cache strategy	Recommends and applies StorageLevel persistence	optimise_cache
Spark session tuning	Recommends and applies Spark SQL/session settings	SparkPipelineAutoTuner / optimise_context
Broadcast analysis	Profiles table sizes for join strategy advice and optional hints	analyse_broadcast / apply_broadcast_hints
Reporting	Summarises pipeline steps and metadata in text or dict form	OptimisationReport

Installation

pip install sparktimise

For local development:

pip install -e .[dev]
pip install pyspark

Quick Start

Context-manager usage

from pyspark.sql import SparkSession
from sparktimise import SparkPipelineAutoTuner

spark = SparkSession.builder.appName("orders-job").getOrCreate()


def run_pipeline():
    orders = spark.read.parquet("s3a://my-bucket/orders/")
    return orders.groupBy("customer_id").count()


with SparkPipelineAutoTuner(
    spark=spark,
    pipeline_name="orders_pipeline",
    watched_modules=["my_project.orders"],
) as tuner:
    tuner.execute("run_orders", run_pipeline)

Entry-point usage through `optimise`

from pyspark.sql import SparkSession
from sparktimise import optimise

spark = SparkSession.builder.appName("orders-job").getOrCreate()

with optimise(
    spark,
    "orders_pipeline",
    run_type="optimise",
    watched_modules=["my_project.orders"],
) as tuner:
    tuner.execute("run_orders", run_pipeline)

Functional usage

from sparktimise.optimisation import optimise_partitions, optimise_skew

step1 = optimise_partitions(df, target_partition_bytes=134_217_728)
step2 = optimise_skew(step1.df, columns=["customer_id"])

optimised_df = step2.df
print(step1.transformations_applied)
print(step2.transformations_applied)

Configuration And Runtime Controls

The primary runtime configuration surface is the optimise context manager.

Parameter	Type	Default	Effect
run_type	str	"optimise"	`"optimise"`, `"baseline"`, or `"report"`
watched_functions	list[str] \| None	None	Exact or wildcard qualified function names to auto-assess
watched_modules	list[str] \| None	None	Module prefixes to auto-assess
auto_capture	bool	True	Enables context-manager function-return capture
include_plan	bool	False	Stores full plan text in assessments
run_id	str \| None	None	Groups results under a named folder; defaults to pipeline name
results_root	str \| None	None	Root folder to write sparktimise_results under
spark	SparkSession	Required	Session used for safe SQL/session tuning

For full configuration details, including file-backed config loading and Spark recommendation settings, see docs/configuration.md.

Architecture And Process Flow

sparktimise follows a hybrid pattern:

Functional core: analyser and optimiser functions.
Imperative shell: context-manager orchestration.
OOP boundaries: adapters, config, and reporting.

Detailed architecture and sequence diagrams are documented in docs/architecture.md.

Documentation Map

Document	Purpose
docs/README.md	Documentation index
docs/usage.md	End-to-end usage patterns and examples
docs/configuration.md	Configuration variables and file formats
docs/architecture.md	Internal design and process flow
docs/troubleshooting.md	Common setup/runtime/CI issues and fixes
CHANGELOG.md	Versioned release and change history

Development Setup

Requirements

Dependency	Purpose
Python 3.10+	Runtime and tooling
Java (JRE/JDK)	Required for Spark tests
PySpark	Runtime dependency for Spark operations

Local setup

python -m pip install -U pip
python -m pip install -e .[dev]
python -m pip install pyspark

Quality checks

ruff check src/sparktimise/ tests/
ruff format --check src/sparktimise/ tests/
mypy src/sparktimise/

Tests

# Unit tests (default)
python -m pytest tests/unit/ --tb=short -q

# Integration tests (requires Java + Spark)
python -m pytest tests/integration/ --run-spark --spark-smoke-timeout 60 --tb=short -q

Build package artifacts

python -m pip install build
python -m build --sdist --wheel

Artifacts are created under dist/.

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

2.0.1

May 15, 2026

This version

2.0.0

May 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparktimise-2.0.0.tar.gz (37.1 kB view details)

Uploaded May 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sparktimise-2.0.0-py3-none-any.whl (42.9 kB view details)

Uploaded May 15, 2026 Python 3

File details

Details for the file sparktimise-2.0.0.tar.gz.

File metadata

Download URL: sparktimise-2.0.0.tar.gz
Upload date: May 15, 2026
Size: 37.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for sparktimise-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`7293c1ed3d85d7550130ff51fa481390f86c248d02a04e614b1b75b1308b72cb`
MD5	`34f3b543d6763408d460cfd1f93937a1`
BLAKE2b-256	`7aafa80c1967bc6ecfb3fedf278a7f39ef4214cead9da4bb747fa97fef92f709`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sparktimise-2.0.0.tar.gz:

Publisher: ci.yml on KeilanEvans/sparktimise

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sparktimise-2.0.0.tar.gz
- Subject digest: 7293c1ed3d85d7550130ff51fa481390f86c248d02a04e614b1b75b1308b72cb
- Sigstore transparency entry: 1546928948
- Sigstore integration time: May 15, 2026
Source repository:
- Permalink: KeilanEvans/sparktimise@c5c81a8c3f80d138eee90813216b4b38fc8a6426
- Branch / Tag: refs/heads/main
- Owner: https://github.com/KeilanEvans
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@c5c81a8c3f80d138eee90813216b4b38fc8a6426
- Trigger Event: workflow_dispatch

File details

Details for the file sparktimise-2.0.0-py3-none-any.whl.

File metadata

Download URL: sparktimise-2.0.0-py3-none-any.whl
Upload date: May 15, 2026
Size: 42.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for sparktimise-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c40c715796570c54148340296280d5d4e8c8904126ce66e72b4655f750f9e703`
MD5	`b8eb26c6a957322794b8dffd5df9b2f5`
BLAKE2b-256	`ba1310db9323ae70d0aae558c32519e359a9c0354a71aea0319b381b166e1073`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sparktimise-2.0.0-py3-none-any.whl:

Publisher: ci.yml on KeilanEvans/sparktimise

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sparktimise-2.0.0-py3-none-any.whl
- Subject digest: c40c715796570c54148340296280d5d4e8c8904126ce66e72b4655f750f9e703
- Sigstore transparency entry: 1546928956
- Sigstore integration time: May 15, 2026
Source repository:
- Permalink: KeilanEvans/sparktimise@c5c81a8c3f80d138eee90813216b4b38fc8a6426
- Branch / Tag: refs/heads/main
- Owner: https://github.com/KeilanEvans
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@c5c81a8c3f80d138eee90813216b4b38fc8a6426
- Trigger Event: workflow_dispatch

sparktimise 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

sparktimise

What The Package Does

Installation

Quick Start

Context-manager usage

Entry-point usage through optimise

Functional usage

Configuration And Runtime Controls

Architecture And Process Flow

Documentation Map

Development Setup

Requirements

Local setup

Quality checks

Tests

Build package artifacts

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Entry-point usage through `optimise`