A Python wrapper that analyses DataFrames and applies optimisation techniques to maximise PySpark session performance.
Project description
sparktimise
A PySpark optimisation library that inspects DataFrames and applies targeted performance improvements with minimal user code changes.
What The Package Does
sparktimise provides two ways to optimise PySpark jobs:
- A pipeline context-manager workflow through
optimise/SparkPipelineAutoTuner. - A functional workflow through
analyse_*andoptimise_*functions for explicit control.
| Capability | Description | Primary API |
|---|---|---|
| Partition optimisation | Estimates optimal shuffle partitions and low-cardinality partition candidates | optimise_partitions |
| Skew mitigation | Detects skewed keys and applies salting columns | optimise_skew |
| Cache strategy | Recommends and applies StorageLevel persistence | optimise_cache |
| Spark session tuning | Recommends and applies Spark SQL/session settings | SparkPipelineAutoTuner / optimise_context |
| Broadcast analysis | Profiles table sizes for join strategy advice and optional hints | analyse_broadcast / apply_broadcast_hints |
| Reporting | Summarises pipeline steps and metadata in text or dict form | OptimisationReport |
Installation
pip install sparktimise
For local development:
pip install -e .[dev]
pip install pyspark
Quick Start
Context-manager usage
from pyspark.sql import SparkSession
from sparktimise import SparkPipelineAutoTuner
spark = SparkSession.builder.appName("orders-job").getOrCreate()
def run_pipeline():
orders = spark.read.parquet("s3a://my-bucket/orders/")
return orders.groupBy("customer_id").count()
with SparkPipelineAutoTuner(
spark=spark,
pipeline_name="orders_pipeline",
watched_modules=["my_project.orders"],
) as tuner:
tuner.execute("run_orders", run_pipeline)
Entry-point usage through optimise
from pyspark.sql import SparkSession
from sparktimise import optimise
spark = SparkSession.builder.appName("orders-job").getOrCreate()
with optimise(
spark,
"orders_pipeline",
run_type="optimise",
watched_modules=["my_project.orders"],
) as tuner:
tuner.execute("run_orders", run_pipeline)
Functional usage
from sparktimise.optimisation import optimise_partitions, optimise_skew
step1 = optimise_partitions(df, target_partition_bytes=134_217_728)
step2 = optimise_skew(step1.df, columns=["customer_id"])
optimised_df = step2.df
print(step1.transformations_applied)
print(step2.transformations_applied)
Configuration And Runtime Controls
The primary runtime configuration surface is the optimise context manager.
| Parameter | Type | Default | Effect |
|---|---|---|---|
| run_type | str | "optimise" | "optimise", "baseline", or "report" |
| watched_functions | list[str] | None | None | Exact or wildcard qualified function names to auto-assess |
| watched_modules | list[str] | None | None | Module prefixes to auto-assess |
| auto_capture | bool | True | Enables context-manager function-return capture |
| include_plan | bool | False | Stores full plan text in assessments |
| run_id | str | None | None | Groups results under a named folder; defaults to pipeline name |
| results_root | str | None | None | Root folder to write sparktimise_results under |
| spark | SparkSession | Required | Session used for safe SQL/session tuning |
For full configuration details, including file-backed config loading and Spark recommendation settings, see docs/configuration.md.
Architecture And Process Flow
sparktimise follows a hybrid pattern:
- Functional core: analyser and optimiser functions.
- Imperative shell: context-manager orchestration.
- OOP boundaries: adapters, config, and reporting.
Detailed architecture and sequence diagrams are documented in docs/architecture.md.
Documentation Map
| Document | Purpose |
|---|---|
| docs/README.md | Documentation index |
| docs/usage.md | End-to-end usage patterns and examples |
| docs/configuration.md | Configuration variables and file formats |
| docs/architecture.md | Internal design and process flow |
| docs/troubleshooting.md | Common setup/runtime/CI issues and fixes |
| CHANGELOG.md | Versioned release and change history |
Development Setup
Requirements
| Dependency | Purpose |
|---|---|
| Python 3.10+ | Runtime and tooling |
| Java (JRE/JDK) | Required for Spark tests |
| PySpark | Runtime dependency for Spark operations |
Local setup
python -m pip install -U pip
python -m pip install -e .[dev]
python -m pip install pyspark
Quality checks
ruff check src/sparktimise/ tests/
ruff format --check src/sparktimise/ tests/
mypy src/sparktimise/
Tests
# Unit tests (default)
python -m pytest tests/unit/ --tb=short -q
# Integration tests (requires Java + Spark)
python -m pytest tests/integration/ --run-spark --spark-smoke-timeout 60 --tb=short -q
Build package artifacts
python -m pip install build
python -m build --sdist --wheel
Artifacts are created under dist/.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sparktimise-2.0.1.tar.gz.
File metadata
- Download URL: sparktimise-2.0.1.tar.gz
- Upload date:
- Size: 38.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
180e4ac1a08e56ba1d5aab899227db2844387832e39c29b97e67b2bed4508f32
|
|
| MD5 |
ea522bc632b80746b774d847abfc9fb4
|
|
| BLAKE2b-256 |
283cf0ee62151778556a4be130bd9122b281c16b99928a2e8aeb745402304cd5
|
Provenance
The following attestation bundles were made for sparktimise-2.0.1.tar.gz:
Publisher:
ci.yml on KeilanEvans/sparktimise
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sparktimise-2.0.1.tar.gz -
Subject digest:
180e4ac1a08e56ba1d5aab899227db2844387832e39c29b97e67b2bed4508f32 - Sigstore transparency entry: 1548640319
- Sigstore integration time:
-
Permalink:
KeilanEvans/sparktimise@659e83a58d2ee27a654554ed9b9eb04c4b0766c4 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/KeilanEvans
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@659e83a58d2ee27a654554ed9b9eb04c4b0766c4 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file sparktimise-2.0.1-py3-none-any.whl.
File metadata
- Download URL: sparktimise-2.0.1-py3-none-any.whl
- Upload date:
- Size: 44.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3834378dd61872df84a73bcdf72bdfd996089bd04052e4c734004811b0b79c74
|
|
| MD5 |
0db36423ca99dc68d108f022b7ca429a
|
|
| BLAKE2b-256 |
8e9ccb6b1bdb25185d08fabbfc366a24a07ba5e61026cc2d52d4d1667408252c
|
Provenance
The following attestation bundles were made for sparktimise-2.0.1-py3-none-any.whl:
Publisher:
ci.yml on KeilanEvans/sparktimise
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sparktimise-2.0.1-py3-none-any.whl -
Subject digest:
3834378dd61872df84a73bcdf72bdfd996089bd04052e4c734004811b0b79c74 - Sigstore transparency entry: 1548640332
- Sigstore integration time:
-
Permalink:
KeilanEvans/sparktimise@659e83a58d2ee27a654554ed9b9eb04c4b0766c4 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/KeilanEvans
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@659e83a58d2ee27a654554ed9b9eb04c4b0766c4 -
Trigger Event:
workflow_dispatch
-
Statement type: