Skip to main content

Create truly fresh local Spark sessions with isolated temp dirs and reliable teardown.

Project description

freshspark

CI PyPI version Python versions License

Small helpers for local PySpark that start each run in a clean sandbox and tear sessions down reliably: isolated warehouse temp dirs, optional embedded Derby kept out of the working tree, in-memory catalog by default, randomized Spark UI port, and aggressive Py4J / JVM shutdown so the process can exit normally.

Use it when notebooks or scripts leave metastore locks, derby.log in the wrong place, Spark UI port collisions, or JVMs that refuse to die after SparkSession.stop().


Requirements

Python 3.9 or newer
JDK On PATH (or via JAVA_HOME). Spark 3.x line: Java 8, 11, or 17; PySpark 3.5+ is also validated against Java 21.
PySpark Declared dependency is PySpark 3.5.x (pyspark>=3.5,<4) for predictable local startup. Spark 4 is not pinned here; if you override to PySpark 4.x, use Java 17 or 21 and expect to manage compatibility yourself.

Install

pip install freshspark

Development (editable install, tests, Ruff):

pip install -e ".[dev]"
ruff format --check freshspark tests
ruff check freshspark tests
mypy freshspark tests
pytest

Quick start

from freshspark import fresh_local_spark, get_fresh_local_spark

# Context manager: new session every `with` block, cleanup on exit
with fresh_local_spark(app_name="etl", preset="dev") as spark:
    spark.range(10).show()

# Manual lifecycle: always call cleanup() when finished (or use try/finally)
spark, cleanup = get_fresh_local_spark(app_name="demo", preset="fat")
try:
    spark.range(1000).summary().show()
finally:
    cleanup()

Public API

Symbol Role
fresh_local_spark(...) Context manager yielding a new SparkSession per with block (no reuse).
get_fresh_local_spark(...) Returns (spark, cleanup). You must call cleanup() when done unless you use the context manager.
reset_active_session() Stops the active session, closes the gateway, and clears in-process reuse cache entries that pointed at that session (or are already dead). Safe to call repeatedly.
ensure_fresh(...) Decorator that runs the wrapped function inside fresh_local_spark; injects spark as a keyword argument. Do not pass spark= yourself (a TypeError is raised if you do).

Configuration highlights

Presets

preset is one of tiny, dev, or fat. They set driver memory and maxResultSize to sensible defaults. Any other string logs a warning and applies no preset keys (you can still set everything via extra_confs).

Preset spark.driver.memory spark.driver.maxResultSize
tiny 1g 512m
dev 2g 1g
fat 4g 2g

Catalog and warehouse

  • Default (hive_metastore=False): spark.sql.catalogImplementation=in-memory and an isolated spark.sql.warehouse.dir under a temp root—no embedded Derby in the default path, so you avoid the usual Derby lock files in the project directory.
  • Hive-style metastore (hive_metastore=True): warehouse and Derby home (-Dderby.system.home=...) both live under the same isolated temp tree.

If you pass extra_confs with spark.driver.extraJavaOptions while hive_metastore=True, that value is merged after the required Derby system home flag so your JVM flags do not accidentally wipe metastore configuration.

Other knobs

  • enable_ui / print_ui_url: Spark UI on a free port (spark.ui.port=0); optionally print the URL once the session is up.
  • extra_confs: flat dict[str, str] merged last so you can override presets or Spark defaults.
  • reuse_within_process=True: same Python process + same app_name returns the same (spark, cleanup) until cleanup() or reset_active_session() runs; dead cached sessions are replaced automatically on the next request.

CLI

# Python REPL with `spark` already constructed
freshspark repl --preset fat

# Stop the active SparkSession in this interpreter (also reconciles reuse cache)
freshspark reset
Command Common flags
freshspark repl --app-name, `--preset tiny
freshspark reset (none)

Jupyter and long-running kernels

Prefer an explicit cleanup cell so temp dirs and the JVM are released even if the kernel stays alive:

from freshspark import get_fresh_local_spark

spark, cleanup = get_fresh_local_spark(app_name="nb", preset="dev")
# ... work ...
cleanup()

If another library left a sticky session in this kernel, call reset_active_session() here. The freshspark reset CLI only affects the interpreter where that command runs (for example a terminal REPL), not a separate Jupyter kernel.


Environment variables

Variable Effect
FRESHSPARK_SKIP_JAVA_CHECK If set to 1, true, or yes, an unsupported Java / Spark pairing warns instead of raising during session construction.

Why this exists

Local PySpark is great until it is not: JVMs that linger, Derby files under cwd, warehouse dirs shared across runs, and UI ports that collide. freshspark centralizes a small set of Spark configs and lifecycle rules so each run gets an isolated temp layout and a cleanup path that actually runs (including an atexit safety net, with idempotent cleanup so manual cleanup() plus process exit does not misbehave).


Project links


License

SPDX Apache-2.0 (full text in LICENSE).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

freshspark-0.2.1.tar.gz (19.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

freshspark-0.2.1-py3-none-any.whl (15.7 kB view details)

Uploaded Python 3

File details

Details for the file freshspark-0.2.1.tar.gz.

File metadata

  • Download URL: freshspark-0.2.1.tar.gz
  • Upload date:
  • Size: 19.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for freshspark-0.2.1.tar.gz
Algorithm Hash digest
SHA256 db8df7cbaa3d345112809f144e5362f192804bf1039a6446fa77433fab70f98a
MD5 bc5b44caec2aaad52f0dc85d1d76c4d3
BLAKE2b-256 6ef90e5f7fc8093df4f2c04482fe00a952fbccfac9cdb7e5ad11231eae02b5b3

See more details on using hashes here.

File details

Details for the file freshspark-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: freshspark-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 15.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for freshspark-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 389bd7c1a234ab0ead081708e3f8e02b5f2ecf522cd68775e11fdf631edd66c0
MD5 6f53a9cd7e0badf4c33c3c1ab3c09659
BLAKE2b-256 f121b1324d62bd60e9db0f85dab260f8fa81811b25224d181215e9fa09efbe22

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page