Skip to main content

Run the PySpark Connect client in JupyterLite/Pyodide via a grpc-web transport (PySpark in JupyterLite).

Project description

pyspark-connect-web - PySpark in JupyterLite

CI PyPI Docs License

Run the real PySpark Connect Python client inside a browser (JupyterLite/Pyodide), talking to an Apache Spark Connect server through a grpc-web transport. Your existing PySpark code runs unchanged - no reimplementation, no local JVM, no Python backend server.

import pyspark_connect_web as pcw
pcw.install()

from pyspark.sql import SparkSession
spark = SparkSession.builder.remote("sc://localhost:8081/;transport=grpcweb").getOrCreate()
spark.range(10).filter("id % 2 = 0").toPandas()   # runs in your browser tab

This is a thin client, not local compute. You still need a running Spark Connect server (Spark 4.x) behind an Envoy grpc-web proxy. The win is: no Python backend, the real PySpark API, anywhere a browser runs.

How it works

PySpark's Connect client is pure Python above a single gRPC stub: it builds protobuf plans and ships them to the server. We monkey-patch only that stub with a grpc-web/fetch transport, and make calls blocking via a Web Worker + Atomics/SharedArrayBuffer bridge so .collect() returns data synchronously. Everything above the stub - DataFrame, Column, functions - is untouched. We patch; we do not fork PySpark. See docs/architecture.md.

Requirements

  • A browser (for the client) or Python 3.11+ (for local dev/tests).
  • In the browser: Pyodide >= 0.28 (Python 3.13), which already ships pyarrow, pandas, protobuf, and numpy. grpcio is not available in Pyodide and is never imported - all transport is grpc-web over fetch.
  • pyspark>=4.0 (pinned by install(); provided by Pyodide in the browser).
  • A running Spark Connect server (Spark 4.x) behind an Envoy grpc-web proxy - the deploy/ stack brings this up for you.
  • The JupyterLite page must be cross-origin isolated (COOP: same-origin + COEP: credentialless), which the deploy stack serves for you. Without it, SharedArrayBuffer - the backbone of the blocking bridge - is unavailable.

Installation

Use a conda environment:

conda create -n pcw python=3.11
conda activate pcw
pip install pyspark-connect-web

In the browser (JupyterLite/Pyodide), install with micropip inside the kernel:

import micropip
await micropip.install("pyspark-connect-web")

The import / package name is pyspark_connect_web (distribution name pyspark-connect-web); the repository is pyspark-client-wasm.

Running a local Spark Connect server

You need a Spark Connect server, and - for the browser - an Envoy grpc_web proxy in front of it (a browser cannot speak raw gRPC). Two options:

Recommended: the full stack (server + Envoy + site)

The deploy/ stack brings up a Spark Connect server, the Envoy grpc_web proxy, and a static host for the JupyterLite site with the mandatory cross-origin-isolation headers - everything the browser client needs:

docker compose -f deploy/compose.yaml up
# wait for the "spark-connect" container to report healthy (~60s cold start)
URL What
sc://localhost:8081/;transport=grpcweb grpc-web endpoint the browser client connects to
http://localhost:8000/ JupyterLite site, served with Cross-Origin-Opener-Policy: same-origin + Cross-Origin-Embedder-Policy: credentialless (required for SharedArrayBuffer)
:15002 Spark Connect raw gRPC (native clients / reference generator)

Lightweight: just a Spark Connect server (no Docker)

For testing with a native PySpark client (or trying Spark Connect without the browser), download a Spark release (needs Java 17) and start its Connect server. Recent Spark bundles Spark Connect, so no --packages is needed:

SPARK_VERSION=4.1.2   # use the latest 4.1.x: https://spark.apache.org/downloads.html
curl -LO "https://dlcdn.apache.org/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop3.tgz"
tar xzf "spark-${SPARK_VERSION}-bin-hadoop3.tgz" && cd "spark-${SPARK_VERSION}-bin-hadoop3"
./sbin/start-connect-server.sh
# -> Spark Connect on sc://localhost:15002  (raw gRPC)

Then a native client can connect: SparkSession.builder.remote("sc://localhost:15002"). The browser client still needs Envoy in front (use the full stack above) - pcw.install() then talks to sc://localhost:8081/;transport=grpcweb.

See deploy/README.md for ports, version pins, and CORS/header curl checks, and docs/running-locally.md for the full walkthrough.

Connecting

Point the client at the grpc-web proxy after install():

import pyspark_connect_web as pcw
pcw.install()

from pyspark.sql import SparkSession
spark = SparkSession.builder.remote("sc://localhost:8081/;transport=grpcweb").getOrCreate()

The connection string is the standard Spark Connect sc:// URI with a transport=grpcweb parameter. A plain https:///http:// shorthand is also accepted. For anything past localhost, terminate TLS at the proxy and use a secure context - a browser needs HTTPS for crossOriginIsolated off localhost.

For TLS + auth (the hardened prod overlay), the proxy is the enforcement point: it gates on Authorization: Bearer <token> and forwards the header upstream. Bring it up with:

# provide a TLS cert (deploy/certs/), set your origins, then:
docker compose -f deploy/compose.yaml -f deploy/compose.prod.yaml up -d
# or: make up-prod

See docs/connection-patterns.md and deploy/README.md (TLS, CORS allowlist, bearer-token gate -> jwt_authn/ext_authz).

Ways to use it

Pick the path that fits - all of them run the real PySpark API in the browser.

1. In JupyterLite (a notebook, nothing to install)

Build the site and bring up the stack (Spark Connect + Envoy grpc-web + the JupyterLite site, served cross-origin isolated on :8000):

make site                                  # build the JupyterLite site into _output/
docker compose -f deploy/compose.yaml up   # serves :8000 (site) + :8081 (grpc-web) + :15002 (Spark)

Open http://localhost:8000/, then in a notebook cell:

import pyspark_connect_web as pcw
pcw.install()

from pyspark.sql import SparkSession
spark = SparkSession.builder.remote("sc://localhost:8081/;transport=grpcweb").getOrCreate()
spark.range(10).filter("id % 2 = 0").toPandas()

GitHub Pages / other static hosts: see JupyterLite hosting.

2. Embed it in your own web page

The site ships a small, self-contained page that boots Pyodide in a Web Worker, micropip-installs the wheel, runs pcw.install(), binds a SparkSession, and exposes window.__pcwRunPython(src). Use pyspark_connect_web/jupyterlite/harness.html as the reference for wiring worker/worker_bootstrap.js + worker/bridge.js into your app. The page must be cross-origin isolated (COOP: same-origin, COEP: credentialless) for the SharedArrayBuffer bridge.

3. Run the end-to-end example

The browser e2e brings up the whole stack and drives the v0 matrix (range/collect, groupBy/agg Arrow parity, createDataFrame, spark.sql) in real Chromium:

make site
docker compose -f deploy/compose.yaml up -d
cd tests/e2e && npm install && npx playwright install --with-deps chromium
E2E_REQUIRE_STACK=1 npx playwright test          # full steps in tests/e2e/README.md

It also runs on every push - see .github/workflows/e2e.yml.

DataFrame API examples

Once connected it is ordinary PySpark. Runnable scripts live in examples/ (quickstart, transformations, aggregations, joins, window, sql, io); they double as plain native-PySpark scripts against any Spark Connect server.

Documentation

Full docs: https://hyukjinkwon.github.io/pyspark-client-wasm/

Compatibility

Component Supported
PySpark >=4.0 (Spark Connect's wire protocol is stable across the 4.x line; install() raises below 4.0). CI exercises 4.0.0 and 4.1.2.
Spark Connect server Spark 4.x (apache/spark:4.1.2 in the deploy stack; CI also runs 4.0.0)
Pyodide >= 0.28 (Python 3.13) in the browser; Python 3.11+ for local dev
Proxy Envoy with envoy.filters.http.grpc_web (v1.31-latest)

The v0 target is full read-path parity - range/select/filter/groupBy/agg, toPandas, createDataFrame, and spark.sql(...) - returning results byte/row-exact versus a native Spark Connect run. See the design notes.

Development

conda create -n pcw python=3.11 && conda activate pcw
pip install -e ".[dev]"
pytest -q

Unit tests stub the transport: they never import grpcio and never touch a browser. grpcio is not available in Pyodide, so the package registers a lightweight gRPC shim (pyspark_connect_web/_grpc_shim.py) before PySpark is imported; CI fails if grpcio is imported anywhere under pyspark_connect_web/.

Build the JupyterLite site (produces _output/ served on :8000):

make site          # or: scripts/build_site.sh

Browser end-to-end tests run under Playwright against the deploy stack; see docs/running-locally.md. Contribution workflow and the lane/coordination model: CONTRIBUTING.md and CONTRIBUTING.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_connect_web-0.2.0.tar.gz (73.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyspark_connect_web-0.2.0-py3-none-any.whl (48.5 kB view details)

Uploaded Python 3

File details

Details for the file pyspark_connect_web-0.2.0.tar.gz.

File metadata

  • Download URL: pyspark_connect_web-0.2.0.tar.gz
  • Upload date:
  • Size: 73.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pyspark_connect_web-0.2.0.tar.gz
Algorithm Hash digest
SHA256 2ec11c93c0232e02ed62d50c4892dfaaa5e308ea89ce95df729e16b960707906
MD5 771a7ea9f94a7e5fed6968fc0d52d192
BLAKE2b-256 91f2d707c07db2332028ae71713100915296df8cab3585a201fb0d443512b7e5

See more details on using hashes here.

File details

Details for the file pyspark_connect_web-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pyspark_connect_web-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 24b77ff2d06899fdaa0d592607d477b36241fed678cde0bdbcf71590a48a2214
MD5 c214d73a56da80ab4fdff215a448a726
BLAKE2b-256 cc9b4078727a325e0164cceb8c57f83fb16cb4badac0b69eb6ccdf4fbf3c86d7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page