Run the PySpark Connect client in JupyterLite/Pyodide via a grpc-web transport (PySpark in JupyterLite).

These details have not been verified by PyPI

Project links

Project description

pyspark-connect-web - PySpark in JupyterLite

Run the real PySpark Connect Python client inside a browser (JupyterLite/Pyodide), talking to an Apache Spark Connect server through a grpc-web transport. Your existing PySpark code runs unchanged - no reimplementation, no local JVM, no Python backend server.

import pyspark_connect_web as pcw
pcw.install()

from pyspark.sql import SparkSession
spark = SparkSession.builder.remote("sc://localhost:8081/;transport=grpcweb").getOrCreate()
spark.range(10).filter("id % 2 = 0").toPandas()   # runs in your browser tab

This is a thin client, not local compute. You still need a running Spark Connect server (Spark 4.x) behind an Envoy grpc-web proxy. The win is: no Python backend, the real PySpark API, anywhere a browser runs.

How it works

PySpark's Connect client is pure Python above a single gRPC stub: it builds protobuf plans and ships them to the server. We monkey-patch only that stub with a grpc-web/fetch transport, and make calls blocking via a Web Worker + Atomics/SharedArrayBuffer bridge so .collect() returns data synchronously. Everything above the stub - DataFrame, Column, functions - is untouched. We patch; we do not fork PySpark. See docs/architecture.md.

Requirements

A browser (for the client) or Python 3.11+ (for local dev/tests).
In the browser: Pyodide >= 0.28 (Python 3.13), which already ships pyarrow, pandas, protobuf, and numpy. grpcio is not available in Pyodide and is never imported - all transport is grpc-web over fetch.
pyspark>=4.0 (pinned by install(); provided by Pyodide in the browser).
A running Spark Connect server (Spark 4.x) behind an Envoy grpc-web proxy - the deploy/ stack brings this up for you.
The JupyterLite page must be cross-origin isolated (COOP: same-origin + COEP: credentialless), which the deploy stack serves for you. Without it, SharedArrayBuffer - the backbone of the blocking bridge - is unavailable.

Installation

Use a conda environment:

conda create -n pcw python=3.11
conda activate pcw
pip install pyspark-connect-web

In the browser (JupyterLite/Pyodide), install with micropip inside the kernel:

import micropip
await micropip.install("pyspark-connect-web")

The import / package name is pyspark_connect_web (distribution name pyspark-connect-web); the repository is pyspark-client-wasm.

Running a local Spark Connect server

You need a Spark Connect server, and - for the browser - an Envoy grpc_web proxy in front of it (a browser cannot speak raw gRPC). Two options:

Recommended: the full stack (server + Envoy + site)

The deploy/ stack brings up a Spark Connect server, the Envoy grpc_web proxy, and a static host for the JupyterLite site with the mandatory cross-origin-isolation headers - everything the browser client needs:

docker compose -f deploy/compose.yaml up
# wait for the "spark-connect" container to report healthy (~60s cold start)

URL	What
`sc://localhost:8081/;transport=grpcweb`	grpc-web endpoint the browser client connects to
http://localhost:8000/	JupyterLite site, served with `Cross-Origin-Opener-Policy: same-origin` + `Cross-Origin-Embedder-Policy: credentialless` (required for `SharedArrayBuffer`)
`:15002`	Spark Connect raw gRPC (native clients / reference generator)

Lightweight: just a Spark Connect server (no Docker)

For testing with a native PySpark client (or trying Spark Connect without the browser), download a Spark release (needs Java 17) and start its Connect server. Recent Spark bundles Spark Connect, so no --packages is needed:

SPARK_VERSION=4.1.2   # use the latest 4.1.x: https://spark.apache.org/downloads.html
curl -LO "https://dlcdn.apache.org/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop3.tgz"
tar xzf "spark-${SPARK_VERSION}-bin-hadoop3.tgz" && cd "spark-${SPARK_VERSION}-bin-hadoop3"
./sbin/start-connect-server.sh
# -> Spark Connect on sc://localhost:15002  (raw gRPC)

Then a native client can connect: SparkSession.builder.remote("sc://localhost:15002"). The browser client still needs Envoy in front (use the full stack above) - pcw.install() then talks to sc://localhost:8081/;transport=grpcweb.

See deploy/README.md for ports, version pins, and CORS/header curl checks, and docs/running-locally.md for the full walkthrough.

Connecting

Point the client at the grpc-web proxy after install():

import pyspark_connect_web as pcw
pcw.install()

from pyspark.sql import SparkSession
spark = SparkSession.builder.remote("sc://localhost:8081/;transport=grpcweb").getOrCreate()

The connection string is the standard Spark Connect sc:// URI with a transport=grpcweb parameter. A plain https:///http:// shorthand is also accepted. For anything past localhost, terminate TLS at the proxy and use a secure context - a browser needs HTTPS for crossOriginIsolated off localhost.

For TLS + auth (the hardened prod overlay), the proxy is the enforcement point: it gates on Authorization: Bearer <token> and forwards the header upstream. Bring it up with:

# provide a TLS cert (deploy/certs/), set your origins, then:
docker compose -f deploy/compose.yaml -f deploy/compose.prod.yaml up -d
# or: make up-prod

See docs/connection-patterns.md and deploy/README.md (TLS, CORS allowlist, bearer-token gate -> jwt_authn/ext_authz).

Ways to use it

Pick the path that fits - all of them run the real PySpark API in the browser.

1. In JupyterLite (a notebook, nothing to install)

Build the site and bring up the stack (Spark Connect + Envoy grpc-web + the JupyterLite site, served cross-origin isolated on :8000):

make site                                  # build the JupyterLite site into _output/
docker compose -f deploy/compose.yaml up   # serves :8000 (site) + :8081 (grpc-web) + :15002 (Spark)

Open http://localhost:8000/, then in a notebook cell:

import pyspark_connect_web as pcw
pcw.install()

from pyspark.sql import SparkSession
spark = SparkSession.builder.remote("sc://localhost:8081/;transport=grpcweb").getOrCreate()
spark.range(10).filter("id % 2 = 0").toPandas()

GitHub Pages / other static hosts: see JupyterLite hosting.

2. Embed it in your own web page

The site ships a small, self-contained page that boots Pyodide in a Web Worker, micropip-installs the wheel, runs pcw.install(), binds a SparkSession, and exposes window.__pcwRunPython(src). Use pyspark_connect_web/jupyterlite/harness.html as the reference for wiring worker/worker_bootstrap.js + worker/bridge.js into your app. The page must be cross-origin isolated (COOP: same-origin, COEP: credentialless) for the SharedArrayBuffer bridge.

3. Run the end-to-end example

The browser e2e brings up the whole stack and drives the v0 matrix (range/collect, groupBy/agg Arrow parity, createDataFrame, spark.sql) in real Chromium:

make site
docker compose -f deploy/compose.yaml up -d
cd tests/e2e && npm install && npx playwright install --with-deps chromium
E2E_REQUIRE_STACK=1 npx playwright test          # full steps in tests/e2e/README.md

It also runs on every push - see .github/workflows/e2e.yml.

DataFrame API examples

Once connected it is ordinary PySpark. Runnable scripts live in examples/ (quickstart, transformations, aggregations, joins, window, sql, io); they double as plain native-PySpark scripts against any Spark Connect server.

Documentation

Full docs: https://hyukjinkwon.github.io/pyspark-client-wasm/

Architecture - the stub seam, the sync bridge, the wire framing.
Quickstart and Running locally.
Connection patterns - sc:// URIs, TLS, auth.
Installation and JupyterLite hosting.
Packaging & release.
Security - threat model (cross-origin isolation, CORS, auth, untrusted server, notebook XSS).

Compatibility

Component	Supported
PySpark	`>=4.0` (Spark Connect's wire protocol is stable across the 4.x line; `install()` raises below 4.0). CI exercises 4.0.0 and 4.1.2.
Spark Connect server	Spark 4.x (`apache/spark:4.1.2` in the deploy stack; CI also runs 4.0.0)
Pyodide	>= 0.28 (Python 3.13) in the browser; Python 3.11+ for local dev
Proxy	Envoy with `envoy.filters.http.grpc_web` (`v1.31-latest`)

The v0 target is full read-path parity - range/select/filter/groupBy/agg, toPandas, createDataFrame, and spark.sql(...) - returning results byte/row-exact versus a native Spark Connect run. See the design notes.

Development

conda create -n pcw python=3.11 && conda activate pcw
pip install -e ".[dev]"
pytest -q

Unit tests stub the transport: they never import grpcio and never touch a browser. grpcio is not available in Pyodide, so the package registers a lightweight gRPC shim (pyspark_connect_web/_grpc_shim.py) before PySpark is imported; CI fails if grpcio is imported anywhere under pyspark_connect_web/.

Build the JupyterLite site (produces _output/ served on :8000):

make site          # or: scripts/build_site.sh

Browser end-to-end tests run under Playwright against the deploy stack; see docs/running-locally.md. Contribution workflow and the lane/coordination model: CONTRIBUTING.md and CONTRIBUTING.md.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Jun 12, 2026

0.1.0

Jun 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_connect_web-0.2.0.tar.gz (73.8 kB view details)

Uploaded Jun 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyspark_connect_web-0.2.0-py3-none-any.whl (48.5 kB view details)

Uploaded Jun 12, 2026 Python 3

File details

Details for the file pyspark_connect_web-0.2.0.tar.gz.

File metadata

Download URL: pyspark_connect_web-0.2.0.tar.gz
Upload date: Jun 12, 2026
Size: 73.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pyspark_connect_web-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`2ec11c93c0232e02ed62d50c4892dfaaa5e308ea89ce95df729e16b960707906`
MD5	`771a7ea9f94a7e5fed6968fc0d52d192`
BLAKE2b-256	`91f2d707c07db2332028ae71713100915296df8cab3585a201fb0d443512b7e5`

See more details on using hashes here.

File details

Details for the file pyspark_connect_web-0.2.0-py3-none-any.whl.

File metadata

Download URL: pyspark_connect_web-0.2.0-py3-none-any.whl
Upload date: Jun 12, 2026
Size: 48.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pyspark_connect_web-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`24b77ff2d06899fdaa0d592607d477b36241fed678cde0bdbcf71590a48a2214`
MD5	`c214d73a56da80ab4fdff215a448a726`
BLAKE2b-256	`cc9b4078727a325e0164cceb8c57f83fb16cb4badac0b69eb6ccdf4fbf3c86d7`

See more details on using hashes here.

pyspark-connect-web 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

pyspark-connect-web - PySpark in JupyterLite

How it works

Requirements

Installation

Running a local Spark Connect server

Recommended: the full stack (server + Envoy + site)

Lightweight: just a Spark Connect server (no Docker)

Connecting

Ways to use it

1. In JupyterLite (a notebook, nothing to install)

2. Embed it in your own web page

3. Run the end-to-end example

DataFrame API examples

Documentation

Compatibility

Development

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes