sparglim
Project description
Sparglim ✨
Sparglim is aimed at providing a clean solution for PySpark applications in cloud-native scenarios (On K8S、Connect Server etc.).
This is a fledgling project, looking forward to any PRs, Feature Requests and Discussions!
🌟✨⭐ Start to support!
Quick Start
Run Jupyterlab with sparglim
docker image:
docker run \
-it \
-p 8888:8888 \
wh1isper/jupyterlab-sparglim
Access http://localhost:8888
in browser to use jupyterlab with sparglim
. Then you can try SQL Magic.
Run and Daemon a Spark Connect Server:
docker run \
-it \
-p 15002:15002 \
-p 4040:4040 \
wh1isper/sparglim-server
Access http://localhost:4040
for Spark-UI and sc://localhost:15002
for Spark Connect Server. Use sparglim to setup SparkSession to connect to Spark Connect Server.
Install: pip install sparglim[all]
- Install only for config and daemon spark connect server
pip install sparglim
- Install for pyspark app
pip install sparglim[pyspark]
- Install for using magic within ipython/jupyter (will also install pyspark)
pip install sparglim[magic]
- Install for all above (such as using magic in jupyterlab on k8s)
pip install sparglim[all]
Feature
- Config Spark via environment variables
%SQL
and%%SQL
magic for executing Spark SQL in IPython/Jupyter- SQL statement can be written in multiple lines, support using
;
to separate statements - Support config
connect client
, see Spark Connect Overview - TODO: Visualize the result of SQL statement(Spark Dataframe)
- SQL statement can be written in multiple lines, support using
sparglim-server
for daemon Spark Connect Server
User cases
Basic
from sparglim.config.builder import ConfigBuilder
from datetime import datetime, date
from pyspark.sql import Row
# Create a local[*] spark session with s3&kerberos config
spark = ConfigBuilder().get_or_create()
df = spark.createDataFrame([
Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])
df.show()
Building a PySpark App
To config Spark on k8s for Data explorations, see examples/jupyter-sparglim-on-k8s
To config Spark for ELT Application/Service, see project pyspark-sampling
Deploy Spark Connect Server on K8S (And Connect to it)
To daemon Spark Connect Server on K8S, see examples/sparglim-server
To daemon Spark Connect Server on K8S and Connect it in JupyterLab , see examples/jupyter-sparglim-sc
Connect to Spark Connect Server
Only thing need to do is to set SPARGLIM_REMOTE
env, format is sc://host:port
Example Code:
import os
os.environ["SPARGLIM_REMOTE"] = "sc://localhost:15002" # or export SPARGLIM_REMOTE=sc://localhost:15002 before run python
from sparglim.config.builder import ConfigBuilder
from datetime import datetime, date
from pyspark.sql import Row
c = ConfigBuilder().config_connect_client()
spark = c.get_or_create()
df = spark.createDataFrame([
Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])
df.show()
SQL Magic
Install Sparglim with
pip install sparglim["magic"]
Load magic in IPython/Jupyter
%load_ext sparglim.sql
spark # show SparkSession brief info
Create a view:
from datetime import datetime, date
from pyspark.sql import Row
df = spark.createDataFrame([
Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])
df.createOrReplaceTempView("tb")
Query the view by %SQL
:
%sql SELECT * FROM tb
%SQL
result dataframe can be assigned to a variable:
df = %sql SELECT * FROM tb
df
or %%SQL
can be used to execute multiple statements:
%%sql SELECT
*
FROM
tb;
You can also using Spark SQL to load data from external data source, such as:
%%sql CREATE TABLE tb_people
USING json
OPTIONS (path "/path/to/file.json");
Show tables;
Develop
Install pre-commit before commit
pip install pre-commit
pre-commit install
Install package locally
pip install -e .[test]
Run unit-test before PR, ensure that new features are covered by unit tests
pytest -v
(Optional, python<=3.10) Use pytype to check typed
pytype ./sparglim
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file sparglim-0.1.4.tar.gz
.
File metadata
- Download URL: sparglim-0.1.4.tar.gz
- Upload date:
- Size: 26.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.17
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 62e479aa738ee3332c581c87d75271d945cddb747432760cd186f42e042b25c9 |
|
MD5 | 8e559f71e60056e31e7241f0b731e762 |
|
BLAKE2b-256 | e049f1066722a9d01c1bb5aa25be27dcd32904faa8c505aa29bc7109efe1d88d |
File details
Details for the file sparglim-0.1.4-py3-none-any.whl
.
File metadata
- Download URL: sparglim-0.1.4-py3-none-any.whl
- Upload date:
- Size: 17.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.17
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cbc213c9f4dbaaffcf134f1aca57e96ca61f1f7edb6034f8ea71699a5b6d6e63 |
|
MD5 | fb5966c0d8d4bb7cd9d5982455655e3d |
|
BLAKE2b-256 | 95b48e1f709deb70b8abbe8a3dac062730603502d5c0cd4be6b3fd61a87b4e31 |