Skip to main content

sparglim

Project description

Sparglim

Sparglim is aimed at providing a clean solution for PySpark applications in cloud-native scenarios (On K8S、Connect Server etc.).

This is a fledgling project, looking forward to any PRs, Feature Requests and discussions!

🌟✨⭐ Start to support!

Quick Start

Run Jupyterlab with sparglim docker image:

docker run \
-it \
-p 8888:8888 \
wh1isper/jupyterlab-sparglim

Access http://localhost:8888 in browser to use jupyterlab with sparglim. Then you can try SQL Magic.

Run and Daemon a Spark Connect Server:

docker run \
-it \
-p 15002:15002 \
-p 4040:4040 \
wh1isper/sparglim-server

Access http://localhost:4040 for Spark-UI and sc://localhost:15002 for Spark Connect Server. Use sparglim to setup SparkSession to connect to Spark Connect Server.

Install: pip install sparglim[all]

  • Install only for config and daemon spark connect server pip install sparglim
  • Install for pyspark app pip install sparglim[pyspark]
  • Install for using magic within ipython/jupyter (will also install pyspark) pip install sparglim[magic]
  • Install for all above (such as using magic in jupyterlab on k8s) pip install sparglim[all]

Feature

  • Config Spark via environment variables, see config spark
  • %SQL and %%SQL magic for executing Spark SQL in IPython/Jupyter
    • SQL statement can be written in multiple lines, support using ; to separate statements
    • Support config connect client, see Spark Connect Overview
    • TODO: Visualize the result of SQL statement(Spark Dataframe)
  • sparglim-server for daemon Spark Connect Server

User cases

PySpark App

To config Spark on k8s for Data explorations, see examples/jupyter-sparglim-on-k8s

(TODO)To config Spark for ELT Application/Service, see pyspark-sampling

Spark Connect Server on K8S

To daemon Spark Connect Server on K8S, see examples/sparglim-server

To daemon Spark Connect Server on K8S and Connect it in JupyterLab , see examples/jupyter-sparglim-sc

Connect to Spark Connect Server

Only thing need to do is to set SPARGLIM_REMOTE env, format is sc://host:port

Example Code:

import os
os.environ["SPARGLIM_REMOTE"] = "sc://localhost:15002" # or export SPARGLIM_REMOTE=sc://localhost:15002 before run python

from sparglim.config.builder import ConfigBuilder
from datetime import datetime, date
from pyspark.sql import Row


c = ConfigBuilder().config_connect_client()
spark = c.get_or_create()

df = spark.createDataFrame([
    Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
    Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
    Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])
df.show()

SQL Magic

Install Sparglim with

pip install sparglim["magic"]

Load magic in IPython/Jupyter

%load_ext sparglim.sql

Create a view:

from sparglim.config.builder import ConfigBuilder


from datetime import datetime, date
from pyspark.sql import Row

c: ConfigBuilder = ConfigBuilder()
spark = c.get_or_create()


df = spark.createDataFrame([
            Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
            Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
            Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
        ])
df.createOrReplaceTempView("tb")

Query the view by %SQL:

%sql SELECT * FROM tb

%SQL result dataframe can be assigned to a variable:

df = %sql SELECT * FROM tb
df

or %%SQL can be used to execute multiple statements:

%%sql SELECT
        *
        FROM
        tb;

You can also using Spark SQL to load data from external data source, such as:

%%sql CREATE TABLE tb_people
USING json
OPTIONS (path "/path/to/file.json");
Show tables;

Develop

Install pre-commit before commit

pip install pre-commit
pre-commit install

Install package locally

pip install -e .[test]

Run unit-test before PR, ensure that new features are covered by unit tests

pytest -v

(Optional, python<=3.10) Use pytype to check typed

pytype ./sparglim

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparglim-0.1.0.tar.gz (23.6 kB view details)

Uploaded Source

Built Distribution

sparglim-0.1.0-py3-none-any.whl (17.1 kB view details)

Uploaded Python 3

File details

Details for the file sparglim-0.1.0.tar.gz.

File metadata

  • Download URL: sparglim-0.1.0.tar.gz
  • Upload date:
  • Size: 23.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for sparglim-0.1.0.tar.gz
Algorithm Hash digest
SHA256 cafe9d8f44b6212eb4cbf13a87ce226415537b2f563c4ecc32e7ebfdd5ee5fdb
MD5 d052d5bd94a91f1d626c85bcedbda644
BLAKE2b-256 64596503f9e7b24d0a1fd6e83b26bbe81b1c5a0e0a96cde2e69d6661e628943f

See more details on using hashes here.

File details

Details for the file sparglim-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: sparglim-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 17.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for sparglim-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e1756cdee108910f190badd89db232a33ed069b17839a31dc2c78771c7bebe05
MD5 3e47094de2d915f322fd1c5a8acdd01c
BLAKE2b-256 55f0137fd0920e2854ed777b0c109b0142c41fc0a06f3e77c61157e0a82a93ca

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page