Skip to main content

sparglim

Project description

Sparglim ✨

Sparglim is aimed at providing a clean solution for PySpark applications in cloud-native scenarios (On K8S、Connect Server etc.).

This is a fledgling project, looking forward to any PRs, Feature Requests and Discussions!

🌟✨⭐ Start to support!

Quick Start

Run Jupyterlab with sparglim docker image:

docker run \
-it \
-p 8888:8888 \
wh1isper/jupyterlab-sparglim

Access http://localhost:8888 in browser to use jupyterlab with sparglim. Then you can try SQL Magic.

Run and Daemon a Spark Connect Server:

docker run \
-it \
-p 15002:15002 \
-p 4040:4040 \
wh1isper/sparglim-server

Access http://localhost:4040 for Spark-UI and sc://localhost:15002 for Spark Connect Server. Use sparglim to setup SparkSession to connect to Spark Connect Server.

Install: pip install sparglim[all]

  • Install only for config and daemon spark connect server pip install sparglim
  • Install for pyspark app pip install sparglim[pyspark]
  • Install for using magic within ipython/jupyter (will also install pyspark) pip install sparglim[magic]
  • Install for all above (such as using magic in jupyterlab on k8s) pip install sparglim[all]

Feature

  • Config Spark via environment variables
  • %SQL and %%SQL magic for executing Spark SQL in IPython/Jupyter
    • SQL statement can be written in multiple lines, support using ; to separate statements
    • Support config connect client, see Spark Connect Overview
    • TODO: Visualize the result of SQL statement(Spark Dataframe)
  • sparglim-server for daemon Spark Connect Server

User cases

Basic

from sparglim.config.builder import ConfigBuilder
from datetime import datetime, date
from pyspark.sql import Row

# Create a local[*] spark session with s3&kerberos config
spark = ConfigBuilder().get_or_create()

df = spark.createDataFrame([
    Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
    Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
    Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])
df.show()

Building a PySpark App

To config Spark on k8s for Data explorations, see examples/jupyter-sparglim-on-k8s

To config Spark for ELT Application/Service, see project pyspark-sampling

Deploy Spark Connect Server on K8S (And Connect to it)

To daemon Spark Connect Server on K8S, see examples/sparglim-server

To daemon Spark Connect Server on K8S and Connect it in JupyterLab , see examples/jupyter-sparglim-sc

Connect to Spark Connect Server

Only thing need to do is to set SPARGLIM_REMOTE env, format is sc://host:port

Example Code:

import os
os.environ["SPARGLIM_REMOTE"] = "sc://localhost:15002" # or export SPARGLIM_REMOTE=sc://localhost:15002 before run python

from sparglim.config.builder import ConfigBuilder
from datetime import datetime, date
from pyspark.sql import Row


c = ConfigBuilder().config_connect_client()
spark = c.get_or_create()

df = spark.createDataFrame([
    Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
    Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
    Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])
df.show()

SQL Magic

Install Sparglim with

pip install sparglim["magic"]

Load magic in IPython/Jupyter

%load_ext sparglim.sql
spark # show SparkSession brief info

Create a view:

from datetime import datetime, date
from pyspark.sql import Row

df = spark.createDataFrame([
            Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
            Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
            Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
        ])
df.createOrReplaceTempView("tb")

Query the view by %SQL:

%sql SELECT * FROM tb

%SQL result dataframe can be assigned to a variable:

df = %sql SELECT * FROM tb
df

or %%SQL can be used to execute multiple statements:

%%sql SELECT
        *
        FROM
        tb;

You can also using Spark SQL to load data from external data source, such as:

%%sql CREATE TABLE tb_people
USING json
OPTIONS (path "/path/to/file.json");
Show tables;

Develop

Install pre-commit before commit

pip install pre-commit
pre-commit install

Install package locally

pip install -e .[test]

Run unit-test before PR, ensure that new features are covered by unit tests

pytest -v

(Optional, python<=3.10) Use pytype to check typed

pytype ./sparglim

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparglim-0.1.3.tar.gz (26.0 kB view details)

Uploaded Source

Built Distribution

sparglim-0.1.3-py3-none-any.whl (17.6 kB view details)

Uploaded Python 3

File details

Details for the file sparglim-0.1.3.tar.gz.

File metadata

  • Download URL: sparglim-0.1.3.tar.gz
  • Upload date:
  • Size: 26.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for sparglim-0.1.3.tar.gz
Algorithm Hash digest
SHA256 75d8eeaac19f9a52e3d386523dfd313f06f7e447a943c41885b6d84be865930d
MD5 ff6f45c54e5b05722ef41ec5a5153d28
BLAKE2b-256 7a9505eea0b8f90c8423ca20f7c5cd0b7bd2a76b4ef0bd72b521dbe6f05c6dd9

See more details on using hashes here.

File details

Details for the file sparglim-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: sparglim-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 17.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for sparglim-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 45fb5c7e7aab44e5fe3ae981ccf9d534f2dd40fb1da1bc68c01e496e46bf6fc2
MD5 48396d0280040fd2cebebf2346dd334a
BLAKE2b-256 f6999ef018d664a9dafeb2418dcd6c87e0c5a2512d4e2f2d67febe61950b23b5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page