A library that provides useful extensions to Apache Spark.
Project description
Spark Extension
This project provides extensions to the Apache Spark project in Scala and Python:
Diff: A diff
transformation for Dataset
s that computes the differences between
two datasets, i.e. which rows to add, delete or change to get from one dataset to the other.
Global Row Number: A withRowNumbers
transformation that provides the global row number w.r.t.
the current order of the Dataset, or any given order. In contrast to the existing SQL function row_number
, which
requires a window spec, this transformation provides the row number across the entire Dataset without scaling problems.
Inspect Parquet files: The structure of Parquet files (the metadata, not the data stored in Parquet) can be inspected similar to parquet-tools or parquet-cli by reading from a simple Spark data source. This simplifies identifying why some Parquet files cannot be split by Spark into scalable partitions.
For details, see the README.md at the project homepage.
Using Spark Extension
PyPi package (local Spark cluster only)
You may want to install the pyspark-extension
python package from PyPi into your development environment.
This provides you code completion, typing and test capabilities during your development phase.
Running your Python application on a Spark cluster will still require one of the ways below to add the Scala package to the Spark environment.
pip install pyspark-extension==2.7.0.3.3
Note: Pick the right Spark version (here 3.3) depending on your PySpark version.
PySpark API
Start a PySpark session with the Spark Extension dependency (version ≥1.1.0) as follows:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.config("spark.jars.packages", "uk.co.gresearch.spark:spark-extension_2.12:2.7.0-3.3") \
.getOrCreate()
Note: Pick the right Scala version (here 2.12) and Spark version (here 3.3) depending on your PySpark version.
PySpark REPL
Launch the Python Spark REPL with the Spark Extension dependency (version ≥1.1.0) as follows:
pyspark --packages uk.co.gresearch.spark:spark-extension_2.12:2.7.0-3.3
Note: Pick the right Scala version (here 2.12) and Spark version (here 3.3) depending on your PySpark version.
PySpark spark-submit
Run your Python scripts that use PySpark via spark-submit
:
spark-submit --packages uk.co.gresearch.spark:spark-extension_2.12:2.7.0-3.3 [script.py]
Note: Pick the right Scala version (here 2.12) and Spark version (here 3.3) depending on your Spark version.
Your favorite Data Science notebook
There are plenty of Data Science notebooks around. To use this library, add a jar dependency to your notebook using these Maven coordinates:
uk.co.gresearch.spark:spark-extension_2.12:2.7.0-3.3
Or download the jar and place it on a filesystem where it is accessible by the notebook, and reference that jar file directly.
Check the documentation of your favorite notebook to learn how to add jars to your Spark environment.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for pyspark_extension-2.7.0.3.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 96e1a36b5f8fa2ad86d17813db3651cee064b955704dfa7b87773e7c068f9f4d |
|
MD5 | 23c8e0a9d55f9672c35114aab7f92ad0 |
|
BLAKE2b-256 | 0cca708045a9087dddabd567b5967ed39f6c1f5e9cf8b64c019c17bf881887db |