A library that provides useful extensions to Apache Spark.

These details have not been verified by PyPI

Project links

Homepage

Project description

Spark Extension

This project provides extensions to the Apache Spark project in Scala and Python:

Diff: A diff transformation and application for Datasets that computes the differences between two datasets, i.e. which rows to add, delete or change to get from one dataset to the other.

Global Row Number: A withRowNumbers transformation that provides the global row number w.r.t. the current order of the Dataset, or any given order. In contrast to the existing SQL function row_number, which requires a window spec, this transformation provides the row number across the entire Dataset without scaling problems.

Inspect Parquet files: The structure of Parquet files (the metadata, not the data stored in Parquet) can be inspected similar to parquet-tools or parquet-cli by reading from a simple Spark data source. This simplifies identifying why some Parquet files cannot be split by Spark into scalable partitions.

.Net DateTime.Ticks: Convert .Net (C#, F#, Visual Basic) DateTime.Ticks into Spark timestamps, seconds and nanoseconds.

Available methods:

// Scala
dotNetTicksToTimestamp(Column): Column       // returns timestamp as TimestampType
dotNetTicksToUnixEpoch(Column): Column       // returns Unix epoch seconds as DecimalType
dotNetTicksToUnixEpochNanos(Column): Column  // returns Unix epoch nanoseconds as LongType

The reverse is provided by (all return LongType .Net ticks):

// Scala
timestampToDotNetTicks(Column): Column
unixEpochToDotNetTicks(Column): Column
unixEpochNanosToDotNetTicks(Column): Column

These methods are also available in Python:

# Python
dotnet_ticks_to_timestamp(column_or_name)         # returns timestamp as TimestampType
dotnet_ticks_to_unix_epoch(column_or_name)        # returns Unix epoch seconds as DecimalType
dotnet_ticks_to_unix_epoch_nanos(column_or_name)  # returns Unix epoch nanoseconds as LongType

timestamp_to_dotnet_ticks(column_or_name)
unix_epoch_to_dotnet_ticks(column_or_name)
unix_epoch_nanos_to_dotnet_ticks(column_or_name)

Spark job description: Set Spark job description for all Spark jobs within a context:

from gresearch.spark import job_description, append_job_description

with job_description("parquet file"):
    df = spark.read.parquet("data.parquet")
    with append_job_description("count"):
        count = df.count
    with append_job_description("write"):
        df.write.csv("data.csv")

For details, see the README.md at the project homepage.

Using Spark Extension

PyPi package (local Spark cluster only)

You may want to install the pyspark-extension python package from PyPi into your development environment. This provides you code completion, typing and test capabilities during your development phase.

Running your Python application on a Spark cluster will still require one of the ways below to add the Scala package to the Spark environment.

pip install pyspark-extension==2.11.0.3.4

Note: Pick the right Spark version (here 3.4) depending on your PySpark version.

PySpark API

Start a PySpark session with the Spark Extension dependency (version ≥1.1.0) as follows:

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .config("spark.jars.packages", "uk.co.gresearch.spark:spark-extension_2.12:2.11.0-3.4") \
    .getOrCreate()

Note: Pick the right Scala version (here 2.12) and Spark version (here 3.4) depending on your PySpark version.

PySpark REPL

Launch the Python Spark REPL with the Spark Extension dependency (version ≥1.1.0) as follows:

pyspark --packages uk.co.gresearch.spark:spark-extension_2.12:2.11.0-3.4

Note: Pick the right Scala version (here 2.12) and Spark version (here 3.4) depending on your PySpark version.

PySpark `spark-submit`

Run your Python scripts that use PySpark via spark-submit:

spark-submit --packages uk.co.gresearch.spark:spark-extension_2.12:2.11.0-3.4 [script.py]

Note: Pick the right Scala version (here 2.12) and Spark version (here 3.4) depending on your Spark version.

Your favorite Data Science notebook

There are plenty of Data Science notebooks around. To use this library, add a jar dependency to your notebook using these Maven coordinates:

uk.co.gresearch.spark:spark-extension_2.12:2.11.0-3.4

Or download the jar and place it on a filesystem where it is accessible by the notebook, and reference that jar file directly.

Check the documentation of your favorite notebook to learn how to add jars to your Spark environment.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

2.15.0.4.1

Mar 18, 2026

2.15.0.4.0

Dec 13, 2025

2.15.0.3.5

Dec 13, 2025

2.15.0.3.4

Dec 13, 2025

2.15.0.3.3

Dec 13, 2025

2.15.0.3.2

Dec 13, 2025

2.14.2.4.0

Jul 22, 2025

2.14.2.3.5.post0

Jul 22, 2025

2.14.2.3.5

Jul 22, 2025

2.14.2.3.4

Jul 22, 2025

2.14.2.3.3

Jul 22, 2025

2.14.2.3.2

Jul 22, 2025

2.14.2.3.1

Jul 22, 2025

2.14.2.3.0

Jul 22, 2025

2.13.0.3.5

Nov 4, 2024

2.13.0.3.4

Nov 4, 2024

2.13.0.3.3

Nov 4, 2024

2.13.0.3.2

Nov 4, 2024

2.13.0.3.1

Nov 4, 2024

2.13.0.3.0

Nov 4, 2024

2.12.0.3.5

Apr 26, 2024

2.12.0.3.4

Apr 26, 2024

2.12.0.3.3

Apr 26, 2024

2.12.0.3.2

Apr 26, 2024

2.12.0.3.1

Apr 26, 2024

2.12.0.3.0

Apr 26, 2024

2.11.0.3.5

Jan 4, 2024

2.11.0.3.4

Jan 4, 2024

This version

2.11.0.3.3

Jan 4, 2024

2.11.0.3.2

Jan 4, 2024

2.11.0.3.1

Jan 4, 2024

2.11.0.3.0

Jan 4, 2024

2.10.0.3.5

Sep 28, 2023

2.10.0.3.4

Sep 28, 2023

2.10.0.3.3

Sep 28, 2023

2.10.0.3.2

Sep 28, 2023

2.10.0.3.1

Sep 28, 2023

2.10.0.3.0

Sep 28, 2023

2.9.0.3.4

Aug 23, 2023

2.9.0.3.3

Aug 23, 2023

2.9.0.3.2

Aug 23, 2023

2.9.0.3.1

Aug 23, 2023

2.9.0.3.0

Aug 23, 2023

2.8.0.3.4

May 24, 2023

2.8.0.3.3

May 24, 2023

2.8.0.3.2

May 24, 2023

2.8.0.3.1

May 24, 2023

2.8.0.3.0

May 24, 2023

2.7.0.3.4.post0

May 5, 2023

2.7.0.3.4

May 5, 2023

2.7.0.3.3.post0

May 5, 2023

2.7.0.3.3

May 5, 2023

2.7.0.3.2.post0

May 5, 2023

2.7.0.3.1.post0

May 5, 2023

2.7.0.3.1

May 5, 2023

2.7.0.3.0.post0

May 5, 2023

2.7.0.3.0

May 5, 2023

2.6.0.3.3

Apr 11, 2023

2.6.0.3.2

Apr 11, 2023

2.6.0.3.1

Apr 11, 2023

2.6.0.3.0

Apr 11, 2023

2.5.0.3.3

Mar 23, 2023

2.5.0.3.2

Mar 23, 2023

2.5.0.3.1

Mar 23, 2023

2.5.0.3.0

Mar 23, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark-extension-2.11.0.3.3.tar.gz (329.3 kB view details)

Uploaded Jan 4, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyspark_extension-2.11.0.3.3-py3-none-any.whl (320.5 kB view details)

Uploaded Jan 4, 2024 Python 3

File details

Details for the file pyspark-extension-2.11.0.3.3.tar.gz.

File metadata

Download URL: pyspark-extension-2.11.0.3.3.tar.gz
Upload date: Jan 4, 2024
Size: 329.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for pyspark-extension-2.11.0.3.3.tar.gz
Algorithm	Hash digest
SHA256	`2efa72c5a058111cbb3ff242c122eb64a57d1c9b564b4a24ac379ee03c4ca9df`
MD5	`f0afcf1489668dd56c06b134b8c9e72f`
BLAKE2b-256	`bf3410894d36b1c74b232566f2e63660044e39d31e9da9c032573f13b1b2bf45`

See more details on using hashes here.

File details

Details for the file pyspark_extension-2.11.0.3.3-py3-none-any.whl.

File metadata

Download URL: pyspark_extension-2.11.0.3.3-py3-none-any.whl
Upload date: Jan 4, 2024
Size: 320.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for pyspark_extension-2.11.0.3.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a0f1010791fbed8637ec7a4a15569e3e3ae1875226f6cf4c45fde99f4b35e3ea`
MD5	`ecbf7d4524f1bba42395eab4e34afe3c`
BLAKE2b-256	`adb63fff67bc3c3f238cef7c0b5214502dd11e37be7f324e77d30ade0905a70f`

See more details on using hashes here.

pyspark-extension 2.11.0.3.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Spark Extension

Using Spark Extension

PyPi package (local Spark cluster only)

PySpark API

PySpark REPL

PySpark `spark-submit`

Your favorite Data Science notebook

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

pyspark-extension 2.11.0.3.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Spark Extension

Using Spark Extension

PyPi package (local Spark cluster only)

PySpark API

PySpark REPL

PySpark spark-submit

Your favorite Data Science notebook

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

PySpark `spark-submit`