Skip to main content

SparkUI enchancements with pyspark

Project description

# pyspark-sugar

Set python traceback on dataframe actions, enrich spark UI with actual business logic stages of spark application.


## Installation

Via pip:
```
pip install pyspark-sugar
```

Directly from github:
```
pip install git+https://github.com/sashgorokhov/pyspark-sugar.git@master
```

## Usage


Consider this synthetic example
```python
import random

import pyspark
import pyspark_sugar
from pyspark.sql import functions as F


# Set verbose job description through decorator
@pyspark_sugar.job_description_decor('Get nulls after type casts')
def get_incorrect_cast_cols(sdf, cols):
"""
Return columns with non-zero nulls amount across its values.

:param cols: Subset of columns to check
"""
cols = set(cols) & set(sdf.columns)
if not cols:
return {}
null_counts = sdf.agg(*[F.sum(F.isnull(col).cast('int')).alias(col) for col in cols]).collect()[0]
return {col: null_counts[col] for col in cols if null_counts[col]}


def main():
sc = pyspark.SparkContext.getOrCreate() # type: pyspark.SparkContext
ssc = pyspark.sql.SparkSession.builder.getOrCreate()

# Define a job group. All actions inside job group will be grouped.
with pyspark_sugar.job_group('ETL', None):
sdf = ssc.createDataFrame([
{'CONTACT_KEY': n, 'PRODUCT_KEY': random.choice(['1', '2', 'what', None])}
for n in range(1000)
]).repartition(2).cache()

sdf = sdf.withColumn('PRODUCT_KEY', F.col('PRODUCT_KEY').cast('integer'))

# Collect action inside this function will have nice and readable description
print(get_incorrect_cast_cols(sdf, ['PRODUCT_KEY']))

sdf = sdf.dropna(subset=['PRODUCT_KEY'])

# You can define several job groups one after another.
with pyspark_sugar.job_group('Analytics', 'Check dataframe metrics'):
sdf.cache()

# Set job description for actions made inside context manager.
with pyspark_sugar.job_description('Get transactions count'):
print(sdf.count())

sdf.show()

sc.stop()


if __name__ == '__main__':
# Patch almost-all dataframe actions with code that will set stack trace to job details description
with pyspark_sugar.patch_dataframe_actions():
main()

```

This how SparkUI pages will look like:

![SparkUI jobs page screenshot](docs/screenshots/screenshot_jobs.png)
![SparkUI sql page screenshot](docs/screenshots/screenshot_sql.png)

Now SparkUI is full of human-readable descriptions so even you manager can read it! Sweet.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_sugar-0.4.1.tar.gz (4.2 kB view details)

Uploaded Source

File details

Details for the file pyspark_sugar-0.4.1.tar.gz.

File metadata

  • Download URL: pyspark_sugar-0.4.1.tar.gz
  • Upload date:
  • Size: 4.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/39.1.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for pyspark_sugar-0.4.1.tar.gz
Algorithm Hash digest
SHA256 9947d83ef3b2afade5e954e4ca75c1c4ca073015bd8f31f4037831c5ec10e6dd
MD5 2bf1bb32456b68b702d5e7e808d3312e
BLAKE2b-256 9a07e120c8b32926f958b5d36e67a97e83580a30cfc2cc68c0dadb9528604e18

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page