SparkUI enchancements with pyspark
Project description
# pyspark-sugar
Set python traceback on dataframe actions, enrich spark UI with actual business logic stages of spark application.
## Installation
Via pip:
```
pip install pyspark-sugar
```
Directly from github:
```
pip install git+https://github.com/sashgorokhov/pyspark-sugar.git@master
```
## Usage
Consider this synthetic example
```python
import random
import pyspark
import pyspark_sugar
from pyspark.sql import functions as F
# Set verbose job description through decorator
@pyspark_sugar.job_description_decor('Get nulls after type casts')
def get_incorrect_cast_cols(sdf, cols):
"""
Return columns with non-zero nulls amount across its values.
:param cols: Subset of columns to check
"""
cols = set(cols) & set(sdf.columns)
if not cols:
return {}
null_counts = sdf.agg(*[F.sum(F.isnull(col).cast('int')).alias(col) for col in cols]).collect()[0]
return {col: null_counts[col] for col in cols if null_counts[col]}
def main():
sc = pyspark.SparkContext.getOrCreate() # type: pyspark.SparkContext
ssc = pyspark.sql.SparkSession.builder.getOrCreate()
# Define a job group. All actions inside job group will be grouped.
with pyspark_sugar.job_group('ETL', None):
sdf = ssc.createDataFrame([
{'CONTACT_KEY': n, 'PRODUCT_KEY': random.choice(['1', '2', 'what', None])}
for n in range(1000)
]).repartition(2).cache()
sdf = sdf.withColumn('PRODUCT_KEY', F.col('PRODUCT_KEY').cast('integer'))
# Collect action inside this function will have nice and readable description
print(get_incorrect_cast_cols(sdf, ['PRODUCT_KEY']))
sdf = sdf.dropna(subset=['PRODUCT_KEY'])
# You can define several job groups one after another.
with pyspark_sugar.job_group('Analytics', 'Check dataframe metrics'):
sdf.cache()
# Set job description for actions made inside context manager.
with pyspark_sugar.job_description('Get transactions count'):
print(sdf.count())
sdf.show()
sc.stop()
if __name__ == '__main__':
# Patch almost-all dataframe actions with code that will set stack trace to job details description
with pyspark_sugar.patch_dataframe_actions():
main()
```
This how SparkUI pages will look like:
![SparkUI jobs page screenshot](docs/screenshots/screenshot_jobs.png)
![SparkUI sql page screenshot](docs/screenshots/screenshot_sql.png)
Now SparkUI is full of human-readable descriptions so even you manager can read it! Sweet.
Set python traceback on dataframe actions, enrich spark UI with actual business logic stages of spark application.
## Installation
Via pip:
```
pip install pyspark-sugar
```
Directly from github:
```
pip install git+https://github.com/sashgorokhov/pyspark-sugar.git@master
```
## Usage
Consider this synthetic example
```python
import random
import pyspark
import pyspark_sugar
from pyspark.sql import functions as F
# Set verbose job description through decorator
@pyspark_sugar.job_description_decor('Get nulls after type casts')
def get_incorrect_cast_cols(sdf, cols):
"""
Return columns with non-zero nulls amount across its values.
:param cols: Subset of columns to check
"""
cols = set(cols) & set(sdf.columns)
if not cols:
return {}
null_counts = sdf.agg(*[F.sum(F.isnull(col).cast('int')).alias(col) for col in cols]).collect()[0]
return {col: null_counts[col] for col in cols if null_counts[col]}
def main():
sc = pyspark.SparkContext.getOrCreate() # type: pyspark.SparkContext
ssc = pyspark.sql.SparkSession.builder.getOrCreate()
# Define a job group. All actions inside job group will be grouped.
with pyspark_sugar.job_group('ETL', None):
sdf = ssc.createDataFrame([
{'CONTACT_KEY': n, 'PRODUCT_KEY': random.choice(['1', '2', 'what', None])}
for n in range(1000)
]).repartition(2).cache()
sdf = sdf.withColumn('PRODUCT_KEY', F.col('PRODUCT_KEY').cast('integer'))
# Collect action inside this function will have nice and readable description
print(get_incorrect_cast_cols(sdf, ['PRODUCT_KEY']))
sdf = sdf.dropna(subset=['PRODUCT_KEY'])
# You can define several job groups one after another.
with pyspark_sugar.job_group('Analytics', 'Check dataframe metrics'):
sdf.cache()
# Set job description for actions made inside context manager.
with pyspark_sugar.job_description('Get transactions count'):
print(sdf.count())
sdf.show()
sc.stop()
if __name__ == '__main__':
# Patch almost-all dataframe actions with code that will set stack trace to job details description
with pyspark_sugar.patch_dataframe_actions():
main()
```
This how SparkUI pages will look like:
![SparkUI jobs page screenshot](docs/screenshots/screenshot_jobs.png)
![SparkUI sql page screenshot](docs/screenshots/screenshot_sql.png)
Now SparkUI is full of human-readable descriptions so even you manager can read it! Sweet.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pyspark_sugar-0.4.1.tar.gz
(4.2 kB
view details)
File details
Details for the file pyspark_sugar-0.4.1.tar.gz
.
File metadata
- Download URL: pyspark_sugar-0.4.1.tar.gz
- Upload date:
- Size: 4.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/39.1.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9947d83ef3b2afade5e954e4ca75c1c4ca073015bd8f31f4037831c5ec10e6dd |
|
MD5 | 2bf1bb32456b68b702d5e7e808d3312e |
|
BLAKE2b-256 | 9a07e120c8b32926f958b5d36e67a97e83580a30cfc2cc68c0dadb9528604e18 |