Skip to main content

A collection of utilities and tools for accelerating pyspark development and productivity.

Project description

Patek

A collection of reusable pyspark utility functions that help make development easier!

Installation

Patek is available on PyPI and can be installed with pip:

pip install patek

Usage


IO Helpers

Patek provides a set of IO helpers to quickly read and write data from/to various sources in PySpark.

Dynamic Delta Table Writer

The superDeltaWriter function allows you to write data to a Delta table using the merge capability without having to write out every single update and merge condition. This is useful when you have a large number of columns and/or a large number of update conditions.

from patek.io import superDeltaWriter

superDeltaWriter(sparkDataframe, ['key_column1'], 'delta/path', sparkSession, sparkContext, ['update_col1', 'update_col2'])

If update columns are not specified, the default is to update all non-key columns that exist in both the source and target tables. Also, if the target table does not exist, it will be created.

Funnel.io Schema to Spark Schema

The funnelSparkler function allows you to convert a Funnel.io schema to a Spark schema. This is useful to remove ambiguity when reading data from Funnel.io exports into spark dataframes, without having to manually define the schema.

from patek.io import funnelSparkler

dataframe = funnelSparkler('path/to/funnel_schema.json', 'path/to/funnel_export_data', sparkSession, sparkContext, data_file_type='csv')

Utility Functions

Patek provides a set of utility functions to help make development easier.

Determine Key Candidates

The determine_key_candidates function allows you to determine the key candidates for a given dataframe. This is useful when you have a large number of columns in a dataframe and you want to quickly determine which columns are good candidates for a key.

from patek.utils import determine_key_candidates

key_candidates = determine_key_candidates(sparkDataframe)
print(key_candidates)

# Output:
# a list containing single column key candidates: ['column1', 'column2', 'column3']
# a list containing composite key candidates: [['column1', 'column2'], ['column1', 'column3']]

Clean Column Names

The column_cleaner function allows you to clean column names in a dataframe. It removes special characters and replaces spaces with underscores.

from patek.utils import column_cleaner

# input dataframe columns: ['column?? 1', 'column: 2', 'column-3']

cleaned_dataframe = column_cleaner(sparkDataframe)

# output dataframe columns: ['column_1', 'column_2', 'column_3']

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

patek-0.5.2.tar.gz (9.0 kB view hashes)

Uploaded source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page