Skip to main content

A collection of utilities and tools for accelerating pyspark development and productivity.

Project description

Patek

A collection of reusable pyspark utility functions that help make development easier!

Installation

Patek is available on PyPI and can be installed with pip:

pip install patek

Usage


IO Helpers

Patek provides a set of IO helpers to quickly read and write data from/to various sources in PySpark.

Dynamic Delta Table Writer

The superDeltaWriter function allows you to write data to a Delta table using the merge capability without having to write out every single update and merge condition. This is useful when you have a large number of columns and/or a large number of update conditions.

from patek.io import superDeltaWriter

superDeltaWriter(sparkDataframe, ['key_column1'], 'delta/path', sparkSession, sparkContext, ['update_col1', 'update_col2'])

If update columns are not specified, the default is to update all non-key columns that exist in both the source and target tables. Also, if the target table does not exist, it will be created.

Funnel.io Schema to Spark Schema

The funnelSparkler function allows you to convert a Funnel.io schema to a Spark schema. This is useful to remove ambiguity when reading data from Funnel.io exports into spark dataframes, without having to manually define the schema.

from patek.io import funnelSparkler

dataframe = funnelSparkler('path/to/funnel_schema.json', 'path/to/funnel_export_data', sparkSession, sparkContext, data_file_type='csv')

Utility Functions

Patek provides a set of utility functions to help make development easier.

Determine Key Candidates

The determine_key_candidates function allows you to determine the key candidates for a given dataframe. This is useful when you have a large number of columns in a dataframe and you want to quickly determine which columns are good candidates for a key.

from patek.utils import determine_key_candidates

key_candidates = determine_key_candidates(sparkDataframe)
print(key_candidates)

# Output:
# a list containing single column key candidates: ['column1', 'column2', 'column3']
# a list containing composite key candidates: [['column1', 'column2'], ['column1', 'column3']]

Clean Column Names

The column_cleaner function allows you to clean column names in a dataframe. It removes special characters and replaces spaces with underscores.

from patek.utils import column_cleaner

# input dataframe columns: ['column?? 1', 'column: 2', 'column-3']

cleaned_dataframe = column_cleaner(sparkDataframe)

# output dataframe columns: ['column_1', 'column_2', 'column_3']

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

patek-0.5.2.tar.gz (9.0 kB view details)

Uploaded Source

File details

Details for the file patek-0.5.2.tar.gz.

File metadata

  • Download URL: patek-0.5.2.tar.gz
  • Upload date:
  • Size: 9.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.2

File hashes

Hashes for patek-0.5.2.tar.gz
Algorithm Hash digest
SHA256 7f73d142fd304093bb0ec146be8d15e94fc9429fbcff50421d5699dcda7d121e
MD5 cfb1b95dde6dab496ef886b4822f853d
BLAKE2b-256 ef4299beea3f0586066e81f284c1c7c4ac1395b495eaaeb7db4ce60e0d1c6769

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page