Skip to main content

Common pyspark utils

Project description

AI Helpers - PySpark utils

pyspark-utils is a Python module that provides a collection of utilities to simplify and enhance the use of PySpark. These utilities are designed to make working with PySpark more efficient and to reduce boilerplate code.

Table of Contents

Installation

You can install the pyspark-utils module via pip:

pip install ai-helpers-pyspark-utils

Getting Started

First, import the module in your Python script:

import pyspark_utils as psu

Now you can use the utilities provided by pyspark-utils.

Utilities & Examples

  • get_spark_session: Recover appropriate SparkSession.

    Create a spark dataframe:

    >>> import pyspark_utils as psu
    
    >>> spark = psu.get_spark_session("example")
    >>> sdf = spark.createDataFrame(
          [
              [None, "a", 1, 1.0],
              ["b", "b", 1, 2.0],
              ["b", "b", None, 3.0],
              ["c", "c", None, 2.0],
              ["c", "c", 3, 4.0],
              ["d", None, 4, 2.0],
              ["d", None, 5, 6.0],
          ],
          ["col0", "col1", "col2", "col3"],
      )
    >>> sdf.show()
    +----+----+----+----+
    |col0|col1|col2|col3|
    +----+----+----+----+
    |NULL|   a|   1| 1.0|
    |   b|   b|   1| 2.0|
    |   b|   b|NULL| 3.0|
    |   c|   c|NULL| 2.0|
    |   c|   c|   3| 4.0|
    |   d|NULL|   4| 2.0|
    |   d|NULL|   5| 6.0|
    +----+----+----+----+ 
    
  • with_columns: Use multiple 'withColumn' calls on a dataframe in a single command.

    >>> import pyspark_utils as psu
    >>> import pyspark.sql.functions as F
    
    >>> col4 = F.col("col3") + 2
    >>> col5 = F.lit(True)
    
    >>> transformed_sdf = psu.with_columns(
      sdf, 
      col_func_mapping={"col4": col4, "col5": col5}
      )
    >>> transformed_sdf.show()
    +----+----+----+----+----+----+
    |col0|col1|col2|col3|col4|col5|
    +----+----+----+----+----+----+
    |NULL|   a|   1| 1.0| 3.0|true|
    |   b|   b|   1| 2.0| 4.0|true|
    |   b|   b|NULL| 3.0| 5.0|true|
    |   c|   c|NULL| 2.0| 4.0|true|
    |   c|   c|   3| 4.0| 6.0|true|
    |   d|NULL|   4| 2.0| 4.0|true|
    |   d|NULL|   5| 6.0| 8.0|true|
    +----+----+----+----+----+----+
    
  • keep_first_rows: Keep the first row of each group defined by partition_cols and order_cols.

    >>> transformed_sdf = psu.utils.keep_first_rows(sdf, [F.col("col0")], [F.col("col3")])
    >>> transformed_sdf.show()
    +----+----+----+----+
    |col0|col1|col2|col3|
    +----+----+----+----+
    |NULL|   a|   1| 1.0|
    |   b|   b|   1| 2.0|
    |   c|   c|NULL| 2.0|
    |   d|NULL|   4| 2.0|
    +----+----+----+----+
    
  • assert_cols_in_df: Assserts that all specified columns are present in specified dataframe.

  • assert_df_close: Asserts that two dataframes are (almost) equal, even if the order of the columns is different.

Contributing

We welcome contributions to pyspark-utils. To contribute, please follow these steps:

  1. Fork the repository.
  2. Create a new branch (git checkout -b feature-branch).
  3. Make your changes.
  4. Commit your changes (git commit -am 'Add some feature').
  5. Push to the branch (git push origin feature-branch).
  6. Create a new Pull Request.

Please ensure your code follows the project's coding standards and includes appropriate tests.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai_helpers_pyspark_utils-0.1.0a3.tar.gz (3.9 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file ai_helpers_pyspark_utils-0.1.0a3.tar.gz.

File metadata

  • Download URL: ai_helpers_pyspark_utils-0.1.0a3.tar.gz
  • Upload date:
  • Size: 3.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.1 CPython/3.9.18 Linux/6.5.0-1021-azure

File hashes

Hashes for ai_helpers_pyspark_utils-0.1.0a3.tar.gz
Algorithm Hash digest
SHA256 34a76f6cebd2702c59a378279abb953ac7b786e481ccf6206e0a0f247366f868
MD5 72752625932b44928ff841d2105671a0
BLAKE2b-256 dc4b38552be66cbc792cf558df7828ef387d5b04d796e81076d3592679734f5d

See more details on using hashes here.

File details

Details for the file ai_helpers_pyspark_utils-0.1.0a3-py3-none-any.whl.

File metadata

File hashes

Hashes for ai_helpers_pyspark_utils-0.1.0a3-py3-none-any.whl
Algorithm Hash digest
SHA256 ed17b9e20d7aa80ca655268f5997278a63df4b1d2c7abfcbcd4064f6dfd5210e
MD5 cc70c8daf32e5f67a5555e0ae5c77ae4
BLAKE2b-256 59588406418ff37a0c8b2ae6fb5b00cb883aef764b9206ea4a115a532160e32f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page