Skip to main content

A library containing various utility functions for playing with PySpark DataFrames

Project description

Spark-frame

PyPI version PyPI - Python Version GitHub Build SonarCloud Coverage SonarCloud Bugs SonarCloud Vulnerabilities PyPI - Downloads Code style: black

What is it ?

Spark-frame is a library that super-charges your Spark DataFrames!

It brings several utility methods and transformation functions for PySpark DataFrames. These methods were initially part of the karadoc project used at Younited, but they were fully independent from karadoc, so it made more sense to keep them as a standalone library.

Several of these methods were my initial inspiration to make the cousin project bigquery-frame, which was first made to illustrate this blog article. This is why you will find similar methods in both spark_frame and bigquery_frame, except the former runs on PySpark while the latter runs on BigQuery (obviously). I try to keep both projects consistent together, and new eventually port new developments made on one project to the other one.

Getting Started

Visit the official Spark-frame website documentation for use cases examples and reference.

Installation

spark-frame is available on PyPi.

pip install spark-frame

Compatibilities and requirements

This library does not depend on any other library. Pyspark must be installed separately to use it. It is compatible with the following versions:

  • Python: requires 3.8.1 or higher (tested against Python 3.9, 3.10 and 3.11)
  • pyspark: requires 3.3.0 or higher

This library is tested against Windows, Mac and Linux.

Some features require extra libraries to be installed alongside this project. We chose to not include them as direct dependencies for security and flexibility reasons. This way, users who are not using these features don't need to worry about these dependencies.

feature Method module required
Generating HTML reports for data diff DiffResult.export_to_html jinja2

Release notes

v0.3.1

Fixes and improvements on data_diff

  • The export_html_diff_report method now accepts arguments to specify the path and encoding of the output html report.
  • Data-diff join now works correctly with null values
  • Visual improvements to HTML diff report

v0.3.0

Fixes and improvements on data_diff

  • Fixed incorrect diff results
  • Column values are not truncated at all, this was causing incorrect results. The possibility to limit the size of the column values will be added back in a later version
  • Made sure that the most frequent values per column are now displayed by decreasing order of frequency

v0.2.0

Two new exciting features: analyze and data_diff. They are still in experimental stage and will be improved in future releases.

  • Added a new transformation spark_frame.transformations.analyze.
  • Added new data_diff feature. Example:
from pyspark.sql import DataFrame
from spark_frame.data_diff import DataframeComparator
df1: DataFrame = ...
df2: DataFrame = ...
diff_result = DataframeComparator().compare_df(df1, df2) # Produces a DiffResult object
diff_result.display() # Print a diff report in the terminal
diff_result.export_to_html() # Generates a html diff report file named diff_report.html

v0.1.1

  • Added a new transformation spark_frame.transformations.flatten_all_arrays.
  • Added support for multi-arg transformation to nested.select and nested.with_fields With this feature, we can now access parent fields from higher levels when applying a transformation. Example:
>>> nested.print_schema(df)
"""
root
 |-- id: integer (nullable = false)
 |-- s1!.average: integer (nullable = false)
 |-- s1!.values!: integer (nullable = false)
"""
>>> df.show(truncate=False)
+---+--------------------------------------+
|id |s1                                    |
+---+--------------------------------------+
|1  |[{2, [1, 2, 3]}, {3, [1, 2, 3, 4, 5]}]|
+---+--------------------------------------+
>>> new_df = df.transform(nested.with_fields, {
>>>     "s1!.values!": lambda s1, value: value - s1["average"]  # This transformation takes 2 arguments
>>> })
+---+-----------------------------------------+
|id |s1                                       |
+---+-----------------------------------------+
|1  |[{2, [-1, 0, 1]}, {3, [-2, -1, 0, 1, 2]}]|
+---+-----------------------------------------+

v0.1.0

  • Added a new amazing module called spark_frame.nested, which makes manipulation of nested data structure much easier! Make sure to check out the reference and the use-cases.

  • Also added a new module called spark_frame.nested_functions, which contains aggregation methods for nested data structures (See Reference).

  • New transformations:

    • spark_frame.transformations.transform_all_field_names
    • spark_frame.transformations.transform_all_fields
    • spark_frame.transformations.unnest_field
    • spark_frame.transformations.unnest_all_fields
    • spark_frame.transformations.union_dataframes

v0.0.3

  • New transformation: spark_frame.transformations.convert_all_maps_to_arrays.
  • New transformation: spark_frame.transformations.sort_all_arrays.
  • New transformation: spark_frame.transformations.harmonize_dataframes.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spark_frame-0.3.1.tar.gz (83.5 kB view hashes)

Uploaded Source

Built Distribution

spark_frame-0.3.1-py3-none-any.whl (112.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page