Skip to main content

extra utilities for pyspark.sql

Project description

pyspark extra utilities

SparkMetrics

Track metrics (like number of rows/number of files written) when writing a DataFrame.

import pyspark.sql
from pysparkextra.metrics import SparkMetrics

spark_session: pyspark.sql.SparkSession
df: pyspark.sql.DataFrame = spark_session.createDataFrame(
    [
        [1, 2],
        [-3, 4],
    ],
    schema=("foo", "bar")
)

with SparkMetrics(spark_session) as metrics:
    df.write.parquet("/tmp/target", mode='overwrite')
print(metrics['numOutputRows'])  # 2

with SparkMetrics(spark_session) as metrics:
    df.union(df).write.parquet("/tmp/target", mode='overwrite')
print(metrics['numOutputRows'])  # 4

print(metrics)  # {'numFiles': 5, 'numOutputBytes': 3175, 'numOutputRows': 4, 'numParts': 0}

union arbitrary number of dataframes with arbitrary number of columns

from pyspark.sql import DataFrame, SparkSession
from pysparkextra.funcs import union

spark_session: SparkSession
df1: DataFrame = spark_session.createDataFrame(
    [
        [1, 2],
        [3, 4],
    ], schema=("foo", "bar"))
df2: DataFrame = spark_session.createDataFrame(
    [
        [10, 20, 30],
        [40, 50, 60],
    ], schema=("bar", "qux", "foo")
)
df3: DataFrame = spark_session.createDataFrame(
    [
        [100, 200],
        [300, 400],
    ], schema=("foo", "bar")
)

df: DataFrame = union(df1, df2, df3)

df.show()

# +---+---+----+
# |foo|bar| qux|
# +---+---+----+
# |  1|  2|null|
# |  3|  4|null|
# | 30| 10|  20|
# | 60| 40|  50|
# |100|200|null|
# |300|400|null|
# +---+---+----+

and more

Check out the tests, which also act as examples.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pysparkextra-0.4.2.tar.gz (8.5 kB view details)

Uploaded Source

Built Distribution

pysparkextra-0.4.2-py3-none-any.whl (10.4 kB view details)

Uploaded Python 3

File details

Details for the file pysparkextra-0.4.2.tar.gz.

File metadata

  • Download URL: pysparkextra-0.4.2.tar.gz
  • Upload date:
  • Size: 8.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.1 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for pysparkextra-0.4.2.tar.gz
Algorithm Hash digest
SHA256 07166c9c9f5ede82c89e32c7b6fcf108ba71a494d8d6ffe828beb26466a0383d
MD5 c86bf9a8b6a30bea0cc0720b58547200
BLAKE2b-256 34f3079f7feedefbd6945aa27fc47b0ab493f8a59e3aedd3ead6dbf0a587fb7b

See more details on using hashes here.

File details

Details for the file pysparkextra-0.4.2-py3-none-any.whl.

File metadata

  • Download URL: pysparkextra-0.4.2-py3-none-any.whl
  • Upload date:
  • Size: 10.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.1 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for pysparkextra-0.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ef08746bd3d0aea0849ba2dcba540cd887b9896019446477f659f98574b1612a
MD5 741847656702283a32c61334cca0eacb
BLAKE2b-256 59528daad79b6598d34bcfd170b8e00e23fa899b236c31a7a6f868b74e9e1ca8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page