extra utilities for pyspark.sql
Project description
pyspark extra utilities
SparkMetrics
Track metrics (like number of rows/number of files written) when writing a DataFrame.
import pyspark.sql
from pysparkextra.metrics import SparkMetrics
spark_session: pyspark.sql.SparkSession
df: pyspark.sql.DataFrame = spark_session.createDataFrame(
[
[1, 2],
[-3, 4],
],
schema=("foo", "bar")
)
with SparkMetrics(spark_session) as metrics:
df.write.parquet("/tmp/target", mode='overwrite')
print(metrics['numOutputRows']) # 2
with SparkMetrics(spark_session) as metrics:
df.union(df).write.parquet("/tmp/target", mode='overwrite')
print(metrics['numOutputRows']) # 4
print(metrics) # {'numFiles': 5, 'numOutputBytes': 3175, 'numOutputRows': 4, 'numParts': 0}
union arbitrary number of dataframes with arbitrary number of columns
from pyspark.sql import DataFrame, SparkSession
from pysparkextra.funcs import union
spark_session: SparkSession
df1: DataFrame = spark_session.createDataFrame(
[
[1, 2],
[3, 4],
], schema=("foo", "bar"))
df2: DataFrame = spark_session.createDataFrame(
[
[10, 20, 30],
[40, 50, 60],
], schema=("bar", "qux", "foo")
)
df3: DataFrame = spark_session.createDataFrame(
[
[100, 200],
[300, 400],
], schema=("foo", "bar")
)
df: DataFrame = union(df1, df2, df3)
df.show()
# +---+---+----+
# |foo|bar| qux|
# +---+---+----+
# | 1| 2|null|
# | 3| 4|null|
# | 30| 10| 20|
# | 60| 40| 50|
# |100|200|null|
# |300|400|null|
# +---+---+----+
and more
Check out the tests, which also act as examples.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pysparkextra-0.4.2.tar.gz
(8.5 kB
view details)
Built Distribution
File details
Details for the file pysparkextra-0.4.2.tar.gz
.
File metadata
- Download URL: pysparkextra-0.4.2.tar.gz
- Upload date:
- Size: 8.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.1 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 07166c9c9f5ede82c89e32c7b6fcf108ba71a494d8d6ffe828beb26466a0383d |
|
MD5 | c86bf9a8b6a30bea0cc0720b58547200 |
|
BLAKE2b-256 | 34f3079f7feedefbd6945aa27fc47b0ab493f8a59e3aedd3ead6dbf0a587fb7b |
File details
Details for the file pysparkextra-0.4.2-py3-none-any.whl
.
File metadata
- Download URL: pysparkextra-0.4.2-py3-none-any.whl
- Upload date:
- Size: 10.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.1 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ef08746bd3d0aea0849ba2dcba540cd887b9896019446477f659f98574b1612a |
|
MD5 | 741847656702283a32c61334cca0eacb |
|
BLAKE2b-256 | 59528daad79b6598d34bcfd170b8e00e23fa899b236c31a7a6f868b74e9e1ca8 |