Productivity functions for common but painful pyspark tasks.
Project description
e2fyi-pyspark
e2fyi-pyspark is an e2fyi namespaced python package with pyspark subpackage
(i.e. e2fyi.pyspark) which holds a collections of useful functions for common
but painful pyspark tasks.
API documentation can be found at https://e2fyi-pyspark.readthedocs.io/en/latest/.
Change logs are available in CHANGELOG.md.
- Python 3.6 and above
- Licensed under Apache-2.0.
Quickstart
pip install e2fyi-pyspark
Infer schema for unknown json strings inside a pyspark dataframe
e2fyi.pyspark.schema.infer_schema_from_rows is a util function to infer the
schema of unknown json strings inside a pyspark dataframe - i.e. so that the
schema can be subsequently used to parse the json string into a typed data
structure in the dataframe
(see pyspark.sql.functions.from_json).
import pyspark
from e2fyi.pyspark.schema import infer_schema_from_rows
# get spark session
spark = pyspark.sql.SparkSession.builder.getOrCreate()
# load a parquet (assume the parquet has a column "json_str", which
# contains a json str with unknown schema)
df = spark.read.parquet("s3://some-bucket/some-file.parquet")
# get 10% of the rows as sample (w/o replacement)
sample_rows = df.select("json_str").sample(False, 0.01).collect()
# infer the schema for json str in col "json_str" based on the sample rows
# NOTE: this is run locally (not in spark)
schema = infer_schema_from_rows(sample_rows, col="json_str")
# add a new column "data" which is the parsed json string with a inferred schema
df = df.withColumn("data", pyspark.sql.functions.from_json("json_str", schema))
# should have a column "data" with a proper schema
df.printSchema()
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file e2fyi-pyspark-0.1.0a1.tar.gz.
File metadata
- Download URL: e2fyi-pyspark-0.1.0a1.tar.gz
- Upload date:
- Size: 8.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.7.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5803e03f99958c919d80296bb5924e0370dea959ac4fdfdb76d6a5d182197087
|
|
| MD5 |
65c80f24e5e9bfb246fbf1faf1e191e4
|
|
| BLAKE2b-256 |
9daf0377f641477d3cf1f259e30626e7ac05e7ddf3cef6c4490a8f82e83de205
|