Skip to main content

This project performs pyspark operations on dataframes, currently for unnesting shallow or deeply nested json data.

Project description

Pyspark ETL

This project aims at solving common problems that face data engineers.

One problem is to handle deeply nested json data and render data in a clean tabular format.

This kind of semistructured data may contain a combination of different data types that need to be handled differently to flatten the data properly.

This package does just that!

In this initial version of the package, there is one module named pysparketl which has one main function: flattenDF and two utility functions: _getArrayCols, _explodeArrayCols.

Usage

flattenDF(df) where df is a pyspark dataframe that has nested data in its columns. The returned dataframe will be a completely flat/tabular structure.

Example

Install the Package
pip install pysparketl
<!-- OR -->
pip3 install pysparketl
Import and Use
# Import the library
from pysparketl.dataframes import flattenDF

# Read your input dataframe that contains Json data
rawJsonDF = spark.read.json(pathToFile)

# Pass your nested json dataframe as input to the flattenDF function
flattenedDF = flattenDF(rawJsonDF)

# flattenedDF will be a flat dataframe instead as a result.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pysparketl-0.0.2.tar.gz (3.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pysparketl-0.0.2-py3-none-any.whl (3.5 kB view details)

Uploaded Python 3

File details

Details for the file pysparketl-0.0.2.tar.gz.

File metadata

  • Download URL: pysparketl-0.0.2.tar.gz
  • Upload date:
  • Size: 3.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for pysparketl-0.0.2.tar.gz
Algorithm Hash digest
SHA256 707de1907ba32f9ab94e20e8f94792b91e4cfdf88275f02a7cbd00024404013d
MD5 bef4ddde5e8f83ea5e5723799e4af317
BLAKE2b-256 ec2207911e725b7ab926c138da319dc878dfbe73a04da6427ec0b0065a331f02

See more details on using hashes here.

File details

Details for the file pysparketl-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: pysparketl-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 3.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for pysparketl-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 60eed4e85bdeb554e67e6a05e936ea249f80356fa469c41d402cd9d2864e6f5b
MD5 4eea2559f0975037c67f81f9b6fb5563
BLAKE2b-256 b287f729f6a4adb5e3b93b8561a38aa38db5fd3a0c113a3a9ddfb8a223426fbf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page