Skip to main content

This project performs pyspark operations on dataframes, currently for unnesting shallow or deeply nested json data.

Project description

Pyspark ETL

This project aims at solving common problems that face data engineers.

One problem is to handle deeply nested json data and render data in a clean tabular format.

This kind of semistructured data may contain a combination of different data types that need to be handled differently to flatten the data properly.

This package does just that!

In this initial version of the package, there is one module named pysparketl which has one main function: flattenDF and two utility functions: _getArrayCols, _explodeArrayCols.

Usage

flattenDF(df) where df is a pyspark dataframe that has nested data in its columns. The returned dataframe will be a completely flat/tabular structure.

Example

Install the Package
pip install pysparketl
<!-- OR -->
pip3 install pysparketl
Import and Use
# Import the library
from pysparketl.dataframes import flattenDF

# Read your input dataframe that contains Json data
rawJsonDF = spark.read.json(pathToFile)

# Pass your nested json dataframe as input to the flattenDF function
flattenedDF = flattenDF(rawJsonDF)

# flattenedDF will be a flat dataframe instead as a result.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pysparketl-0.0.3.tar.gz (3.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pysparketl-0.0.3-py3-none-any.whl (3.5 kB view details)

Uploaded Python 3

File details

Details for the file pysparketl-0.0.3.tar.gz.

File metadata

  • Download URL: pysparketl-0.0.3.tar.gz
  • Upload date:
  • Size: 3.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for pysparketl-0.0.3.tar.gz
Algorithm Hash digest
SHA256 7be1e057d99043a1f689f566e057f0559ba0353c7ac59ffe5a772e2ed4ca5a6f
MD5 a7963a44a5ffdb76485b24a6de716bd4
BLAKE2b-256 35d8d10d3d5552ed310ff45b999825bdbcf0d35687469946fbe5a6ce5065fcbe

See more details on using hashes here.

File details

Details for the file pysparketl-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: pysparketl-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 3.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for pysparketl-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 15693a4e60a232cf9564a01799402f198f52d2264107311dfa8735c54041740e
MD5 4ad1c0694fc61e296535cbcc7f49a7af
BLAKE2b-256 603ac98bc7f1344c42a795bc2c6bcf79e7f0ebf48c6ff9e1dcef32bdb4c4323f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page