Generic ETL Pipeline Framework for Apache Spark

These details have not been verified by PyPI

Project links

Homepage

Project description

Overview
APIs
Tool: etl.py
- Build an application
- Deploy and run application

See https://stonezhong.github.io/spark_etl/ for more informaion

Overview

Goal

There are many public clouds provide managed Apache Spark as service, such as databricks, AWS EMR, Oracle OCI DataFlow, see the table below for a complete list.

However, each platform has it's own way of launching Spark jobs, and the way to launch spark jobs between platforms are not compatible with each other.

spark-etl is a python package, which simplifies the spark application management cross platforms, with 3 uniformed steps:

Build your spark application
Deploy your spark application
Run your spark application

Benefit

Your application using spark-etl is spark provider agnostic. For example, you can move your application from Azure HDInsight to AWS EMR without changing your application's code.

You can also run a down-scaled version of your data lake with pyspark in a laptop, since pyspark is a supported spark platform, with this feature, you can validate your spark application with pyspark on your laptop, instead of run it in cloud, to save cost.

Supported platforms

	You setup your own Apache Spark Cluster.
	Use PySpark package, fully compatible to other spark platform, allows you to test your pipeline in a single computer.
	You host your spark cluster in databricks
	You host your spark cluster in Amazon AWS EMR
	You host your spark cluster in Google Cloud
	You host your spark cluster in Microsoft Azure HDInsight
	You host your spark cluster in Oracle Cloud Infrastructure, Data Flow Service
	You host your spark cluster in IBM Cloud

Deploy and run application

Please see the Demos

APIs

pydocs for APIs

Application

An application is a pyspark application, so far we only support pyspark, Java and Scala support will be added latter. An application contains:

A main.py file which contain the application entry
A manifest.json file, which specify the metadata of the application.
A requirements.txt file, which specify the application dependency.

Application class:

You can create an application via Application(app_location)
You can build an application via app.build(destination_location)

Application entry signature

In your application's main.py, you shuold have a main function with the following signature:

spark is the spark session object
input_args a dict, is the argument user specified when running this job.
sysops is the system options passed, it is platform specific. Job submitter may inject platform specific object in sysops object.
Your main function's return value will be returned from the job submitter to the caller.

def main(spark, input_args, sysops={}):
    # your code here

Here is an example.

Job Deployer

For job deployers, please check the wiki .

Job Submitter

For job submitters, please check the wiki

Tool: etl.py

Build an application

To build an application, run

./etl.py -a build --app-dir <app-dir> --build-dir <build-dir>

<app_dir> is the directory where your application is located.
<build-dir> is the directory where you want your build to be deployed
- Your build actually located at <build-dir>/<version>, where <version> is specified by application's manifest file
Build is mostly platform independent. You can put platform related package in file common_requirements.txt

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.0.130

Jun 5, 2023

0.0.129

May 30, 2023

0.0.128

May 30, 2023

0.0.127

May 26, 2023

0.0.126

May 26, 2023

0.0.125

May 26, 2023

0.0.124

May 26, 2023

0.0.123

May 24, 2023

0.0.122

Apr 24, 2023

0.0.121

Apr 24, 2023

0.0.120

Apr 24, 2023

0.0.119

Apr 24, 2023

0.0.118

Apr 24, 2023

0.0.117

Apr 22, 2023

0.0.116

Apr 21, 2023

0.0.115

Feb 24, 2023

0.0.114

May 2, 2022

0.0.113

May 2, 2022

0.0.112

May 2, 2022

0.0.110

Apr 27, 2022

0.0.109

Dec 10, 2021

0.0.108

Dec 8, 2021

0.0.107

Dec 6, 2021

0.0.106

Dec 6, 2021

0.0.105

Dec 5, 2021

0.0.104

Dec 5, 2021

0.0.103

Nov 2, 2021

0.0.102

Nov 1, 2021

0.0.101

Oct 12, 2021

0.0.100

Aug 28, 2021

0.0.99

Aug 12, 2021

0.0.98

Apr 17, 2021

0.0.97

Apr 15, 2021

0.0.96

Apr 12, 2021

0.0.95

Apr 8, 2021

0.0.94

Apr 8, 2021

0.0.93

Apr 7, 2021

This version

0.0.92

Mar 17, 2021

0.0.91

Mar 9, 2021

0.0.90

Mar 9, 2021

0.0.89

Mar 9, 2021

0.0.88

Mar 8, 2021

0.0.87

Mar 8, 2021

0.0.86

Mar 7, 2021

0.0.85

Mar 3, 2021

0.0.82

Feb 19, 2021

0.0.81

Feb 18, 2021

0.0.80

Feb 18, 2021

0.0.79

Feb 18, 2021

0.0.78

Feb 18, 2021

0.0.77

Feb 18, 2021

0.0.76

Feb 18, 2021

0.0.75

Feb 12, 2021

0.0.71

Feb 10, 2021

0.0.70

Feb 10, 2021

0.0.69

Feb 10, 2021

0.0.68

Feb 10, 2021

0.0.51

Feb 8, 2021

0.0.49

Feb 8, 2021

0.0.46

Feb 3, 2021

0.0.38

Jan 2, 2021

0.0.37

Dec 30, 2020

0.0.31

Dec 29, 2020

0.0.30

Dec 29, 2020

0.0.21

Dec 15, 2020

0.0.11

Nov 25, 2020

0.0.10

Nov 18, 2020

0.0.9

Nov 9, 2020

0.0.8

Nov 6, 2020

0.0.7

Nov 6, 2020

0.0.6

Sep 25, 2020

0.0.5

Sep 21, 2020

0.0.4

Jul 22, 2020

0.0.3

Jul 22, 2020

0.0.1

Jul 22, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spark-etl-0.0.92.tar.gz (20.5 kB view details)

Uploaded Mar 17, 2021 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

spark_etl-0.0.92-py3-none-any.whl (28.8 kB view details)

Uploaded Mar 17, 2021 Python 3

File details

Details for the file spark-etl-0.0.92.tar.gz.

File metadata

Download URL: spark-etl-0.0.92.tar.gz
Upload date: Mar 17, 2021
Size: 20.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.6.9

File hashes

Hashes for spark-etl-0.0.92.tar.gz
Algorithm	Hash digest
SHA256	`bdefc378b8079e99439912acf8b3e4b2e8172fcc20d2df12ac525bfe3da1b852`
MD5	`082f719bf858912fb06c5f9f796e566b`
BLAKE2b-256	`944b136deba375279ebcdbd952f0924e56fa947f9a2ca994871029e6b7b93480`

See more details on using hashes here.

File details

Details for the file spark_etl-0.0.92-py3-none-any.whl.

File metadata

Download URL: spark_etl-0.0.92-py3-none-any.whl
Upload date: Mar 17, 2021
Size: 28.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.6.9

File hashes

Hashes for spark_etl-0.0.92-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e9f4da811fed499449610573a2856513e20a9187d9d7aa498fcf2f8926db1e62`
MD5	`f04882a85b4019e3217eda4841d4599d`
BLAKE2b-256	`a224c3622ee1e4cfafdae9b0b7bdc69ec9443fff44a601654d2f9d7594eacc5f`

See more details on using hashes here.

spark-etl 0.0.92

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Overview

Goal

Benefit

Supported platforms

Deploy and run application

APIs

Application

Application entry signature

Job Deployer

Job Submitter

Tool: etl.py

Build an application

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes