Generic ETL Pipeline Framework for Apache Spark

These details have not been verified by PyPI

Project links

Homepage

Project description

Overview

Goal

This is a cross platform tool allowing you to build, deploy and run your ETL job. It supports native Apache Spark cluster, Amazon EMR and Oracle DataFlow, which means:

You can use this library if you build your own Apache Spark cluster
You can use this library if you use Amazon EMR
You can use this library if you use Oracle DataFlow

Application

What is an application?

An application is the code for a spark job. It contains:

A main.py file which contain the application entry
A manifest.json file, which specify the metadata of the application, for example, the current version, check here for example.
A requirements.txt file, which specify the application dependency.

How to build an application

Building an application will generate application artifacts which is needed when you deploy the application.

Here is sample code to build application:

from spark_etl import Application
...

app = Application("path_to_application_dir")
app.build("path_do_artifact_directory")

# it load the application from path_to_application_dir
# it generate artifacts in path_do_artifact_directory

Application entry signature

In your application's main.py, you shuold have a function called main with the following signature:

def main(spark, input_args):
    # your code here

The argument spark is the spark session object passed to you
The argument input_args is a dict that represent the arguments when you invoke the application, by default it is an empty dict

See here for example.

What is deployer

A deployer is an object that knows how to deploy your ETL job in a paticular platform

What is submitter

A submitter is an object that knows how to submit your ETL job in a paticular platform.

Deploy and Submit job with native Apache Spark

Deploy and Submit job with Oracle DataFlow

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.0.130

Jun 5, 2023

0.0.129

May 30, 2023

0.0.128

May 30, 2023

0.0.127

May 26, 2023

0.0.126

May 26, 2023

0.0.125

May 26, 2023

0.0.124

May 26, 2023

0.0.123

May 24, 2023

0.0.122

Apr 24, 2023

0.0.121

Apr 24, 2023

0.0.120

Apr 24, 2023

0.0.119

Apr 24, 2023

0.0.118

Apr 24, 2023

0.0.117

Apr 22, 2023

0.0.116

Apr 21, 2023

0.0.115

Feb 24, 2023

0.0.114

May 2, 2022

0.0.113

May 2, 2022

0.0.112

May 2, 2022

0.0.110

Apr 27, 2022

0.0.109

Dec 10, 2021

0.0.108

Dec 8, 2021

0.0.107

Dec 6, 2021

0.0.106

Dec 6, 2021

0.0.105

Dec 5, 2021

0.0.104

Dec 5, 2021

0.0.103

Nov 2, 2021

0.0.102

Nov 1, 2021

0.0.101

Oct 12, 2021

0.0.100

Aug 28, 2021

0.0.99

Aug 12, 2021

0.0.98

Apr 17, 2021

0.0.97

Apr 15, 2021

0.0.96

Apr 12, 2021

0.0.95

Apr 8, 2021

0.0.94

Apr 8, 2021

0.0.93

Apr 7, 2021

0.0.92

Mar 17, 2021

0.0.91

Mar 9, 2021

0.0.90

Mar 9, 2021

0.0.89

Mar 9, 2021

0.0.88

Mar 8, 2021

0.0.87

Mar 8, 2021

0.0.86

Mar 7, 2021

0.0.85

Mar 3, 2021

0.0.82

Feb 19, 2021

0.0.81

Feb 18, 2021

0.0.80

Feb 18, 2021

0.0.79

Feb 18, 2021

0.0.78

Feb 18, 2021

0.0.77

Feb 18, 2021

0.0.76

Feb 18, 2021

0.0.75

Feb 12, 2021

0.0.71

Feb 10, 2021

0.0.70

Feb 10, 2021

0.0.69

Feb 10, 2021

0.0.68

Feb 10, 2021

0.0.51

Feb 8, 2021

0.0.49

Feb 8, 2021

0.0.46

Feb 3, 2021

0.0.38

Jan 2, 2021

0.0.37

Dec 30, 2020

0.0.31

Dec 29, 2020

0.0.30

Dec 29, 2020

0.0.21

Dec 15, 2020

0.0.11

Nov 25, 2020

0.0.10

Nov 18, 2020

0.0.9

Nov 9, 2020

0.0.8

Nov 6, 2020

This version

0.0.7

Nov 6, 2020

0.0.6

Sep 25, 2020

0.0.5

Sep 21, 2020

0.0.4

Jul 22, 2020

0.0.3

Jul 22, 2020

0.0.1

Jul 22, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spark-etl-0.0.7.tar.gz (10.5 kB view hashes)

Uploaded Nov 6, 2020 Source

Built Distribution

spark_etl-0.0.7-py3-none-any.whl (17.7 kB view hashes)

Uploaded Nov 6, 2020 Python 3

Hashes for spark-etl-0.0.7.tar.gz

Hashes for spark-etl-0.0.7.tar.gz
Algorithm	Hash digest
SHA256	`3d4c6ffdd8937ca70e8345a1ee30a6963d2c9192028c4b558506d535bc20f7ed`
MD5	`22be74789a0a59a0e9aec2fac09a3dcc`
BLAKE2b-256	`82cb635aeb1069ebcca5dd233fc7496b6bfa9bcf19864b7d723f5497afe970b8`

Hashes for spark_etl-0.0.7-py3-none-any.whl

Hashes for spark_etl-0.0.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`464ddcad1d6d3d0ba58345de315e28882ed5a6a964ad62585cee48af006c35ef`
MD5	`b6f468eb26ba4c356c3cc59f2967c2cc`
BLAKE2b-256	`05e0b373ab9ab7ba4aa6fa80cd2ed219073064de9d0646da06711a9d18ca2f69`