Generic ETL Pipeline Framework for Apache Spark

These details have not been verified by PyPI

Project links

Homepage

Project description

Overview
APIs
- Job Deployer
- Job Submitter

Overview

Goal

There are many public clouds provide managed Apache Spark as service, such as databricks, AWS EMR, Oracle OCI DataFlow, see the table below for a detailed list.

However, the way to deploy Spark application and launch Spark application are incompatible between different cloud Spark platforms.

Now with spark-etl, you can deploy and launch your Spark application in a standard way.

Benefit

Your application using spark-etl can be deployed and launched from different Spark providers without changing the source code. Please check out the demos in the tables below.

Application

An application is a python program. It contains:

A main.py file which contain the application entry
A manifest.json file, which specify the metadata of the application.
A requirements.txt file, which specify the application dependency.

Application entry signature

In your application's main.py, you shuold have a main function with the following signature:

spark is the spark session object
input_args a dict, is the argument user specified when running this application.
sysops is the system options passed, it is platform specific. Job submitter may inject platform specific object in sysops object.
Your main function's return value should be a JSON object, it will be returned from the job submitter to the caller.

def main(spark, input_args, sysops={}):
    # your code here

Here is an application example.

Build your application

etl -a build -c <config-filename> -p <application-name>

For details, please checkout examples below.

Deploy your application

etl -a deploy -c <config-filename> -p <application-name> -f <profile-name>

For details, please checkout examples below.

Run your application

etl -a run -c <config-filename> -p <application-name> -f <profile-name> --run-args <input-filename>

For details, please checkout examples below.

Supported platforms

	You setup your own Apache Spark Cluster. Demo: Access Data on HDFS Demo: Access Data on AWS S3
	Use PySpark package, fully compatible to other spark platform, allows you to test your pipeline in a single computer. Demo: Access Data on local filesystem Demo: Access Data on AWS S3
	You host your spark cluster in databricks
	You host your spark cluster in Amazon AWS EMR Demo: Access Data on AWS S3
	You host your spark cluster in Google Cloud
	You host your spark cluster in Microsoft Azure HDInsight
	You host your spark cluster in Oracle Cloud Infrastructure, Data Flow Service
	You host your spark cluster in IBM Cloud

APIs

pydocs for APIs

Job Deployer

For job deployers, please check the wiki .

Job Submitter

For job submitters, please check the wiki

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.0.130

Jun 5, 2023

0.0.129

May 30, 2023

0.0.128

May 30, 2023

0.0.127

May 26, 2023

0.0.126

May 26, 2023

0.0.125

May 26, 2023

0.0.124

May 26, 2023

0.0.123

May 24, 2023

0.0.122

Apr 24, 2023

0.0.121

Apr 24, 2023

0.0.120

Apr 24, 2023

0.0.119

Apr 24, 2023

0.0.118

Apr 24, 2023

0.0.117

Apr 22, 2023

0.0.116

Apr 21, 2023

0.0.115

Feb 24, 2023

0.0.114

May 2, 2022

0.0.113

May 2, 2022

0.0.112

May 2, 2022

This version

0.0.110

Apr 27, 2022

0.0.109

Dec 10, 2021

0.0.108

Dec 8, 2021

0.0.107

Dec 6, 2021

0.0.106

Dec 6, 2021

0.0.105

Dec 5, 2021

0.0.104

Dec 5, 2021

0.0.103

Nov 2, 2021

0.0.102

Nov 1, 2021

0.0.101

Oct 12, 2021

0.0.100

Aug 28, 2021

0.0.99

Aug 12, 2021

0.0.98

Apr 17, 2021

0.0.97

Apr 15, 2021

0.0.96

Apr 12, 2021

0.0.95

Apr 8, 2021

0.0.94

Apr 8, 2021

0.0.93

Apr 7, 2021

0.0.92

Mar 17, 2021

0.0.91

Mar 9, 2021

0.0.90

Mar 9, 2021

0.0.89

Mar 9, 2021

0.0.88

Mar 8, 2021

0.0.87

Mar 8, 2021

0.0.86

Mar 7, 2021

0.0.85

Mar 3, 2021

0.0.82

Feb 19, 2021

0.0.81

Feb 18, 2021

0.0.80

Feb 18, 2021

0.0.79

Feb 18, 2021

0.0.78

Feb 18, 2021

0.0.77

Feb 18, 2021

0.0.76

Feb 18, 2021

0.0.75

Feb 12, 2021

0.0.71

Feb 10, 2021

0.0.70

Feb 10, 2021

0.0.69

Feb 10, 2021

0.0.68

Feb 10, 2021

0.0.51

Feb 8, 2021

0.0.49

Feb 8, 2021

0.0.46

Feb 3, 2021

0.0.38

Jan 2, 2021

0.0.37

Dec 30, 2020

0.0.31

Dec 29, 2020

0.0.30

Dec 29, 2020

0.0.21

Dec 15, 2020

0.0.11

Nov 25, 2020

0.0.10

Nov 18, 2020

0.0.9

Nov 9, 2020

0.0.8

Nov 6, 2020

0.0.7

Nov 6, 2020

0.0.6

Sep 25, 2020

0.0.5

Sep 21, 2020

0.0.4

Jul 22, 2020

0.0.3

Jul 22, 2020

0.0.1

Jul 22, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spark-etl-0.0.110.tar.gz (28.9 kB view details)

Uploaded Apr 27, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

spark_etl-0.0.110-py3-none-any.whl (37.2 kB view details)

Uploaded Apr 27, 2022 Python 3

File details

Details for the file spark-etl-0.0.110.tar.gz.

File metadata

Download URL: spark-etl-0.0.110.tar.gz
Upload date: Apr 27, 2022
Size: 28.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.10

File hashes

Hashes for spark-etl-0.0.110.tar.gz
Algorithm	Hash digest
SHA256	`34c80c64d3549f5815f88219024822925ae4e11822acf184918ec11e8409fbea`
MD5	`631cd0f8ead2a86de487637a38758321`
BLAKE2b-256	`d9ef5b42e7024515a90cd55742a0ef21f47d68b3537bcf6626eb583dda54a2ce`

See more details on using hashes here.

File details

Details for the file spark_etl-0.0.110-py3-none-any.whl.

File metadata

Download URL: spark_etl-0.0.110-py3-none-any.whl
Upload date: Apr 27, 2022
Size: 37.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.10

File hashes

Hashes for spark_etl-0.0.110-py3-none-any.whl
Algorithm	Hash digest
SHA256	`78bc0a98d23fff1ec28b1934bc83678d10139f1f8a8cc3147e5dc4bf4b9198f3`
MD5	`5ac830d087f44181fc8530a96770b8e7`
BLAKE2b-256	`102b5b85ff27adf98a9f72ce72d59d30cb5e54f7a50f311030e99310a8421a8e`

See more details on using hashes here.

spark-etl 0.0.110

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Overview

Goal

Benefit

Application

Application entry signature

Build your application

Deploy your application

Run your application

Supported platforms

APIs

Job Deployer

Job Submitter

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes