Generic ETL Pipeline Framework for Apache Spark
Project description
Overview
Goal
There are many public clouds provide managed Apache Spark as service, such as databricks, AWS EMR, Oracle OCI DataFlow, see the table below for a detailed list.
However, the way to deploy Spark application and launch Spark application are incompatible between different cloud Spark platforms.
Now with spark-etl
, you can deploy and launch your Spark application in a standard way.
Benefit
Your application using spark-etl
can be deployed and launched from different Spark providers without changing the source code. Please check out the demos in the tables below.
Application
An application is a python program. It contains:
- A
main.py
file which contain the application entry - A
manifest.json
file, which specify the metadata of the application. - A
requirements.txt
file, which specify the application dependency.
Application entry signature
In your application's main.py
, you shuold have a main
function with the following signature:
spark
is the spark session objectinput_args
a dict, is the argument user specified when running this application.sysops
is the system options passed, it is platform specific. Job submitter may inject platform specific object insysops
object.- Your
main
function's return value should be a JSON object, it will be returned from the job submitter to the caller.
def main(spark, input_args, sysops={}):
# your code here
Here is an application example.
Build your application
etl -a build -c <config-filename> -p <application-name>
For details, please checkout examples below.
Deploy your application
etl -a deploy -c <config-filename> -p <application-name> -f <profile-name>
For details, please checkout examples below.
Run your application
etl -a run -c <config-filename> -p <application-name> -f <profile-name> --run-args <input-filename>
For details, please checkout examples below.
Supported platforms
You setup your own Apache Spark Cluster. | |
Use PySpark package, fully compatible to other spark platform, allows you to test your pipeline in a single computer.
|
|
You host your spark cluster in databricks | |
You host your spark cluster in Amazon AWS EMR
|
|
You host your spark cluster in Google Cloud | |
You host your spark cluster in Microsoft Azure HDInsight | |
You host your spark cluster in Oracle Cloud Infrastructure, Data Flow Service | |
You host your spark cluster in IBM Cloud |
APIs
Job Deployer
For job deployers, please check the wiki .
Job Submitter
For job submitters, please check the wiki
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for spark_etl-0.0.107-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 037bc87015f2786b352df85e9ae812a1075cbbcfb945877445957b777077b106 |
|
MD5 | 24adf2d1ace1b3d9437661e672b81dcb |
|
BLAKE2b-256 | 11967f0f23b90e5625175e99ecd38cdf54ef4b152c98fe01fb14cea1d1cdac47 |