Generic ETL Pipeline Framework for Apache Spark
Project description
Overview
Goal
There are many public clouds provide managed Apache Spark as service, such as databricks, AWS EMR, Oracle OCI DataFlow, see the table below for a detailed list.
However, the way to deploy Spark application and launch Spark application are incompatible among different cloud Spark platforms.
spark-etl is a python package, provides a standard way for building, deploying and running your Spark application that supports various cloud spark platforms.
Benefit
Your application using spark-etl can be deployed and launched from different cloud spark platforms without changing the source code.
Application
An application is a python program. It contains:
- A
main.pyfile which contains the application entry - A
manifest.jsonfile, which specify the metadata of the application. - A
requirements.txtfile, which specify the application dependency.
Application entry signature
In your application's main.py, you shuold have a main function with the following signature:
sparkis the spark session objectinput_argsa dict, is the argument user specified when running this application.sysopsis the system options passed, it is platform specific. Job submitter may inject platform specific object insysopsobject.- Your
mainfunction's return value should be a JSON object, it will be returned from the job submitter to the caller.
def main(spark, input_args, sysops={}):
# your code here
Here is an application example.
Build your application
etl -a build -c <config-filename> -p <application-name>
Deploy your application
etl -a deploy -c <config-filename> -p <application-name> -f <profile-name>
Run your application
etl -a run -c <config-filename> -p <application-name> -f <profile-name> --run-args <input-filename>
Supported platforms
|
|
You setup your own Apache Spark Cluster. |
|
|
Use PySpark package, fully compatible to other spark platform, allows you to test your pipeline in a single computer. |
|
|
You host your spark cluster in databricks |
|
|
You host your spark cluster in Amazon AWS EMR |
|
|
You host your spark cluster in Google Cloud |
|
|
You host your spark cluster in Microsoft Azure HDInsight |
|
|
You host your spark cluster in Oracle Cloud Infrastructure, Data Flow Service |
|
|
You host your spark cluster in IBM Cloud |
Demos
- Using local pyspark, access data on local disk
- Using local pyspark, access data on AWS S3
- Using on-premise spark, access data on HDFS
- Using on-premise spark, access data on AWS S3
- Using AWS EMR's spark, access data on AWS S3
- Using Oracle OCI's Dataflow with API key, access data on Object Storage
- Using Oracle OCI's Dataflow with instance principal, access data on Object Storage
APIs
Job Deployer
For job deployers, please check the wiki .
Job Submitter
For job submitters, please check the wiki
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file spark-etl-0.0.130.tar.gz.
File metadata
- Download URL: spark-etl-0.0.130.tar.gz
- Upload date:
- Size: 29.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
43be6c4d2f76db2c1c4585966d646cd46007f127e4524a7d8842663f837e80c2
|
|
| MD5 |
e8f7269e03566b015c2de8afbf53b0ad
|
|
| BLAKE2b-256 |
5db3ad4d1de0ea8cc1aaa9f7157a476519b537bc5fc8e3bbab787a9ce4b9ee5a
|
File details
Details for the file spark_etl-0.0.130-py3-none-any.whl.
File metadata
- Download URL: spark_etl-0.0.130-py3-none-any.whl
- Upload date:
- Size: 37.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dba7e7bbd27340ebc6ccb6523b88b79277e288763b18d768c3051db50e1d0af0
|
|
| MD5 |
150ff342561ef4ebf1ebf48ef8a30f40
|
|
| BLAKE2b-256 |
48a4f4ecc0a43eafd634615e93b46d61ff2a73b6a839b6e57dac5f7825e460d2
|