Skip to main content

Generic ETL Pipeline Framework for Apache Spark

Project description

Overview

Goal

There are many public clouds provide managed Apache Spark as service, such as databricks, AWS EMR, Oracle OCI DataFlow, see the table below for a detailed list.

However, the way to deploy Spark application and launch Spark application are incompatible among different cloud Spark platforms.

spark-etl is a python package, provides a standard way for building, deploying and running your Spark application that supports various cloud spark platforms.

Benefit

Your application using spark-etl can be deployed and launched from different cloud spark platforms without changing the source code.

Application

An application is a python program. It contains:

  • A main.py file which contains the application entry
  • A manifest.json file, which specify the metadata of the application.
  • A requirements.txt file, which specify the application dependency.

Application entry signature

In your application's main.py, you shuold have a main function with the following signature:

  • spark is the spark session object
  • input_args a dict, is the argument user specified when running this application.
  • sysops is the system options passed, it is platform specific. Job submitter may inject platform specific object in sysops object.
  • Your main function's return value should be a JSON object, it will be returned from the job submitter to the caller.
def main(spark, input_args, sysops={}):
    # your code here

Here is an application example.

Build your application

etl -a build -c <config-filename> -p <application-name>

Deploy your application

etl -a deploy -c <config-filename> -p <application-name> -f <profile-name>

Run your application

etl -a run -c <config-filename> -p <application-name> -f <profile-name> --run-args <input-filename>

Supported platforms

You setup your own Apache Spark Cluster.
Use PySpark package, fully compatible to other spark platform, allows you to test your pipeline in a single computer.
You host your spark cluster in databricks
You host your spark cluster in Amazon AWS EMR
You host your spark cluster in Google Cloud
You host your spark cluster in Microsoft Azure HDInsight
You host your spark cluster in Oracle Cloud Infrastructure, Data Flow Service
You host your spark cluster in IBM Cloud

Demos

APIs

pydocs for APIs

Job Deployer

For job deployers, please check the wiki .

Job Submitter

For job submitters, please check the wiki

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spark-etl-0.0.130.tar.gz (29.4 kB view details)

Uploaded Source

Built Distribution

spark_etl-0.0.130-py3-none-any.whl (37.9 kB view details)

Uploaded Python 3

File details

Details for the file spark-etl-0.0.130.tar.gz.

File metadata

  • Download URL: spark-etl-0.0.130.tar.gz
  • Upload date:
  • Size: 29.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.6

File hashes

Hashes for spark-etl-0.0.130.tar.gz
Algorithm Hash digest
SHA256 43be6c4d2f76db2c1c4585966d646cd46007f127e4524a7d8842663f837e80c2
MD5 e8f7269e03566b015c2de8afbf53b0ad
BLAKE2b-256 5db3ad4d1de0ea8cc1aaa9f7157a476519b537bc5fc8e3bbab787a9ce4b9ee5a

See more details on using hashes here.

File details

Details for the file spark_etl-0.0.130-py3-none-any.whl.

File metadata

  • Download URL: spark_etl-0.0.130-py3-none-any.whl
  • Upload date:
  • Size: 37.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.6

File hashes

Hashes for spark_etl-0.0.130-py3-none-any.whl
Algorithm Hash digest
SHA256 dba7e7bbd27340ebc6ccb6523b88b79277e288763b18d768c3051db50e1d0af0
MD5 150ff342561ef4ebf1ebf48ef8a30f40
BLAKE2b-256 48a4f4ecc0a43eafd634615e93b46d61ff2a73b6a839b6e57dac5f7825e460d2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page