Skip to main content
Join the official 2020 Python Developers SurveyStart the survey!

Generic ETL Pipeline Framework for Apache Spark

Project description

Overview

Goal

This is a cross platform tool allowing you to build, deploy and run your ETL job. It supports native Apache Spark cluster, Amazon EMR and Oracle DataFlow, which means:

  • You can use this library if you build your own Apache Spark cluster
  • You can use this library if you use Amazon EMR
  • You can use this library if you use Oracle DataFlow

Application

What is an application?

An application is the code for a spark job. It contains:

  • A main.py file which contain the application entry
  • A manifest.json file, which specify the metadata of the application, for example, the current version, check here for example.
  • A requirements.txt file, which specify the application dependency.

How to build an application

Building an application will generate application artifacts which is needed when you deploy the application.

Here is sample code to build application:

from spark_etl import Application
...

app = Application("path_to_application_dir")
app.build("path_do_artifact_directory")

# it load the application from path_to_application_dir
# it generate artifacts in path_do_artifact_directory

Application entry signature

In your application's main.py, you shuold have a function called main with the following signature:

def main(spark, input_args):
    # your code here
  • The argument spark is the spark session object passed to you
  • The argument input_args is a dict that represent the arguments when you invoke the application, by default it is an empty dict

See here for example.

What is deployer

A deployer is an object that knows how to deploy your ETL job in a paticular platform

What is submitter

A submitter is an object that knows how to submit your ETL job in a paticular platform.

Deploy and Submit job with native Apache Spark

Deploy and Submit job with Oracle DataFlow

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for spark-etl, version 0.0.6
Filename, size File type Python version Upload date Hashes
Filename, size spark_etl-0.0.6-py3-none-any.whl (15.3 kB) File type Wheel Python version py3 Upload date Hashes View
Filename, size spark-etl-0.0.6.tar.gz (9.6 kB) File type Source Python version None Upload date Hashes View

Supported by

Pingdom Pingdom Monitoring Google Google Object Storage and Download Analytics Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page