Skip to main content

Generic ETL Pipeline Framework for Apache Spark

Project description

See https://stonezhong.github.io/spark_etl/ for more informaion

Overview

Goal

There are many public clouds provide managed Apache Spark as service, such as databricks, AWS EMR, Oracle OCI DataFlow, see the table below for a complete list.

However, each platform has it's own way of launching Spark jobs, and the way to launch spark jobs between platforms are not compatible with each other.

spark-etl is a python package, which simplifies the spark application management cross platforms, with 3 uniformed steps:

  • Build your spark application
  • Deploy your spark application
  • Run your spark application

Benefit

Your application using spark-etl is spark provider agnostic. For example, you can move your application from Azure HDInsight to AWS EMR without changing your application's code.

You can also run a down-scaled version of your data lake with pyspark in a laptop, since pyspark is a supported spark platform, with this feature, you can validate your spark application with pyspark on your laptop, instead of run it in cloud, to save cost.

Supported platforms

You setup your own Apache Spark Cluster.
Use PySpark package, fully compatible to other spark platform, allows you to test your pipeline in a single computer.
You host your spark cluster in databricks
You host your spark cluster in Amazon AWS EMR
You host your spark cluster in Google Cloud
You host your spark cluster in Microsoft Azure HDInsight
You host your spark cluster in Oracle Cloud Infrastructure, Data Flow Service
You host your spark cluster in IBM Cloud

Deploy and run application

Please see the Demos

APIs

pydocs for APIs

Application

An application is a pyspark application, so far we only support pyspark, Java and Scala support will be added latter. An application contains:

  • A main.py file which contain the application entry
  • A manifest.json file, which specify the metadata of the application.
  • A requirements.txt file, which specify the application dependency.

Application class:

  • You can create an application via Application(app_location)
  • You can build an application via app.build(destination_location)

Application entry signature

In your application's main.py, you shuold have a main function with the following signature:

  • spark is the spark session object
  • input_args a dict, is the argument user specified when running this job.
  • sysops is the system options passed, it is platform specific. Job submitter may inject platform specific object in sysops object.
  • Your main function's return value will be returned from the job submitter to the caller.
def main(spark, input_args, sysops={}):
    # your code here

Here is an example.

Job Deployer

For job deployers, please check the wiki .

Job Submitter

For job submitters, please check the wiki

Tool: etl.py

Build an application

To build an application, run

./etl.py -a build --app-dir <app-dir> --build-dir <build-dir>
  • <app_dir> is the directory where your application is located.

  • <build-dir> is the directory where you want your build to be deployed

    • Your build actually located at <build-dir>/<version>, where <version> is specified by application's manifest file
  • Build is mostly platform independent. You can put platform related package in file common_requirements.txt

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spark-etl-0.0.92.tar.gz (20.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spark_etl-0.0.92-py3-none-any.whl (28.8 kB view details)

Uploaded Python 3

File details

Details for the file spark-etl-0.0.92.tar.gz.

File metadata

  • Download URL: spark-etl-0.0.92.tar.gz
  • Upload date:
  • Size: 20.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.6.9

File hashes

Hashes for spark-etl-0.0.92.tar.gz
Algorithm Hash digest
SHA256 bdefc378b8079e99439912acf8b3e4b2e8172fcc20d2df12ac525bfe3da1b852
MD5 082f719bf858912fb06c5f9f796e566b
BLAKE2b-256 944b136deba375279ebcdbd952f0924e56fa947f9a2ca994871029e6b7b93480

See more details on using hashes here.

File details

Details for the file spark_etl-0.0.92-py3-none-any.whl.

File metadata

  • Download URL: spark_etl-0.0.92-py3-none-any.whl
  • Upload date:
  • Size: 28.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.6.9

File hashes

Hashes for spark_etl-0.0.92-py3-none-any.whl
Algorithm Hash digest
SHA256 e9f4da811fed499449610573a2856513e20a9187d9d7aa498fcf2f8926db1e62
MD5 f04882a85b4019e3217eda4841d4599d
BLAKE2b-256 a224c3622ee1e4cfafdae9b0b7bdc69ec9443fff44a601654d2f9d7594eacc5f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page