Skip to main content

Generic ETL Pipeline Framework for Apache Spark

Project description

Overview

Goal

There are many public clouds provide managed Apache Spark as service, such as databricks, AWS EMR, Oracle OCI DataFlow, see the table below for a detailed list.

However, the way to deploy Spark application and launch Spark application are incompatible between different cloud Spark platforms.

Now with spark-etl, you can deploy and launch your Spark application in a standard way.

Benefit

Your application using spark-etl can be deployed and launched from different Spark providers without changing the source code. Please check out the demos in the tables below.

Application

An application is a python program. It contains:

  • A main.py file which contain the application entry
  • A manifest.json file, which specify the metadata of the application.
  • A requirements.txt file, which specify the application dependency.

Application entry signature

In your application's main.py, you shuold have a main function with the following signature:

  • spark is the spark session object
  • input_args a dict, is the argument user specified when running this application.
  • sysops is the system options passed, it is platform specific. Job submitter may inject platform specific object in sysops object.
  • Your main function's return value should be a JSON object, it will be returned from the job submitter to the caller.
def main(spark, input_args, sysops={}):
    # your code here

Here is an application example.

Build your application

etl -a build -c <config-filename> -p <application-name>

For details, please checkout examples below.

Deploy your application

etl -a deploy -c <config-filename> -p <application-name> -f <profile-name>

For details, please checkout examples below.

Run your application

etl -a run -c <config-filename> -p <application-name> -f <profile-name> --run-args <input-filename>

For details, please checkout examples below.

Supported platforms

You setup your own Apache Spark Cluster.
Use PySpark package, fully compatible to other spark platform, allows you to test your pipeline in a single computer.
  • Demo: Access Data on local filesystem
  • Demo: Access Data on AWS S3
  • You host your spark cluster in databricks
    You host your spark cluster in Amazon AWS EMR
  • Demo: Access Data on AWS S3
  • You host your spark cluster in Google Cloud
    You host your spark cluster in Microsoft Azure HDInsight
    You host your spark cluster in Oracle Cloud Infrastructure, Data Flow Service
    You host your spark cluster in IBM Cloud

    APIs

    pydocs for APIs

    Job Deployer

    For job deployers, please check the wiki .

    Job Submitter

    For job submitters, please check the wiki

    Project details


    Release history Release notifications | RSS feed

    Download files

    Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

    Source Distribution

    spark-etl-0.0.110.tar.gz (28.9 kB view details)

    Uploaded Source

    Built Distribution

    If you're not sure about the file name format, learn more about wheel file names.

    spark_etl-0.0.110-py3-none-any.whl (37.2 kB view details)

    Uploaded Python 3

    File details

    Details for the file spark-etl-0.0.110.tar.gz.

    File metadata

    • Download URL: spark-etl-0.0.110.tar.gz
    • Upload date:
    • Size: 28.9 kB
    • Tags: Source
    • Uploaded using Trusted Publishing? No
    • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.10

    File hashes

    Hashes for spark-etl-0.0.110.tar.gz
    Algorithm Hash digest
    SHA256 34c80c64d3549f5815f88219024822925ae4e11822acf184918ec11e8409fbea
    MD5 631cd0f8ead2a86de487637a38758321
    BLAKE2b-256 d9ef5b42e7024515a90cd55742a0ef21f47d68b3537bcf6626eb583dda54a2ce

    See more details on using hashes here.

    File details

    Details for the file spark_etl-0.0.110-py3-none-any.whl.

    File metadata

    • Download URL: spark_etl-0.0.110-py3-none-any.whl
    • Upload date:
    • Size: 37.2 kB
    • Tags: Python 3
    • Uploaded using Trusted Publishing? No
    • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.10

    File hashes

    Hashes for spark_etl-0.0.110-py3-none-any.whl
    Algorithm Hash digest
    SHA256 78bc0a98d23fff1ec28b1934bc83678d10139f1f8a8cc3147e5dc4bf4b9198f3
    MD5 5ac830d087f44181fc8530a96770b8e7
    BLAKE2b-256 102b5b85ff27adf98a9f72ce72d59d30cb5e54f7a50f311030e99310a8421a8e

    See more details on using hashes here.

    Supported by

    AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page