Skip to main content

An opinionated framework for ETL built on top of Airflow

Project description

gusty

gusty allows you to manage your Airflow DAGs and tasks with greater ease. Instead of writing your DAGs, tasks, and dependencies in a .py file, you can instead specify DAGs and tasks in .yml files, and designate task dependencies within a task's .yml file, as well.

In addition to parsing .yml files, gusty also parses YAML front matter in .ipynb and .Rmd files, allowing you to include Python and R notebook formats in your data pipeline.

Hello World

Tasks

Instead of importing and calling a BashOperator directly, you can specify the operator and the command parameter (which is a required field for Airflow's BashOperator) in a .yml:

operator: BashOperator
bash_command: echo hello world

gusty takes the above .yml and turns it into a task based on its file name. If this file was called hello_world.yml, the resulting task would show up in your DAG as hello_world.

You can also set dependencies between jobs in .yml as well. Here is another task, goodbye_world. that depends on hello_world.

operator: BashOperator
bash_command: echo goodbye world
dependencies:
  - hello_world

This will automatically set the goodbye_world task downstream of the hello_world task.

External dependencies can also be set using the format:

external_dependencies:
  - dag: task

To wait for an entire external DAG to run successfully just use dag: all instead.

DAGs

Your DAGs can also be represented as .yml files. Specifically, DAGs should be represented in a file called METADATA.yml. Similar to the basic Airflow tutorial, our DAG might look something like this:

description: "A Gusty version of the DAG described by this Airflow tutorial: https://airflow.apache.org/docs/stable/tutorial.html"
schedule_interval: "1 0 * * *"
default_args:
    owner: airflow
    depends_on_past: False
    start_date: !days_ago 1
    email: airflow@example.com
    email_on_failure: False
    email_on_retry: False
    retries: 1
    retry_delay: !timedelta 'minutes: 5'
#   queue: bash_queue
#   pool: backfill
#   priority_weight: 10
#   end_date: !datetime [2016, 1, 1]
#   wait_for_downstream: false
#   sla: !timedelta 'hours: 2'
#   trigger_rule: all_success

By default, gusty will create a latest_only DAG, where every job in the DAG will only run for the most recent run date, regardless of if a backfill is called. You can disable this behavior by adding latest_only: False to the default_args block above.

GustyDAG

To have gusty generate a DAG, you can use the GustyDAG class, which just needs an (absolute) path to a directory that contains a METADATA.yml for the DAG and .yml files for the tasks. You must also import airflow. An example of the entire .py file that generates your DAG looks like this:

import airflow
from gusty import GustyDAG

dag = GustyDAG('/usr/local/airflow/dags/hello_world')

The resulting DAG will be named after the directory, in this case, hello_world.

Operators

Airflow Operators

gusty will take parameterized .yml for any operator located in airflow.operators and airflow.contrib.operators. In theory, if it's available in these modules, you can use a .yml to define it.

Custom Operators

gusty will also work with any of your custom operators, so long as those operators are located in an operators directory in your designated AIRFLOW_HOME.

Demo

You use a containerized demo of gusty and Airflow over at the gusty-demo.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gusty-0.1.0.tar.gz (6.4 kB view details)

Uploaded Source

Built Distribution

gusty-0.1.0-py3-none-any.whl (12.3 kB view details)

Uploaded Python 3

File details

Details for the file gusty-0.1.0.tar.gz.

File metadata

  • Download URL: gusty-0.1.0.tar.gz
  • Upload date:
  • Size: 6.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for gusty-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d6d7fe072fe209559ae1e1a2f9455588ceb44512221108c2f7f3ed68b3737364
MD5 85956a894417c481379aa99e508a5fd9
BLAKE2b-256 ebb1a3cebd2647a95f0a53acf44a113be63b89b1131398757a9148587618e806

See more details on using hashes here.

File details

Details for the file gusty-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: gusty-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 12.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for gusty-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e7ed0ff72e5463b136dae8bba859534e35b54fe54e02fc4c47aa6e477f68c7a2
MD5 8b120c2ced7380699114b33cd3927d9a
BLAKE2b-256 8a872981f5286c70f32130b14d1b91c27236e2a1a217bdc8a1b99b8203f7b133

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page