Skip to main content

An opinionated framework for ETL built on top of Airflow

Project description

gusty

gusty allows you to manage your Airflow DAGs and tasks with greater ease. Instead of writing your DAGs, tasks, and dependencies in a .py file, you can instead specify DAGs and tasks in .yml files, and designate task dependencies within a task's .yml file, as well.

In addition to parsing .yml files, gusty also parses YAML front matter in .ipynb and .Rmd files, allowing you to include Python and R notebook formats in your data pipeline.

Hello World

Tasks

Instead of importing and calling a BashOperator directly, you can specify the operator and the command parameter (which is a required field for Airflow's BashOperator) in a .yml:

operator: BashOperator
bash_command: echo hello world

gusty takes the above .yml and turns it into a task based on its file name. If this file was called hello_world.yml, the resulting task would show up in your DAG as hello_world.

You can also set dependencies between jobs in .yml as well. Here is another task, goodbye_world. that depends on hello_world.

operator: BashOperator
bash_command: echo goodbye world
dependencies:
  - hello_world

This will automatically set the goodbye_world task downstream of the hello_world task.

External dependencies can also be set using the format:

external_dependencies:
  - dag: task

To wait for an entire external DAG to run successfully just use dag: all instead.

DAGs

Your DAGs can also be represented as .yml files. Specifically, DAGs should be represented in a file called METADATA.yml. Similar to the basic Airflow tutorial, our DAG might look something like this:

description: "A Gusty version of the DAG described by this Airflow tutorial: https://airflow.apache.org/docs/stable/tutorial.html"
schedule_interval: "1 0 * * *"
default_args:
    owner: airflow
    depends_on_past: False
    start_date: !days_ago 1
    email: airflow@example.com
    email_on_failure: False
    email_on_retry: False
    retries: 1
    retry_delay: !timedelta 'minutes: 5'
#   queue: bash_queue
#   pool: backfill
#   priority_weight: 10
#   end_date: !datetime [2016, 1, 1]
#   wait_for_downstream: false
#   sla: !timedelta 'hours: 2'
#   trigger_rule: all_success

By default, gusty will create a latest_only DAG, where every job in the DAG will only run for the most recent run date, regardless of if a backfill is called. You can disable this behavior by adding latest_only: False to the default_args block above.

GustyDAG

To have gusty generate a DAG, you can use the GustyDAG class, which just needs an (absolute) path to a directory that contains a METADATA.yml for the DAG and .yml files for the tasks. You must also import airflow. An example of the entire .py file that generates your DAG looks like this:

import airflow
from gusty import GustyDAG

dag = GustyDAG('/usr/local/airflow/dags/hello_world')

The resulting DAG will be named after the directory, in this case, hello_world.

Operators

Airflow Operators

gusty will take parameterized .yml for any operator located in airflow.operators and airflow.contrib.operators. In theory, if it's available in these modules, you can use a .yml to define it.

Custom Operators

gusty will also work with any of your custom operators, so long as those operators are located in an operators directory in your designated AIRFLOW_HOME.

Demo

You use a containerized demo of gusty and Airflow over at the gusty-demo.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gusty-0.1.0.tar.gz (6.4 kB view hashes)

Uploaded Source

Built Distribution

gusty-0.1.0-py3-none-any.whl (12.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page