Skip to main content

An opinionated framework for ETL built on top of Airflow

Project description

gusty

gusty allows you to manage your Airflow DAGs and tasks with greater ease. Instead of writing your DAGs, tasks, and dependencies in a .py file, you can instead specify DAGs and tasks in .yml files, and designate task dependencies within a task's .yml file, as well.

In addition to parsing .yml files, gusty also parses YAML front matter in .ipynb and .Rmd files, allowing you to include Python and R notebook formats in your data pipeline.

gusty works with both Airflow 1.x and Airflow 2.x.

Hello World

Tasks

Instead of importing and calling a BashOperator directly, you can specify the operator and the command parameter (which is a required field for Airflow's BashOperator) in a .yml:

operator: airflow.operators.bash.BashOperator
bash_command: echo hello world

gusty takes the above .yml and turns it into a task based on its file name. If this file was called hello_world.yml, the resulting task would show up in your DAG as hello_world.

You can also set dependencies between jobs in .yml as well. Here is another task, goodbye_world. that depends on hello_world.

operator: airflow.operators.bash.BashOperator
bash_command: echo goodbye world
dependencies:
  - hello_world

This will automatically set the goodbye_world task downstream of the hello_world task.

External dependencies can also be set using the format:

external_dependencies:
  - dag: task

To wait for an entire external DAG to run successfully just use dag: all instead.

DAGs

Your DAGs can also be represented as .yml files. Specifically, DAGs should be represented in a file called METADATA.yml. Similar to the basic Airflow tutorial, our DAG might look something like this:

description: "A Gusty version of the DAG described by this Airflow tutorial: https://airflow.apache.org/docs/stable/tutorial.html"
schedule_interval: "1 0 * * *"
default_args:
    owner: airflow
    depends_on_past: False
    start_date: !days_ago 1
    email: airflow@example.com
    email_on_failure: False
    email_on_retry: False
    retries: 1
    retry_delay: !timedelta 'minutes: 5'
#   queue: bash_queue
#   pool: backfill
#   priority_weight: 10
#   end_date: !datetime [2016, 1, 1]
#   wait_for_downstream: false
#   sla: !timedelta 'hours: 2'
#   trigger_rule: all_success

By default, gusty will create a latest_only DAG, where every job in the DAG will only run for the most recent run date, regardless of if a backfill is called. You can disable this behavior by adding latest_only: False to the default_args block above.

GustyDAG

To have gusty generate a DAG, you can use the GustyDAG class, which just needs an (absolute) path to a directory that contains a METADATA.yml for the DAG and .yml files for the tasks. You must also import airflow. An example of the entire .py file that generates your DAG looks like this:

import airflow
from gusty import GustyDAG

dag = GustyDAG('/usr/local/airflow/dags/hello_world')

The resulting DAG will be named after the directory, in this case, hello_world.

Operators

Airflow Operators

gusty will take parameterized .yml for any operator located in airflow.operators and airflow.contrib.operators. In theory, if it's available in these modules, you can use a .yml to define it.

Custom Operators

gusty will also work with any of your custom operators, so long as those operators are located in an operators directory in your designated AIRFLOW_HOME.

In order for your local operators to import properly, they must follow the pattern of having a snake_case file name and a CamelCase operator name, for example the filename of an operator called YourOperator must be called your_operator.py.

Just as the BashOperator above was accessed via with full module path prepended, airflow.operators.bash.BashOperator, your local operators are accessed via the local keyword, e.g. local.YourOperator.

Demo

You can use a containerized demo of gusty and Airflow over at the gusty-demo, and see all of the above in practice.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gusty-0.2.0.tar.gz (6.8 kB view details)

Uploaded Source

Built Distribution

gusty-0.2.0-py3-none-any.whl (12.7 kB view details)

Uploaded Python 3

File details

Details for the file gusty-0.2.0.tar.gz.

File metadata

  • Download URL: gusty-0.2.0.tar.gz
  • Upload date:
  • Size: 6.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for gusty-0.2.0.tar.gz
Algorithm Hash digest
SHA256 52d3535c788231fffaee139802c43a7225ce8402952675d9caea43cd3edb3e37
MD5 9addfd9d51de5eacbe741d1f6d4c6d37
BLAKE2b-256 f2068e78827fbb12ab528b08371cd13d6f61e6ce7b0c1ca63234654e2b53f4ae

See more details on using hashes here.

File details

Details for the file gusty-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: gusty-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 12.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for gusty-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1bb390abd1895bfd4490ac39e397346c23732fdf2783fb373cfdd297f82264d9
MD5 938503933b99d8aad906aa0cf70b7a57
BLAKE2b-256 9aa9dacf031b39ccbf611d04d6faef0c8d00062f25703ae2f74f3c07d9cb0a78

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page