Marquez integration with Airflow

Project description

marquez-airflow

A library that integrates Airflow DAGs with Marquez for automatic metadata collection.

Features

Metadata

Task lifecycle
Task parameters
Task runs linked to versioned code
Task inputs / outputs

Lineage

Track inter-DAG dependencies

Built-in

SQL parser
Link to code builder (ex: GitHub)
Metadata extractors

Status

This library is under active development with a rapidly evolving API and we'd love your help!

Requirements

Installation

$ pip3 install marquez-airflow

Note: You can also add marquez-airflow to your requirements.txt for Airflow.

To install from source, run:

$ python3 setup.py install

Configuration

The library depends on a backend. A Backend is configurable and lets the library know where to write dataset and job metadata.

Backends

http: Write metadata to Marquez
file: Write metadata to a file (as json) under /tmp/marquez

By default, the http backend will be used (see next section). To override the default backend and write metadata to a file, use MARQUEZ_BACKEND:

MARQUEZ_BACKEND=file

Note: Metadata will be written to /tmp/marquez/client.requests.log, but can be overridden with MARQUEZ_FILE.

Pointing to your Marquez service

marquez-airflow needs to know where to talk to the Marquez server API. You can set these using environment variables to be read by your Airflow service.

You will also need to set the namespace if you are using something other than the default namespace.

MARQUEZ_BACKEND=http
MARQUEZ_URL=http://my_hosted_marquez.example.com:5000
MARQUEZ_NAMESPACE=my_special_ns

Extractors : Sending the correct data from your DAGs

If you do nothing, Marquez will receive the Job and the Run from your DAGs, but sources and datasets will not be sent.

marquez-airflow allows you to do more than that by building "Extractors". Extractors are in the process of changing right now, but they basically take a task and extract:

Name : The name of the task
Location : Location of the code for the task
Inputs : List of input datasets
Outputs : List of output datasets
Context : The Airflow context for the task

It's important to understand the inputs and outputs are lists and relate directly to the Dataset object in Marquez. Datasets also include a source which relates directly to the Source object in Marquez.

A PostgresExtractor is currently in progress. When that's merged, it will represent a good example of how to write custom extractors

Usage

To begin collecting Airflow DAG metadata with Marquez, use:

- from airflow import DAG
+ from marquez_airflow import DAG

When enabled, the library will:

On DAG start, collect metadata for each task using an Extractor (the library defines a default extractor to use otherwise)
Collect task input / output metadata (source, schema, etc)
Collect task run-level metadata (execution time, state, parameters, etc)
On DAG complete, also mark the task as complete in Marquez

To enable logging, set the environment variable MARQUEZ_LOG_LEVEL to DEBUG, INFO, or ERROR:

$ export MARQUEZ_LOG_LEVEL='INFO'

Example

from datetime import datetime
from marquez_airflow import DAG
from airflow.operators.postgres_operator import PostgresOperator
from airflow.utils.dates import days_ago

default_args = {
    'owner': 'datascience',
    'depends_on_past': False,
    'start_date': days_ago(1),
    'email_on_failure': False,
    'email_on_retry': False,
    'email': ['datascience@datakin.com']
}

dag = DAG(
    'orders_popular_day_of_week',
    schedule_interval='@weekly',
    default_args=default_args,
    description='Determines the popular day of week orders are placed.'
)

t1 = PostgresOperator(
    task_id='if_not_exists',
    postgres_conn_id='food_delivery_db',
    sql='''
    CREATE TABLE IF NOT EXISTS popular_orders_day_of_week (
      order_day_of_week VARCHAR(64) NOT NULL,
      order_placed_on   TIMESTAMP NOT NULL,
      orders_placed     INTEGER NOT NULL
    );''',
    dag=dag
)

t2 = PostgresOperator(
    task_id='insert',
    postgres_conn_id='food_delivery_db',
    sql='''
    INSERT INTO popular_orders_day_of_week (order_day_of_week, order_placed_on, orders_placed)
      SELECT EXTRACT(ISODOW FROM order_placed_on) AS order_day_of_week,
             order_placed_on,
             COUNT(*) AS orders_placed
        FROM top_delivery_times
       GROUP BY order_placed_on;
    ''',
    dag=dag
)

t1 >> t2

Contributing

See CONTRIBUTING.md for more details about how to contribute.

Project details

Release history Release notifications | RSS feed

0.20.0

Dec 13, 2021

0.19.1

Nov 5, 2021

0.19.0

Oct 21, 2021

0.18.0

Sep 14, 2021

0.17.0

Aug 20, 2021

0.16.1

Jul 14, 2021

0.16.1rc1 pre-release

Jul 13, 2021

0.16.0

Jul 2, 2021

0.16.0rc1 pre-release

Jul 1, 2021

0.15.2

Jun 17, 2021

0.15.2rc3 pre-release

Jun 17, 2021

0.15.2rc2 pre-release

Jun 15, 2021

0.15.2rc1 pre-release

Jun 15, 2021

0.15.1

Jun 11, 2021

0.15.0

May 24, 2021

0.14.2

May 6, 2021

0.14.1

May 6, 2021

0.14.0

May 4, 2021

0.13.1

Apr 5, 2021

0.13.0

Mar 30, 2021

0.12.2

Mar 16, 2021

0.12.0

Feb 10, 2021

0.3.9

Dec 17, 2020

0.3.8

Dec 8, 2020

0.3.7

Nov 10, 2020

0.3.6

Oct 24, 2020

0.3.5

Oct 13, 2020

0.3.4

Oct 9, 2020

0.3.3

Oct 5, 2020

0.3.2

Oct 2, 2020

This version

0.3.1

Sep 23, 2020

0.3.0

Sep 23, 2020

0.2.4

Sep 15, 2020

0.2.3

Sep 4, 2020

0.2.2

Sep 3, 2020

0.2.1

Sep 2, 2020

0.2.0

Feb 27, 2020

0.1.6

Jun 4, 2019

0.1.5

Apr 4, 2019

0.1.4

Mar 22, 2019

0.1.3

Mar 21, 2019

0.1.2

Mar 20, 2019

0.0.3

Mar 4, 2019

0.0.2

Feb 8, 2019

0.0.1

Feb 8, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

marquez-airflow-0.3.1.tar.gz (16.0 kB view details)

Uploaded Sep 23, 2020 Source

File details

Details for the file marquez-airflow-0.3.1.tar.gz.

File metadata

Download URL: marquez-airflow-0.3.1.tar.gz
Upload date: Sep 23, 2020
Size: 16.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.8.0 tqdm/4.49.0 CPython/3.6.12

File hashes

Hashes for marquez-airflow-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`94a378aef6dfe0ac5a3c9544f1bd3d48debb2de5da03519c577b3a42c3745417`
MD5	`69147b5af80e3f19dd7984af53918d4f`
BLAKE2b-256	`4c62e068a7af7a90571f13f21b2e2b7390a45f4e12d8ce4954a8c8aeec1c2815`