Skip to main content

Marquez integration with Airflow

Project description

marquez-airflow

A library that integrates Airflow DAGs with Marquez for automatic metadata collection.

Features

Metadata

  • Task lifecycle
  • Task parameters
  • Task runs linked to versioned code
  • Task inputs / outputs

Lineage

  • Track inter-DAG dependencies

Built-in

  • SQL parser
  • Link to code builder (ex: GitHub)
  • Metadata extractors

Requirements

Installation

$ pip3 install marquez-airflow

Note: You can also add marquez-airflow to your requirements.txt for Airflow.

To install from source, run:

$ python3 setup.py install

Configuration

The library depends on a backend. A Backend is configurable and lets the library know where to write dataset, job, and run metadata.

Backends

  • HTTP: Write metadata to Marquez
  • FILE: Write metadata to a file (as json) under /tmp/marquez
  • LOG: Simply just logs the metadata to the console

By default, the HTTP backend will be used (see next sections on configuration). To override the default backend and write metadata to a file, use MARQUEZ_BACKEND:

MARQUEZ_BACKEND=FILE

Note: Metadata will be written to /tmp/marquez/client.requests.log, but the location can be overridden with MARQUEZ_FILE.

HTTP Backend Authentication

The HTTP backend supports using API keys to authenticate requests via Bearer auth. To include a key when making an API request, use MARQUEZ_API_KEY:

MARQUEZ_BACKEND=HTTP
MARQUEZ_API_KEY=[YOUR_API_KEY]

HTTP Backend Environment Variables

marquez-airflow needs to know where to talk to the Marquez server API. You can set these using environment variables to be read by your Airflow service.

You will also need to set the namespace if you are using something other than the default namespace.

MARQUEZ_BACKEND=HTTP
MARQUEZ_URL=http://my_hosted_marquez.example.com:5000
MARQUEZ_NAMESPACE=my_special_ns

Extractors : Sending the correct data from your DAGs

If you do nothing, Marquez will receive the Job and the Run from your DAGs, but sources and datasets will not be sent.

marquez-airflow allows you to do more than that by building "Extractors". Extractors are in the process of changing right now, but they basically take a task and extract:

  1. Name : The name of the task
  2. Location : Location of the code for the task
  3. Inputs : List of input datasets
  4. Outputs : List of output datasets
  5. Context : The Airflow context for the task

It's important to understand the inputs and outputs are lists and relate directly to the Dataset object in Marquez. Datasets also include a source which relates directly to the Source object in Marquez.

Usage

To begin collecting Airflow DAG metadata with Marquez, use:

- from airflow import DAG
+ from marquez_airflow import DAG

When enabled, the library will:

  1. On DAG start, collect metadata for each task using an Extractor (the library defines a default extractor to use otherwise)
  2. Collect task input / output metadata (source, schema, etc)
  3. Collect task run-level metadata (execution time, state, parameters, etc)
  4. On DAG complete, also mark the task as complete in Marquez

To enable logging, set the environment variable MARQUEZ_LOG_LEVEL to DEBUG, INFO, or ERROR:

$ export MARQUEZ_LOG_LEVEL=INFO

Development

To install all dependencies for local development:

$ pip3 install -e .[dev]

To run the entire test suite, you'll first want to initialize the Airflow database:

$ airflow initdb

Then, run the test suite with:

$ pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

marquez-airflow-0.14.2.tar.gz (25.2 kB view details)

Uploaded Source

Built Distribution

marquez_airflow-0.14.2-py3-none-any.whl (35.7 kB view details)

Uploaded Python 3

File details

Details for the file marquez-airflow-0.14.2.tar.gz.

File metadata

  • Download URL: marquez-airflow-0.14.2.tar.gz
  • Upload date:
  • Size: 25.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.6.13

File hashes

Hashes for marquez-airflow-0.14.2.tar.gz
Algorithm Hash digest
SHA256 17ead6ed79af3dfbd9b7f7b7e2e351e10f8a60ab971a122e2f72418980b403d3
MD5 76594f7508d61537a58b851ed712e11a
BLAKE2b-256 b167ef17d4cda7a7214991c313208ecc7d35c803e70838df6361749cb8efb25d

See more details on using hashes here.

File details

Details for the file marquez_airflow-0.14.2-py3-none-any.whl.

File metadata

  • Download URL: marquez_airflow-0.14.2-py3-none-any.whl
  • Upload date:
  • Size: 35.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.6.13

File hashes

Hashes for marquez_airflow-0.14.2-py3-none-any.whl
Algorithm Hash digest
SHA256 75fe5d3496c65ad190ddef1c7f4a2d7de649b262306cc64f0ef35ccdfcdd87a4
MD5 37cda2ee0b38c67b65f2dd3fad432ce1
BLAKE2b-256 93de24c704363e1ae5ec150fc9c03178c0ae919dde94a651cbc8913b1108d378

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page