Skip to main content

Marquez integration with Airflow

Project description

marquez-airflow

A library that integrates Airflow DAGs with Marquez for automatic metadata collection.

Features

Metadata

  • Task lifecycle
  • Task parameters
  • Task runs linked to versioned code
  • Task inputs / outputs

Lineage

  • Track inter-DAG dependencies

Built-in

  • SQL parser
  • Link to code builder (ex: GitHub)
  • Metadata extractors

Requirements

Installation

$ pip3 install marquez-airflow

Note: You can also add marquez-airflow to your requirements.txt for Airflow.

To install from source, run:

$ python3 setup.py install

Configuration

The library depends on a backend. A Backend is configurable and lets the library know where to write dataset, job, and run metadata.

Backends

  • HTTP: Write metadata to Marquez
  • FILE: Write metadata to a file (as json) under /tmp/marquez
  • LOG: Simply just logs the metadata to the console

By default, the HTTP backend will be used (see next sections on configuration). To override the default backend and write metadata to a file, use MARQUEZ_BACKEND:

MARQUEZ_BACKEND=FILE

Note: Metadata will be written to /tmp/marquez/client.requests.log, but the location can be overridden with MARQUEZ_FILE.

HTTP Backend Authentication

The HTTP backend supports using API keys to authenticate requests via Bearer auth. To include a key when making an API request, use MARQUEZ_API_KEY:

MARQUEZ_BACKEND=HTTP
MARQUEZ_API_KEY=[YOUR_API_KEY]

HTTP Backend Environment Variables

marquez-airflow needs to know where to talk to the Marquez server API. You can set these using environment variables to be read by your Airflow service.

You will also need to set the namespace if you are using something other than the default namespace.

MARQUEZ_BACKEND=HTTP
MARQUEZ_URL=http://my_hosted_marquez.example.com:5000
MARQUEZ_NAMESPACE=my_special_ns

Extractors : Sending the correct data from your DAGs

If you do nothing, Marquez will receive the Job and the Run from your DAGs, but sources and datasets will not be sent.

marquez-airflow allows you to do more than that by building "Extractors". Extractors are in the process of changing right now, but they basically take a task and extract:

  1. Name : The name of the task
  2. Location : Location of the code for the task
  3. Inputs : List of input datasets
  4. Outputs : List of output datasets
  5. Context : The Airflow context for the task

It's important to understand the inputs and outputs are lists and relate directly to the Dataset object in Marquez. Datasets also include a source which relates directly to the Source object in Marquez.

Usage

To begin collecting Airflow DAG metadata with Marquez, use:

- from airflow import DAG
+ from marquez_airflow import DAG

When enabled, the library will:

  1. On DAG start, collect metadata for each task using an Extractor (the library defines a default extractor to use otherwise)
  2. Collect task input / output metadata (source, schema, etc)
  3. Collect task run-level metadata (execution time, state, parameters, etc)
  4. On DAG complete, also mark the task as complete in Marquez

To enable logging, set the environment variable MARQUEZ_LOG_LEVEL to DEBUG, INFO, or ERROR:

$ export MARQUEZ_LOG_LEVEL=INFO

Development

To install all dependencies for local development:

$ pip3 install -e .[dev]

To run the entire test suite, you'll first want to initialize the Airflow database:

$ airflow initdb

Then, run the test suite with:

$ pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

marquez-airflow-0.14.1.tar.gz (26.3 kB view details)

Uploaded Source

File details

Details for the file marquez-airflow-0.14.1.tar.gz.

File metadata

  • Download URL: marquez-airflow-0.14.1.tar.gz
  • Upload date:
  • Size: 26.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.5

File hashes

Hashes for marquez-airflow-0.14.1.tar.gz
Algorithm Hash digest
SHA256 07f061520a1cf76efdb1f63a3cde31f524089d5354c29ae82ec815f8834ae938
MD5 96fd163efeff6c967112c3990d939955
BLAKE2b-256 76111ee81ee6e15173d777a071e00ea7314352cc9498605bce6e80ee41d01ebd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page