Skip to main content

OpenLineage integration with Dagster

Project description

New maintainers needed: please open an issue to get started.

OpenLineage Dagster Integration

A library that integrates Dagster with OpenLineage for automatic metadata collection. It provides an OpenLineage sensor, a Dagster sensor that tails Dagster event logs for tracking metadata. On each sensor evaluation, the function processes a batch of event logs, converts Dagster events into OpenLineage events, and emits them to an OpenLineage backend.

Features

Metadata

  • Dagster job & op lifecycle

Requirements

Installation

$ python -m pip install openlineage-dagster

Usage

OpenLineage Sensor & Event Log Storage Requirements

Single OpenLineage sensor per Dagster instance
As it processes all event logs for a given Dagster instance, define and enable only a single OpenLineage sensor per instance. Running multiple will result in emitting duplicate OpenLineage job runs for Dagster steps with different OpenLineage run ids that are dynamically generated during sensor runs.

Non-sharded Event Log Storage
For the sensor to handle all event logs across runs, use non-sharded event log storage. If an event log storage sharded by run (i.e. default SqliteEventLogStorage) is used, cursor that tracks the last processed event storage id may not update properly.

OpenLineage Sensor Setup

Get OpenLineage sensor definition from openlineage_sensor() factory function and add it to your Dagster repository.

from dagster import repository
from openlineage.dagster.sensor import openlineage_sensor


@repository
def my_repository():
    openlineage_sensor_def = openlineage_sensor()
    return other_defs + [openlineage_sensor_def]

With parallel sensor run not supported at the time of writing, some tuning may be necessary to avoid affecting other sensors' performance.

See Dagster's documentation on Evaluation Interval for more detail on minimum_interval_seconds, which defaults to 30 seconds. record_filter_limit is the maximum number of event logs to process on each sensor evaluation, and it defaults to 30 records per evaluation. Default values can be overridden as below.

@repository
def my_repository():
    openlineage_sensor_def = openlineage_sensor(
        minimum_interval_seconds=60,
        record_filter_limit=60,
    )
    return other_defs + [openlineage_sensor_def]

OpenLineage sensor handles event logs in ascending order of storage id, and by default, starts with the first log. Optionally, after_storage_id can be specified to customize the starting point. This is only applicable when cursor is undefined or has been deleted.

@repository
def my_repository():
    openlineage_sensor_def = openlineage_sensor(
        after_storage_id=100
    )
    return other_defs + [openlineage_sensor_def]

OpenLineage Adapter & Client Configuration

The sensor uses OpenLineage adapter and client to convert and push data to an OpenLineage backend, and they depend on the following environment variables.

If using User Repository Deployments, add the variables to the repository where the sensor is defined. Otherwise, add the variables to Dagster Daemon.

  • OPENLINEAGE_URL - point to service which will consume OpenLineage events
  • OPENLINEAGE_API_KEY - set if consumer of OpenLineage events requires Bearer authentication key
  • OPENLINEAGE_NAMESPACE - set if you are using something other than the default as the default namespace when Dagster repository is undefined

OpenLineage Namespace & Dagster Repository

For Dagster jobs organized in repositories, Dagster keeps track of the repository name for each pipeline run. When the repository name is present, it is always used as the OpenLineage namespace name. OPENLINEAGE_NAMESPACE option is a way to fall back and provide some other static default value.

Development

To install all dependencies for local development:

$ python -m pip install -e .[dev]  # or python -m pip install -e .\[dev\] in zsh 

To run test suite:

$ pytest

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openlineage-dagster-0.12.0.tar.gz (8.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

openlineage_dagster-0.12.0-py3-none-any.whl (8.7 kB view details)

Uploaded Python 3

File details

Details for the file openlineage-dagster-0.12.0.tar.gz.

File metadata

  • Download URL: openlineage-dagster-0.12.0.tar.gz
  • Upload date:
  • Size: 8.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.7.13

File hashes

Hashes for openlineage-dagster-0.12.0.tar.gz
Algorithm Hash digest
SHA256 319052c5d60c9bc672d408bffdc7a3ffc3e82087aaf88c363fe8269a7958ace3
MD5 82aa03a1df6e6fec3279e322a212ed49
BLAKE2b-256 9f2467357e0f5c31cf6c5125b7a5723ea9d72810e7b7876390eb4098289ca27b

See more details on using hashes here.

File details

Details for the file openlineage_dagster-0.12.0-py3-none-any.whl.

File metadata

File hashes

Hashes for openlineage_dagster-0.12.0-py3-none-any.whl
Algorithm Hash digest
SHA256 87ec0e053bcd6ea151140c4fd38dea66f050dd88619f6487d6ccb86a2dba63db
MD5 efcb032ccb3422c723dd8f9a3a8e7d56
BLAKE2b-256 8b93b5d55835020e13ed3d1a85544cda22e33a66cf3c146c039179a2818d4435

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page