Skip to main content

Datahub Airflow plugin to capture executions and send to Datahub

Project description

Datahub Airflow Plugin

See the DataHub Airflow docs for details.

Version Compatibility

The plugin supports Apache Airflow versions 2.7+ and 3.1+.

Airflow Version Extra to Install Status Notes
2.7-2.10 [airflow2] ✅ Fully Supported
3.0.x [airflow3] ⚠️ Requires manual fix Needs pydantic>=2.11.8 upgrade
3.1+ [airflow3] ✅ Fully Supported

Note on Airflow 3.0.x: Airflow 3.0.6 pins pydantic==2.11.7, which contains a bug that prevents the DataHub plugin from importing correctly. This issue is resolved in Airflow 3.1.0+ which uses pydantic>=2.11.8. If you must use Airflow 3.0.6, you can manually upgrade pydantic to >=2.11.8, though this may conflict with Airflow's dependency constraints. We recommend upgrading to Airflow 3.1.0 or later.

Related issue: https://github.com/pydantic/pydantic/issues/10963

Installation

The installation command varies depending on your Airflow version due to different OpenLineage dependencies.

For Airflow 2.x (2.7+)

pip install 'acryl-datahub-airflow-plugin[airflow2]'

This installs the plugin with Legacy OpenLineage (openlineage-airflow>=1.2.0), which is required for Airflow 2.x lineage extraction.

Alternative: Using Native OpenLineage Provider on Airflow 2.7+

If your Airflow 2.7+ environment rejects the Legacy OpenLineage package (e.g., due to dependency conflicts), you can use the native OpenLineage provider instead:

# Install the native Airflow provider first
pip install 'apache-airflow-providers-openlineage>=1.0.0'

# Then install the DataHub plugin without OpenLineage extras
pip install acryl-datahub-airflow-plugin

The plugin will automatically detect and use apache-airflow-providers-openlineage when available, providing the same functionality.

For Airflow 3.x (3.1+)

pip install 'acryl-datahub-airflow-plugin[airflow3]'

This installs the plugin with apache-airflow-providers-openlineage>=1.0.0, which is the native OpenLineage provider for Airflow 3.x.

Note: If using Airflow 3.0.x (3.0.6 specifically), you'll need to manually upgrade pydantic:

pip install 'acryl-datahub-airflow-plugin[airflow3]' 'pydantic>=2.11.8'

We recommend using Airflow 3.1.0+ which resolves this issue. See the Version Compatibility section above for details.

What Gets Installed

Base Installation (No Extras)

When you install without any extras:

pip install acryl-datahub-airflow-plugin

You get:

  • acryl-datahub[sql-parser,datahub-rest] - DataHub SDK with SQL parsing and REST emitter
  • pydantic>=2.4.0 - Required for data validation
  • apache-airflow>=2.5.0,<4.0.0 - Airflow itself
  • No OpenLineage package - You'll need to provide your own or use one of the extras below

With [airflow2] Extra

pip install 'acryl-datahub-airflow-plugin[airflow2]'

Adds:

  • openlineage-airflow>=1.2.0 - Standalone OpenLineage package for Airflow 2.x

With [airflow3] Extra

pip install 'acryl-datahub-airflow-plugin[airflow3]'

Adds:

  • apache-airflow-providers-openlineage>=1.0.0 - Native OpenLineage provider for Airflow 3.x

Additional Extras

You can combine multiple extras if needed:

# For Airflow 3.x with Kafka emitter support
pip install 'acryl-datahub-airflow-plugin[airflow3,datahub-kafka]'

# For Airflow 2.x with file emitter support
pip install 'acryl-datahub-airflow-plugin[airflow2,datahub-file]'

Available extras:

  • airflow2: OpenLineage support for Airflow 2.x (adds openlineage-airflow>=1.2.0)
  • airflow3: OpenLineage support for Airflow 3.x (adds apache-airflow-providers-openlineage>=1.0.0)
  • datahub-kafka: Kafka-based metadata emission (adds acryl-datahub[datahub-kafka])
  • datahub-file: File-based metadata emission (adds acryl-datahub[sync-file-emitter]) - useful for testing

Why Different Extras?

Airflow 2.x and 3.x have different OpenLineage integrations:

  • Airflow 2.x (2.5-2.6) typically uses Legacy OpenLineage (openlineage-airflow package)
  • Airflow 2.x (2.7+) can use either Legacy OpenLineage or native OpenLineage Provider (apache-airflow-providers-openlineage)
  • Airflow 3.x uses native OpenLineage Provider (apache-airflow-providers-openlineage)

The plugin automatically detects which OpenLineage variant is installed and uses it accordingly. This means:

  1. With extras ([airflow2] or [airflow3]): The appropriate OpenLineage dependency is installed automatically
  2. Without extras: You provide your own OpenLineage installation, and the plugin auto-detects it

This flexibility allows you to adapt to different Airflow environments and dependency constraints.

Configuration

The plugin can be configured via airflow.cfg under the [datahub] section. Below are the key configuration options:

Extractor Patching (OpenLineage Enhancements)

When enable_extractors=True (default), the DataHub plugin enhances OpenLineage extractors to provide better lineage. You can fine-tune these enhancements:

[datahub]
# Enable/disable all OpenLineage extractors
enable_extractors = True  # Default: True

# Fine-grained control over DataHub's OpenLineage enhancements

# --- SQL Parsing Configuration ---

# Enable multi-statement SQL parsing (resolves temp tables, merges lineage)
enable_multi_statement_sql_parsing = False  # Default: False

# --- Patches (work with both Legacy OpenLineage and OpenLineage Provider) ---

# Patch SqlExtractor to use DataHub's advanced SQL parser (enables column-level lineage)
patch_sql_parser = True  # Default: True

# Patch SnowflakeExtractor to fix default schema detection
patch_snowflake_schema = True  # Default: True

# --- Custom Extractors (only apply to Legacy OpenLineage) ---

# Use DataHub's custom AthenaOperatorExtractor (better Athena lineage)
extract_athena_operator = True  # Default: True

# Use DataHub's custom BigQueryInsertJobOperatorExtractor (handles BQ job configuration)
extract_bigquery_insert_job_operator = True  # Default: True

Multi-Statement SQL Parsing:

When enable_multi_statement_sql_parsing=True, if a task executes multiple SQL statements (e.g., CREATE TEMP TABLE ...; INSERT ... FROM temp_table;), DataHub parses all statements together and resolves temporary table dependencies within that task. By default (False), only the first statement is parsed.

How it works:

Patches (apply to both Legacy OpenLineage and OpenLineage Provider):

  • Apply monkey-patching to OpenLineage extractor/operator classes at runtime
  • Work on both Airflow 2.x and Airflow 3.x
  • When patch_sql_parser=True:
    • Airflow 2: Patches SqlExtractor.extract() method
    • Airflow 3: Patches SQLParser.generate_openlineage_metadata_from_sql() method
    • Provides: More accurate lineage extraction, column-level lineage (CLL), better SQL dialect support
  • When patch_snowflake_schema=True:
    • Airflow 2: Patches SnowflakeExtractor.default_schema property
    • Airflow 3: Currently not needed (handled by Airflow's native support)
    • Fixes Snowflake schema detection issues

Custom Extractors/Operator Patches:

  • Register DataHub's custom implementations for specific operators
  • Work on both Airflow 2.x and Airflow 3.x
  • extract_athena_operator:
    • Airflow 2 (Legacy OpenLineage only): Registers AthenaOperatorExtractor
    • Airflow 3: Patches AthenaOperator.get_openlineage_facets_on_complete()
    • Uses DataHub's SQL parser for better Athena lineage
  • extract_bigquery_insert_job_operator:
    • Airflow 2 (Legacy OpenLineage only): Registers BigQueryInsertJobOperatorExtractor
    • Airflow 3: Patches BigQueryInsertJobOperator.get_openlineage_facets_on_complete()
    • Handles BigQuery job configuration and destination tables

Example use cases:

Disable DataHub's SQL parser to use OpenLineage's native parsing:

[datahub]
enable_extractors = True
patch_sql_parser = False  # Use OpenLineage's native SQL parser
patch_snowflake_schema = True  # Still fix Snowflake schema detection

Disable custom Athena extractor (only relevant for Legacy OpenLineage):

[datahub]
enable_extractors = True
extract_athena_operator = False  # Use OpenLineage's default Athena extractor

Other Configuration Options

For a complete list of configuration options, see the DataHub Airflow documentation.

Developing

See the developing docs.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

acryl_datahub_airflow_plugin-1.5.0.18rc3.tar.gz (80.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file acryl_datahub_airflow_plugin-1.5.0.18rc3.tar.gz.

File metadata

File hashes

Hashes for acryl_datahub_airflow_plugin-1.5.0.18rc3.tar.gz
Algorithm Hash digest
SHA256 e5eb0dc137ab950a7cac3437bb1bcf36abd8e02455728c4aee20274279ff47c6
MD5 8f2d802b9355f9ae1c395f544cec8acd
BLAKE2b-256 09fcc2cea1e5032ae02a74a48b23cf3381bb0d823f967452b681a5bfb28ef57f

See more details on using hashes here.

File details

Details for the file acryl_datahub_airflow_plugin-1.5.0.18rc3-py3-none-any.whl.

File metadata

File hashes

Hashes for acryl_datahub_airflow_plugin-1.5.0.18rc3-py3-none-any.whl
Algorithm Hash digest
SHA256 6558606f6402b3fd65d193274287e68c9c8ee5c39b12e29a1be7e5ff39253dc7
MD5 dd9e577f22bc5380eff8a07d090c4cbd
BLAKE2b-256 0c18c80742c1cd55bc180f592fa0f405e17a55bd0e2e2b75782f115c7e1ecee4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page