Skip to main content

Datahub Airflow plugin to capture executions and send to Datahub

Project description

Datahub Airflow Plugin

See the DataHub Airflow docs for details.

Version Compatibility

The plugin supports Apache Airflow versions 2.7+ and 3.1+.

Airflow Version Extra to Install Status Notes
2.7-2.10 [airflow2] ✅ Fully Supported
3.0.x [airflow3] ⚠️ Requires manual fix Needs pydantic>=2.11.8 upgrade
3.1+ [airflow3] ✅ Fully Supported

Note on Airflow 3.0.x: Airflow 3.0.6 pins pydantic==2.11.7, which contains a bug that prevents the DataHub plugin from importing correctly. This issue is resolved in Airflow 3.1.0+ which uses pydantic>=2.11.8. If you must use Airflow 3.0.6, you can manually upgrade pydantic to >=2.11.8, though this may conflict with Airflow's dependency constraints. We recommend upgrading to Airflow 3.1.0 or later.

Related issue: https://github.com/pydantic/pydantic/issues/10963

Installation

The installation command varies depending on your Airflow version due to different OpenLineage dependencies.

For Airflow 2.x (2.7+)

pip install 'acryl-datahub-airflow-plugin[airflow2]'

This installs the plugin with Legacy OpenLineage (openlineage-airflow>=1.2.0), which is required for Airflow 2.x lineage extraction.

Alternative: Using Native OpenLineage Provider on Airflow 2.7+

If your Airflow 2.7+ environment rejects the Legacy OpenLineage package (e.g., due to dependency conflicts), you can use the native OpenLineage provider instead:

# Install the native Airflow provider first
pip install 'apache-airflow-providers-openlineage>=1.0.0'

# Then install the DataHub plugin without OpenLineage extras
pip install acryl-datahub-airflow-plugin

The plugin will automatically detect and use apache-airflow-providers-openlineage when available, providing the same functionality.

For Airflow 3.x (3.1+)

pip install 'acryl-datahub-airflow-plugin[airflow3]'

This installs the plugin with apache-airflow-providers-openlineage>=1.0.0, which is the native OpenLineage provider for Airflow 3.x.

Note: If using Airflow 3.0.x (3.0.6 specifically), you'll need to manually upgrade pydantic:

pip install 'acryl-datahub-airflow-plugin[airflow3]' 'pydantic>=2.11.8'

We recommend using Airflow 3.1.0+ which resolves this issue. See the Version Compatibility section above for details.

What Gets Installed

Base Installation (No Extras)

When you install without any extras:

pip install acryl-datahub-airflow-plugin

You get:

  • acryl-datahub[sql-parser,datahub-rest] - DataHub SDK with SQL parsing and REST emitter
  • pydantic>=2.4.0 - Required for data validation
  • apache-airflow>=2.5.0,<4.0.0 - Airflow itself
  • No OpenLineage package - You'll need to provide your own or use one of the extras below

With [airflow2] Extra

pip install 'acryl-datahub-airflow-plugin[airflow2]'

Adds:

  • openlineage-airflow>=1.2.0 - Standalone OpenLineage package for Airflow 2.x

With [airflow3] Extra

pip install 'acryl-datahub-airflow-plugin[airflow3]'

Adds:

  • apache-airflow-providers-openlineage>=1.0.0 - Native OpenLineage provider for Airflow 3.x

Additional Extras

You can combine multiple extras if needed:

# For Airflow 3.x with Kafka emitter support
pip install 'acryl-datahub-airflow-plugin[airflow3,datahub-kafka]'

# For Airflow 2.x with file emitter support
pip install 'acryl-datahub-airflow-plugin[airflow2,datahub-file]'

Available extras:

  • airflow2: OpenLineage support for Airflow 2.x (adds openlineage-airflow>=1.2.0)
  • airflow3: OpenLineage support for Airflow 3.x (adds apache-airflow-providers-openlineage>=1.0.0)
  • datahub-kafka: Kafka-based metadata emission (adds acryl-datahub[datahub-kafka])
  • datahub-file: File-based metadata emission (adds acryl-datahub[sync-file-emitter]) - useful for testing

Why Different Extras?

Airflow 2.x and 3.x have different OpenLineage integrations:

  • Airflow 2.x (2.5-2.6) typically uses Legacy OpenLineage (openlineage-airflow package)
  • Airflow 2.x (2.7+) can use either Legacy OpenLineage or native OpenLineage Provider (apache-airflow-providers-openlineage)
  • Airflow 3.x uses native OpenLineage Provider (apache-airflow-providers-openlineage)

The plugin automatically detects which OpenLineage variant is installed and uses it accordingly. This means:

  1. With extras ([airflow2] or [airflow3]): The appropriate OpenLineage dependency is installed automatically
  2. Without extras: You provide your own OpenLineage installation, and the plugin auto-detects it

This flexibility allows you to adapt to different Airflow environments and dependency constraints.

Configuration

The plugin can be configured via airflow.cfg under the [datahub] section. Below are the key configuration options:

Extractor Patching (OpenLineage Enhancements)

When enable_extractors=True (default), the DataHub plugin enhances OpenLineage extractors to provide better lineage. You can fine-tune these enhancements:

[datahub]
# Enable/disable all OpenLineage extractors
enable_extractors = True  # Default: True

# Fine-grained control over DataHub's OpenLineage enhancements

# --- SQL Parsing Configuration ---

# Enable multi-statement SQL parsing (resolves temp tables, merges lineage)
enable_multi_statement_sql_parsing = False  # Default: False

# --- Patches (work with both Legacy OpenLineage and OpenLineage Provider) ---

# Patch SqlExtractor to use DataHub's advanced SQL parser (enables column-level lineage)
patch_sql_parser = True  # Default: True

# Patch SnowflakeExtractor to fix default schema detection
patch_snowflake_schema = True  # Default: True

# --- Custom Extractors (only apply to Legacy OpenLineage) ---

# Use DataHub's custom AthenaOperatorExtractor (better Athena lineage)
extract_athena_operator = True  # Default: True

# Use DataHub's custom BigQueryInsertJobOperatorExtractor (handles BQ job configuration)
extract_bigquery_insert_job_operator = True  # Default: True

Multi-Statement SQL Parsing:

When enable_multi_statement_sql_parsing=True, if a task executes multiple SQL statements (e.g., CREATE TEMP TABLE ...; INSERT ... FROM temp_table;), DataHub parses all statements together and resolves temporary table dependencies within that task. By default (False), only the first statement is parsed.

How it works:

Patches (apply to both Legacy OpenLineage and OpenLineage Provider):

  • Apply monkey-patching to OpenLineage extractor/operator classes at runtime
  • Work on both Airflow 2.x and Airflow 3.x
  • When patch_sql_parser=True:
    • Airflow 2: Patches SqlExtractor.extract() method
    • Airflow 3: Patches SQLParser.generate_openlineage_metadata_from_sql() method
    • Provides: More accurate lineage extraction, column-level lineage (CLL), better SQL dialect support
  • When patch_snowflake_schema=True:
    • Airflow 2: Patches SnowflakeExtractor.default_schema property
    • Airflow 3: Currently not needed (handled by Airflow's native support)
    • Fixes Snowflake schema detection issues

Custom Extractors/Operator Patches:

  • Register DataHub's custom implementations for specific operators
  • Work on both Airflow 2.x and Airflow 3.x
  • extract_athena_operator:
    • Airflow 2 (Legacy OpenLineage only): Registers AthenaOperatorExtractor
    • Airflow 3: Patches AthenaOperator.get_openlineage_facets_on_complete()
    • Uses DataHub's SQL parser for better Athena lineage
  • extract_bigquery_insert_job_operator:
    • Airflow 2 (Legacy OpenLineage only): Registers BigQueryInsertJobOperatorExtractor
    • Airflow 3: Patches BigQueryInsertJobOperator.get_openlineage_facets_on_complete()
    • Handles BigQuery job configuration and destination tables

Example use cases:

Disable DataHub's SQL parser to use OpenLineage's native parsing:

[datahub]
enable_extractors = True
patch_sql_parser = False  # Use OpenLineage's native SQL parser
patch_snowflake_schema = True  # Still fix Snowflake schema detection

Disable custom Athena extractor (only relevant for Legacy OpenLineage):

[datahub]
enable_extractors = True
extract_athena_operator = False  # Use OpenLineage's default Athena extractor

Other Configuration Options

For a complete list of configuration options, see the DataHub Airflow documentation.

Developing

See the developing docs.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

acryl_datahub_airflow_plugin-1.4.0.5.tar.gz (80.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

acryl_datahub_airflow_plugin-1.4.0.5-py3-none-any.whl (113.3 kB view details)

Uploaded Python 3

File details

Details for the file acryl_datahub_airflow_plugin-1.4.0.5.tar.gz.

File metadata

File hashes

Hashes for acryl_datahub_airflow_plugin-1.4.0.5.tar.gz
Algorithm Hash digest
SHA256 d68669a01f0f1d9d1896ac6f85eca5c54f0f3fe8e0389d6b2c1104bc8fb61e89
MD5 48b92b368a35e28840a08c7ec4ab3bc6
BLAKE2b-256 ae0c09051f1af99d7cfac63e09b95cb3debfe9a6d874255b617c346bb963f173

See more details on using hashes here.

File details

Details for the file acryl_datahub_airflow_plugin-1.4.0.5-py3-none-any.whl.

File metadata

File hashes

Hashes for acryl_datahub_airflow_plugin-1.4.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 287b81cd43588b15cc4f8251213fa3c98c0e6967802131324e4d451c2fda2f17
MD5 517cbf5135f0b71a1059d3de96814b07
BLAKE2b-256 cd1a7f279d8e47ff9fbf3cdb481828bd664f0424ace6aa8809a9064b5a586ae2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page