Skip to main content

A package for extracting dbt column lineage

Project description

DBT Column Lineage Extractor

DISCLAIMER

WARNING: This tool is currently in beta and has only been tested on a limited number of dbt projects using the snowflake dialect. It might not perform as expected in every situation. Please report any issues or suggestions in the Repository

Overview

The DBT Column Lineage Extractor is a lightweight Python-based tool for extracting and analyzing data column lineage for dbt projects. This tool utilizes the sqlglot library to parse and analyze SQL queries defined in your dbt models and maps their column lineage relationships.

GitHub Repository

dbt Column Lineage Extractor

Features

  • Extract column level lineage for specified model columns, including direct and recursive relationships.
  • Output results in a human-readable JSON format, which can be programmatically integrated for use cases such as data impact analysis, data tagging, etc.; or visualized with other tools.

Installation

pip installation

pip install dbt-column-lineage-extractor==0.1.2b2

Required Input Files

To run the DBT Column Lineage Extractor, you need the following files:

  • catalog.json: Provides the schema of the models, including names and types of the columns.
  • manifest.json: Offers model-level lineage information.

These files are generated by executing the command:

dbt docs generate

Important Notes

  • The dbt docs generate command does not parse your SQL syntax. Instead, it connects to the data warehouse to retrieve schema information.
  • Ensure that the relevant models are materialized in your dbt project as either tables or views for accurate schema information.
  • If the models aren't materialized in your development environment, you might use the --target flag to specify an alternative target environment with all models materialized (e.g., --target prod), given you have access to it.
  • After modifying the schemas, update the materialized models in your warehouse before running the dbt docs generate command.

Example Usage and Customization

The DBT Column Lineage Extractor can be used in two ways: via the command line interface or by integrating the Python scripts into your codebase.

cd examples

Option 1 - Command Line Interface

First, generate column lineage relationships to model's direct parents and children using the dbt_column_lineage_direct command, e.g.:

dbt_column_lineage_direct --manifest ./inputs/manifest.json --catalog ./inputs/catalog.json

Then analyze recursive column lineage relationships for a specific model and column using the dbt_column_lineage_recursive command, e.g.:

dbt_column_lineage_recursive --model model.jaffle_shop.stg_orders --column order_id

See more usage guides using dbt_column_lineage_direct -h and dbt_column_lineage_recursive -h.

Option 2 - Python Scripts

See the readme file in the examples directory for more detailed instructions on how to integrate the DBT Column Lineage Extractor into your python scripts.

Example Outputs
  • seed.jaffle_shop.raw_orders -- id

    • Structured Ancestors:
      {}
      
    • Structured Descendants:
      {
        "model.jaffle_shop.stg_orders": {
           "order_id": {
                 "+": {
                    "model.jaffle_shop.customers": {
                       "number_of_orders": {
                             "+": {}
                       }
                    },
                    "model.jaffle_shop.orders": {
                       "order_id": {
                             "+": {}
                       }
                    }
                 }
           }
        }
      }
      
  • model.jaffle_shop.stg_orders -- order_id

    • Structured Ancestors:
      {
        "seed.jaffle_shop.raw_orders": {
           "id": {
                 "+": {}
           }
        }
      }
      
    • Structured Descendants:
      {
        "model.jaffle_shop.customers": {
           "number_of_orders": {
                 "+": {}
           }
        },
        "model.jaffle_shop.orders": {
           "order_id": {
                 "+": {}
           }
        }
      }
      

Example Visualization

The structured JSON outputs can be used programmatically, or loaded into visualization tools like jsoncrack.com to visualize the column lineage relationships and dependencies. visualize

Limitations

  • Doesn’t support parse certain syntax, e.g. lateral flatten
  • Doesn’t support dbt python models
  • Only tested with snowflake dialect so far

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dbt_column_lineage_extractor-0.1.2b2.tar.gz (8.8 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file dbt_column_lineage_extractor-0.1.2b2.tar.gz.

File metadata

File hashes

Hashes for dbt_column_lineage_extractor-0.1.2b2.tar.gz
Algorithm Hash digest
SHA256 5d6141b59fd464479959858908a247293475dfba9ca43257f9cfde762544f21d
MD5 d0d47df0640a7d3c93d8fd761a9464dd
BLAKE2b-256 4f9f240ff9ef6c9d6d613703749e7d5a696bc90d77f900ad9619d4aafe21a2b3

See more details on using hashes here.

File details

Details for the file dbt_column_lineage_extractor-0.1.2b2-py3-none-any.whl.

File metadata

File hashes

Hashes for dbt_column_lineage_extractor-0.1.2b2-py3-none-any.whl
Algorithm Hash digest
SHA256 2ec69e2245e7160125117fd8bff224c0e34c53dfa08c1d48ad9c3e14f5210b8c
MD5 c2a4ce39bcbab71db0044702a576f3f2
BLAKE2b-256 b2b93d37d17c0ba3bd4eec072269ec9a172a280c71eed19e153facc87a233bd5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page