Skip to main content

A project to build and visualize data lineage from SQL written in python. It supports column level lineage and can combine with metadata retriever for better result.

Project description


SQL Data Lineage

Introducing SQL Data Lineage, a powerful package designed to simplify SQL query analysis. This versatile tool parses data lineage from individual SQL queries or builds comprehensive lineage from multiple queries. It offers both an interactive command-line interface and programmatic integration, making it easy to incorporate into your Python projects.

SQL Data Lineage performs detailed column-level analysis, tracing data flows step-by-step through tables, CTEs, subqueries, and more. It generates user-friendly lineage graphs that clearly show how columns move and transform across SQL components.

You can easily enhance your lineage insights by retrieving and customizing metadata to fit your specific requirements.

We welcome and encourage contributions to the SQL Data Lineage project!

Installation

pip install sql-datalineage

Usage

CLI usage

Show help of CLI commands.

datalineage --help

Generate data lineage of a sql file, output type is mermaid.

$> datalineage -i docs/example/test-query.sql --schema-path docs/example/test-schema.json -r mermaid

%%{init: {"flowchart": {"defaultRenderer": "elk"}} }%%
graph LR
subgraph 2420861448752 ["Table: catalog.schema1.customer AS customer"]
2420861447168["id"]
2420861446976["name"]
2420861448464["phone"]
2420860590112["address"]
2420861446304["location"]
end

subgraph 2420861448224 ["CTE: cte1 AS t1"]
2420861448848["id"]
2420861449040["name"]
2420861448272["phone"]
2420861449184["address"]
end
2420861447168 --> 2420861448848
2420861446976 --> 2420861449040
2420861448464 --> 2420861448272
2420860590112 --> 2420861449184

....

[!TIP] The output of the above command is truncated. You can optionally save the command's result to a file using the -o option.

datalineage -i /docs/example/test-query.sql --schema-path docs/example/test-schema.json -o docs/example/output.mermaid -r mermaid

[!NOTE] Currently, the results are generated dynamically, so each run will produce different IDs. However, rest assured that the lineage remains deterministic.

You can preview the above result using Mermaid Live Editor, here is the result:

readme_example

Interactive usage

You can import datalineage into your project and generate the lineage tree directly.

>>> from datalineage.lineage import lineage
>>> sql = """select  
  id, name, phone, address
  from (
    select id, name, phone, address,
    row_number() over(partition by phone order by name) as rn    
    from `catalog.schema1.customer`) data
  where data.rn = 1
"""
>>> schema = None # we will infer the schema of table when no schema are provided
>>> dialect = "bigquery"
>>> tree = lineage(sql, dialect, schema)
>>> tree
Node<{"name": "myroot", "expression": "ANCHOR",...

You can traversal and print out the lineage tree in this way:

>>> def print_node(node):
>>>     print("Node:", node.name)
>>>     list(map(lambda c: print("Column:", c.name), node.children))
>>> for node in tree.walk():
...     print_node(node)
... 
Node: myroot
Node: _output_
Column: id
Column: name
Column: phone
Column: address
Node: data
Column: id
Column: name
Column: phone
Column: address
Column: rn
Node: "catalog"."schema1"."customer" AS "customer"
Column: id
Column: name
Column: phone
Column: address

Or you can render the tree to a format you like, for example, mermaid.

>>> from datalineage.renderer import MermaidRenderer
>>> renderer = MermaidRenderer()
>>> print(renderer.render(tree))
%%{init: {"flowchart": {"defaultRenderer": "elk"}} }%%
graph LR
subgraph 1434247920720 ["Table: catalog.schema1.customer AS customer"]
1434247920624["id"]
1434247921104["name"]
1434247919568["phone"]
1434247921200["address"]
end

subgraph 1434247919280 ["Subquery: data"]
1434247920816["id"]
1434247919856["name"]
1434247917696["phone"]
1434247917744["address"]
1434247918224["rn"]
end
1434247920624 --> 1434247920816
1434247921104 --> 1434247919856
1434247919568 --> 1434247917696
1434247921200 --> 1434247917744
1434247919568 --> 1434247918224
1434247921104 --> 1434247918224

subgraph 1434247918032 ["Select: _output_"]
1434247921392["id"]
1434247921344["name"]
1434247921152["phone"]
1434247920912["address"]
end
1434247920816 --> 1434247921392
1434247919856 --> 1434247921344
1434247917696 --> 1434247921152
1434247917744 --> 1434247920912
>>>

[!TIP] You can render to json format using datalineage.renderer.JsonRenderer class, or customize your own renderer.

Here is the flow-chart of the above result:

%%{init: {"flowchart": {"defaultRenderer": "elk"}} }%%
graph LR
subgraph 1434247920720 ["Table: catalog.schema1.customer AS customer"]
1434247920624["id"]
1434247921104["name"]
1434247919568["phone"]
1434247921200["address"]
end

subgraph 1434247919280 ["Subquery: data"]
1434247920816["id"]
1434247919856["name"]
1434247917696["phone"]
1434247917744["address"]
1434247918224["rn"]
end
1434247920624 --> 1434247920816
1434247921104 --> 1434247919856
1434247919568 --> 1434247917696
1434247921200 --> 1434247917744
1434247919568 --> 1434247918224
1434247921104 --> 1434247918224

subgraph 1434247918032 ["Select: _output_"]
1434247921392["id"]
1434247921344["name"]
1434247921152["phone"]
1434247920912["address"]
end
1434247920816 --> 1434247921392
1434247919856 --> 1434247921344
1434247917696 --> 1434247921152
1434247917744 --> 1434247920912

Contribution

Setup Environment

make install-dev

Run Lint

make style

Run Tests

make test

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sql_datalineage-0.0.13.tar.gz (210.9 kB view details)

Uploaded Source

Built Distribution

sql_datalineage-0.0.13-py3-none-any.whl (16.4 kB view details)

Uploaded Python 3

File details

Details for the file sql_datalineage-0.0.13.tar.gz.

File metadata

  • Download URL: sql_datalineage-0.0.13.tar.gz
  • Upload date:
  • Size: 210.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for sql_datalineage-0.0.13.tar.gz
Algorithm Hash digest
SHA256 3abc0459cf89330e496d508a3311a676c7e384a2366f358f07796bee8fe7b66e
MD5 6f696e7ddf3b1d4803dc535ab750cdbc
BLAKE2b-256 6d359135117f375459811ab3d2a283a4f6cfd99e96f0cffe3910cf939fa1f366

See more details on using hashes here.

File details

Details for the file sql_datalineage-0.0.13-py3-none-any.whl.

File metadata

File hashes

Hashes for sql_datalineage-0.0.13-py3-none-any.whl
Algorithm Hash digest
SHA256 533adcc4449d1e71b4c51c008118a517a4b6e4872c3871a4fe421bffb37ed646
MD5 2d86aed56c5129076499cfddf648c70f
BLAKE2b-256 26cd1d086c8b08bf54606675e721ca55a11b8a6638e476a1171e168c4de4db77

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page