Skip to main content

SQL Lineage Analysis Tool powered by Python

Project description

SQLLineage

SQL Lineage Analysis Tool powered by Python

image image image image Build Status Documentation Status codecov Code style: black security: bandit

Never get the hang of a SQL parser? SQLLineage comes to the rescue. Given a SQL command, SQLLineage will tell you its source and target tables, without worrying about Tokens, Keyword, Identifier and all the jagons used by SQL parsers.

Behind the scene, SQLLineage pluggable leverages parser library (sqlfluff and sqlparse) to parse the SQL command, analyze the AST, stores the lineage information in a graph (using graph library networkx), and brings you all the human-readable result with ease.

Demo & Documentation

Talk is cheap, show me a demo.

Documentation is online hosted by readthedocs, and you can check the release note there.

Quick Start

Install sqllineage via PyPI:

$ pip install sqllineage

Using sqllineage command to parse a quoted-query-string:

$ sqllineage -e "insert into db1.table1 select * from db2.table2"
Statements(#): 1
Source Tables:
    db2.table2
Target Tables:
    db1.table1

Or you can parse a SQL file with -f option:

$ sqllineage -f foo.sql
Statements(#): 1
Source Tables:
    db1.table_foo
    db1.table_bar
Target Tables:
    db2.table_baz

Advanced Usage

Multiple SQL Statements

Lineage is combined from multiple SQL statements, with intermediate tables identified:

$ sqllineage -e "insert into db1.table1 select * from db2.table2; insert into db3.table3 select * from db1.table1;"
Statements(#): 2
Source Tables:
    db2.table2
Target Tables:
    db3.table3
Intermediate Tables:
    db1.table1

Verbose Lineage Result

And if you want to see lineage for each SQL statement, just toggle verbose option

$ sqllineage -v -e "insert into db1.table1 select * from db2.table2; insert into db3.table3 select * from db1.table1;"
Statement #1: insert into db1.table1 select * from db2.table2;
    table read: [Table: db2.table2]
    table write: [Table: db1.table1]
    table cte: []
    table rename: []
    table drop: []
Statement #2: insert into db3.table3 select * from db1.table1;
    table read: [Table: db1.table1]
    table write: [Table: db3.table3]
    table cte: []
    table rename: []
    table drop: []
==========
Summary:
Statements(#): 2
Source Tables:
    db2.table2
Target Tables:
    db3.table3
Intermediate Tables:
    db1.table1

Dialect-Awareness Lineage

By default, sqllineage use ansi dialect to parse and validate your SQL. However, some SQL syntax you take for granted in daily life might not be in ANSI standard. In addition, different SQL dialects have different set of SQL keywords, further weakening sqllineage's capabilities when keyword used as table name or column name. To get the most out of sqllineage, we strongly encourage you to pass the dialect to assist the lineage analyzing.

Take below example, INSERT OVERWRITE statement is only supported by big data solutions like Hive/SparkSQL, and MAP is a reserved keyword in Hive thus can not be used as table name while it is not for SparkSQL. Both ansi and hive dialect tell you this causes syntax error and sparksql gives the correct result:

$ sqllineage -e "INSERT OVERWRITE TABLE map SELECT * FROM foo"
...
sqllineage.exceptions.InvalidSyntaxException: This SQL statement is unparsable, please check potential syntax error for SQL

$ sqllineage -e "INSERT OVERWRITE TABLE map SELECT * FROM foo" --dialect=hive
...
sqllineage.exceptions.InvalidSyntaxException: This SQL statement is unparsable, please check potential syntax error for SQL

$ sqllineage -e "INSERT OVERWRITE TABLE map SELECT * FROM foo" --dialect=sparksql
Statements(#): 1
Source Tables:
    <default>.foo
Target Tables:
    <default>.map

Use sqllineage --dialects to see all available dialects.

Column-Level Lineage

We also support column level lineage in command line interface, set level option to column, all column lineage path will be printed.

INSERT INTO foo
SELECT a.col1,
       b.col1     AS col2,
       c.col3_sum AS col3,
       col4,
       d.*
FROM bar a
         JOIN baz b
              ON a.id = b.bar_id
         LEFT JOIN (SELECT bar_id, sum(col3) AS col3_sum
                    FROM qux
                    GROUP BY bar_id) c
                   ON a.id = sq.bar_id
         CROSS JOIN quux d;

INSERT INTO corge
SELECT a.col1,
       a.col2 + b.col2 AS col2
FROM foo a
         LEFT JOIN grault b
              ON a.col1 = b.col1;

Suppose this sql is stored in a file called test.sql

$ sqllineage -f test.sql -l column
<default>.corge.col1 <- <default>.foo.col1 <- <default>.bar.col1
<default>.corge.col2 <- <default>.foo.col2 <- <default>.baz.col1
<default>.corge.col2 <- <default>.grault.col2
<default>.foo.* <- <default>.quux.*
<default>.foo.col3 <- c.col3_sum <- <default>.qux.col3
<default>.foo.col4 <- col4

MetaData-Awareness Lineage

By observing the column lineage generated from previous step, you'll possibly notice that:

  1. <default>.foo.* <- <default>.quux.*: the wildcard is not expanded.
  2. <default>.foo.col4 <- col4: col4 is not assigned with source table.

It's not perfect because we don't know the columns encoded in * of table quux. Likewise, given the context, col4 could be coming from bar, baz or quux. Without metadata, this is the best sqllineage can do.

User can optionally provide the metadata information to sqllineage to improve the lineage result.

Suppose all the tables are created in sqlite database with a file called db.db. In particular, table quux has columns col5 and col6 and baz has column col4.

sqlite3 db.db 'CREATE TABLE IF NOT EXISTS baz (bar_id int, col1 int, col4 int)';
sqlite3 db.db 'CREATE TABLE IF NOT EXISTS quux (quux_id int, col5 int, col6 int)';

Now given the same SQL, column lineage is fully resolved.

$ SQLLINEAGE_DEFAULT_SCHEMA=main sqllineage -f test.sql -l column --sqlalchemy_url=sqlite:///db.db
main.corge.col1 <- main.foo.col1 <- main.bar.col1
main.corge.col2 <- main.foo.col2 <- main.bar.col1
main.corge.col2 <- main.grault.col2
main.foo.col3 <- c.col3_sum <- main.qux.col3
main.foo.col4 <- main.baz.col4
main.foo.col5 <- main.quux.col5
main.foo.col6 <- main.quux.col6

The default schema name in sqlite is called main, we have to specify here because the tables in SQL file are unqualified.

SQLLineage leverages sqlalchemy to retrieve metadata from different SQL databases. Check for more details on SQLLineage MetaData.

Lineage Visualization

One more cool feature, if you want a graph visualization for the lineage result, toggle graph-visualization option

Still using the above SQL file

sqllineage -g -f foo.sql

A webserver will be started, showing DAG representation of the lineage result in browser:

  • Table-Level Lineage
Table-Level Lineage
  • Column-Level Lineage
Column-Level Lineage

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sqllineage-1.5.6.tar.gz (397.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sqllineage-1.5.6-py3-none-any.whl (488.0 kB view details)

Uploaded Python 3

File details

Details for the file sqllineage-1.5.6.tar.gz.

File metadata

  • Download URL: sqllineage-1.5.6.tar.gz
  • Upload date:
  • Size: 397.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sqllineage-1.5.6.tar.gz
Algorithm Hash digest
SHA256 12c209481d5c77cbf9accf3447b2d0f6958e04b63cc49a1df6245ed1ddb1aed3
MD5 394e50782bac2db370a8a9c14f63f6fb
BLAKE2b-256 aacc64010b4cd8e05c8ebaeb7c6bcea1e139427f89a6354e5f720dd955bc7aca

See more details on using hashes here.

Provenance

The following attestation bundles were made for sqllineage-1.5.6.tar.gz:

Publisher: python-publish.yml on reata/sqllineage

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sqllineage-1.5.6-py3-none-any.whl.

File metadata

  • Download URL: sqllineage-1.5.6-py3-none-any.whl
  • Upload date:
  • Size: 488.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sqllineage-1.5.6-py3-none-any.whl
Algorithm Hash digest
SHA256 06def92559fd3636e9b0439a3a7db5ad337cc8c00511b3e10f16e5c4f45123e4
MD5 d8c90096d6573b3080252784bbdb7240
BLAKE2b-256 da45fb284affd602da258a99199e7536ea9a068b6c7289c216cdad8e8dcd3069

See more details on using hashes here.

Provenance

The following attestation bundles were made for sqllineage-1.5.6-py3-none-any.whl:

Publisher: python-publish.yml on reata/sqllineage

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page