Skip to main content

Table-level data lineage tool

Project description

Stairlight

Stairlight

PyPi Version PyPi License PyPi Python Versions Code style: black CI

A table-level data lineage tool, detects table dependencies by SELECT queries.

Queries can be read from following systems.

Installation

This package is distributed on PyPI.

$ pip install stairlight

(v0.4+) The base package is for Local file system only. Please set extras when reading from other data sources.

$ pip install "stairlight[gcs,redash]"

Getting Started

There are 3 steps to use.

# Step 1: Initialize and set data location settings
$ stairlight init
'./stairlight.yaml' has created.
Please edit it to set your data sources.

# Step 2: Map SQL queries and tables, and add metadata
$ stairlight map
'./mapping_yyyyMMddhhmmss.yaml' has created.
Please map undefined tables and parameters, and append to your latest configuration file.

# Step 3: Get a table dependency map
$ stairlight

Description

Input

  • SQL SELECT queries
  • Configuration files (YAML)
    • stairlight.yaml: SQL query locations and include/exclude conditions.
    • mapping.yaml: Mapping SQL queries and tables.

Output

  • Dependency map (JSON)

    Example
    {
        "PROJECT_d.DATASET_e.TABLE_f": {
            "PROJECT_j.DATASET_k.TABLE_l": {
                "TemplateSourceType": "File",
                "Key": "tests/sql/main/one_line_2.sql",
                "Uri": "/foo/bar/stairlight/tests/sql/main/one_line_2.sql",
                "Lines": [
                    {
                        "LineNumber": 1,
                        "LineString": "SELECT * FROM PROJECT_j.DATASET_k.TABLE_l WHERE 1 = 1"
                    }
                ]
            },
            "PROJECT_C.DATASET_C.TABLE_C": {
                "TemplateSourceType": "GCS",
                "Key": "sql/cte/cte_multi_line.sql",
                "Uri": "gs://stairlight/sql/cte/cte_multi_line.sql",
                "Lines": [
                    {
                        "LineNumber": 6,
                        "LineString": "        PROJECT_C.DATASET_C.TABLE_C"
                    }
                ],
                "BucketName": "stairlight",
                "Labels": {
                    "Source": "gcs",
                    "Test": "b"
                }
            }
        },
        "AggregateSales": {
            "PROJECT_e.DATASET_e.TABLE_e": {
                "TemplateSourceType": "Redash",
                "Key": 5,
                "Uri": "AggregateSales",
                "Lines": [
                    {
                        "LineNumber": 1,
                        "LineString": "SELECT service, SUM(total_amount) FROM PROJECT_e.DATASET_e.TABLE_e GROUP BY service"
                    }
                ],
                "DataSourceName": "BigQuery",
                "Labels": {
                    "Category": "Sales"
                }
            }
        },
    }
    

Configuration

Configuration files can be found here, used for unit test in CI.

stairlight.yaml

'stairlight.yaml' is for setting up Stairlight itself.

It is responsible for specifying the destination of SQL queries to be read, and for specifying data sources.

Include:
  - TemplateSourceType: File
    FileSystemPath: "./tests/sql"
    Regex: ".*/*.sql$"
    DefaultTablePrefix: "PROJECT_A"
  - TemplateSourceType: GCS
    ProjectId: null
    BucketName: stairlight
    Regex: "^sql/.*/*.sql$"
    DefaultTablePrefix: "PROJECT_A"
  - TemplateSourceType: Redash
    DatabaseUrlEnvironmentVariable: REDASH_DATABASE_URL
    DataSourceName: BigQuery
    QueryIds:
      - 1
      - 3
      - 5
Exclude:
  - TemplateSourceType: File
    Regex: "main/exclude.sql$"
Settings:
  MappingPrefix: "mapping"

mapping.yaml

'mapping.yaml' is used to define relationships between input queries and tables.

A template of this file can be created by map command, based on the configuration of 'stairlight.yaml'.

Global:
  Parameters:
    DESTINATION_PROJECT: stairlight
    params:
      PROJECT: 1234567890
      DATASET: public
      TABLE: taxirides
Mapping:
  - TemplateSourceType: File
    FileSuffix: "tests/sql/main/union_same_table.sql"
    Tables:
      - TableName: "test_project.beam_streaming.taxirides_aggregation"
  - TemplateSourceType: GCS
    Uri: "gs://stairlight/sql/one_line/one_line.sql"
    Tables:
      - TableName: "PROJECT_a.DATASET_b.TABLE_c"
  - TemplateSourceType: Redash
    QueryId: 5
    DataSourceName: metadata
    Tables:
      - TableName: Copy of (#4) New Query
        Parameters:
          table: dashboards
        Labels:
          Category: Redash test
Metadata:
  - TableName: "PROJECT_A.DATASET_A.TABLE_A"
    Labels:
      Source: Null
      Test: a

Global Section

This section is for global configurations.

Parameters attribute is used to set common parameters. If conflicts has occurred with Parameters attributes in mapping section, mapping section's parameters will be used in preference to global.

Mapping Section

Mapping section is used to define relationships between queries and tables that created as a result of query execution.

Parameters attribute allows you to reflect settings in jinja template variables embedded in queries. If multiple settings are applied to a query using jinja template, the query will be read as if there were the same number of queries as the number of settings.

Metadata Section

This section is mainly used to set metadata to tables appears only in queries.

Command and Option

$ stairlight --help
usage: stairlight [-h] [-c CONFIG] [--save SAVE] [--load LOAD] {init,check,up,down} ...

A table-level data lineage tool, detects table dependencies by SELECT queries.
Without positional arguments, return a table dependency map as JSON format.

positional arguments:
  {init,map,check,up,down}
    init                create new Stairlight configuration file
    map (check)         create new configuration file about undefined mappings
    up                  return upstairs ( table | SQL file ) list
    down                return downstairs ( table | SQL file ) list

optional arguments:
  -h, --help            show this help message and exit
  -c CONFIG, --config CONFIG
                        set Stairlight configuration directory
  -q, --quiet           keep silence
  --save SAVE           file path where results will be saved(File system or GCS)
  --load LOAD           file path in which results are saved(File system or GCS), can be specified multiple times

init

init creates a new Stairlight configuration file.

$ stairlight init --help
usage: stairlight init [-h] [-c CONFIG]

optional arguments:
  -h, --help            show this help message and exit
  -c CONFIG, --config CONFIG
                        set Stairlight configuration directory.
  -q, --quiet           keep silence

map(check)

map creates new configuration file about undefined mappings.check is an alias. The option specification is the same as init.

up

up outputs a list of tables or SQL files located upstream from the specified table.

  • Use table(-t, --table) or label(-l, --label) option to specify tables to search.
  • Recursive option(-r, --recursive) is set, Stairlight will find tables recursively and output as a list.
  • Verbose option(-v, --verbose) is set, Stairlight will add detailed information and output it as a dict.
$ stairlight up --help
usage: stairlight up [-h] [-c CONFIG] [--save SAVE] [--load LOAD] (-t TABLE | -l LABEL) [-o {table,file}]
                     [-v] [-r]

optional arguments:
  -h, --help            show this help message and exit
  -c CONFIG, --config CONFIG
                        set Stairlight configuration directory
  -q, --quiet           keep silence
  --save SAVE           file path where results will be saved(File system or GCS)
  --load LOAD           file path in which results are saved(File system or GCS), can be specified multiple times
  -t TABLE, --table TABLE
                        table names that Stairlight searches for, can be specified
                        multiple times. e.g. -t PROJECT_a.DATASET_b.TABLE_c -t
                        PROJECT_d.DATASET_e.TABLE_f
  -l LABEL, --label LABEL
                        labels set for the table in mapping configuration, can be
                        specified multiple times. The separator between key and value
                        should be a colon(:). e.g. -l key_1:value_1 -l key_2:value_2
  -o {table,file}, --output {table,file}
                        output type
  -v, --verbose         return verbose results
  -r, --recursive       search recursively

down

down outputs a list of tables or SQL files located downstream from the specified table. The option specification is the same as up.

Use as a library

Stairlight can also be used as a library.

tosh2230/stairlight-app is a sample web application rendering table dependency graph with Stairlight, using Graphviz, Streamlit and Google Cloud Run.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stairlight-0.4.1.tar.gz (23.2 kB view hashes)

Uploaded Source

Built Distribution

stairlight-0.4.1-py3-none-any.whl (25.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page