Skip to main content

Open Source Data Lineage Tool For AWS and GCP

Project description

CircleCI codecov PyPI image image

Data Lineage for Databases and Data Lakes

data-lineage is an open source application to query and visualize data lineage in databases, data warehouses and data lakes in AWS and GCP.

data-lineage's goal is to be fast, simple setup and allow analysis of the lineage. To achieve these goals, data lineage has the following features :

  1. Generate data lineage from query history. Most databases maintain query history for a few days. Therefore the setup costs of an infrastructure to capture and store metadata is minimal.
  2. Use networkx graph library to create a DAG of the lineage. Networkx graphs provide programmatic access to data lineage providing rich opportunities to analyze data lineage.
  3. Integrate with Jupyter Notebooks. Jupyter Notebooks provide an excellent IDE to generate, manipulate and analyze data lineage graphs.
  4. Use Plotly to visualize the graph with rich annotations. Plotly provides a number of features to provide rich graphs with tool tips, color coding and weights based on different attributes of the graph.

Checkout an example data lineage notebook.

Use Cases

Data Lineage enables the following use cases:

  • Business Rules Verification
  • Change Impact Analysis
  • Data Quality Verification

Check out the post on using data lineage for cost control for an example of how data lineage can be used in production.

Quick Start

# Install packages
pip install data-lineage
pip install jupyter

jupyter notebook

# Checkout example notebook: http://tokern.io/docs/data-lineage/example/ 

Supported Technologies

  • Postgres
  • AWS Redshift
  • Snowflake

Coming Soon

  • MySQL
  • SparkSQL
  • Presto

Documentation

For advanced usage, please refer to data-lineage documentation

Survey

Please take this survey if you are a user or considering using data-lineage. Responses will help us prioritize features better.

Developer Setup

# Install dependencies
pipenv install --dev

# Setup pre-commit and pre-push hooks
pipenv run pre-commit install -t pre-commit
pipenv run pre-commit install -t pre-push

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data-lineage-0.3.0.tar.gz (15.1 kB view details)

Uploaded Source

Built Distribution

data_lineage-0.3.0-py3-none-any.whl (15.4 kB view details)

Uploaded Python 3

File details

Details for the file data-lineage-0.3.0.tar.gz.

File metadata

  • Download URL: data-lineage-0.3.0.tar.gz
  • Upload date:
  • Size: 15.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.5

File hashes

Hashes for data-lineage-0.3.0.tar.gz
Algorithm Hash digest
SHA256 7237b34a96af0ec8db296b9ebd3a1d347fc312315deeb0281c0426de660bc1cf
MD5 bfe0e1e95b98cb1ae0cc071dcd303515
BLAKE2b-256 24c20543bebd832a68ccfd5cc8d6ed06f54949307ce770b8b6f3b41df1c4a8a1

See more details on using hashes here.

File details

Details for the file data_lineage-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: data_lineage-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 15.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.5

File hashes

Hashes for data_lineage-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8c227702c5f55b294335403bde849e649fc865f692b704817d64559f1d080804
MD5 d9fa619dc5c034986472ed77125844fb
BLAKE2b-256 8c4abb2edfea081f777336b6167e7f76caaca2edc86032ba3444dca38fecea41

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page