Skip to main content

Data Lineage Tracing Library

Project description

“DAI-Lab” An open source project from Data to AI Lab at MIT.

Development Status PyPI Shield Downloads Run Tests

DataTracer

Data Lineage Tracing Library

Overview

DataTracer is a Python library for solving Data Lineage problems using statistical methods, machine learning techniques, and hand-crafted heuristics.

Currently the Data Tracer library implements discovery of the following properties:

  • Primary Key: Identify which column is the primary key in each table.
  • Foreign Key: Find which relationships exist between the tables.
  • Column Mapping: Given a field in a table, deduce which other fields, from the same table or other tables, are more related or contributed the most in generating the given field.

REST API

The DataTracer library also incorporates a REST API that enables interaction with the DataTracer Solvers via HTTP communication. You can check it here

Install

Requirements

DataTracer has been developed and tested on Python 3.5 and 3.6, 3.7

Also, although it is not strictly required, the usage of a virtualenv is highly recommended in order to avoid interfering with other software installed in the system where DataTracer is run.

Install with pip

The easiest and recommended way to install DataTracer is using pip:

pip install datatracer

This will pull and install the latest stable release from PyPi.

If you want to install from source or contribute to the project please read the Contributing Guide.

Data Format: Datasets and Metadata

The DataTracer library is prepared to work using datasets, which are a collection of tables loaded as pandas.DataFrames and a MetaData JSON which provides information about the dataset structure.

You can find more information about the MetaData format in the MetaData repository.

The DataTracer also includes a few demo datasets which you can easily download to your computer using the datatracer.get_demo_data function:

from datatracer import get_demo_data

get_demo_data()

This will create a folder called datatracer_demo in your working directory with a few datasets ready to use inside it.

Quickstart

In this short tutorial we will guide you through a series of steps that will help you getting started with Data Tracer.

Load data

The first step will be to load the data in the format expected by DataTracer.

For this, we can use the datatracer.load_dataset function passing the path to the dataset folder.

For example, if we want to use the classicmodels dataset included in the demo folder that we just created we can load it using:

from datatracer import load_dataset

metadata, tables = load_dataset('datatracer_demo/classicmodels')

This will return a tuple which contains:

  • A MetaData instance with details about the dataset.
  • A dict with all the tables of the dataset loaded as a pandas.DataFrame.

Select a Solver

In the DataTracer project, the different Data Lineage problems are solved using what we call solvers.

We can see the list of available solvers using the get_solvers function:

from datatracer import get_solvers

get_solvers()

which will return a list with their names:

['datatracer.column_map',
 'datatracer.foreign_key.basic',
 'datatracer.foreign_key.standard',
 'datatracer.primary_key.basic']

Use a DataTracer instance to find table relationships

In order to use the selected solver you will need to load it using the DataTracer class.

In this example, we will try to figure out the relationships between the tables in our dataset by using the solver datatracer.foreign_key.standard.

from datatracer import DataTracer

# Load the Solver
solver = DataTracer.load('datatracer.foreign_key.standard')

# Solve the Data Lineage problem
foreign_keys = solver.solve(tables)

The result will be a dictionary containing the foreign key candidates:

[{'table': 'products',
  'field': 'productLine',
  'ref_table': 'productlines',
  'ref_field': 'productLine'},
 {'table': 'payments',
  'field': 'customerNumber',
  'ref_table': 'customers',
  'ref_field': 'customerNumber'},
 {'table': 'orders',
  'field': 'customerNumber',
  'ref_table': 'customers',
  'ref_field': 'customerNumber'},
 {'table': 'orderdetails',
  'field': 'productCode',
  'ref_table': 'products',
  'ref_field': 'productCode'},
 {'table': 'orderdetails',
  'field': 'orderNumber',
  'ref_table': 'orders',
  'ref_field': 'orderNumber'},
 {'table': 'employees',
  'field': 'officeCode',
  'ref_table': 'offices',
  'ref_field': 'officeCode'}]

What's next?

You can learn more about the DataTracer features in the notebook tutorials.

Also don't forget to have a look at the DataTracer REST API.

History

0.0.6 - 2020-06-19

  • Add update_metadata primitives and pipelines.
  • Upgrade to MetaData v0.0.2

0.0.5 - 2020-06-12

  • Add new update_metadata endpoint to the REST API.
  • New demo dataset and new tutorial.

0.0.4 - 2020-06-05

  • Add initial version of pretrained solvers
  • Reorganize ColumnMapSolver code tree
  • Add REST API to access DataTracer solvers via HTTP

0.0.3 - 2020-05-28

  • Finish Column Mapping and add tutorial
  • Minor refactoring and adding docstrings
  • Fix testing config

0.0.2 - 2020-05-26

  • Curate configuration and dependencies

0.0.1 - 2020-05-22

First release.

Features:

  • Primary Key Detection
  • Foreign Key Detection
  • Column Mapping

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datatracer-0.0.6.tar.gz (302.2 kB view details)

Uploaded Source

Built Distribution

datatracer-0.0.6-py2.py3-none-any.whl (277.6 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file datatracer-0.0.6.tar.gz.

File metadata

  • Download URL: datatracer-0.0.6.tar.gz
  • Upload date:
  • Size: 302.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.6.9

File hashes

Hashes for datatracer-0.0.6.tar.gz
Algorithm Hash digest
SHA256 0ee8e7ae1b955765fd4e64331748977aadfe75a9ef79951da909b25e69163fe1
MD5 90dd2cea860cf7693fa353f2d9808a00
BLAKE2b-256 6c24648f0bcde3d93371771a2a6ea0f77e92ec3a0a824da6b55d245b5b77ba2f

See more details on using hashes here.

File details

Details for the file datatracer-0.0.6-py2.py3-none-any.whl.

File metadata

  • Download URL: datatracer-0.0.6-py2.py3-none-any.whl
  • Upload date:
  • Size: 277.6 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.6.9

File hashes

Hashes for datatracer-0.0.6-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 794cd6cfa37d8fa255d9f645641f3f0c05cb3948cbc7b298c51b858afecb4fbc
MD5 e1da068ff8d875e97394b414a08c7216
BLAKE2b-256 81cc6b65fb89baa95acdb9aa7991e88524aa96071be7de02e0146ef35ac254db

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page