Commons library for ingesting RDBMS metadata into Google Cloud Data Catalog

These details have not been verified by PyPI

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

google-datacatalog-rdbms-connector

Common resources for Data Catalog RDBMS connectors.

Disclaimer: This is not an officially supported Google product.

1. Installation
- 1.1. Mac/Linux
- 1.2. Windows
2. Install from source
- 2.1. Get the code
- 2.2. Virtualenv
3. Developer environment
4. Setting up the RDBMS on a new connector
- 4.1 Add support for optional queries
- 4.2 Add support for SQL Objects

1. Installation

Install this library in a virtualenv using pip. virtualenv is a tool to create isolated Python environments. The basic problem it addresses is one of dependencies and versions, and indirectly permissions.

With virtualenv, it's possible to install this library without needing system install permissions, and without clashing with the installed system dependencies. Make sure you use Python 3.6+.

1.1. Mac/Linux

pip3 install virtualenv
virtualenv --python python3.6 <your-env>
source <your-env>/bin/activate
<your-env>/bin/pip install google-datacatalog-rdbms-connector

1.2. Windows

pip3 install virtualenv
virtualenv --python python3.6 <your-env>
<your-env>\Scripts\activate
<your-env>\Scripts\pip.exe install google-datacatalog-rdbms-connector

2. Install from source

2.1. Get the code

git clone https://github.com/GoogleCloudPlatform/datacatalog-connectors-rdbms/
cd datacatalog-connectors-rdbms/google-datacatalog-rdbms-connector

2.2. Virtualenv

Using virtualenv is optional, but strongly recommended.

2.2.1. Install Python 3.6

2.2.2. Create and activate a virtualenv

pip3 install virtualenv
virtualenv --python python3.6 <your-env>
source <your-env>/bin/activate

2.2.3. Install

pip install .

3. Developer environment

3.1. Install and run YAPF formatter

pip install --upgrade yapf

# Auto update files
yapf --in-place --recursive src tests

# Show diff
yapf --diff --recursive src tests

# Set up pre-commit hook
# From the root of your git project.
curl -o pre-commit.sh https://raw.githubusercontent.com/google/yapf/master/plugins/pre-commit.sh
chmod a+x pre-commit.sh
mv pre-commit.sh .git/hooks/pre-commit

3.2. Install and run Flake8 linter

pip install --upgrade flake8
flake8 src tests

3.3. Install the package in editable mode (i.e. setuptools “develop mode”)

pip install --editable .

3.4. Run the unit tests

python setup.py test

4. Setting up the RDBMS on a new connector

To set up the RDBMS connector to work with a Relational Database 3 files are needed.

metadata_definition.json
metadata_query.sql
Extending the metadata_scraper class and implementing your rdbms connection method: _create_rdbms_connection

for the metadata_definition file your have fields available for 3 levels:

table_container_def
table_def
column_def

If you want working examples please take a look at the already implemented connectors for: Oracle, Teradata, MySQL, PostgreSQL, Greenplum, Redshift and SQLServer.

For the metadata_defition target fields you have the following options as target:

Level	Target	Description	Mandatory
table_container_def	creator	Creator of the Table Container.	N
table_container_def	owner	Owner of the Table Container.	N
table_container_def	update_user	Last user that updated the Table Container.	N
table_container_def	desc	Table Container Description.	N
table_def	num_rows	Number of rows contained in the Table.	N
table_def	creator	Creator of the Table.	N
table_def	owner	Owner of the Table.	N
table_def	update_user	Last user that updated the Table.	N
table_def	desc	Table Description.	N
table_def	table_size_MB	Table size, in MB.	N

If those fields are configured they will be used to create Tags.

For columns they are used to create the Data Catalog Entry schema, two target fields are used:

Level	Target	Description	Mandatory
column_def	type	Column type.	Y
column_def	desc	Column description.	N

4.1 Add support for optional queries

You can use user configuration file and execute optional queries to scrape additional metadata from a database. You would need to do the following:

Create additional SQL queries. At the moment, support for two optional queries are implemented in this common RDBMS package:
- Statement for refreshing metadata (e.g. ANALYZE)
- Query to scrape number of rows in each table
Extend the query_assembler class and implement a methods for getting optional queries: _get_refresh_statement and _get_path_to_num_rows_query.
Implement _get_query_assembler and _execute_refresh_query on the extension of the matadata_scraper class.

You can see working examples of implementing optional queries in the connectors code for PostgreSQL and MySQL. Please also refer to these samples to see how user configuration file ingest_cfg.yaml should look like.

4.2 Add support for SQL Objects

You can use the SQL Objects mechanism from the SQL connector to ingest SQL objects such as Functions, Stored Procedures, Views, Materialized Views, and so on. This mechanism uses naming convention to locate a SQL query and a metadata definition file.

You can see working examples of implementing SQL Objects in the test sources. To enable it you need 3 configuration files:

ingest_cfg.yaml
```
sql_objects:
  - name: 'functions'
    enabled: True
  - name: 'stored_procedures'
    enabled: True
```
The ingest_cfg.yaml file must be located at the connection execution directory. You can find a sample in the SAP HANA connector.

You can specify a list of SQL Objects with a flag to enable/disable the SQL Objects mechanism. It will be only considered for objects whose enabled flag is set to True.

The related query and metadata_definition files are located by the sql_objects.name value. From the above sample, the function SQL Object requires two files: query_functions_sql_object.sql and metadata_definition_functions_sql_object.sql.

And the stored_procedures SQL Object requires: query_stored_procedures_sql_object.sql and metadata_definition_stored_procedures_sql_object.sql.
query_functions_sql_object.sql
Implement a query file that matches the SQL Object name, with the following pattern: query_{name}_sql_object.sql this file must be located at the connector config directory. You can find a sample in the SAP HANA connector.

metadata_definition_functions_sql_object.json
Implement a metadata definition file that matches the SQL Object name, with the following pattern:metadata_definition_{sql_objects.name}_sql_object.sql this file must be located at the connector config directory. You can find a sample in the SAP HANA connector.

These are the required attributes for metadata_definition file:

Field	Description
key	Key used to lookup the config files
type	SQL Object type
name	Name of the SQL Object
fields	Fields definition scraped from query

The fields attributes needs to implement the source and target definitions:

Field	Description
fields.source	Name of the field scraped by the query
fields.target	Object with target field attributes
fields.target.field_name	Name of the field when translated to Data Catalog attributes
fields.target.model	Type of Data Catalog model, can be (tag or entry)
fields.target.type	Type of the field to be translated to Data Catalog attributes

**The SQL Object will be skipped if any of the 3 configuration files are missing. **

Project details

These details have not been verified by PyPI

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.12.2

Nov 14, 2021

0.12.1

Mar 26, 2021

0.12.0

Mar 26, 2021

0.11.1

Nov 25, 2020

This version

0.11.0

Nov 24, 2020

0.10.1

Nov 20, 2020

0.10.0

Nov 13, 2020

0.9.1

Nov 9, 2020

0.9.0

Nov 6, 2020

0.8.2

Nov 6, 2020

0.8.1

Oct 17, 2020

0.8.0

Oct 13, 2020

0.7.0

Sep 23, 2020

0.6.0

Jul 24, 2020

0.5.0

Jun 17, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

google-datacatalog-rdbms-connector-0.11.0.tar.gz (26.5 kB view hashes)

Uploaded Nov 24, 2020 Source

Built Distribution

google_datacatalog_rdbms_connector-0.11.0-py2.py3-none-any.whl (42.5 kB view hashes)

Uploaded Nov 24, 2020 Python 2 Python 3

Hashes for google-datacatalog-rdbms-connector-0.11.0.tar.gz

Hashes for google-datacatalog-rdbms-connector-0.11.0.tar.gz
Algorithm	Hash digest
SHA256	`8bd613014caea74200c57983873ebe91a851df3e748aacfe7348e1ebd5582b74`
MD5	`1ee1371d2171b311bd6ff13a66e92dfa`
BLAKE2b-256	`8ea38c4ca333d0caf34212ff8556a281f59ca8eeb18176f583d167519b7a2be5`

Hashes for google_datacatalog_rdbms_connector-0.11.0-py2.py3-none-any.whl

Hashes for google_datacatalog_rdbms_connector-0.11.0-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`019368107d5274e3973147830041c634328508348c332621961569ea4b33a868`
MD5	`681cff193c1d863196dee04789e85aa4`
BLAKE2b-256	`09063883c512d539ee380611529a1bd72f011861ab05b3811b66715508cc0ac0`