Commons library for ingesting RDBMS metadata into Google Cloud Data Catalog
Project description
google-datacatalog-rdbms-connector
Common resources for Data Catalog RDBMS connectors.
Disclaimer: This is not an officially supported Google product.
Table of Contents
- 1. Installation
- 2. Install from source
- 3. Developer environment
- 4. Setting up the RDBMS on a new connector
1. Installation
Install this library in a virtualenv using pip. virtualenv is a tool to create isolated Python environments. The basic problem it addresses is one of dependencies and versions, and indirectly permissions.
With virtualenv, it's possible to install this library without needing system install permissions, and without clashing with the installed system dependencies. Make sure you use Python 3.6+.
1.1. Mac/Linux
pip3 install virtualenv
virtualenv --python python3.6 <your-env>
source <your-env>/bin/activate
<your-env>/bin/pip install google-datacatalog-rdbms-connector
1.2. Windows
pip3 install virtualenv
virtualenv --python python3.6 <your-env>
<your-env>\Scripts\activate
<your-env>\Scripts\pip.exe install google-datacatalog-rdbms-connector
2. Install from source
2.1. Get the code
git clone https://github.com/GoogleCloudPlatform/datacatalog-connectors-rdbms/
cd datacatalog-connectors-rdbms/google-datacatalog-rdbms-connector
2.2. Virtualenv
Using virtualenv is optional, but strongly recommended.
2.2.1. Install Python 3.6
2.2.2. Create and activate a virtualenv
pip3 install virtualenv
virtualenv --python python3.6 <your-env>
source <your-env>/bin/activate
2.2.3. Install
pip install .
3. Developer environment
3.1. Install and run YAPF formatter
pip install --upgrade yapf
# Auto update files
yapf --in-place --recursive src tests
# Show diff
yapf --diff --recursive src tests
# Set up pre-commit hook
# From the root of your git project.
curl -o pre-commit.sh https://raw.githubusercontent.com/google/yapf/master/plugins/pre-commit.sh
chmod a+x pre-commit.sh
mv pre-commit.sh .git/hooks/pre-commit
3.2. Install and run Flake8 linter
pip install --upgrade flake8
flake8 src tests
3.3. Install the package in editable mode (i.e. setuptools “develop mode”)
pip install --editable .
3.4. Run the unit tests
python setup.py test
4. Setting up the RDBMS on a new connector
To set up the RDBMS connector to work with a Relational Database 3 files are needed.
metadata_definition.json
metadata_query.sql
- Extending the
metadata_scraper
class and implementing your rdbms connection method:_create_rdbms_connection
for the metadata_definition file your have fields available for 3 levels:
table_container_def
table_def
column_def
If you want working examples please take a look at the already implemented connectors for: Oracle, Teradata, MySQL, PostgreSQL, Greenplum, Redshift and SQLServer.
For the metadata_defition
target fields you have the following options as target
:
Level | Target | Description | Mandatory |
---|---|---|---|
table_container_def | creator | Creator of the Table Container. | N |
table_container_def | owner | Owner of the Table Container. | N |
table_container_def | update_user | Last user that updated the Table Container. | N |
table_container_def | desc | Table Container Description. | N |
table_def | num_rows | Number of rows contained in the Table. | N |
table_def | creator | Creator of the Table. | N |
table_def | owner | Owner of the Table. | N |
table_def | update_user | Last user that updated the Table. | N |
table_def | desc | Table Description. | N |
table_def | table_size_MB | Table size, in MB. | N |
If those fields are configured they will be used to create Tags.
For columns they are used to create the Data Catalog Entry schema, two target
fields are used:
Level | Target | Description | Mandatory |
---|---|---|---|
column_def | type | Column type. | Y |
column_def | desc | Column description. | N |
4.1 Add support for optional queries
You can use user configuration file and execute optional queries to scrape additional metadata from a database. You would need to do the following:
- Create additional SQL queries. At the moment, support for two optional queries are implemented
in this common RDBMS package:
- Statement for refreshing metadata (e.g. ANALYZE)
- Query to scrape number of rows in each table
- Extend the
query_assembler
class and implement a methods for getting optional queries:_get_refresh_statement
and_get_path_to_num_rows_query
. - Implement
_get_query_assembler
and_execute_refresh_query
on the extension of thematadata_scraper
class.
You can see working examples of implementing optional queries in the connectors code for
PostgreSQL and MySQL. Please also refer to these samples to see how user configuration file
ingest_cfg.yaml
should look like.
4.2 Add support for SQL Objects
You can use the SQL Objects mechanism from the SQL connector to ingest SQL objects such as Functions, Stored Procedures, Views, Materialized Views, and so on. This mechanism uses naming convention to locate a SQL query and a metadata definition file.
You can see working examples of implementing SQL Objects in the test sources. To enable it you need 3 configuration files:
-
ingest_cfg.yaml
sql_objects: - name: 'functions' enabled: True - name: 'stored_procedures' enabled: True
The
ingest_cfg.yaml
file must be located at the connection execution directory. You can find a sample in the SAP HANA connector.You can specify a list of SQL Objects with a flag to enable/disable the SQL Objects mechanism. It will be only considered for objects whose
enabled
flag is set toTrue
.The related
query
andmetadata_definition
files are located by thesql_objects.name
value. From the above sample, thefunction
SQL Object requires two files:query_functions_sql_object.sql
andmetadata_definition_functions_sql_object.sql
.And the
stored_procedures
SQL Object requires:query_stored_procedures_sql_object.sql
andmetadata_definition_stored_procedures_sql_object.sql
. -
query_functions_sql_object.sql
Implement a query file that matches the SQL Object name, with the following pattern:query_{name}_sql_object.sql
this file must be located at the connectorconfig
directory. You can find a sample in the SAP HANA connector. -
metadata_definition_functions_sql_object.json
Implement a metadata definition file that matches the SQL Object name, with the following pattern:metadata_definition_{sql_objects.name}_sql_object.sql
this file must be located at the connectorconfig
directory. You can find a sample in the SAP HANA connector.These are the required attributes for
metadata_definition
file:Field Description key Key used to lookup the config files type SQL Object type name Name of the SQL Object fields Fields definition scraped from query The
fields
attributes needs to implement thesource
andtarget
definitions:Field Description fields.source Name of the field scraped by the query fields.target Object with target field attributes fields.target.field_name Name of the field when translated to Data Catalog attributes fields.target.model Type of Data Catalog model, can be (tag or entry) fields.target.type Type of the field to be translated to Data Catalog attributes
The SQL Object will be skipped if any of the 3 configuration files are missing.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file google-datacatalog-rdbms-connector-0.12.2.tar.gz
.
File metadata
- Download URL: google-datacatalog-rdbms-connector-0.12.2.tar.gz
- Upload date:
- Size: 28.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6f6bc4887a362c8be203da466bdc064cb400ecfeb18d8563cfefb2dd2d1cda9d |
|
MD5 | 29ff954ab0ea697a8d79820512c08973 |
|
BLAKE2b-256 | c1e6f4df75b241de0666d8c846e73b1ce1e3098bc620adb1efe8c574ca83c02e |
File details
Details for the file google_datacatalog_rdbms_connector-0.12.2-py2.py3-none-any.whl
.
File metadata
- Download URL: google_datacatalog_rdbms_connector-0.12.2-py2.py3-none-any.whl
- Upload date:
- Size: 44.3 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 427d6abd6ca81b736a9bdf466e6cd5f09903af26bee76e7ab43c5bd65fc9bc9d |
|
MD5 | 4dbbc5e47b2f04686c0b28740d24204b |
|
BLAKE2b-256 | 6aee02630cf72b8fa33172b3d8a8986e3a8d18b7667df9529bec250962508fb6 |