Commons library for ingesting RDBMS metadata into Google Cloud Data Catalog
Project description
google-datacatalog-rdbms-connector
Common resources for Data Catalog RDBMS connectors.
Disclaimer: This is not an officially supported Google product.
Table of Contents
- 1. Installation
- 2. Install from source
- 3. Developer environment
- 4. Setting up the RDBMS on a new connector
1. Installation
Install this library in a virtualenv using pip. virtualenv is a tool to create isolated Python environments. The basic problem it addresses is one of dependencies and versions, and indirectly permissions.
With virtualenv, it's possible to install this library without needing system install permissions, and without clashing with the installed system dependencies. Make sure you use Python 3.6+.
1.1. Mac/Linux
pip3 install virtualenv
virtualenv --python python3.6 <your-env>
source <your-env>/bin/activate
<your-env>/bin/pip install google-datacatalog-rdbms-connector
1.2. Windows
pip3 install virtualenv
virtualenv --python python3.6 <your-env>
<your-env>\Scripts\activate
<your-env>\Scripts\pip.exe install google-datacatalog-rdbms-connector
2. Install from source
2.1. Get the code
git clone https://github.com/GoogleCloudPlatform/datacatalog-connectors-rdbms/
cd datacatalog-connectors-rdbms/google-datacatalog-rdbms-connector
2.2. Virtualenv
Using virtualenv is optional, but strongly recommended.
2.2.1. Install Python 3.6
2.2.2. Create and activate a virtualenv
pip3 install virtualenv
virtualenv --python python3.6 <your-env>
source <your-env>/bin/activate
2.2.3. Install
pip install .
3. Developer environment
3.1. Install and run YAPF formatter
pip install --upgrade yapf
# Auto update files
yapf --in-place --recursive src tests
# Show diff
yapf --diff --recursive src tests
# Set up pre-commit hook
# From the root of your git project.
curl -o pre-commit.sh https://raw.githubusercontent.com/google/yapf/master/plugins/pre-commit.sh
chmod a+x pre-commit.sh
mv pre-commit.sh .git/hooks/pre-commit
3.2. Install and run Flake8 linter
pip install --upgrade flake8
flake8 src tests
3.3. Install the package in editable mode (i.e. setuptools “develop mode”)
pip install --editable .
3.4. Run the unit tests
python setup.py test
4. Setting up the RDBMS on a new connector
To set up the RDBMS connector to work with a Relational Database 3 files are needed.
metadata_definition.jsonmetadata_query.sql- Extending the
metadata_scraperclass and implementing your rdbms connection method:_create_rdbms_connection
for the metadata_definition file your have fields available for 3 levels:
table_container_deftable_defcolumn_def
If you want working examples please take a look at the already implemented connectors for: Oracle, Teradata, MySQL, PostgreSQL, Greenplum, Redshift and SQLServer.
For the metadata_defition target fields you have the following options as target:
| Level | Target | Description | Mandatory |
|---|---|---|---|
| table_container_def | creator | Creator of the Table Container. | N |
| table_container_def | owner | Owner of the Table Container. | N |
| table_container_def | update_user | Last user that updated the Table Container. | N |
| table_container_def | desc | Table Container Description. | N |
| table_def | num_rows | Number of rows contained in the Table. | N |
| table_def | creator | Creator of the Table. | N |
| table_def | owner | Owner of the Table. | N |
| table_def | update_user | Last user that updated the Table. | N |
| table_def | desc | Table Description. | N |
| table_def | table_size_MB | Table size, in MB. | N |
If those fields are configured they will be used to create Tags.
For columns they are used to create the Data Catalog Entry schema, two target fields are used:
| Level | Target | Description | Mandatory |
|---|---|---|---|
| column_def | type | Column type. | Y |
| column_def | desc | Column description. | N |
4.1 Add support for optional queries
You can use user configuration file and execute optional queries to scrape additional metadata from a database. You would need to do the following:
- Create additional SQL queries. At the moment, support for two optional queries are implemented in this common
RDBMS package:
- Statement for refreshing metadata (e.g. ANALYZE)
- Query to scrape number of rows in each table
- Extend the
query_assemblerclass and implement a methods for getting optional queries:_get_refresh_statementand_get_path_to_num_rows_query. - Implement
_get_query_assemblerand_execute_refresh_queryon the extension of thematadata_scraperclass.
You can see working examples of implementing optional queries in the connectors code for PostgreSQL and MySQL.
Please also refer to these samples to see how user configuration file ingest_cfg.yaml should look like.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file google-datacatalog-rdbms-connector-0.9.1.tar.gz.
File metadata
- Download URL: google-datacatalog-rdbms-connector-0.9.1.tar.gz
- Upload date:
- Size: 18.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.7.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6e318061677b0fa6634f4b00712bdf4d3457c0ed0425a6640ae16a4fc53dfa23
|
|
| MD5 |
a0b9e8029b766172d041b05176ed0e81
|
|
| BLAKE2b-256 |
e78fef4a725692e7036c52da23759ba07c13ac02ca4bc91177e75cfb6d850c1b
|
File details
Details for the file google_datacatalog_rdbms_connector-0.9.1-py2.py3-none-any.whl.
File metadata
- Download URL: google_datacatalog_rdbms_connector-0.9.1-py2.py3-none-any.whl
- Upload date:
- Size: 25.6 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.7.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
30c5564568250d1ae567b282a7903242b005b9cfd242dd766894b6921a222d46
|
|
| MD5 |
be510e4e167c394f44b6cb6d378efc13
|
|
| BLAKE2b-256 |
bc961ced3bbd7a303294d932114fc08453dc66536da9c53f7527ea2b383ba54a
|