Skip to main content

Library for ingesting Hive metadata into Google Cloud Data Catalog

Project description

google-datacatalog-hive-connector

Library for ingesting Hive metadata into Google Cloud Data Catalog. You are able to directly connect to your Hive Metastore or Consume message events using Cloud Run.

This connector is prepared to work with the Hive Metastore 2.3.0 version, backed by a PostgreSQL or MySQL database.

Python package PyPi License Issues

Disclaimer: This is not an officially supported Google product.

Table of Contents


1. Installation

Install this library in a virtualenv using pip. virtualenv is a tool to create isolated Python environments. The basic problem it addresses is one of dependencies and versions, and indirectly permissions.

With virtualenv, it's possible to install this library without needing system install permissions, and without clashing with the installed system dependencies. Make sure you use Python 3.7+.

1.1. Mac/Linux

pip3 install virtualenv
virtualenv --python python3.7 <your-env>
source <your-env>/bin/activate
<your-env>/bin/pip install google-datacatalog-hive-connector

1.2. Windows

pip3 install virtualenv
virtualenv --python python3.7 <your-env>
<your-env>\Scripts\activate
<your-env>\Scripts\pip.exe install google-datacatalog-hive-connector

1.3. Install from source

1.3.1. Get the code

git clone https://github.com/GoogleCloudPlatform/datacatalog-connectors-hive/
cd datacatalog-connectors-hive/google-datacatalog-hive-connector

1.3.2. Create and activate a virtualenv

pip3 install virtualenv
virtualenv --python python3.7 <your-env> 
source <your-env>/bin/activate

2. Environment setup

2.1. Auth credentials

2.1.1. Create a service account and grant it below roles

  • Data Catalog Admin

2.1.2. Download a JSON key and save it as

  • <YOUR-CREDENTIALS_FILES_FOLDER>/hive2dc-datacatalog-credentials.json

Please notice this folder and file will be required in next steps.

2.2. Set environment variables to connect to your Hive Metastore

Replace below values according to your environment:

export GOOGLE_APPLICATION_CREDENTIALS=data_catalog_credentials_file

export HIVE2DC_DATACATALOG_PROJECT_ID=google_cloud_project_id
export HIVE2DC_DATACATALOG_LOCATION_ID=us-google_cloud_location_id
export HIVE2DC_HIVE_METASTORE_DB_HOST=hive_metastore_db_server
export HIVE2DC_HIVE_METASTORE_DB_USER=hive_metastore_db_user
export HIVE2DC_HIVE_METASTORE_DB_PASS=hive_metastore_db_pass
export HIVE2DC_HIVE_METASTORE_DB_NAME=hive_metastore_db_name
export HIVE2DC_HIVE_METASTORE_DB_TYPE=mysql or postgres

Make sure you use mysql on HIVE2DC_HIVE_METASTORE_DB_NAME if you are connecting to a MySQL backed Hive Metastore or postgres if it's a PostgreSQL backed Hive Metastore.

3. Run entry point

3.1. Run Python entry point

  • Virtualenv
google-datacatalog-hive-connector \
--datacatalog-project-id=$HIVE2DC_DATACATALOG_PROJECT_ID \
--datacatalog-location-id=$HIVE2DC_DATACATALOG_LOCATION_ID \
--hive-metastore-db-host=$HIVE2DC_HIVE_METASTORE_DB_HOST \
--hive-metastore-db-user=$HIVE2DC_HIVE_METASTORE_DB_USER \
--hive-metastore-db-pass=$HIVE2DC_HIVE_METASTORE_DB_PASS \
--hive-metastore-db-name=$HIVE2DC_HIVE_METASTORE_DB_NAME \
--hive-metastore-db-type=$HIVE2DC_HIVE_METASTORE_DB_TYPE    

3.2. Run Docker entry point

In case you have your Hive metastore DB running in your localhost environment, pass --network="host"

docker build -t hive2datacatalog .
docker run --network="host" --rm --tty -v data:/data hive2datacatalog --datacatalog-project-id=$HIVE2DC_DATACATALOG_PROJECT_ID --datacatalog-location-id=$HIVE2DC_DATACATALOG_LOCATION_ID --hive-metastore-db-host=$HIVE2DC_HIVE_METASTORE_DB_HOST --hive-metastore-db-user=$HIVE2DC_HIVE_METASTORE_DB_USER --hive-metastore-db-pass=$HIVE2DC_HIVE_METASTORE_DB_PASS --hive-metastore-db-name=$HIVE2DC_HIVE_METASTORE_DB_NAME --hive-metastore-db-type=$HIVE2DC_HIVE_METASTORE_DB_TYPE  

4. Deploy Message Event Consumer on Cloud Run (Optional)

4.1. Set environment variables to deploy to Cloud Run

Replace below values according to your environment:

export GOOGLE_APPLICATION_CREDENTIALS=data_catalog_credentials_file

export HIVE2DC_DATACATALOG_PROJECT_ID=google_cloud_project_id
export HIVE2DC_DATACATALOG_LOCATION_ID=us-google_cloud_location_id

4.2. Execute the deploy script

source deploy.sh

If the deploy succeeded, you will be presented the Cloud Run endpoint, example: https://hive-sync-example-uc.a.run.app

Save the endpoint which will be needed for the next step.

4.3. Create your Pub/Sub topic and subscription

4.3.1 Set additional environment variables

Replace with your Cloud Run endpoint:

export HIVE2DC_DATACATALOG_TOPIC_ID=google_cloud_topic_id
export HIVE2DC_DATACATALOG_APP_ENDPOINT=https://hive-sync-example-uc.a.run.app

4.3.2 Execute pubsub config script

source tools/create_pub_sub_run_invoker.sh

4.3.4 Send a message to your Pub/Sub topic to test

You can look at valid message events examples on: tools/resources/*.json

5. Tools (Optional)

5.1. Clean up all entries on DataCatalog from the hive entrygroup

run python tools/cleanup_datacatalog.py

5.2. Sample of Hive2Datacatalog Library usage

run python tools/hive2datacatalog_client_sample.py

6. Developer environment

6.1. Install and run Yapf formatter

pip install --upgrade yapf

# Auto update files
yapf --in-place --recursive src tests

# Show diff
yapf --diff --recursive src tests

# Set up pre-commit hook
# From the root of your git project.
curl -o pre-commit.sh https://raw.githubusercontent.com/google/yapf/master/plugins/pre-commit.sh
chmod a+x pre-commit.sh
mv pre-commit.sh .git/hooks/pre-commit

6.2. Install and run Flake8 linter

pip install --upgrade flake8
flake8 src tests

6.3. Run Tests

python setup.py test

7. Metrics

Metrics README.md

8. Connector Architecture

Architecture README.md

9. Troubleshooting

In the case you receive the error:

OSError: mysql_config not found

or

sqlalchemy.exc.NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:MySQL

Some system libraries or the MySQL driver was not found in the running machine, try installing it by running (On linux machines):

sudo apt-get install libmysqlclient-dev python-dev

In case the package libmysqlclient-dev is not available, use default-libmysqlclient-dev:

sudo apt-get install default-libmysqlclient-dev  python-dev

In the case a connector execution hits Data Catalog quota limit, an error will be raised and logged with the following detailement, depending on the performed operation READ/WRITE/SEARCH:

status = StatusCode.RESOURCE_EXHAUSTED
details = "Quota exceeded for quota metric 'Read requests' and limit 'Read requests per minute' of service 'datacatalog.googleapis.com' for consumer 'project_number:1111111111111'."
debug_error_string = 
"{"created":"@1587396969.506556000", "description":"Error received from peer ipv4:172.217.29.42:443","file":"src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Quota exceeded for quota metric 'Read requests' and limit 'Read requests per minute' of service 'datacatalog.googleapis.com' for consumer 'project_number:1111111111111'.","grpc_status":8}"

For more info about Data Catalog quota, go to: Data Catalog quota docs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

google-datacatalog-hive-connector-0.9.0.tar.gz (16.9 kB view hashes)

Uploaded Source

Built Distribution

google_datacatalog_hive_connector-0.9.0-py2.py3-none-any.whl (24.3 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page