Skip to main content

A package to manage Google Cloud Data Catalog Fileset scripts

Project description

Datacatalog Fileset Processor

CircleCI PyPi License Issues

A package to manage Google Cloud Data Catalog Fileset scripts.

Disclaimer: This is not an officially supported Google product.

Table of Contents


Executing in Cloud Shell

# Set your SERVICE ACCOUNT, for instructions go to 1.3. Auth credentials
# This name is just a suggestion, feel free to name it following your naming conventions
export GOOGLE_APPLICATION_CREDENTIALS=~/datacatalog-fileset-processor-sa.json

# Install datacatalog-fileset-processor
pip3 install datacatalog-fileset-processor --user

# Add to your PATH
export PATH=~/.local/bin:$PATH

# Look for available commands
datacatalog-fileset-processor --help

1. Environment setup

1.1. Python + virtualenv

Using virtualenv is optional, but strongly recommended unless you use Docker.

1.1.1. Install Python 3.6+

1.1.2. Get the source code

git clone https://github.com/mesmacosta/datacatalog-fileset-processor
cd ./datacatalog-fileset-processor

All paths starting with ./ in the next steps are relative to the datacatalog-fileset-processor folder.

1.1.3. Create and activate an isolated Python environment

pip install --upgrade virtualenv
python3 -m virtualenv --python python3 env
source ./env/bin/activate

1.1.4. Install the package

pip install --upgrade .

1.2. Docker

Docker may be used as an alternative to run the script. In this case, please disregard the Virtualenv setup instructions.

1.3. Auth credentials

1.3.1. Create a service account and grant it below roles

  • Data Catalog Admin

1.3.2. Download a JSON key and save it as

This name is just a suggestion, feel free to name it following your naming conventions

  • ./credentials/datacatalog-fileset-processor-sa.json

1.3.3. Set the environment variables

This step may be skipped if you're using Docker.

export GOOGLE_APPLICATION_CREDENTIALS=~/credentials/datacatalog-fileset-processor-sa.json

2. Create Filesets from CSV file

2.1. Create a CSV file representing the Entry Groups and Entries to be created

Filesets are composed of as many lines as required to represent all of their fields. The columns are described as follows:

Column Description Mandatory
entry_group_name Entry Group Name. Y
entry_group_display_name Entry Group Display Name. N
entry_group_description Entry Group Description. N
entry_id Entry ID. Y
entry_display_name Entry Display Name. Y
entry_description Entry Description. N
entry_file_patterns Entry File Patterns. Y
schema_column_name Schema column name. N
schema_column_type Schema column type. N
schema_column_description Schema column description. N
schema_column_mode Schema column mode. N

Please note that the schema_column_type is an open string field and accept anything, if you want to use your fileset with Dataflow SQL, follow the data-types in the official docs.

2.2. Run the datacatalog-fileset-processor script - Create the Filesets Entry Groups and Entries

  • Python + virtualenv
datacatalog-fileset-processor filesets create --csv-file CSV_FILE_PATH

2.3. Run the datacatalog-fileset-processor script - Delete the Filesets Entry Groups and Entries

  • Python + virtualenv
datacatalog-fileset-processor filesets delete --csv-file CSV_FILE_PATH

TIPS

History

0.1.0 (2020-04-24)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datacatalog-fileset-processor-0.1.5.tar.gz (16.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datacatalog_fileset_processor-0.1.5-py2.py3-none-any.whl (12.9 kB view details)

Uploaded Python 2Python 3

File details

Details for the file datacatalog-fileset-processor-0.1.5.tar.gz.

File metadata

  • Download URL: datacatalog-fileset-processor-0.1.5.tar.gz
  • Upload date:
  • Size: 16.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.54.0 CPython/3.7.0

File hashes

Hashes for datacatalog-fileset-processor-0.1.5.tar.gz
Algorithm Hash digest
SHA256 6b6abcbc05b1aedfc3b5c2c6f01109f27d0d1dbf422d5854fe6efb3f80d1b92b
MD5 983e17c6a192c309621f0aba73a184a4
BLAKE2b-256 d683cbcaabbc26860a2a2e9a61dc51fafcc086ff7afef297eec012f40410a07e

See more details on using hashes here.

File details

Details for the file datacatalog_fileset_processor-0.1.5-py2.py3-none-any.whl.

File metadata

  • Download URL: datacatalog_fileset_processor-0.1.5-py2.py3-none-any.whl
  • Upload date:
  • Size: 12.9 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.54.0 CPython/3.7.0

File hashes

Hashes for datacatalog_fileset_processor-0.1.5-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 18887a5fcf66d77a5aa62d00793ab181c5641a1db5752d25b5b11d551e569d49
MD5 24b187a26c59ab67f6e2d406a7b1341d
BLAKE2b-256 a1bfeb4b05f0bc5b4373907eab61d14052ad24598912ee0b606cff0d9c03d3a5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page