A package to manage Google Cloud Data Catalog Fileset scripts
Project description
Datacatalog Fileset Processor
A package to manage Google Cloud Data Catalog Fileset scripts.
Disclaimer: This is not an officially supported Google product.
Table of Contents
- Executing in Cloud Shell
- 1. Environment setup
- 2. Create Filesets from CSV file
Executing in Cloud Shell
# Set your SERVICE ACCOUNT, for instructions go to 1.3. Auth credentials
# This name is just a suggestion, feel free to name it following your naming conventions
export GOOGLE_APPLICATION_CREDENTIALS=~/datacatalog-fileset-processor-sa.json
# Install datacatalog-fileset-processor
pip3 install datacatalog-fileset-processor --user
# Add to your PATH
export PATH=~/.local/bin:$PATH
# Look for available commands
datacatalog-fileset-processor --help
1. Environment setup
1.1. Python + virtualenv
Using virtualenv is optional, but strongly recommended unless you use Docker.
1.1.1. Install Python 3.6+
1.1.2. Get the source code
git clone https://github.com/mesmacosta/datacatalog-fileset-processor
cd ./datacatalog-fileset-processor
All paths starting with ./
in the next steps are relative to the datacatalog-fileset-processor
folder.
1.1.3. Create and activate an isolated Python environment
pip install --upgrade virtualenv
python3 -m virtualenv --python python3 env
source ./env/bin/activate
1.1.4. Install the package
pip install --upgrade .
1.2. Docker
Docker may be used as an alternative to run the script. In this case, please disregard the Virtualenv setup instructions.
1.3. Auth credentials
1.3.1. Create a service account and grant it below roles
- Data Catalog Admin
1.3.2. Download a JSON key and save it as
This name is just a suggestion, feel free to name it following your naming conventions
./credentials/datacatalog-fileset-processor-sa.json
1.3.3. Set the environment variables
This step may be skipped if you're using Docker.
export GOOGLE_APPLICATION_CREDENTIALS=~/credentials/datacatalog-fileset-processor-sa.json
2. Create Filesets from CSV file
2.1. Create a CSV file representing the Entry Groups and Entries to be created
Filesets are composed of as many lines as required to represent all of their fields. The columns are described as follows:
Column | Description | Mandatory |
---|---|---|
entry_group_name | Entry Group Name. | Y |
entry_group_display_name | Entry Group Display Name. | N |
entry_group_description | Entry Group Description. | N |
entry_id | Entry ID. | Y |
entry_display_name | Entry Display Name. | Y |
entry_description | Entry Description. | N |
entry_file_patterns | Entry File Patterns. | Y |
schema_column_name | Schema column name. | N |
schema_column_type | Schema column type. | N |
schema_column_description | Schema column description. | N |
schema_column_mode | Schema column mode. | N |
Please note that the schema_column_type
is an open string field and accept anything, if you want
to use your fileset with Dataflow SQL, follow the data-types in the official docs.
2.2. Run the datacatalog-fileset-processor script - Create the Filesets Entry Groups and Entries
- Python + virtualenv
datacatalog-fileset-processor filesets create --csv-file CSV_FILE_PATH
2.3. Run the datacatalog-fileset-processor script - Delete the Filesets Entry Groups and Entries
- Python + virtualenv
datacatalog-fileset-processor filesets delete --csv-file CSV_FILE_PATH
TIPS
-
sample-input/create-filesets for reference;
-
If you want to create filesets without schema: sample-input/create-filesets/fileset-entry-opt-1-all-metadata-no-schema.csv for reference;
History
0.1.0 (2020-04-24)
- First release on PyPI.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for datacatalog-fileset-processor-0.1.5.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6b6abcbc05b1aedfc3b5c2c6f01109f27d0d1dbf422d5854fe6efb3f80d1b92b |
|
MD5 | 983e17c6a192c309621f0aba73a184a4 |
|
BLAKE2b-256 | d683cbcaabbc26860a2a2e9a61dc51fafcc086ff7afef297eec012f40410a07e |
Hashes for datacatalog_fileset_processor-0.1.5-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 18887a5fcf66d77a5aa62d00793ab181c5641a1db5752d25b5b11d551e569d49 |
|
MD5 | 24b187a26c59ab67f6e2d406a7b1341d |
|
BLAKE2b-256 | a1bfeb4b05f0bc5b4373907eab61d14052ad24598912ee0b606cff0d9c03d3a5 |