HiDALGO Data Transfer CLI provides commands to transfer data between different data providers and consumers using NIFI pipelines
Project description
Hidalgo2 Data Transfer Tool
This repository contains the implementation of the Hidalgo2 data transfer tool. It uses Apache NIFI to transfer data from different data sources to specified targets
Features
This tool is planning to support the following features:
- transfer datasets from Cloud Providers to HDFS
- transfer datasets from Cloud Providers to CKAN
- transfer datasets from/to Hadoop HDFS to/from HPC
- transfer datasets from/to Hadoop HDFS to/from CKAN
- transfer datasets from/to a CKAN to/from HPC
- transfer datasets from/to local filesystem to/from CKAN
Current Version
Current version supports the following features:
- transfer datasets from/to Hadoop HDFS to/from HPC
- transfer datasets from/to Hadoop HDFS to/from CKAN
- transfer datasets from/to a CKAN to/from HPC
- transfer datasets from/to local filesystem to/from CKAN
Implementation
Current implementation is based on Python. It is implemented as a CLI that executes a transfer command, by creating a NIFI process group out of the worflow definition reqistered in NIFI registry. It uses the parameters given within the CLI command invocation to populate a NIFI parameter context that is asociated to the created process group. Then, the process group processors are executed once (or until the incoming flowfile queues is empty), one after another, following the group sequence flow, until the flow is completed. To check the status of the transfer command, the CLI offers a check-status command. The Data Transfer CLI tool sends requests to NIFI through its REST API.
Requirements
To use the Data Transfer CLI tool, it is required the following requirements:
- Python3 execution environment
- Poetry python package management tool (optional)
- NIFI instance, with a NIFI server SSH account (for keys transfer)
- Keycloak instance, with a KEYCLOAK user's account
- HDFS instance, with a user Kerberos principal account
- CKAN instance, with an user APIKey
Python3 and Poetry (optional, only from installation from the GitHub repository) should be installed in the computer where Data Transfer CLI tool will be used. To install Poetry, follows this instructions
For a quick download, setup, configuration and execution of the DTCLI go to section Quick Deployment, setup, configuration and execution
CLI configuration
Configuration file
Before using the Data Transfer CLI tool, you should configure it to point at the target NIFI. The configuration file is located at the user's ~/dtcli/dtcli.cfg file. This configuration overrides (optionally) and completes the tool configuration.
The default tool configuration is:
[Nifi]
nifi_endpoint=http://localhost:8443
nifi_upload_folder=/opt/nifi/data/upload
nifi_download_folder=/opt/nifi/data/download
nifi_secure_connection=True
[Keycloak]
keycloak_endpoint=https://idm.hidalgo2.eu
keycloak_client_id=nifi
[Logging]
logging_level=INFO
[Network]
check_status_sleep_lapse=5
Under the NIFI section,
- We define the url of the NIFI service (nifi_endpoint),
- We also specify a folder (nifi_upload_folder) in NIFI server where to upload files
- And another folder (nifi_download_folder) where from to download files. These folder must be accessible by the NIFI service (ask NIFI administrator for details).
- Additionally, you cat set if NIFI servers listens on a secure HTTPS connection (nifi_secure_connection=True) or on a non-secure HTTP (nifi_secure_connection=False)
Under the Keycloak section, you can configure the Keycloak integrated with NIFI, specifying:
- The Keycloak service endpoint (keycloak_endpoint)
- The NIFI client in Keycloak (keycloak_client)
Under the Logging section, you can configure the logging level. Logfile *dtcli.log" is located at the workdir of the process that executes the library.
Under the Network section, you can configure the lapse time (in seconds) each processor in the NIFI pipeline is checked for completion. Most of users should leave the default value.
This default configuration is set up to work with HiDALGO2 NIFI and Keycloak, and does not need to be overriden by the user. In the context of HiDALGO2 only the Logging and Network information could be overriden.
This default configuration must be complemented with sensitive and user's specific configuration in the file ~/dtcli/dtcli.cfg. In particular, contact the Keycloak administrator for the keycloak_client_secret, which needs to be set up.
Other user's account settings are the following:
User's accounts
User's accounts are specified in the user's specific configuration file ~/.dtcli/dtcli.cfg:
[Nifi]
nifi_server_username=<user_name>
nifi_server_private_key=<path/to/private/key>
[Keycloak]
keycloak_login=<user_name>
keycloak_password=<password>
keycloak_client_secret=<keycloak_nifi_client_secret>
[Logging]
logging_level=DEBUG
[Network]
check_status_sleep_lapse=2
Under the Nifi section, you must specify a user account (username, private_key) that grants to upload/download files to the NIFI server (as requested to upload temporary HPC keys or to support local file transfer). This user's account is provided by Hidalgo2 infrastructure provider and it is user's or service's specific.
Under the Keycloak section, you must specify your Keycloak account (username and password). This account grants access to the NIFI service.
For HiDALGO2 developers, NIFI (Service, Server) and Keycloak accounts are provided by the HiDALGO2 administrator.
The example above of ~/.dtcli/dtcli.cfg also shows how to specified the required keycloak_client_secret and how to override default values for the logging level or the sleep lapse time for checking the processors status on the Nifi pipeline
Quick Deployment, setup, configuration and execution
From GitLab repository (requires Poetry)
- Clone this Data Transfer CLI repository.
- Setup the data-transfer-cli project with poetry.
Go to folder hid-data-management/data-transfer/nifi/data-transfer-cli.
On the prompt, run
./setup.sh - Configure your NIFI and Keycloak services, by modifying the user's DT CLI configuration located at ~/dtcli/dtcli.cfg. Provide your accounts for KEYCLOAK (also the nifi_client) and the NIFI server. Contact the HiDALGO2 administrator to request them.
- Add hid-data-management/data-transfer/nifi/data-transfer-cli folder to your classpath
- Run Data Transfer CLI tool. In this example, we ask it for help:
dtcli -h
From Pipy installation
- Install data_transfer_cli with:
pip install data_transfer_cli - Configure your NIFI and Keycloak services, by modifying the user's DT CLI configuration located at ~/dtcli/dtcli.cfg. Provide your accounts for KEYCLOAK (also the nifi_client) and the NIFI server. Contact the HiDALGO2 administrator to request them.
- Run Data Transfer CLI tool. In this example, we ask it for help:
dtcli -h
Usage
The Data Transfer CLI tool can be executed by invoking the command dtcli. Add this command location to your path, either by adding the data_transfer_cli folder (when cloned from GitLab) or its location when installed with pip from Pypi:
./dtcli command <arguments>
To get help execute:
./dtcli -h
obtaining:
usage: ['-h'] [-h]
{check-status,hdfs2hpc,hpc2hdfs,ckan2hdfs,hdfs2ckan,ckan2hpc,hpc2ckan,local2ckan,ckan2local}
...
positional arguments:
{check-status,hdfs2hpc,hpc2hdfs,ckan2hdfs,hdfs2ckan,ckan2hpc,hpc2ckan,local2ckan,ckan2local}
supported commands to transfer data
check-status check the status of a command
hdfs2hpc transfer data from HDFS to target HPC
hpc2hdfs transfer data from HPC to target HDFS
ckan2hdfs transfer data from CKAN to target HDFS
hdfs2ckan transfer data from HDFS to a target CKAN
ckan2hpc transfer data from CKAN to target HPC
hpc2ckan transfer data from HPC to a target CKAN
local2ckan transfer data from a local filesystem to a target CKAN
ckan2local transfer data from CKAN to a local filesystem
options:
-h, --help show this help message and exit
To get help of a particular command:
./dtcli hdfs2hpc -h
obtaining:
usage: ['hdfs2hpc', '-h'] hdfs2hpc [-h] -s DATA_SOURCE [-t DATA_TARGET] [-kpr KERBEROS_PRINCIPAL] [-kp KERBEROS_PASSWORD] -H HPC_HOST [-z HPC_PORT] -u HPC_USERNAME [-p HPC_PASSWORD] [-k HPC_SECRET_KEY] [-P HPC_SECRET_KEY_PASSWORD]
options:
-h, --help show this help message and exit
-s DATA_SOURCE, --data-source DATA_SOURCE
HDFS file path
-t DATA_TARGET, --data-target DATA_TARGET
[Optional] HPC folder
-kpr KERBEROS_PRINCIPAL, --kerberos-principal KERBEROS_PRINCIPAL
[Optional] Kerberos principal (mandatory for a Kerberized HDFS)
-kp KERBEROS_PASSWORD, --kerberos-password KERBEROS_PASSWORD
[Optional] Kerberos principal password (mandatory for a Kerberized HDFS)
-H HPC_HOST, --hpc-host HPC_HOST
Target HPC ssh host
-z HPC_PORT, --hpc-port HPC_PORT
[Optional] Target HPC ssh port
-u HPC_USERNAME, --hpc-username HPC_USERNAME
Username for HPC account
-p HPC_PASSWORD, --hpc-password HPC_PASSWORD
[Optional] Password for HPC account. Either password or secret key is required
-k HPC_SECRET_KEY, --hpc-secret-key HPC_SECRET_KEY
[Optional] Path to HPC secret key. Either password or secret key is required
-P HPC_SECRET_KEY_PASSWORD, --hpc-secret-key-password HPC_SECRET_KEY_PASSWORD
[Optional] Password for HPC secret key
-2fa, --two-factor-authentication
[Optional] HPC requires 2FA authentication
-acct, --accounting [Optional] Enable returning accounting information of data transfer
-ct CONCURRENT_TASKS, --concurrent-tasks CONCURRENT_TASKS
[Optional] set the number of concurrent tasks for parallel data transfer
-R, --recursive [Optional] if True the data-source subdirectories will be transferred as well, otherwise only the root data-source folder
A common command flow (e.g. transfer data from hdfs to hpc) would be like this:
- execute hdfs2hcp CLI command to transfer data from an hdfs location (e.g. /users/yosu/data/genome-tags.csv) to a remote HPC (e.g. LUMI, at $HOME/data folder)
- check status of hdfs2hcp transfer (and possible warnings/errors) with check-status CLI command
If accounting report is enabled, the output of the command will include some transfer statistics:
Data transfer report:
Transfer time: 21 s
Transfer size: 12.86 MB
Transfer rate: 0.61 MB/s
Number of transferred files: 1
Support for HPC clusters that require a 2FA token
The Data Transfer CLI tool's commands support transferring data to/from HPC clusters that require a 2FA token. These commands offer an optional flag _2fa. If set by the user, the command prompts the user (in the standard input) for the token when required.
Predefined profiles for data hosts
To avoid feeding the Data Transfer CLI tool with many inputs decribing the hosts of the source and target data providers/consumers, the user can defined them in the ~/dtcli/server_config YAML file, as shown in the following YAML code snippet:
# Meluxina
login.lxp.lu:
username: u102309
port: 8822
secret-key: ~/.ssh/<secret_key>
secret-key-password: <password>
# CKAN
ckan.hidalgo2.eu:
api-key: <api-key>
organization: atos
dataset: test-dataset
where details for Meluxina HPC and CKAN are given. For a HPC cluster, provide the HPC host as key, followed by colon, and below, with identation, any of the hpc parameters described by the Data Tranfer CLI tool help, without the hpc_ prefix. For instance, if the Data Transfer CLI tool help mentions:
-u HPC_USERNAME, --hpc-username HPC_USERNAME
Username for HPC account
that is, --hpc-username as parameter, use username as nested property for the HPC profile's description in the YAML config file, as shown in the example below. Similarly, proceed for other HPC parameters, such as port, password, secret-key, etc. The same procedure can be adopted to describe the CKAN host's parameters.
Note: Hidalgo2 HPDA configuration is included in the Data Transfer CLI tool implementation and does not require to be included in this config file.
Then, when you launch a Data Tranfer CLI tool command, any parameter not included in the command line will be retrieved from the config file if the corresponding host entry is included. After that, if the command line gets complete (i.e. all required parameters are provided), the command will be executed, otherwise the corresponding error will be triggered.
Data transfer optimization
You can improve the data transfer rate by setting the optional parameter -ct|--concurrent-tasks (integer) to the number of concurrent tasks that will be used in the NIFI pipeline (default is 1). The maximum number of tasks that improve the transfer throughput depends on the physical resources of the NIFI server (consult its administrator). The parallel transfer is currently supported to/from HPC and HDFS data servers, but not to/from CKAN (under development)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file data_transfer_cli-0.3.9.tar.gz.
File metadata
- Download URL: data_transfer_cli-0.3.9.tar.gz
- Upload date:
- Size: 15.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.1 CPython/3.12.9 Linux/6.18.0-1-default
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2845e43cea278347bc9adf3ba990db1b328872f2a1d877031b19f904271a18b7
|
|
| MD5 |
22a2b1a061ef8bab921da0f065d8313d
|
|
| BLAKE2b-256 |
7fc2b7d654b31b321ee158d9429e27f1ffe7a302d38bd5d792ff65a808af5cf2
|
File details
Details for the file data_transfer_cli-0.3.9-py3-none-any.whl.
File metadata
- Download URL: data_transfer_cli-0.3.9-py3-none-any.whl
- Upload date:
- Size: 14.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.1 CPython/3.12.9 Linux/6.18.0-1-default
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ba88e0b05975d4fd5c22cfee057164780da6999e934bc4f0e9ef293392d6ac1a
|
|
| MD5 |
6ab5f24f4e346aac50ecbedfd2c347de
|
|
| BLAKE2b-256 |
5f62de5db184d89dc31fc1ca93a72d4973016a86cb3da7c2d23c4a3bac844459
|