HiDALGO Data Transfer library provides methods to transfer data between different data providers and consumers using NIFI pipelines
Project description
Hidalgo2 Data Transfer Tool
This repository contains the implementation of the Hidalgo2 data transfer library. It uses Apache NIFI to transfer data from different data sources to specified targets
Features
This library is planning to support the following features:
- transfer datasets from Cloud Providers to HDFS
- transfer datasets from Cloud Providers to CKAN
- transfer datasets from/to Hadoop HDFS to/from HPC
- transfer datasets from/to a CKAN to/from HPC
- transfer datasets from/to local filesystem to/from HPC
- transfer datasets from/to local filesystem to/from CKAN
Prototype
Current prototype of the library supports the following features:
- transfer datasets from/to Hadoop HDFS (without Kerberos security) to/from HPC (PoC)
- transfer datasets from/to a CKAN to/from HPC
- transfer datasets from/to local filesystem to/from CKAN
Implementation
This is a Python library that offers specialized API methods to transfer data from data sources to targets. Each API method launches a NIFI pipeline, by instantiating a NIFI process group out of its workflow definition registered in the NIFI registry. It uses the parameters given within the library method invocation to populate a NIFI parameter context that is asociated to the process group. Then, processors in the process group are executed once (or until the incomining processor's flowfile queue gets empty), one after another, following the group sequence flow, until the flow is completed. A processor is executed after the previous one has terminated. To check the status of the transfer command, the library offers another check-status command. Upon termination, the NIFI environment is cleaned up, by removing the created entities (i.e. the process group and its paramenter context). The Data Transfer Library sends requests to NIFI through its REST API.
Requirements
To use the Data Transfer library, it is required the following requirements:
- Python3 execution environment
- Poetry python package management tool (optional)
- NIFI instance, and either an NIFI or KEYCLOAK user's account and a NIFI server ssh account
- HDFS instance
- CKAN instance, with an user APIKey
Python3 should be installed in the computer where Data Transfer CLI will be used. To install Poetry, follows this instructions
Data Transfer lib configuration
Configuration file
Before using the Data Transfer library, you should configure it to point at the target NIFI. The configuration file is located, by default, at the data_transfer_cli/conf/hid_dt.cfg file. Otherwise, its location can be specified in the environement variable HID_DT_CONFIG_FILE
[Nifi]
nifi_endpoint=https://nifi.hidalgo2.eu:9443
nifi_upload_folder=/opt/nifi/data/upload
nifi_download_folder=/opt/nifi/data/download
nifi_secure_connection=True
[Keycloak]
keycloak_endpoint=https://idm.hidalgo2.eu
keycloak_client_id=nifi
keycloak_client_secret=<keycloak_nifi_client_secret>
Under the NIFI section,
- We define the url of the NIFI service (nifi_endpoint),
- We also specify a folder (nifi_upload_folder) in NIFI server where to upload files
- And another folder (nifi_download_folder) where from to download files. These folder must be accessible by the NIFI service (ask NIFI administrator for details).
- Additionally, you cat set if NIFI servers listens on a secure HTTPS connection (nifi_secure_connection=True) or on a non-secure HTTP (nifi_secure_connection=False)
Under the Keycloak section, you can configure the Keycloak integrated with NIFI, specifying:
- The Keycloak service endpoint (keycloak_endpoint)
- The NIFI client in Keycloak (keycloak_client)
- The NIFI secret in Keycloak (keycloak_client_secret)
HiDALGO2 developers can contact the Keycloak administrator for the keycloak_client_secret
User's accounts in environment variables
You must also specify a user account (username, private_key) that grants to upload/download files to the NIFI server (as requested to upload temporary HPC keys or to support local file transfer). This user's account is provided by Hidalgo2 infrastructure provider and it is user's or service's specific. This account is set up in the following environment variables
- NIFI_SERVER_USERNAME:
export NIFI_SERVER_USERNAME=<nifi_server_username>
- NIFI_SERVER_PRIVATE_KEY:
export NIFI_SERVER_PRIVATE_KEY=<path_to_private_key>
Additionally, a user account granted with access to the NIFI service must be specified, either a
A) NIFI User Account
The NIFI account must be configured in the following environment variables:
- NIFI_LOGIN:
export NIFI_LOGIN=<nifi_login>
- NIFI_PASSWORD:
export NIFI_PASSWORD=<nifi_password>
This NIFI account is provided by the NIFI administrator.
B) Keycloak Account with access to NIFI
The Keycloak account must be configured in the following environment variables:
- KEYCLOAK_LOGIN:
export KEYCLOAK_LOGIN=<keycloak_login>
- KEYCLOAK_PASSWORD:
export KEYCLOAK_PASSWORD=<keycloak_password>
For HiDALGO2 developers, NIFI (Service, Server) and Keycloak accounts are provided by the HiDALGO2 administrator.
Usage
The data transfer library can be invoked following this procedure:
- Provide NIFI server and Keycloak accounts in environment variables
NIFI_SERVER_USERNAME=<nifi_server_username>
NIFI_SERVER_PRIVATE_KEY=<path_to_nifi_server_user_private_key>
KEYCLOAK_LOGIN=<keycloak_username>
KEYCLOAK_PASSWORD=<keycloak_password>
-
Customized above hid_dt.cfg and specify its path in the envirorment variable
HID_DT_CONFIG_FILE=<path_to_data_transfer_configuration_file
-
In your python code, instantiate a HIDDataTransferConfiguration object and an HIDDataTranfer object The HDIDataTransfer object can be created, by default, using the Keycloak account provided in the environment variables, or by providing a dictionary with the Keycloak token, the refresh token, and the expiration time
from hid_data_transfer_lib.hid_dt_lib import HIDDataTransfer
from hid_data_transfer_lib.conf.hid_dt_configuration import (
HidDataTransferConfiguration
)
config = HidDataTransferConfiguration()
# Create a HIDDataTransfer object that uses the Keycloak account provided in the environment variables
dt_client = HIDDataTransfer(conf=config, secure=True)
# OR
# Create a HIDDataTransfer object that uses the provided Keycloak token dictionary
keycloak_token = {
"username": <keycloak_username>,
"token": <keycloak_token>,
"expires_in": <keycloak_token_expires_in>,
"refresh_token": <keycloak_refresh_token>
}
dt_client = HIDDataTransfer(
conf=config,
secure=True,
keycloak_token=keycloak_token
)
- Invoke any data transfer library method using the created object to tranfer data
dt_client.ckan2hpc(
ckan_host=<ckan_endpoint>,
ckan_api_key=<ckan_apikey>,
ckan_organization=<ckan_organization>,
ckan_dataset=<ckan_dataset>,
ckan_resource=<ckan_resource>,
hpc_host=<hpc_endpoint>,
hpc_username=<hpc_username>,
hpc_secret_key_path=<hpc_secret_key>,
data_target=<hpc_target_folder>,
)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file hid_data_transfer_lib-0.1.0.tar.gz
.
File metadata
- Download URL: hid_data_transfer_lib-0.1.0.tar.gz
- Upload date:
- Size: 98.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.11.5 Linux/6.11.3-2-default
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fc7f934c0e5e10ee4efee692d31b618a9e71016989b2b6df3102fa7c2afb08b3 |
|
MD5 | cfcdcd2e1422c5d928d8a54532de8b77 |
|
BLAKE2b-256 | bb686e7301b835ea71af6b7005307583473853ef7298fab60353307da43e2280 |
File details
Details for the file hid_data_transfer_lib-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: hid_data_transfer_lib-0.1.0-py3-none-any.whl
- Upload date:
- Size: 30.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.11.5 Linux/6.11.3-2-default
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 48334479806c45d88b4d4d19176d6188695f3f5d35b0e2e82de12cd3b56cef63 |
|
MD5 | efd0bfa899777278eae03fa020c6b091 |
|
BLAKE2b-256 | 8325c30e72a61d8f8fe25108fa1256e500177c3580dff2571d7c77c2fc7059e1 |