Skip to main content

HiDALGO Data Transfer library provides methods to transfer data between different data providers and consumers using NIFI pipelines

Project description

Hidalgo2 Data Transfer Tool

This repository contains the implementation of the Hidalgo2 data transfer library. It uses Apache NIFI to transfer data from different data sources to specified targets

Features

This library is planning to support the following features:

  • transfer datasets from Cloud Providers to HDFS
  • transfer datasets from Cloud Providers to CKAN
  • transfer datasets from/to Hadoop HDFS to/from HPC
  • transfer datasets from/to a CKAN to/from HPC
  • transfer datasets from/to local filesystem to/from HPC
  • transfer datasets from/to local filesystem to/from CKAN

Prototype

Current prototype of the library supports the following features:

  • transfer datasets from/to Hadoop HDFS (without Kerberos security) to/from HPC (PoC)
  • transfer datasets from/to a CKAN to/from HPC
  • transfer datasets from/to local filesystem to/from CKAN

Implementation

This is a Python library that offers specialized API methods to transfer data from data sources to targets. Each API method launches a NIFI pipeline, by instantiating a NIFI process group out of its workflow definition registered in the NIFI registry. It uses the parameters given within the library method invocation to populate a NIFI parameter context that is asociated to the process group. Then, processors in the process group are executed once (or until the incomining processor's flowfile queue gets empty), one after another, following the group sequence flow, until the flow is completed. A processor is executed after the previous one has terminated. To check the status of the transfer command, the library offers another check-status command. Upon termination, the NIFI environment is cleaned up, by removing the created entities (i.e. the process group and its paramenter context). The Data Transfer Library sends requests to NIFI through its REST API.

Requirements

To use the Data Transfer library, it is required the following requirements:

  • Python3 execution environment
  • Poetry python package management tool (optional)
  • NIFI instance, and either an NIFI or KEYCLOAK user's account and a NIFI server ssh account
  • HDFS instance
  • CKAN instance, with an user APIKey

Python3 should be installed in the computer where Data Transfer CLI will be used. To install Poetry, follows this instructions

Data Transfer lib configuration

Configuration file

Before using the Data Transfer library, you should configure it to point at the target NIFI. The configuration file is located, by default, at the data_transfer_cli/conf/hid_dt.cfg file. Otherwise, its location can be specified in the environement variable HID_DT_CONFIG_FILE

[Nifi]
nifi_endpoint=https://nifi.hidalgo2.eu:9443
nifi_upload_folder=/opt/nifi/data/upload
nifi_download_folder=/opt/nifi/data/download
nifi_secure_connection=True

[Keycloak]
keycloak_endpoint=https://idm.hidalgo2.eu
keycloak_client_id=nifi
keycloak_client_secret=<keycloak_nifi_client_secret>

Under the NIFI section,

  • We define the url of the NIFI service (nifi_endpoint),
  • We also specify a folder (nifi_upload_folder) in NIFI server where to upload files
  • And another folder (nifi_download_folder) where from to download files. These folder must be accessible by the NIFI service (ask NIFI administrator for details).
  • Additionally, you cat set if NIFI servers listens on a secure HTTPS connection (nifi_secure_connection=True) or on a non-secure HTTP (nifi_secure_connection=False)

Under the Keycloak section, you can configure the Keycloak integrated with NIFI, specifying:

  • The Keycloak service endpoint (keycloak_endpoint)
  • The NIFI client in Keycloak (keycloak_client)
  • The NIFI secret in Keycloak (keycloak_client_secret)

HiDALGO2 developers can contact the Keycloak administrator for the keycloak_client_secret

User's accounts in environment variables

You must also specify a user account (username, private_key) that grants to upload/download files to the NIFI server (as requested to upload temporary HPC keys or to support local file transfer). This user's account is provided by Hidalgo2 infrastructure provider and it is user's or service's specific. This account is set up in the following environment variables

  • NIFI_SERVER_USERNAME: export NIFI_SERVER_USERNAME=<nifi_server_username>
  • NIFI_SERVER_PRIVATE_KEY: export NIFI_SERVER_PRIVATE_KEY=<path_to_private_key>

Additionally, a user account granted with access to the NIFI service must be specified, either a

A) NIFI User Account

The NIFI account must be configured in the following environment variables:

  • NIFI_LOGIN: export NIFI_LOGIN=<nifi_login>
  • NIFI_PASSWORD: export NIFI_PASSWORD=<nifi_password>

This NIFI account is provided by the NIFI administrator.

B) Keycloak Account with access to NIFI

The Keycloak account must be configured in the following environment variables:

  • KEYCLOAK_LOGIN: export KEYCLOAK_LOGIN=<keycloak_login>
  • KEYCLOAK_PASSWORD: export KEYCLOAK_PASSWORD=<keycloak_password>

For HiDALGO2 developers, NIFI (Service, Server) and Keycloak accounts are provided by the HiDALGO2 administrator.

Usage

The data transfer library can be invoked following this procedure:

  • Provide NIFI server and Keycloak accounts in environment variables
NIFI_SERVER_USERNAME=<nifi_server_username>
NIFI_SERVER_PRIVATE_KEY=<path_to_nifi_server_user_private_key>
KEYCLOAK_LOGIN=<keycloak_username>
KEYCLOAK_PASSWORD=<keycloak_password>
  • Customized above hid_dt.cfg and specify its path in the envirorment variable HID_DT_CONFIG_FILE=<path_to_data_transfer_configuration_file

  • In your python code, instantiate a HIDDataTransferConfiguration object and an HIDDataTranfer object The HDIDataTransfer object can be created, by default, using the Keycloak account provided in the environment variables, or by providing a dictionary with the Keycloak token, the refresh token, and the expiration time

from hid_data_transfer_lib.hid_dt_lib import HIDDataTransfer
from hid_data_transfer_lib.conf.hid_dt_configuration import (
    HidDataTransferConfiguration
)

config = HidDataTransferConfiguration()
# Create a HIDDataTransfer object that uses the Keycloak account provided in the environment variables
dt_client = HIDDataTransfer(conf=config, secure=True)

# OR

# Create a HIDDataTransfer object that uses the provided Keycloak token dictionary
keycloak_token = {
  "username": <keycloak_username>,
  "token": <keycloak_token>,
  "expires_in": <keycloak_token_expires_in>,
  "refresh_token": <keycloak_refresh_token>
}
dt_client = HIDDataTransfer(
  conf=config,
  secure=True,
  keycloak_token=keycloak_token
)
  • Invoke any data transfer library method using the created object to tranfer data
dt_client.ckan2hpc(
  ckan_host=<ckan_endpoint>,
  ckan_api_key=<ckan_apikey>,
  ckan_organization=<ckan_organization>,
  ckan_dataset=<ckan_dataset>,
  ckan_resource=<ckan_resource>,
  hpc_host=<hpc_endpoint>,
  hpc_username=<hpc_username>,
  hpc_secret_key_path=<hpc_secret_key>,
  data_target=<hpc_target_folder>,
)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hid_data_transfer_lib-0.1.0.tar.gz (98.9 kB view details)

Uploaded Source

Built Distribution

hid_data_transfer_lib-0.1.0-py3-none-any.whl (30.7 kB view details)

Uploaded Python 3

File details

Details for the file hid_data_transfer_lib-0.1.0.tar.gz.

File metadata

  • Download URL: hid_data_transfer_lib-0.1.0.tar.gz
  • Upload date:
  • Size: 98.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.5 Linux/6.11.3-2-default

File hashes

Hashes for hid_data_transfer_lib-0.1.0.tar.gz
Algorithm Hash digest
SHA256 fc7f934c0e5e10ee4efee692d31b618a9e71016989b2b6df3102fa7c2afb08b3
MD5 cfcdcd2e1422c5d928d8a54532de8b77
BLAKE2b-256 bb686e7301b835ea71af6b7005307583473853ef7298fab60353307da43e2280

See more details on using hashes here.

File details

Details for the file hid_data_transfer_lib-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for hid_data_transfer_lib-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 48334479806c45d88b4d4d19176d6188695f3f5d35b0e2e82de12cd3b56cef63
MD5 efd0bfa899777278eae03fa020c6b091
BLAKE2b-256 8325c30e72a61d8f8fe25108fa1256e500177c3580dff2571d7c77c2fc7059e1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page