Skip to main content

HDInsight provider for Airflow

Project description

airflow-hdinsight

Documentation Status PyPi Version Supported versions PyPi downloads

A set of airflow hooks, operators and sensors to allow airflow DAGs to operate with the Azure HDInsight platform, for cluster creation and monitoring as well as job submission and monitoring. Also included are some enhanced Azure Blob and Data Lake sensors.

This project is both an amalgamation and enhancement of existing open source airflow extensions, plus new extensions to solve the problem.

Installation

pip install airflow-hdinsight

Extensions

airflowhdi

Type Name What it does
Hook AzureHDInsightHook Uses the HDInsightManagementClient from the HDInsight SDK for Python to expose several operations on an HDInsight cluster - get cluster state, create, delete.
Operator AzureHDInsightCreateClusterOperator Use the AzureHDInsightHook to create a cluster
Operator AzureHDInsightDeleteClusterOperator Use the AzureHDInsightHook to delete a cluster
Operator ConnectedAzureHDInsightCreateClusterOperator Extends the AzureHDInsightCreateClusterOperator to allow fetching of the security credentials and cluster creation spec from an airflow connection
Operator AzureHDInsightSshOperator Uses the AzureHDInsightHook and SSHHook to run an SSH command on the master node of the given HDInsight cluster
Sensor AzureHDInsightClusterSensor A sensor to monitor the provisioning state or running state (can switch between either mode) of a given HDInsight cluster. Uses the AzureHDInsightHook.
Sensor WasbWildcardPrefixSensor An enhancement to the WasbPrefixSensor to support sensing on a wildcard prefix
Sensor AzureDataLakeStorageGen1WebHdfsSensor Uses airflow's AzureDataLakeHook to sense a glob path (which implicitly supports wildcards) on ADLS Gen 1. ADLS Gen 2 is not yet supported in airflow.

airflowlivy

Type Name What it does
Hook LivyBatchHook Uses the Apache Livy Batch API to submit spark jobs to a livy server, get batch state, verify batch state by quering either the spark history server or yarn resource manager, spill the logs of the spark job post completion, etc.
Operator LivyBatchOperator Uses the LivyBatchHook to submit a spark job to a livy server
Sensor LivyBatchSensor Uses the LivyBatchHook to sense termination and verify completion, spill logs of a spark job submitted earlier to a livy server

Origins of the HDinsight operator work

The HDInsight operator work is loosely inspired from alikemalocalan/airflow-hdinsight-operators, however that has a huge number of defects, as to why it was never accepted to be merged into airflow in the first place. This project solves all of those issues and more, and is frankly a full rewrite.

Origins of the livy work

The livy batch operator is based on the work by panovvv's project airfllow-livy-operators. It does some necessary changes:

  • Seperates the operator into a hook (LivyBatchHook), an operator (LivyBatchOperator) and a sensor (LivyBatchSensor)
  • Adds additional verification and log spilling to the sensor (the original sensor does not)
  • Removes additional verifiation and log spilling from the operator - hence alllowing a async pattern akin to the EMR add step operator and step sensor.
  • Creates livy, spark and YARN airflow connections dynamically from an Azure HDInsight connection
  • Returns the batch ID from the operator so that a sensor can use it after being passed through XCom
  • Changes logging to LoggingMixin calls
  • Allows templatization of fields

State of airflow livy operators in the wild..

As it stands today (June of 2020), there are multiple airflow livy operator projects out there:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

airflow_hdinsight-0.0.1.3-py3-none-any.whl (31.1 kB view details)

Uploaded Python 3

File details

Details for the file airflow_hdinsight-0.0.1.3-py3-none-any.whl.

File metadata

  • Download URL: airflow_hdinsight-0.0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 31.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.7

File hashes

Hashes for airflow_hdinsight-0.0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 c2184eb1cbfde1ea9bd16d66cfe67045dcfdec3b7e03973e1fd6c7fe5cab2fde
MD5 47cd8ed5df54d69736c7b37e3d295073
BLAKE2b-256 cf6bd8075b98de5c244c314539bfe50ac0d37eabed753ec48207a84380fd6471

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page