Skip to main content

TODO(crezvoy): add description

Project description

Airflow connector for Ocean Apache Spark

An Airflow plugin and provider to launch and monitor Spark applications on the Ocean for Spark.

Compatibility

ocean-spark-airflow-provider is compatible with both Airflow 1 and Airflow 2. it is detected as an Airflow plugin by Airflow 1 and up, and as a provider by Airflow 2.

Installation

pip install ocean-spark-airflow-provider

Usage

For general usage of Ocean for Spark, refer to the official documentation.

Setting up the connection

In the connection menu, register a new connection of type Ocean For Spark. The default connection name is ocean_spark_default. You will need to have:

  • The Ocean Spark cluster ID of the cluster you just created (of the format osc-e4089a00). You can find this in the console in the list of clusters, or by using the Get Cluster List in the API.
  • A Spot token to interact with Spot API.

connection setup dialog

The Ocean For Spark connection type is not available for Airflow 1, instead create an HTTP connection and fill your cluster id as host your API token as password.

You will need to create a separate connection for every Ocean Spark cluster that you plan to use with Airflow. In the OceanSparkOperator, you can select which Ocean Spark connection to use with the connection_name argument (defaults to ocean_spark_default).

Using the operator

from airflow import __version__ as airflow_version
if airflow_version.starts_with("1."):
    # Airflow 1, import as plugin
    from airflow.operators.ocean_spark import OceanSparkOperator
else:
    # Airflow 2
    from ocean_spark.operators import OceanSparkOperator
    
# DAG creation
    
spark_pi_task = OceanSparkOperator(
    job_id="spark-pi",
    task_id="compute-pi",
    dag=dag,
    config_overrides={
        "type": "Scala",
        "sparkVersion": "3.2.0",
        "image": "gcr.io/datamechanics/spark:platform-3.2-latest",
        "imagePullPolicy": "IfNotPresent",
        "mainClass": "org.apache.spark.examples.SparkPi",
        "mainApplicationFile": "local:///opt/spark/examples/jars/examples.jar",
        "arguments": ["10000"],
        "driver": {
            "cores": 1,
        },
        "executor": {
            "cores": 2,
            "instances": 1,
        },
    },
)

more examples are available for Airflow 1 and Airflow 2.

Test locally

You can test the plugin locally using the docker compose setup in this repository. Run make serve_airflow2 at the root of the repository to launch an instance of Airflow 2 with the provider already installed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ocean-spark-airflow-provider-0.1.0.tar.gz (8.8 kB view hashes)

Uploaded Source

Built Distribution

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page