Skip to main content

Launches a Dask Gateway cluster in K8s and joins HTCondor workers to it

Project description

NCSA HTCdaskGateway

Subclasses the Dask Gateway client to launch dask clusters in Kubernetes, but with HTCondor workers. This is a fork of the ingenious original idea by Maria Acosta at Fermilab as part of their Elastic Analysis Facility project.

ICRN Quick Start

As a user of the Illinois Computes Research Notebooks environment, you will use conda to set up the Condor tools and install this library. Create the following conda.yaml file:

name: dask-gateway
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.11
  - htcondor
  - pip
  - pip:
      - ncsa-htcdaskgateway>=1.0.4
      - dask==2025.2.0
      - distributed==2025.2.0
      - tornado==6.4.2

From a Jupyter terminal window create the conda environment with:

conda env create -f conda.yaml
conda activate dask-gateway

Note: Depending on your conda setup, the conda activate command may not be available you can also activate the environment with the command source activate dask-gateway.

Now you can use the setup_condor script to set up the HTCondor tools. This will request your Illinois password and attempt to log into the HTCondor login node and execute a command that generates a token file. This token file is used by the HTCondor tools to authenticate with the HTCondor cluster. The script will put the token in your ~/.condor/tokens.d directory.

It will also write appropriate condor_config settings to the conda environment's condor directory.

When complete, you should be able to view the condor queue from an ICRN terminal with

condor_q

Use in Jupyter Notebook

In your Jupyter notebook first thing you need to do is activate the conda environment:

!source activate dask-gateway

Now you can pip install any additional dependencies. For objects that are sent to dask or received as return values, you must have the exact same versions.

! python -m pip install numpy==2.2.4

Providing Path to Condor Tools

There are some interesting interactions between conda and Jupyter. Conda has installed the condor binaries, but doesn't update PATH in the notebook kernel. We use an environment variable to tell the htcdaskgateway client how to find the binaries.

In a terminal window:

source activate dask-gateway
which condor_q

Back in your notebook:

import os

os.environ["CONDOR_BIN_DIR"] = "/home/myhome/.conda/envs/dask-gateway/bin"

Setting up a dotenv file

It is good practice to keep passwords out of your notebooks. Create a .env file that contains an entry for DASK_GATEWAY_PASSWORD

Add python-dotenv to your pip installed dependencies and add this line to your notebook:

from dotenv import load_dotenv

load_dotenv()  # take environment variables from .env.

Connecting to the Gateway and Scaling up Cluster

Now we can finally start up a cluster!

from htcdaskgateway import HTCGateway
from dask_gateway.auth import BasicAuth
import os

gateway = HTCGateway(
    address="https://dask.software-dev.ncsa.illinois.edu",
    proxy_address=8786,
    auth=BasicAuth(username=None, password=os.environ["DASK_GATEWAY_PASSWORD"]),
)

cluster = gateway.new_cluster(
    image="ncsa/dask-public-health:latest",
    container_image="/u/bengal1/condor/PublicHealth.sif",
)
cluster.scale(2)
client = cluster.get_client()
client

This will display the URL to access the cluster dashboard

How it Works

This is a drop-in replacement for the official Dask Gateway client. It keeps the same authentication and interaction with the gateway server (which is assumed to be running in a Kubernetes cluster). When the user requests a new cluster, this client communicates with the gateway server and instructs it to launch a cluster. We are running a modified docker image in the cluster which only launches the scheduler, and assumes that HTC workers will eventually join.

The client then uses the user's credentials to build an HTC Job file and submits it to the cluster. These jobs run the dask worker and have the necessary certs to present themselves to the scheduler.

The scheduler then accepts them into the cluster and we are ready to compute

Preparing The Ground

There are a number of configuration steps that need to be done in order for this configuration to work. Here are the main ones:

  1. The workers communicate with the scheduler via TLS connection on port 8786. The Kubernetes traefik ingress needs to know about this port and route traffic from it
  2. The scheduler need to be able to communicate with the workers and the workers need to communicate with each other. This happens on a range of ports which need to be opened up
  3. The client library submits jobs to the HTCondor cluster. This means that the user environment must be configured to submit and manage HTCJobs

HTCondor Integration

The client library creates a job file and a shell script to run the dask worker. These files need to be in a spot that is readable by HTCondor worker nodes.

The condor tools (condor_submit, condor_q, and condor_rm) require some configuration and a token file in order to operate with the cluster.

Usage:

At a minimum, the client environment will need to install:

  1. This library: ncsa-htcdaskgateway
  2. dask

Connect to the gateway and create a cluster:

from htcdaskgateway import HTCGateway
from dask_gateway.auth import BasicAuth

gateway = HTCGateway(
    address="https://dask.software-dev.ncsa.illinois.edu",
    proxy_address=8786,
    auth=BasicAuth(username=None, password="____________"),
)

cluster = gateway.new_cluster(
    image="ncsa/dask-public-health:latest",
    container_image="/u/bengal1/condor/PublicHealth.sif",
)
cluster.scale(4)

client = cluster.get_client()

Hopefully your environment will have more secure auth model than this....

The image argument is the docker image the scheduler will run in. The container_image argument is the path to an Apptainer image the HTCondor worker will run in.

In order for the image argument to work, you need to deploy the gateway with the image customization enabled:

gateway:
  # The gateway server log level
  loglevel: INFO

  extraConfig:
    # Enable scheduler image name to be provided by the client cluster constructor
    image-customization: |
      from dask_gateway_server.options import Options, String

      def option_handler(options):
          return {
              "image": options.image,
              # Add other options as needed
          }

      c.Backend.cluster_options = Options(
          String("image", default="daskgateway/dask-gateway:latest", label="Image"),
          # Add other option parameters as needed
          handler=option_handler,
      )

Notes from FNAL Implementation

  • A Dask Gateway client extension for heterogeneous cluster mode combining the Kubernetes backend for pain-free scheduler networking, with COFFEA-powered HTCondor workers and/or OKD [coming soon].
  • Latest PyPI version is installed by default and deployed to the COFFEA-DASK notebook on EAF (https://analytics-hub.fnal.gov). A few lines will get you going!
  • The current image for workers/schedulers is: coffeateam/coffea-dask-cc7-gateway:0.7.12-fastjet-3.3.4.0rc9-g8a990fa

Basic usage @ Fermilab EAF

  • Make sure the notebook launched supports this functionality (COFFEA-DASK notebook)
from htcdaskgateway import HTCGateway

gateway = HTCGateway()
cluster = gateway.new_cluster()
cluster

# Scale my cluster to 5 HTCondor workers
cluster.scale(5)

# Obtain a client for connecting to your cluster scheduler
# Your cluster should be ready to take requests
client = cluster.get_client()
client

# When computations are finished, shutdown the cluster
cluster.shutdown()

Other functions worth checking out

  • This is a multi-tenant environment, and you are authenticated via JupyterHub Oauth which means that you can create as many* clusters as you wish
  • To list your clusters:
# Verify that the gateway is responding to requests by asking to list all its clusters
clusters = gateway.list_clusters()
clusters
  • To connect to a specific cluster from the list:
cluster = gateway.connect(cluster_name)
cluster
cluster.shutdown()
  • To gracefully close the cluster and remove HTCondor worker jobs associated to it:
cluster.shutdown()
  • There are widgets implemented by Dask Gateway. Make sure to give them a try from your EAF COFFEA notebook, just execute the client and cluster commands (after properly initializing them) in a cell like:
-------------
cluster = gateway.new_cluster()
cluster
< Widget will appear after this step>
-------------
client = cluster.get_client()
client
< Widget will appear after this step >
-------------
cluster
< Widget will appear after this step >

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ncsa_htcdaskgateway-1.0.4.tar.gz (11.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ncsa_htcdaskgateway-1.0.4-py3-none-any.whl (12.0 kB view details)

Uploaded Python 3

File details

Details for the file ncsa_htcdaskgateway-1.0.4.tar.gz.

File metadata

  • Download URL: ncsa_htcdaskgateway-1.0.4.tar.gz
  • Upload date:
  • Size: 11.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for ncsa_htcdaskgateway-1.0.4.tar.gz
Algorithm Hash digest
SHA256 061d66da4e8ccb7ff5279005fdf68864f886f6bf8824d6f731829b6d70ba0b51
MD5 5e04d3cba1a220660ff430f494fe7a10
BLAKE2b-256 43445928623d54d3a8deda0357b4f2df20d5209d407303f3be0c2015b9023314

See more details on using hashes here.

Provenance

The following attestation bundles were made for ncsa_htcdaskgateway-1.0.4.tar.gz:

Publisher: cd.yml on ncsa/htcdaskgateway

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ncsa_htcdaskgateway-1.0.4-py3-none-any.whl.

File metadata

File hashes

Hashes for ncsa_htcdaskgateway-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 558b57e4a06d2e8027b55a3b30bf68d676fc1efaca22a9c77b4f10c26cb717e0
MD5 8e6fc72106fd36a7aaf41197b938c903
BLAKE2b-256 5110cef0bd8d9a1e44b80b85e1d65e73f80470e288e990074843c1cea1bc4295

See more details on using hashes here.

Provenance

The following attestation bundles were made for ncsa_htcdaskgateway-1.0.4-py3-none-any.whl:

Publisher: cd.yml on ncsa/htcdaskgateway

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page