Skip to main content

Computation and work management system for time-constrained cluster environments.

Project description

JuQueue

Computation and workflow management system for time-constrained cluster environments. This system is aimed at compute clusters, on which users are accounted for the runtime of an entire node or minimum resource allocation units (e.g. at the Jülich Supercomputing Centre (JSC)).

Work in progress and potentially unstable. The wiki provides further documentation.

Concept

  • Runs
    • Defines the command and its corresponding parameters.
    • Defines an Executor which determines environment variables, virtual environments, etc...
    • Commands should be robust to termination, i.e.
      • Should resume from previous computation if terminated.
        • If the Node shuts down/fails, the Run will be requeued.
      • Upon failure, must return a non-zero status code. [will not be requeued]
      • Must return status code 0 if completed. [will not be requeued]
  • Experiment
    • A logical group of Runs.
  • Clusters
    • Each Cluster (currently local and slurm) defines a group of nodes.
    • A ClusterManager manages NodeManagers on computation nodes (e.g. via SLURM jobs).
      • Each NodeManager specifies a certain number of Slots and manages the execution of Runs in Python subprocesses.
      • As Runs are (un-)queued from/to the Cluster, or are completed/failed, the number of nodes is rescaled as necessary.
    • For now, the system is aggressive in minimizing the number of nodes, e.g.
      • Assume 4 nodes (each with 4 slots), each executing a single Run
      • Then 3 nodes are cancelled (along with the runs) and rescheduled to the remaining node.

Installation

From source

git clone https://github.com/tran-khoa/JuQueue juqueue
cd juqueue
pip install -e .

# (optional) Start with example definitions
cp -r example_defs ~/defs

Via pip

pip install juqueue

Usage

juqueue --def-dir [PATH] --work-dir [PATH]

A minimal user interface is offered at localhost:51234. For more advanced usage, JuQueue can be controlled via FastAPI's interactive docs available at localhost:51234/docs.

Documentation

For now, refer to the examples in example_defs/ and FastAPI's docs, available at localhost:51234/docs or localhost:51234/redoc.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

juqueue-0.0.15.tar.gz (748.7 kB view hashes)

Uploaded Source

Built Distribution

juqueue-0.0.15-py3-none-any.whl (807.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page