Skip to main content

Computation and work management system for time-constrained cluster environments.

Project description

JuQueue

Computation and workflow management system for time-constrained cluster environments. This system is aimed at compute clusters, on which users are accounted for the runtime of an entire node or minimum resource allocation units (e.g. at the Jülich Supercomputing Centre (JSC)).

Work in progress and potentially unstable. The wiki provides further documentation.

Concept

  • Runs
    • Defines the command and its corresponding parameters.
    • Defines an Executor which determines environment variables, virtual environments, etc...
    • Commands should be robust to termination, i.e.
      • Should resume from previous computation if terminated.
        • If the Node shuts down/fails, the Run will be requeued.
      • Upon failure, must return a non-zero status code. [will not be requeued]
      • Must return status code 0 if completed. [will not be requeued]
  • Experiment
    • A logical group of Runs.
  • Clusters
    • Each Cluster (currently local and slurm) defines a group of nodes.
    • A ClusterManager manages NodeManagers on computation nodes (e.g. via SLURM jobs).
      • Each NodeManager specifies a certain number of Slots and manages the execution of Runs in Python subprocesses.
      • As Runs are (un-)queued from/to the Cluster, or are completed/failed, the number of nodes is rescaled as necessary.
    • For now, the system is aggressive in minimizing the number of nodes, e.g.
      • Assume 4 nodes (each with 4 slots), each executing a single Run
      • Then 3 nodes are cancelled (along with the runs) and rescheduled to the remaining node.

Installation

From source

git clone https://github.com/tran-khoa/JuQueue juqueue
cd juqueue
pip install -e .

# (optional) Start with example definitions
cp -r example_defs ~/defs

Via pip

pip install juqueue

Usage

juqueue --def-dir [PATH] --work-dir [PATH]

A minimal user interface is offered at localhost:51234. For more advanced usage, JuQueue can be controlled via FastAPI's interactive docs available at localhost:51234/docs.

Documentation

For now, refer to the examples in example_defs/ and FastAPI's docs, available at localhost:51234/docs or localhost:51234/redoc.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

juqueue-0.0.15.tar.gz (748.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

juqueue-0.0.15-py3-none-any.whl (807.4 kB view details)

Uploaded Python 3

File details

Details for the file juqueue-0.0.15.tar.gz.

File metadata

  • Download URL: juqueue-0.0.15.tar.gz
  • Upload date:
  • Size: 748.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.5

File hashes

Hashes for juqueue-0.0.15.tar.gz
Algorithm Hash digest
SHA256 e5878d7e458c5fc2ccea3b32683dae2edc0f93f251190a21e04e5879966767e9
MD5 cb933c4bdfd06b092c6b3eea37fc31e4
BLAKE2b-256 ab61e33cded71abd733cd8ea5376e8eaa154b4dd00afd61372e9f9d9e744a0e2

See more details on using hashes here.

File details

Details for the file juqueue-0.0.15-py3-none-any.whl.

File metadata

  • Download URL: juqueue-0.0.15-py3-none-any.whl
  • Upload date:
  • Size: 807.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.5

File hashes

Hashes for juqueue-0.0.15-py3-none-any.whl
Algorithm Hash digest
SHA256 164ece6ac8bdface3217cdcd74e76a294e41f664132086eb4da03755a14e1636
MD5 4cbd93f0c5896ba0dc39fdf259cfc332
BLAKE2b-256 73d9da6c435fa53da0b3adfb30c1d2167c3bb693636160e339c446f156fdc572

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page