Skip to main content

Computation and work management system for time-constrained cluster environments.

Project description

JuQueue

Computation and workflow management system for time-constrained cluster environments. This system is aimed at compute clusters, on which users are accounted for the runtime of an entire node or minimum resource allocation units (e.g. at the Jülich Supercomputing Centre (JSC)).

Work in progress and potentially unstable. The wiki provides further documentation.

Concept

  • Runs
    • Defines the command and its corresponding parameters.
    • Defines an Executor which determines environment variables, virtual environments, etc...
    • Commands should be robust to termination, i.e.
      • Should resume from previous computation if terminated.
        • If the Node shuts down/fails, the Run will be requeued.
      • Upon failure, must return a non-zero status code. [will not be requeued]
      • Must return status code 0 if completed. [will not be requeued]
  • Experiment
    • A logical group of Runs.
  • Clusters
    • Each Cluster (currently local and slurm) defines a group of nodes.
    • A ClusterManager manages NodeManagers on computation nodes (e.g. via SLURM jobs).
      • Each NodeManager specifies a certain number of Slots and manages the execution of Runs in Python subprocesses.
      • As Runs are (un-)queued from/to the Cluster, or are completed/failed, the number of nodes is rescaled as necessary.
    • For now, the system is aggressive in minimizing the number of nodes, e.g.
      • Assume 4 nodes (each with 4 slots), each executing a single Run
      • Then 3 nodes are cancelled (along with the runs) and rescheduled to the remaining node.

Installation

From source

git clone https://github.com/tran-khoa/JuQueue juqueue
cd juqueue
pip install -e .

# (optional) Start with example definitions
cp -r example_defs ~/defs

Via pip

pip install juqueue

Usage

juqueue --def-dir [PATH] --work-dir [PATH]

A minimal user interface is offered at localhost:51234. For more advanced usage, JuQueue can be controlled via FastAPI's interactive docs available at localhost:51234/docs.

Documentation

For now, refer to the examples in example_defs/ and FastAPI's docs, available at localhost:51234/docs or localhost:51234/redoc.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

juqueue-0.0.14.tar.gz (185.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

juqueue-0.0.14-py3-none-any.whl (193.0 kB view details)

Uploaded Python 3

File details

Details for the file juqueue-0.0.14.tar.gz.

File metadata

  • Download URL: juqueue-0.0.14.tar.gz
  • Upload date:
  • Size: 185.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.5

File hashes

Hashes for juqueue-0.0.14.tar.gz
Algorithm Hash digest
SHA256 0371b5f21d174415e4b761141f2d9dda4d1523ca1759f9bcbffe9889f9ea6c32
MD5 740bf80c578bc27166e625954e6ff428
BLAKE2b-256 5717d15b04b27766980925934e27fb56bb29addb2c5e7032f626ca92d8462175

See more details on using hashes here.

File details

Details for the file juqueue-0.0.14-py3-none-any.whl.

File metadata

  • Download URL: juqueue-0.0.14-py3-none-any.whl
  • Upload date:
  • Size: 193.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.5

File hashes

Hashes for juqueue-0.0.14-py3-none-any.whl
Algorithm Hash digest
SHA256 2824d086e44fb850006b53483a384d6e581ad0bdceade94c80fd64e0f277405a
MD5 a992cdd4b51baddbde7d0b0160472a05
BLAKE2b-256 1207af93fa82dd3e5c41886ddf1b5ab24fd5b1c829364ecccd7ddc9d0fdf77ef

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page