Skip to main content

Computation and work management system for time-constrained cluster environments.

Project description

JuQueue

Computation and workflow management system for time-constrained cluster environments. This system is aimed at compute clusters, on which users are accounted for the runtime of an entire node, rather than the resources requested by a single job (e.g. JURECA).

Work in progress and potentially unstable.

Concept

  • Runs
    • Defines the command and its corresponding parameters.
    • Defines an Executor which determines environment variables, virtual environments, etc...
    • Commands should be robust to termination, i.e.
      • Should resume from previous computation if terminated.
        • If the Node shuts down/fails, the Run will be requeued.
      • Upon failure, must return a non-zero status code. [will not be requeued]
      • Must return status code 0 if completed. [will not be requeued]
  • Experiment
    • A logical group of Runs.
  • Clusters
    • Each Cluster (currently local and slurm) defines a group of nodes.
    • A ClusterManager manages NodeManagers on computation nodes (e.g. via SLURM jobs).
      • Each NodeManager specifies a certain number of Slots and manages the execution of Runs in Python subprocesses.
      • As Runs are (un-)queued from/to the Cluster, or are completed/failed, the number of nodes is rescaled as necessary.
    • For now, the system is aggressive in minimizing the number of nodes, e.g.
      • Assume 4 nodes (each with 4 slots), each executing a single Run
      • Then 3 nodes are cancelled (along with the runs) and rescheduled to the remaining node.

Installation

From source

git clone https://github.com/tran-khoa/JuQueue juqueue
cd juqueue
pip install -e .

# (optional) Start with example definitions
cp -r example_defs ~/defs

Via pip

pip install juqueue

Usage

juqueue --def-dir [PATH] --work-dir [PATH]

A minimal user interface is offered at localhost:51234. For more advanced usage, JuQueue can be controlled via FastAPI's interactive docs available at localhost:51234/docs.

Documentation

For now, refer to the examples in example_defs/ and FastAPI's docs, available at localhost:51234/docs or localhost:51234/redoc.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

juqueue-0.0.7.tar.gz (18.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

juqueue-0.0.7-py3-none-any.whl (19.2 kB view details)

Uploaded Python 3

File details

Details for the file juqueue-0.0.7.tar.gz.

File metadata

  • Download URL: juqueue-0.0.7.tar.gz
  • Upload date:
  • Size: 18.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.4

File hashes

Hashes for juqueue-0.0.7.tar.gz
Algorithm Hash digest
SHA256 0833112405ac85c351ab9b19133b13441ef8dd8dca000fbfc5c155cf2892696d
MD5 8813ce893cb57fd0e9654cdf23b0229e
BLAKE2b-256 9b8f27d60d8db85364872b5d3180fe9034e7c6cfac20e59bb2a75f6a9aab703a

See more details on using hashes here.

File details

Details for the file juqueue-0.0.7-py3-none-any.whl.

File metadata

  • Download URL: juqueue-0.0.7-py3-none-any.whl
  • Upload date:
  • Size: 19.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.4

File hashes

Hashes for juqueue-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 99fa9ca5b2aaf56ea5d7a91de8c01d6b97ce5b2be2ebf1f6a5b048706dba136f
MD5 649d66613401de523b1d292ed9415493
BLAKE2b-256 c5a4e2ae75ef8fa599b200177904616dee5a4fac1ade5a8096ffc021960d230e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page