Skip to main content

Computation and work management system for time-constrained cluster environments.

Project description

JuQueue

Computation and workflow management system for time-constrained cluster environments. This system is aimed at compute clusters, on which users are accounted for the runtime of an entire node, rather than the resources requested by a single job (e.g. JURECA).

Work in progress and potentially unstable.

Concept

  • Runs
    • Defines the command and its corresponding parameters.
    • Defines an Executor which determines environment variables, virtual environments, etc...
    • Commands should be robust to termination, i.e.
      • Should resume from previous computation if terminated.
        • If the Node shuts down/fails, the Run will be requeued.
      • Upon failure, must return a non-zero status code. [will not be requeued]
      • Must return status code 0 if completed. [will not be requeued]
  • Experiment
    • A logical group of Runs.
  • Clusters
    • Each Cluster (currently local and slurm) defines a group of nodes.
    • A ClusterManager manages NodeManagers on computation nodes (e.g. via SLURM jobs).
      • Each NodeManager specifies a certain number of Slots and manages the execution of Runs in Python subprocesses.
      • As Runs are (un-)queued from/to the Cluster, or are completed/failed, the number of nodes is rescaled as necessary.
    • For now, the system is aggressive in minimizing the number of nodes, e.g.
      • Assume 4 nodes (each with 4 slots), each executing a single Run
      • Then 3 nodes are cancelled (along with the runs) and rescheduled to the remaining node.

Installation

From source

git clone https://github.com/tran-khoa/JuQueue juqueue
cd juqueue
pip install -e .

# (optional) Start with example definitions
cp -r example_defs ~/defs

Via pip

pip install juqueue

Usage

juqueue --def-dir [PATH] --work-dir [PATH]

A minimal user interface is offered at localhost:51234. For more advanced usage, JuQueue can be controlled via FastAPI's interactive docs available at localhost:51234/docs.

Documentation

For now, refer to the examples in example_defs/ and FastAPI's docs, available at localhost:51234/docs or localhost:51234/redoc.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

juqueue-0.0.12.tar.gz (867.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

juqueue-0.0.12-py3-none-any.whl (879.9 kB view details)

Uploaded Python 3

File details

Details for the file juqueue-0.0.12.tar.gz.

File metadata

  • Download URL: juqueue-0.0.12.tar.gz
  • Upload date:
  • Size: 867.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.4

File hashes

Hashes for juqueue-0.0.12.tar.gz
Algorithm Hash digest
SHA256 5bb2afc12ac4b45878a5f5e8329daa4ff875e5bcb19264dd17d8b417395fbd8f
MD5 492e78eb08cff59a02d228d62a00d506
BLAKE2b-256 ddfdfbc289904e96aeda7c9ae5161adc3ce182c2e79753134f1cbbdb5dcf6db3

See more details on using hashes here.

File details

Details for the file juqueue-0.0.12-py3-none-any.whl.

File metadata

  • Download URL: juqueue-0.0.12-py3-none-any.whl
  • Upload date:
  • Size: 879.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.4

File hashes

Hashes for juqueue-0.0.12-py3-none-any.whl
Algorithm Hash digest
SHA256 9cde375b137a389ff1e8b401d5d43a6e1703a4f3a67e6ea08c502df636655f3b
MD5 aa61c338b78fa4587cdd3e78f88e43ce
BLAKE2b-256 7ce1f10cb194b9f37ebe3f532ee57f05f13aab1b474bb94f8bdf8c043386de4b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page