Skip to main content

Computation and work management system for time-constrained cluster environments.

Project description

JuQueue

Computation and workflow management system for time-constrained cluster environments. This system is aimed at compute clusters, on which users are accounted for the runtime of an entire node or minimum resource allocation units (e.g. at the Jülich Supercomputing Centre (JSC)).

Work in progress and potentially unstable. The wiki provides further documentation.

Concept

  • Runs
    • Defines the command and its corresponding parameters.
    • Defines an Executor which determines environment variables, virtual environments, etc...
    • Commands should be robust to termination, i.e.
      • Should resume from previous computation if terminated.
        • If the Node shuts down/fails, the Run will be requeued.
      • Upon failure, must return a non-zero status code. [will not be requeued]
      • Must return status code 0 if completed. [will not be requeued]
  • Experiment
    • A logical group of Runs.
  • Clusters
    • Each Cluster (currently local and slurm) defines a group of nodes.
    • A ClusterManager manages NodeManagers on computation nodes (e.g. via SLURM jobs).
      • Each NodeManager specifies a certain number of Slots and manages the execution of Runs in Python subprocesses.
      • As Runs are (un-)queued from/to the Cluster, or are completed/failed, the number of nodes is rescaled as necessary.
    • For now, the system is aggressive in minimizing the number of nodes, e.g.
      • Assume 4 nodes (each with 4 slots), each executing a single Run
      • Then 3 nodes are cancelled (along with the runs) and rescheduled to the remaining node.

Installation

From source

git clone https://github.com/tran-khoa/JuQueue juqueue
cd juqueue
pip install -e .

# (optional) Start with example definitions
cp -r example_defs ~/defs

Via pip

pip install juqueue

Usage

juqueue --def-dir [PATH] --work-dir [PATH]

A minimal user interface is offered at localhost:51234. For more advanced usage, JuQueue can be controlled via FastAPI's interactive docs available at localhost:51234/docs.

Documentation

For now, refer to the examples in example_defs/ and FastAPI's docs, available at localhost:51234/docs or localhost:51234/redoc.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

juqueue-0.0.13.tar.gz (868.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

juqueue-0.0.13-py3-none-any.whl (881.7 kB view details)

Uploaded Python 3

File details

Details for the file juqueue-0.0.13.tar.gz.

File metadata

  • Download URL: juqueue-0.0.13.tar.gz
  • Upload date:
  • Size: 868.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.5

File hashes

Hashes for juqueue-0.0.13.tar.gz
Algorithm Hash digest
SHA256 5404a38984ad84804fa61e8a2843e04a992db3ab1205e5306506825cd542bbf3
MD5 d02c9dc49129f839ba5463bcb37bb3dd
BLAKE2b-256 10359fdeab7838711c90126ee28c2bffc5f0b6a726ebb579a44d44ab22da53ac

See more details on using hashes here.

File details

Details for the file juqueue-0.0.13-py3-none-any.whl.

File metadata

  • Download URL: juqueue-0.0.13-py3-none-any.whl
  • Upload date:
  • Size: 881.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.5

File hashes

Hashes for juqueue-0.0.13-py3-none-any.whl
Algorithm Hash digest
SHA256 ee24eeef6cdec0b085a07f6be10aa6c85b7862564fc7f87564fdba3e10cd17ed
MD5 771d012135f0adb64b41236be2a18151
BLAKE2b-256 589e2ce915c3a479b6bd663b5101c8d36387bb9fd15042204089ad0a5e809540

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page