The "Slurm" HQS Tasks Backend

Experimental Feature

This backend is experimental. Additionally, the documentation is currently not complete.

When using HQS Tasks with the Slurm backend, all tasks will be submitted to a Slurm cluster.

To do so, the client connects to a "login node" using SSH, on which some Slurm CLI commands are expected to be available and used for the communication. In particular, these are sbatch and scontrol.

Configuration

To use this backend, add the following basic configuration to your client script:

from hqs_tasks_execution.config import global_config, BackendConfigurationSlurm

global_config.backend = BackendConfigurationSlurm(
    login_host="...",
    slurm_tmp_io="/mnt/shared/hqs_tasks/io",
)

This tells the HQS Tasks execution client to use the Slurm backend, with the following required sub-parameters to be configured specifically for this backend:

login_host (str): The hostname to which the client opens the SSH connection and on which Slurm CLI commands can be issued (sbatch etc.)
slurm_tmp_io (str): A directory path which is accessible on the login node and any compute node on which the Slurm jobs will be running.
slurm_tmp_io_local (str | None): The local directory path of the slurm_tmp_io folder if it is accessible on the local machine (where the client script is running). If set, normal file access is used to transfer files between the local machine and the Slurm network; otherwise SSH is used to transfer them from the login node.

Besides these, the following optional sub-parameters can be configured, depending on the needs:

login_user (str): The username for the SSH connection to the login node
ssh_extra (list[str]): Any extra parameters to be appended to the command line when issuing SSH commands (e.g., for specifying authentication parameters)
sbatch_extra (list[str]): Any extra parameters to be appended to the sbatch CLI command (e.g., for specifying Slurm-specific provisioning options)
slurm_script_shell (str): The login shell type to be used for the automatically created Slurm submit scripts; defaults to /bin/bash
preparation_script_before (list[str]): A list of commands to be added to the Slurm submit script before the task-specific preparation script
preparation_script_after (list[str]): A list of commands to be added to the Slurm submit script after the task-specific preparation script (but still before running the task CLI)
debug_slurm_script (bool): When set to True, the script which is used when submitting the Slurm script is being dumped which helps debugging the preparation script and the backend itself; defaults to False

Example Configuration (using Conda / Micromamba)

First of all, note that to use this backend, the task needs to be made available to the Slurm nodes in some way (i.e., the environment needs to be prepared for the task execution). This needs to be taken care of manually, and there are different options to achieve this.

Here we demonstrate one option to do so, and show how to configure the client in the user script to run tasks in that environment.

Preparation: Create the Environment

We assume conda / micromamba is installed on the Slurm nodes. We create a new environment on the shared network drive which is mounted on the Slurm nodes. This guide assumes that to be /mnt/shared (like shown above). The name of the environment we create can be chosen freely and it may be installed in any sub-folder of the network drive.

The following command creates such an environment and initially installs Python in it (please adjust the path accordingly):

micromamba create -n /mnt/shared/hqs_tasks/envs/hqs_task_example python=3.13

Then, activate the environment:

micromamba activate /mnt/shared/hqs_tasks/envs/hqs_task_example

Finally, install the software needed to run the task:

the task implementation itself, for example hqs_task_example
a general execution wrapper script called hqs_task_execute

Simply install these using hqstage:

hqstage install hqs_task_example hqs_task_execute

Then, make sure that jq and curl are available on the Slurm nodes, since those are non-Python dependencies for running the task. To install them (if they are missing), a little helper script is shipped and available in the PATH which can now be invoked with the following command:

hqs-task-execute-install-requirements

Configuration

Now the only thing left to do is to tell the client to use (activate) this environment in the script sent to the Slurm node where the task will be running.

This is done by specifying the preparation_script_before option in the backend configuration. We need two commands: one will initialize conda / micromamba, which is usually done automatically for interactive shell sessions but a Slurm script does not fall into this category. The second command then activates the environment, like we did above when we installed everything.

from hqs_tasks_execution.config import global_config, BackendConfigurationSlurm

global_config.backend = BackendConfigurationSlurm(
    login_host="...",
    slurm_tmp_io="/mnt/shared/hqs_tasks/io",

    # Script needed to activate the environment:
    preparation_script_before=[
        'eval "$(micromamba shell hook --shell bash)"',
        "micromamba activate /mnt/shared/hqs_tasks/envs/hqs_task_example",
    ],
)

Current Limitations / Caveats

Since this backend is currently in an experimental state, it's not feature-complete.

The following is a summary of issues / current limitations:

The login node is put on high load when many users are executing client scripts simultaneously, as in the current implementation we are opening a new SSH connection for every CLI command. You can solve this externally by configuring SSH to use a permanent connection for the login node using SSH Multiplexing.
Manual configuration of environments to run tasks in (conda / module); proper task versioning almost impossible right now. (Also, automatic task deployment is currently not implemented in contrast to the REST backend.)
Tasks are not containerized even if docker is available (maybe we will allow both containerized and non-containerized setup in the future).
Provisioning defaults of tasks not respected (but per-execution provisioning options are).

HQS Tasks User Documentation

The "Slurm" HQS Tasks Backend

Configuration

Example Configuration (using Conda / Micromamba)

Preparation: Create the Environment

Configuration

Current Limitations / Caveats