To reserve the node for jobs that e.g. require GPU, you can create a separate partition for GPU jobs only and configure the node to belong to that partition. Here's an example configuration for your case:
# Define compute node NodeName=rtx Gres=gpu:1 CPUs=16 Sockets=1 CoresPerSocket=8 ThreadsPerCore=2 CPUSpecList=0,1 State=UNKNOWN Weight=80
# Define GPU partition PartitionName=gpu Nodes=rtx State=UP
After adding this configuration to your slurm.conf file and restarting the slurmctld and slurmd services, you can submit jobs to the gpu partition using the --partition option, like this:
srun --partition=gpu echo hi
This will reserve the rtx node for jobs in the gpu partition, and prevent non-GPU jobs from running on that node.
You also will need to add DEFAULT
partition for the other jobs:
PartitionName=general Nodes=def Default=YES MaxTime=INFINITE State=UP
All other jobs should run on this default partition then. The GPU partition will only be used if explicitly required, even if the default partition has no available nodes as the following example shows:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST gpu up infinite 1 idle rtx general* up infinite 1 unk def > sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST gpu up infinite 1 idle rtx general* up infinite 1 unk* def > srun --partition=gpu echo hi hi > srun echo hi srun: Required node not available (down, drained or reserved) srun: job 11 queued and waiting for resources
Please start posting anonymously - your entry will be published after you log in or create a new account. This space is reserved only for answers. If you would like to engage in a discussion, please instead post a comment under the question or an answer that you would like to discuss
Asked: 2023-04-05 12:01:24 +0000
Seen: 7 times
Last updated: Apr 08
How can I install the CGroup V2 Plugin for Slurm?
How can I fix error: Security violation, ping RPC from uid 1001 in Slurm log?
How to distribute Elasticache cluster nodes among several AWS Availability Zones?
How can I establish a cluster with a shared network in GKE?
What does the hbm mean here in Slurm config GresTypes=hbm,gpu?