Slurm on Ubuntu 18.04

How to install slurm on ubuntu 18.04

I needed to install slurm on a workstation. These are my notes.

I mostly followed this guide at The Weekend Writeup blog from the start, and consulted instructions here , here and here.

Instalation

Install slurm, munge and slurm client (I will also submit jobs from the same workstation), pretty much following the instruction from the source

Munge

sudo apt install munge

Mariadb

(Final notes, after finishing: I am not sure if installing and configuring mariadb is necessary for single node with logging to a file, but I am keeping the instructions here anyway. Probably the config would work without the database as well.)

$ sudo apt install mariadb-server
$ sudo mysql -u root
create database slurm_acct_db;
create user 'slurm'@'localhost';
set password for 'slurm'@'localhost' = password('slurmdbpass');
grant usage on *.* to 'slurm'@'localhost';
grant all privileges on slurm_acct_db.* to 'slurm'@'localhost';
flush privileges;
exit

Slurm

$ sudo apt install slurmd slurm-client slurmctld

Also instal doc and torque packages

$ sudo apt install slurm-wlm-doc slurm-wlm-torque

Then generate the configuration file by opening file /usr/share/doc/slurm-wlm/html/configurator.html in your browser. In case you are installing on a remote machine, you can use python http server and ssh port forwarding to achieve this:

$ cd /usr/share/doc/slurm-wlm/html/
$ python3 -m http.server

and then on local machine

$ ssh -L 8000:127.0.0.1:8000 user@slurm-server

and use the configurator.html in your local browser (localhost:8000).

Hostname is $ hostname, and the cpu configuration (sockets, CPUs, …) can be obtained from $ lscpu Now paste the generated configuration into /etc/slurm-llnll/slurm.conf

Make sure the PID file paths are the same as in systemd file, and make sure they are accessible by the user running slurm.

I made a dedicated folder in /run, which is a tempfs filesystem, so to do that, create a file /etc/tmpfiles.d/slurm.conf with

d /run/slurm 0770 root slurm -

And then put the PID files in that folder (/run/slurm/slurmctld.pid, /run/slurm/slurmd.pid. Make sure the user slurm exists, so we may run slurm as a different user later:

$ id slurm
uid=64030(slurm) gid=64030(slurm) groups=64030(slurm)

Create /etc/slurm-lln/cgroup.conf

CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm/cgroup"
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes

And restart daemons

sudo systemctl restart slurmctld
sudo systemctl restart slurmd

Check sinfo gives no errors

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1   idle a715

Submit and run a test job. Create a test job script

#!/bin/bash
set -x
echo "My id is $SLURM_JOB_ID" >>  ~/slurm-test.log

and submit it with

$ qsub -q debug -l "nodes=1:ppn=8:walltime=3600,mem=4000MB" test.sh

If all goes well, the qsub will print the job id, and the test script job id will also be in ~/slurm-test.log.

At this point there is no usage accounting, there is only one job queueu (debug).

Add job accounting and a second job queue

I am using slurm on the same workstation on whith I also do development, so it is useful to leave some ram/CPU power for the development tools and not allocate all resources to the default queue. For the tasks I wish to run almost immediately, or when I am not using the workstation, I set up another queue.

To enable job accounting modify /etc/slurm-llnll/slurm.conf to the following contents:

# Job accounting: using a file on disk
AccountingStorageType=accounting_storage/filetxt
JobCompType=jobcomp/filetxt
AccountingStorageLoc=/var/log/slurm-llnl/accounting
JobCompLoc=/var/log/slurm-llnl/job_completions

JobAcctGatherType=jobacct_gather/linux

AccountingStorageEnforce=limits,qos

# COMPUTE NODES
NodeName=mars Procs=16 Sockets=1 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=40000 State=UNKNOWN

# PARTITIONS
PartitionName=main Nodes=mars MaxTime=INFINITE State=UP Shared=YES MaxCPUsPerNode=12 Default=YES
PartitionName=single Nodes=mars MaxTime=INFINITE State=UP Shared=YES MaxCPUsPerNode=1
PartitionName=two Nodes=mars MaxTime=INFINITE State=DOWN Shared=YES MaxCPUsPerNode=2

#scheduling
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory

The configuration describes one node (hostname mars) with 16 processors and 40GB of memory.

Job info is stored to a file.

The jobs can be ran through three queues, the main queue is limited to 12 cpus per each node, the single queue is limited to 1 cpu, and the two queue is limited to 2.

The idea is that when away, I can utilize the whole workstation by filling queues main and two, when working the main queue main continues to run and I have enough cpu and memory for development, and in case I need to run a few jobs quickly while working I can use the single queue. The parititon two is by default set to DOWN state

Scheduling is set to adhere to limits in resources (CPUs and memory) and schedule jobs with backfill.

Useful debug commands:

$ scontrol show nodes
$ scontrol show partitions
$ sinfo

show the info on nodes (useful to check if limits are properly enforced), partitions (same info as nodes) and short info on partitions and their states.

Submit jobs

To submit jobs to queue, use sbatch command. For me, using qsub gave wrong results with regard to requested number of cpus and memory, and did not work well in combination with enforcing limits

Here is a short summary of options to pass to sbatch:

  • --dependency=afterok:job_id[:jobid...]
  • --nodes=<minnodes-maxnodes>
  • --ntasks=<num> Number of tasks the job needs, by default (see --cpus-per-task) this is the same as number of cpus.
  • --time=<time> Limit on the total run time A time limit of zero requests that no time limit be imposed. Acceptable time formats include “minutes”, “minutes:seconds”, “hours:minutes:seconds”, “days-hours”, “days-hours:minutes” and “days-hours:minutes:seconds”.
  • --partition=<partition names> select the partitions on which the job can run
  • --mem=<total memory per node in MB>
  • --mem-per-cpu=<memory per cpu in MB>
  • --cpus-per-task=<num. of cpus per each task, defaults to 1>
  • --test-only Validate the batch script and return an estimate of when a job would be scheduled to run given the current job queue and all the other arguments specifying the job requirements. No job is actually submitted.

To switch partition two up or down use:

$ sudo scontrol update Partition=two State=drain
$ sudo scontrol update Partition=two State=down
$ sudo scontrol update Partition=two State=resume
$ sudo scontrol update Partition=two State=up

The states mean:

  • drain: no further jobs will be scheduler, active ones will be kept alive
  • down: active jobs will also be killed
  • resume: start scheduling on the node again
  • up: turn the node up, if you turned it down before

To get info on job queue use qstat.