Slurm on Ubuntu 18.04
How to install slurm on ubuntu 18.04
I needed to install slurm on a workstation. These are my notes.
I mostly followed this guide at The Weekend Writeup blog from the start, and consulted instructions here , here and here.
Instalation
Install slurm, munge and slurm client (I will also submit jobs from the same workstation), pretty much following the instruction from the source
Munge
sudo apt install munge
Mariadb
(Final notes, after finishing: I am not sure if installing and configuring mariadb is necessary for single node with logging to a file, but I am keeping the instructions here anyway. Probably the config would work without the database as well.)
$ sudo apt install mariadb-server $ sudo mysql -u root create database slurm_acct_db; create user 'slurm'@'localhost'; set password for 'slurm'@'localhost' = password('slurmdbpass'); grant usage on *.* to 'slurm'@'localhost'; grant all privileges on slurm_acct_db.* to 'slurm'@'localhost'; flush privileges; exit
Slurm
$ sudo apt install slurmd slurm-client slurmctld
Also instal doc and torque packages
$ sudo apt install slurm-wlm-doc slurm-wlm-torque
Then generate the configuration file by opening file /usr/share/doc/slurm-wlm/html/configurator.html
in your browser.
In case you are installing on a remote machine, you can use python http server and ssh port forwarding to achieve this:
$ cd /usr/share/doc/slurm-wlm/html/ $ python3 -m http.server
and then on local machine
$ ssh -L 8000:127.0.0.1:8000 user@slurm-server
and use the configurator.html
in your local browser (localhost:8000
).
Hostname is $ hostname
, and the cpu configuration (sockets, CPUs, …) can be obtained from $ lscpu
Now paste the generated configuration into /etc/slurm-llnll/slurm.conf
Make sure the PID file paths are the same as in systemd file, and make sure they are accessible by the user running slurm.
I made a dedicated folder in /run
, which is a tempfs
filesystem, so to do that, create a file /etc/tmpfiles.d/slurm.conf
with
d /run/slurm 0770 root slurm -
And then put the PID files in that folder (/run/slurm/slurmctld.pid
, /run/slurm/slurmd.pid
. Make sure the user slurm
exists, so we may run slurm as a different user later:
$ id slurm uid=64030(slurm) gid=64030(slurm) groups=64030(slurm)
Create /etc/slurm-lln/cgroup.conf
CgroupAutomount=yes CgroupReleaseAgentDir="/etc/slurm/cgroup" ConstrainCores=yes ConstrainDevices=yes ConstrainRAMSpace=yes
And restart daemons
sudo systemctl restart slurmctld sudo systemctl restart slurmd
Check sinfo
gives no errors
$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 1 idle a715
Submit and run a test job. Create a test job script
#!/bin/bash set -x echo "My id is $SLURM_JOB_ID" >> ~/slurm-test.log
and submit it with
$ qsub -q debug -l "nodes=1:ppn=8:walltime=3600,mem=4000MB" test.sh
If all goes well, the qsub
will print the job id,
and the test script job id will also be in ~/slurm-test.log
.
At this point there is no usage accounting,
there is only one job queueu (debug
).
Add job accounting and a second job queue
I am using slurm on the same workstation on whith I also do development, so it is useful to leave some ram/CPU power for the development tools and not allocate all resources to the default queue. For the tasks I wish to run almost immediately, or when I am not using the workstation, I set up another queue.
To enable job accounting modify /etc/slurm-llnll/slurm.conf
to the following contents:
# Job accounting: using a file on disk AccountingStorageType=accounting_storage/filetxt JobCompType=jobcomp/filetxt AccountingStorageLoc=/var/log/slurm-llnl/accounting JobCompLoc=/var/log/slurm-llnl/job_completions JobAcctGatherType=jobacct_gather/linux AccountingStorageEnforce=limits,qos # COMPUTE NODES NodeName=mars Procs=16 Sockets=1 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=40000 State=UNKNOWN # PARTITIONS PartitionName=main Nodes=mars MaxTime=INFINITE State=UP Shared=YES MaxCPUsPerNode=12 Default=YES PartitionName=single Nodes=mars MaxTime=INFINITE State=UP Shared=YES MaxCPUsPerNode=1 PartitionName=two Nodes=mars MaxTime=INFINITE State=DOWN Shared=YES MaxCPUsPerNode=2 #scheduling SchedulerType=sched/backfill SelectType=select/cons_res SelectTypeParameters=CR_CPU_Memory
The configuration describes one node (hostname mars) with 16 processors and 40GB of memory.
Job info is stored to a file.
The jobs can be ran through three queues, the main queue is limited to 12 cpus per each node, the single queue is limited to 1 cpu, and the two queue is limited to 2.
The idea is that when away, I can utilize the whole workstation by filling queues main and two, when working the main queue main continues to run and I have enough cpu and memory for development, and in case I need to run a few jobs quickly while working I can use the single queue. The parititon two is by default set to DOWN state
Scheduling is set to adhere to limits in resources (CPUs and memory) and schedule jobs with backfill.
Useful debug commands:
$ scontrol show nodes $ scontrol show partitions $ sinfo
show the info on nodes (useful to check if limits are properly enforced), partitions (same info as nodes) and short info on partitions and their states.
Submit jobs
To submit jobs to queue, use sbatch
command.
For me, using qsub
gave wrong results with regard to requested number
of cpus and memory, and did not work well in combination with enforcing limits
Here is a short summary of options to pass to sbatch
:
--dependency=afterok:job_id[:jobid...]
--nodes=<minnodes-maxnodes>
--ntasks=<num>
Number of tasks the job needs, by default (see--cpus-per-task
) this is the same as number of cpus.--time=<time>
Limit on the total run time A time limit of zero requests that no time limit be imposed. Acceptable time formats include “minutes”, “minutes:seconds”, “hours:minutes:seconds”, “days-hours”, “days-hours:minutes” and “days-hours:minutes:seconds”.--partition=<partition names>
select the partitions on which the job can run--mem=<total memory per node in MB>
--mem-per-cpu=<memory per cpu in MB>
--cpus-per-task=<num. of cpus per each task, defaults to 1>
--test-only
Validate the batch script and return an estimate of when a job would be scheduled to run given the current job queue and all the other arguments specifying the job requirements. No job is actually submitted.
To switch partition two up or down use:
$ sudo scontrol update Partition=two State=drain $ sudo scontrol update Partition=two State=down $ sudo scontrol update Partition=two State=resume $ sudo scontrol update Partition=two State=up
The states mean:
- drain: no further jobs will be scheduler, active ones will be kept alive
- down: active jobs will also be killed
- resume: start scheduling on the node again
- up: turn the node up, if you turned it down before
To get info on job queue use qstat
.