Slurm Installation Guide: Difference between revisions

From DISI
Jump to navigation Jump to search
No edit summary
 
Line 17: Line 17:
== Install MUNGE ==
== Install MUNGE ==
MUNGE is authentication service that Slurm uses validating users' credentials.
MUNGE is authentication service that Slurm uses validating users' credentials.
  $ sudo yum install munge munge-libs munge-devel
  sudo yum install munge munge-libs munge-devel
=== (master node only) Create secret key ===
=== (master node only) Create secret key ===
  $ dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
  dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
  $ chown munge:munge /etc/munge/munge.key
  chown munge:munge /etc/munge/munge.key
  $ chmod 400 /etc/munge/munge.key
  chmod 400 /etc/munge/munge.key
For worker nodes, scp the munge.key from master node and set the correct ownership and permission
For worker nodes, scp the munge.key from master node and set the correct ownership and permission
  $ scp -p /etc/munge/munge.key hostXXX:/etc/munge/munge.key
  scp -p /etc/munge/munge.key hostXXX:/etc/munge/munge.key
=== Set ownership and permission to following directories ===
=== Set ownership and permission to following directories ===
  $ chown -R munge: /etc/munge/ /var/log/munge/ /var/lib/munge /run/munge
  chown -R munge: /etc/munge/ /var/log/munge/ /var/lib/munge /run/munge
  $ chmod 0700 /etc/munge/ /var/log/munge/
  chmod 0700 /etc/munge/ /var/log/munge/


=== Start and enable MUNGE daemon at boot time ===
=== Start and enable MUNGE daemon at boot time ===
  $ systemctl enable munge
  systemctl enable munge
  $ systemctl start  munge
  systemctl start  munge
=== Increase number of MUNGE threads on master node (Optional by recommended on busy server) ===
=== Increase number of MUNGE threads on master node (Optional by recommended on busy server) ===
  $ cp /usr/lib/systemd/system/munge.service /etc/systemd/system/munge.service
  cp /usr/lib/systemd/system/munge.service /etc/systemd/system/munge.service
  $ vim /etc/systemd/system/munge.service
  vim /etc/systemd/system/munge.service
  Edit this line >> ExecStart=/usr/sbin/munged --num-threads 10
  Edit this line >> ExecStart=/usr/sbin/munged --num-threads 10
  Reload daemon and restart munge
  Reload daemon and restart munge
$ systemctl daemon-reload
  systemctl daemon-reload
$ systemctl restart munge
  systemctl restart munge
== Install Slurm ==
== Install Slurm ==
Although slurm is available on epel. It is better to build from RPMs to ensure we have the latest update.
Although slurm is available on epel. It is better to build from RPMs to ensure we have the latest update.
Line 55: Line 55:


If you are setting up slurmdbd, you will also need
If you are setting up slurmdbd, you will also need
  $ yum install mariadb-server mariadb-devel
  yum install mariadb-server mariadb-devel


=== [Ubuntu] Download packages ===
=== [Ubuntu] Download packages ===
Line 65: Line 65:
The current version at the time of this tutorial is 22.05.5
The current version at the time of this tutorial is 22.05.5
   
   
  $ cd /root
  cd /root
  $ wget https://download.schedmd.com/slurm/slurm-22.05.5.tar.bz2
  wget https://download.schedmd.com/slurm/slurm-22.05.5.tar.bz2
  $ export VER=22.05.5; rpmbuild -ta slurm-$VER.tar.bz2 --with mysql --with slurmrestd
  export VER=22.05.5; rpmbuild -ta slurm-$VER.tar.bz2 --with mysql --with slurmrestd
       # Includes accounting support with the slurm-slurmdbd package
       # Includes accounting support with the slurm-slurmdbd package
  $ cd /root/rpmbuild/RPMS/x86_64
  cd /root/rpmbuild/RPMS/x86_64
==== For Master Node ====
==== For Master Node ====
  $ yum install slurm-$VER*rpm slurm-devel-$VER*rpm slurm-perlapi-$VER*rpm slurm-torque-$VER*rpm slurm-example-configs-$VER*rpm slurm-slurmctld-$VER*rpm slurm-slurmd-$VER*rpm slurm-libpmi-$VER*rpm slurm-slurmdbd-$VER*rpm slurm-slurmrestd-$VER*rpm
  yum install slurm-$VER*rpm slurm-devel-$VER*rpm slurm-perlapi-$VER*rpm slurm-torque-$VER*rpm slurm-example-configs-$VER*rpm slurm-slurmctld-$VER*rpm slurm-slurmd-$VER*rpm slurm-libpmi-$VER*rpm slurm-slurmdbd-$VER*rpm slurm-slurmrestd-$VER*rpm


==== For Worker Nodes ====
==== For Worker Nodes ====
  $ yum install slurm-$VER*rpm slurm-devel-$VER*rpm slurm-perlapi-$VER*rpm slurm-torque-$VER*rpm slurm-example-configs-$VER*rpm slurm-slurmd-$VER*rpm slurm-libpmi-$VER*rpm slurm-slurmrestd-$VER*rpm
  yum install slurm-$VER*rpm slurm-devel-$VER*rpm slurm-perlapi-$VER*rpm slurm-torque-$VER*rpm slurm-example-configs-$VER*rpm slurm-slurmd-$VER*rpm slurm-libpmi-$VER*rpm slurm-slurmrestd-$VER*rpm


==== Slurmctld ====
==== Slurmctld ====
===== Slurm.conf =====
===== Slurm.conf =====
  $ cd /etc/slurm
  cd /etc/slurm
  $ cp slurm.conf.example slurm.conf
  cp slurm.conf.example slurm.conf
  Edit slurm.conf file, '''see gimel2's slurm.conf''' for example
  Edit slurm.conf file, '''see gimel2's slurm.conf''' for example
  If there is no slurm.conf.example, you can get one here https://github.com/SchedMD/slurm/blob/master/etc/slurm.conf.example
  If there is no slurm.conf.example, you can get one here https://github.com/SchedMD/slurm/blob/master/etc/slurm.conf.example


===== Other directories =====
===== Other directories =====
  $ mkdir -p /var/spool/slurm-llnl /var/run/slurm-llnl /var/log/slurm-llnl
  mkdir -p /var/spool/slurm-llnl /var/run/slurm-llnl /var/log/slurm-llnl
  $ chown -R slurm:slurm /var/spool/slurm-llnl /var/run/slurm-llnl /var/log/slurm-llnl /etc/slurm
  chown -R slurm:slurm /var/spool/slurm-llnl /var/run/slurm-llnl /var/log/slurm-llnl /etc/slurm


===== Open Firewall =====
===== Open Firewall =====
  $ firewall-cmd --permanent --zone=public --add-port=6817/tcp #slurmctld
  firewall-cmd --permanent --zone=public --add-port=6817/tcp #slurmctld
  $ firewall-cmd --permanent --zone=public --add-port=6818/tcp #slurmd
  firewall-cmd --permanent --zone=public --add-port=6818/tcp #slurmd
  $ firewall-cmd --reload
  firewall-cmd --reload
===== Start daemon =====
===== Start daemon =====
  $ systemctl enable slurmctld
  systemctl enable slurmctld
  $ systemctl start slurmctld
  systemctl start slurmctld
  $ systemctl enable slurmdbd
  systemctl enable slurmdbd
  $ systemctl start slurmdbd  
  systemctl start slurmdbd  
===== Verify Setup =====
===== Verify Setup =====
  $ sinfo -lNe # list nodes and partitions
  sinfo -lNe # list nodes and partitions


==== Slurmdbd ====
==== Slurmdbd ====
Line 103: Line 103:


Make sure that mariadb packages are installed before built Slurm RPMs
Make sure that mariadb packages are installed before built Slurm RPMs
  $ rpm -q mariadb-server mariadb-devel
  rpm -q mariadb-server mariadb-devel
  $ rpm -ql slurm-slurmdbd | grep accounting_storage_mysql.so    # Must show location of this file
  rpm -ql slurm-slurmdbd | grep accounting_storage_mysql.so    # Must show location of this file
   
   
===== Start daemon =====
===== Start daemon =====
  $ systemctl start mariadb
  systemctl start mariadb
  $ systemctl enable mariadb
  systemctl enable mariadb


===== Configure db =====
===== Configure db =====
1. Set up db's root password  
1. Set up db's root password  
  $ /usr/bin/mysql_secure_installation
  /usr/bin/mysql_secure_installation
2. Create db
2. Create db
  $ mysql -p
  mysql -p
  > grant all on slurm_acct_db.* TO 'slurm'@'localhost' identified by 'some_pass' with grant option;  ### WARNING: change the "some_pass" to the db password you just set.
  > grant all on slurm_acct_db.* TO 'slurm'@'localhost' identified by 'some_pass' with grant option;  ### WARNING: change the "some_pass" to the db password you just set.
  > SHOW VARIABLES LIKE 'have_innodb';
  > SHOW VARIABLES LIKE 'have_innodb';
Line 121: Line 121:


  To verify db grant for slurm user
  To verify db grant for slurm user
  $ mysql -p -u slurm
  mysql -p -u slurm
  > show grants;
  > show grants;
  > quit;
  > quit;
Line 148: Line 148:
   
   
Change ownership and permission
Change ownership and permission
  $ chown slurm: /etc/slurm/slurmdbd.conf
  chown slurm: /etc/slurm/slurmdbd.conf
  $ chmod 600 /etc/slurm/slurmdbd.conf
  chmod 600 /etc/slurm/slurmdbd.conf


  $ systemctl restart slurmdbd
  systemctl restart slurmdbd
===== Firewall =====
===== Firewall =====
  $ firewall-cmd --permanent --zone=public --add-port=6819/tcp #slurmdbd
  firewall-cmd --permanent --zone=public --add-port=6819/tcp #slurmdbd
  $ firewall-cmd --reload
  firewall-cmd --reload


===== Verify Setup =====
===== Verify Setup =====
  $ scontrol show config | grep AccountingStorageHost
  scontrol show config | grep AccountingStorageHost
  $ sacct -a
  sacct -a


=== Slurmd ===
=== Slurmd ===
Line 164: Line 164:
==== Slurm.conf====
==== Slurm.conf====
You will need to edit the slurm.conf in master node and copy it to the worker node. Figure out what CPU/Memory resources you have at this worker node  
You will need to edit the slurm.conf in master node and copy it to the worker node. Figure out what CPU/Memory resources you have at this worker node  
  $ cat /proc/cpuinfo
  cat /proc/cpuinfo
  or
  or
  $ slurmd -C
  slurmd -C
append the following line:
append the following line:
  NodeName=[Node_Name] NodeAddr=XX.XX.XX.XXX CPUs=[no_cpu] State=UNKNOWN
  NodeName=[Node_Name] NodeAddr=XX.XX.XX.XXX CPUs=[no_cpu] State=UNKNOWN
Line 174: Line 174:


==== Setup directories ====
==== Setup directories ====
  $ mkdir -p /var/spool/slurm-llnl /var/run/slurm-llnl /var/log/slurm-llnl
  mkdir -p /var/spool/slurm-llnl /var/run/slurm-llnl /var/log/slurm-llnl
  $ chown -R slurm:slurm /etc/slurm /var/spool/slurm-llnl /var/run/slurm-llnl /var/log/slurm-llnl
  chown -R slurm:slurm /etc/slurm /var/spool/slurm-llnl /var/run/slurm-llnl /var/log/slurm-llnl
==== Cgroup ====
==== Cgroup ====
  $ cd /etc/slurm
  cd /etc/slurm
  $ cp cgroup.conf.example cgroup.conf
  cp cgroup.conf.example cgroup.conf
==== Firewall ====
==== Firewall ====
  $ firewall-cmd --permanent --zone=public --add-port=6818/tcp #slurmd
  firewall-cmd --permanent --zone=public --add-port=6818/tcp #slurmd
  $ firewall-cmd --reload
  firewall-cmd --reload
==== Restart daemon ====
==== Restart daemon ====
  $ systemctl enable slurmd
  systemctl enable slurmd
  $ systemctl start slurmd
  systemctl start slurmd

Latest revision as of 20:27, 22 March 2024

This page will show you how to setup and configure a Slurm queueing system. Useful link: https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/

Pre-installation

Create global user account

Slurm and MUNGE users need to have a consistent UID/GID across all nodes in the cluster. Creating global user accounts must be done before installing the RPMs. It can be done via LDAPAdmin or any services that you use to manage users. If you don't have access to those services, please contact your system administrators.

Install the latest epel-release

[Centos/RHEL]

CentOS8: dnf install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
CentOS7: yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
RHEL7:   yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm

[Ubuntu]

apt-get update

Install MUNGE

MUNGE is authentication service that Slurm uses validating users' credentials.

sudo yum install munge munge-libs munge-devel

(master node only) Create secret key

dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
chown munge:munge /etc/munge/munge.key
chmod 400 /etc/munge/munge.key

For worker nodes, scp the munge.key from master node and set the correct ownership and permission

scp -p /etc/munge/munge.key hostXXX:/etc/munge/munge.key

Set ownership and permission to following directories

chown -R munge: /etc/munge/ /var/log/munge/ /var/lib/munge /run/munge
chmod 0700 /etc/munge/ /var/log/munge/

Start and enable MUNGE daemon at boot time

systemctl enable munge
systemctl start  munge

Increase number of MUNGE threads on master node (Optional by recommended on busy server)

cp /usr/lib/systemd/system/munge.service /etc/systemd/system/munge.service
vim /etc/systemd/system/munge.service
Edit this line >> ExecStart=/usr/sbin/munged --num-threads 10
Reload daemon and restart munge
 systemctl daemon-reload
 systemctl restart munge

Install Slurm

Although slurm is available on epel. It is better to build from RPMs to ensure we have the latest update.

This still shows you how to set up slurm with accounting (slurmdbd using MySQL as database). Accounting is optional and can be skipped, but it is useful for keeping records of job and managing resources.

Install prerequisite packages

$ yum install rpm-build gcc python3 openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel gtk2-devel libibmad libibumad perl-Switch perl-ExtUtils-MakeMaker xorg-x11-xauth http-parser-devel json-c-devel mysql-devel libssh2-devel man2html

If you are seeing this error, this means httpd 2.2 is installed in the server

Error: httpd24u-filesystem conflicts with httpd-filesystem-2.4.35-5.el7.noarch
Error: httpd24u-tools conflicts with httpd-tools-2.4.35-5.el7.x86_64

Solution is to uninstall this version and run the yum command above again, it will install the correct package

yum remove httpd
yum remove httpd-tools
yum remove httpd-filesystem

If you are setting up slurmdbd, you will also need

yum install mariadb-server mariadb-devel

[Ubuntu] Download packages

apt-get update
apt-get install slurm slurmd slurm-client slurmrestd slurmdbd slurmcltd

[Centos/RHEL] Build RPMS

Check for the latest version in https://download.schedmd.com/slurm/

The current version at the time of this tutorial is 22.05.5

cd /root
wget https://download.schedmd.com/slurm/slurm-22.05.5.tar.bz2
export VER=22.05.5; rpmbuild -ta slurm-$VER.tar.bz2 --with mysql --with slurmrestd
     # Includes accounting support with the slurm-slurmdbd package
cd /root/rpmbuild/RPMS/x86_64

For Master Node

yum install slurm-$VER*rpm slurm-devel-$VER*rpm slurm-perlapi-$VER*rpm slurm-torque-$VER*rpm slurm-example-configs-$VER*rpm slurm-slurmctld-$VER*rpm slurm-slurmd-$VER*rpm slurm-libpmi-$VER*rpm slurm-slurmdbd-$VER*rpm slurm-slurmrestd-$VER*rpm

For Worker Nodes

yum install slurm-$VER*rpm slurm-devel-$VER*rpm slurm-perlapi-$VER*rpm slurm-torque-$VER*rpm slurm-example-configs-$VER*rpm slurm-slurmd-$VER*rpm slurm-libpmi-$VER*rpm slurm-slurmrestd-$VER*rpm

Slurmctld

Slurm.conf
cd /etc/slurm
cp slurm.conf.example slurm.conf
Edit slurm.conf file, see gimel2's slurm.conf for example
If there is no slurm.conf.example, you can get one here https://github.com/SchedMD/slurm/blob/master/etc/slurm.conf.example
Other directories
mkdir -p /var/spool/slurm-llnl /var/run/slurm-llnl /var/log/slurm-llnl
chown -R slurm:slurm /var/spool/slurm-llnl /var/run/slurm-llnl /var/log/slurm-llnl /etc/slurm
Open Firewall
firewall-cmd --permanent --zone=public --add-port=6817/tcp #slurmctld
firewall-cmd --permanent --zone=public --add-port=6818/tcp #slurmd
firewall-cmd --reload
Start daemon
systemctl enable slurmctld
systemctl start slurmctld
systemctl enable slurmdbd
systemctl start slurmdbd 
Verify Setup
sinfo -lNe # list nodes and partitions

Slurmdbd

This will show you how to set up slurmdb's accounting storage in Mariadb. Slurmdbd doesn't need to be installed in the master node.

Make sure that mariadb packages are installed before built Slurm RPMs

rpm -q mariadb-server mariadb-devel
rpm -ql slurm-slurmdbd | grep accounting_storage_mysql.so     # Must show location of this file

Start daemon
systemctl start mariadb
systemctl enable mariadb
Configure db

1. Set up db's root password

/usr/bin/mysql_secure_installation

2. Create db

mysql -p
> grant all on slurm_acct_db.* TO 'slurm'@'localhost' identified by 'some_pass' with grant option;  ### WARNING: change the "some_pass" to the db password you just set.
> SHOW VARIABLES LIKE 'have_innodb';
> create database slurm_acct_db;
> quit;
To verify db grant for slurm user
mysql -p -u slurm
> show grants;
> quit;
Slurmdbd.conf
LogFile=/var/log/slurm/slurmdbd.log
DbdHost=XXXX    # Replace by the slurmdbd server hostname (for example, 
slurmdbd.my.domain)
DbdPort=6819    # The default value
SlurmUser=slurm
StorageHost=localhost
StoragePass=some_pass    # The above defined database password
StorageLoc=slurm_acct_db
.
.
LogFile=/var/log/slurm-llnl/slurmdbd.log
PidFile=/var/run/slurm-llnl/slurmdbd.pid
# Add these variables below so that the db doesn't get too big
PurgeEventAfter=3months
PurgeJobAfter=3months
PurgeResvAfter=3months
PurgeStepAfter=2months
PurgeSuspendAfter=1month
PurgeTXNAfter=3months
PurgeUsageAfter=3months

Change ownership and permission

chown slurm: /etc/slurm/slurmdbd.conf
chmod 600 /etc/slurm/slurmdbd.conf
systemctl restart slurmdbd
Firewall
firewall-cmd --permanent --zone=public --add-port=6819/tcp #slurmdbd
firewall-cmd --reload
Verify Setup
scontrol show config | grep AccountingStorageHost
sacct -a

Slurmd

Follow the Build RPMS above for Worker Nodes

Slurm.conf

You will need to edit the slurm.conf in master node and copy it to the worker node. Figure out what CPU/Memory resources you have at this worker node

cat /proc/cpuinfo
or
slurmd -C

append the following line:

NodeName=[Node_Name] NodeAddr=XX.XX.XX.XXX CPUs=[no_cpu] State=UNKNOWN
You can choose to add the new node into partition or having this line which will automatically add any nodes 
PartitionName=gimel2.cpu Nodes=ALL Default=YES MaxTime=INFINITE State=UP

Setup directories

mkdir -p /var/spool/slurm-llnl /var/run/slurm-llnl /var/log/slurm-llnl
chown -R slurm:slurm /etc/slurm /var/spool/slurm-llnl /var/run/slurm-llnl /var/log/slurm-llnl

Cgroup

cd /etc/slurm
cp cgroup.conf.example cgroup.conf

Firewall

firewall-cmd --permanent --zone=public --add-port=6818/tcp #slurmd
firewall-cmd --reload

Restart daemon

systemctl enable slurmd
systemctl start slurmd