Slurm: Difference between revisions
No edit summary |
Jgutierrez6 (talk | contribs) |
||
(32 intermediate revisions by 5 users not shown) | |||
Line 1: | Line 1: | ||
'''Slurm user-guide''' | '''Slurm user-guide''' | ||
=== Submit Jobs with Slurm === | |||
==== SBATCH-MR (beta) ==== | |||
It is a slurm-version of qsub-mr for submitting job on Slurm queueing system. Note: this is have not been extensively tested yet. Please contact me if the script is not working out. We are hoping to fully migrate to Slurm from the out-dated SGE system. Any error report would be helpful - Khanh | |||
New slurm scripts are located in /nfs/soft/tools/utils/sbatch-slice | |||
Just simply replace /nfs/soft/tools/utils/qsub-slice/qsub-mr with /nfs/soft/tools/utils/sbatch-slice/sbatch-mr in your script | |||
To check the status of your job: | |||
By username | |||
$ squeue -u <username> | |||
By jobid | |||
$ squeue -j <job_id> | |||
==== Submit load2d Jobs ==== | |||
$ cd <catalog_shortname> | |||
$ source /nfs/exa/work/khtang/ZINC21_load2d/loadenv_zinc21.sh | |||
(development) $ sh /nfs/exa/work/khtang/submit_scripts/sbatch_slice/batch_zinc21.slurm <catalog_shortname>.ism | |||
==== Submit DOCK Jobs ==== | |||
* ANACONDA Installation (Python 2.7) | * ANACONDA Installation (Python 2.7) | ||
Line 26: | Line 46: | ||
* CHEMBL4422_active_ligands.mol2 - TOP500 scoring poses | * CHEMBL4422_active_ligands.mol2 - TOP500 scoring poses | ||
* extract_all.sort.uniq.txt - a print-out of scoring details | * extract_all.sort.uniq.txt - a print-out of scoring details | ||
Slurm queue manager is installed locally at gimel, use it to run this test (and all your future jobs) in parallel.<br> | Slurm queue manager is installed locally at gimel, use it to run this test (and all your future jobs) in parallel.<br> | ||
Line 64: | Line 86: | ||
As root at gimel, it is possible to modify a particular job, e.g., ''scontrol update jobid=635 TimeLimit=7-00:00:00'' | As root at gimel, it is possible to modify a particular job, e.g., ''scontrol update jobid=635 TimeLimit=7-00:00:00'' | ||
=== Slurm Installation Guide === | |||
'''Detailed step-by-step installation instruction (for sysadmins only)''' | |||
==== Setup Node master ==== | |||
'''TBA''' | |||
==== Setup Compute Nodes ==== | |||
Line 72: | Line 99: | ||
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation | https://wiki.fysik.dtu.dk/niflheim/Slurm_installation | ||
'''node n-1-17''' | '''node n-1-17''' (Installation of the latest slurm version (20.02.04)? see below in "Migrating to gimel5" section) | ||
* make sure you have there Centos 7: ''cat /etc/redhat-release'' | * make sure you have there Centos 7: ''cat /etc/redhat-release'' | ||
* ''wget https://download.schedmd.com/slurm/slurm-17.02.11.tar.bz2'' | * ''wget https://download.schedmd.com/slurm/slurm-17.02.11.tar.bz2'' | ||
* ''yum install readline-devel perl-ExtUtils-MakeMaker.noarch munge-devel pam-devel'' | * ''yum install readline-devel perl-ExtUtils-MakeMaker.noarch munge-devel pam-devel openssl-devel'' | ||
* ''export VER=17.02.11; rpmbuild -ta slurm-$VER.tar.bz2 --without mysql; mv /root/rpmbuild .'' | * ''export VER=17.02.11; rpmbuild -ta slurm-$VER.tar.bz2 --without mysql; mv /root/rpmbuild .'' | ||
Line 83: | Line 110: | ||
* ''yum install rpmbuild/RPMS/x86_64/slurm-17.02.11-1.el7.x86_64.rpm'' | * ''yum install rpmbuild/RPMS/x86_64/slurm-17.02.11-1.el7.x86_64.rpm'' | ||
* ''yum install rpmbuild/RPMS/x86_64/slurm-munge-17.02.11-1.el7.x86_64.rpm'' | * ''yum install rpmbuild/RPMS/x86_64/slurm-munge-17.02.11-1.el7.x86_64.rpm'' | ||
'''setting up munge''': | '''setting up munge''': | ||
Line 110: | Line 136: | ||
And last but not least, asking the firewall to allow communication between master node and computing node n-1-17: | And last but not least, asking the firewall to allow communication between master node and computing node n-1-17: | ||
* ''firewall-cmd --permanent --zone=public --add-port=6818/tcp'' | * ''firewall-cmd --permanent --zone=public --add-port=6817/tcp'' #slurmctld | ||
* ''firewall-cmd --permanent --zone=public --add-port=6818/tcp'' #slurmd | |||
* ''firewall-cmd --reload'' | * ''firewall-cmd --reload'' | ||
Line 124: | Line 151: | ||
To disable a specific node, do ''scontrol update NodeName=n-1-17 State=DRAIN Reason=DRAINED''<br> | To disable a specific node, do ''scontrol update NodeName=n-1-17 State=DRAIN Reason=DRAINED''<br> | ||
To return back to service, do ''scontrol update NodeName=n-1-17 State= | To return back to service, do ''scontrol update NodeName=n-1-17 State=RESUME'' | ||
Line 130: | Line 157: | ||
''sudo yum install csh tcsh'' | ''sudo yum install csh tcsh'' | ||
'''Node down after reboot''' | |||
On gimel (master node) | On gimel (master node) | ||
sudo scontrol update NodeName=<node_name> State=RESUME | |||
On GPUs | |||
sudo nvidia-smi -c 3 (to wake up the GPUs and set them in exclusive mode) | |||
sudo scontrol update NodeName=<node_name> State=RESUME | sudo scontrol update NodeName=<node_name> State=RESUME | ||
Line 137: | Line 168: | ||
'''Useful links:''' | '''Useful links:''' | ||
https://support.ceci-hpc.be/doc/_contents/QuickStart/SubmittingJobs/SlurmTutorial.html | https://support.ceci-hpc.be/doc/_contents/QuickStart/SubmittingJobs/SlurmTutorial.html | ||
==== Migrating to gimel5 ==== | |||
<source> | |||
yum install rpm-build gcc python3 openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel gtk2-devel libibmad libibumad perl-Switch perl-ExtUtils-MakeMaker xorg-x11-xauth http-parser-devel json-c-devel mysql-devel libssh2-devel man2html munge munge-devel munge-libs -y | |||
wget https://download.schedmd.com/slurm/slurm-22.05.5.tar.bz2 | |||
export VER=22.05.5; rpmbuild -ta slurm-$VER.tar.bz2 --with mysql --with slurmrestd | |||
cd /root/rpmbuild/RPMS/x86_64 | |||
yum install slurm-$VER*rpm slurm-devel-$VER*rpm slurm-perlapi-$VER*rpm slurm-torque-$VER*rpm slurm-example-configs-$VER*rpm slurm-slurmd-$VER*rpm slurm-libpmi-$VER*rpm slurm-slurmrestd-$VER*rpm -y | |||
scp <user>@gimel2:/etc/munge/munge.key <user>@gimel2:/etc/slurm/slurm.conf /tmp/ | |||
mv /tmp/munge.key /etc/munge/ | |||
mv /tmp/slurm.conf /etc/slurm/ | |||
systemctl enable munge slurmd | |||
systemctl start munge slurmd | |||
systemctl status munge slurmd | |||
</source> | |||
'''Installation for a Backup Controller''' | |||
Currently (04/05/2021): gimel4 | |||
* In gimel5's /etc/slurm/slurm.conf, find "BackupController=" | |||
** Set the value to ''' gimel4 ''' | |||
** Copy the conf file to gimel4 | |||
* ''yum install rpmbuild/RPMS/x86_64/slurm-20.02.4-1.el7.x86_64.rpm<br> | |||
* ''yum install rpmbuild/RPMS/x86_64/slurm-slurmd-20.02.4-1.el7.x86_64.rpm<br> | |||
* ''yum install rpmbuild/RPMS/x86_64/slurm-slurmctld-20.02.4-1.el7.x86_64.rpm<br> | |||
* ''systemctl enable slurmctld.service<br> | |||
* ''systemctl start slurmctld.service | |||
==== GPUs specification ==== | |||
- 32-core: | |||
+ n-9-34 (GTX 1080 Ti) | |||
+ n-9-36 (GTX 1080 Ti) | |||
+ n-1-126 (GTX 980) | |||
+ n-1-141 (GTX 980) | |||
- 40-core: | |||
+ n-1-28 (RTX 2080 Super) | |||
+ n-1-38 (RTX 2080 Super) | |||
+ n-1-101 (RTX 2080 Super) | |||
+ n-1-105 (RTX 2080 Super) | |||
+ n-1-124 (RTX 2080 Super) | |||
==== Log Rotattion ==== | |||
If you haven't had logrotate install: | |||
$ yum install logrotate | |||
Here is a sample logrotate configuration. Make appropriate site modifications and save as /etc/logrotate.d/slurm on all nodes. | |||
## | |||
# Slurm Logrotate Configuration | |||
## | |||
/var/log/slurm-llnl/*log { | |||
compress | |||
missingok | |||
nocopytruncate | |||
nocreate | |||
nodelaycompress | |||
nomail | |||
notifempty | |||
noolddir | |||
rotate 5 | |||
sharedscripts | |||
size=500M | |||
create 640 slurm root | |||
postrotate | |||
/etc/init.d/slurm reconfig | |||
endscript | |||
} | |||
=== Slurm Admin Notes === | |||
==== Add user on gimel2 ==== | |||
sudo sacctmgr add user jji account=bks | |||
=== Troubleshooting === | |||
==== Zero Bytes were transmitted or received ==== | |||
Error | |||
slurm_load_partitions: Zero Bytes were transmitted or received | |||
This could mean that the clock on worker node is out-of-sync from master | |||
timedatectl set-time [HH:MM:ss] | |||
Back to [[DOCK_3.7]] | Back to [[DOCK_3.7]] | ||
[[Category : Slurm]] |
Latest revision as of 20:26, 28 May 2024
Slurm user-guide
Submit Jobs with Slurm
SBATCH-MR (beta)
It is a slurm-version of qsub-mr for submitting job on Slurm queueing system. Note: this is have not been extensively tested yet. Please contact me if the script is not working out. We are hoping to fully migrate to Slurm from the out-dated SGE system. Any error report would be helpful - Khanh
New slurm scripts are located in /nfs/soft/tools/utils/sbatch-slice
Just simply replace /nfs/soft/tools/utils/qsub-slice/qsub-mr with /nfs/soft/tools/utils/sbatch-slice/sbatch-mr in your script
To check the status of your job:
By username $ squeue -u <username> By jobid $ squeue -j <job_id>
Submit load2d Jobs
$ cd <catalog_shortname> $ source /nfs/exa/work/khtang/ZINC21_load2d/loadenv_zinc21.sh (development) $ sh /nfs/exa/work/khtang/submit_scripts/sbatch_slice/batch_zinc21.slurm <catalog_shortname>.ism
Submit DOCK Jobs
- ANACONDA Installation (Python 2.7)
Each user is welcome to download anaconda and install into his/her own folder
https://www.anaconda.com/distribution/
wget https://repo.anaconda.com/archive/Anaconda2-2019.10-Linux-x86_64.sh
NB: It is also available for Python3, which is our nearest future
simple installation via /bin/sh Anaconda2-2019.10-Linux-x86_64.sh
After the installation is completed, you need to install a few packages:
conda install -c free bsddb conda install -c rdkit rdkit conda install numpy
Running DOCK-3.7 with Slurm
Here is a “guinea pig project”, which has been done with DOCK-3.7 locally.
GPR40 example: /mnt/nfs/home/dudenko/TEST_DOCKING_PROJECT
ChEMBL ligands: /mnt/nfs/home/dudenko/CHEMBL4422_active_ligands
This test calculation should run smoothly, if not, then there is a problem.
Ultimately, you may need to compare your results with the reference run:
- CHEMBL4422_active_ligands.mol2 - TOP500 scoring poses
- extract_all.sort.uniq.txt - a print-out of scoring details
Slurm queue manager is installed locally at gimel, use it to run this test (and all your future jobs) in parallel.
Do not forget to set DOCKBASE variable: export DOCKBASE=/nfs/soft/dock/versions/dock37/DOCK-3.7.3rc1/
Useful DOCKING commands to remind:
$DOCKBASE/docking/setup/setup_db2_zinc15_file_number.py ./ CHEMBL4422_active_ligands_ CHEMBL4422_active_ligands.sdi 100 count $DOCKBASE/analysis/extract_all.py -s -20 $DOCKBASE/analysis/getposes.py -l 500 -o CHEMBL4422_active_ligands.mol2
Useful slurm commands (see https://slurm.schedmd.com/quickstart.html):
to see what machine resources are offered by the cluster, do sinfo -lNe to submit a DOCK-3.7 job, run $DOCKBASE/docking/submit/submit_slurm_array.csh to see what is happening in the queue, run squeue to see a detailed info for a specific job: scontrol show jobid=_JOBID_ to delete a job from queue, run scancel _JOBID_
Should your slurm run correctly, type squeue and you should see something like this:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 4187_[637-2091] gimel array_jo dudenko PD 0:00 1 (Resources) 4187_629 gimel array_jo dudenko R 0:00 1 n-1-20 4187_630 gimel array_jo dudenko R 0:00 1 n-5-34 4187_631 gimel array_jo dudenko R 0:00 1 n-1-21 4187_632 gimel array_jo dudenko R 0:00 1 n-5-34 4187_633 gimel array_jo dudenko R 0:00 1 n-1-21 4187_634 gimel array_jo dudenko R 0:00 1 n-5-34 4187_635 gimel array_jo dudenko R 0:00 1 n-5-35 4187_636 gimel array_jo dudenko R 0:00 1 n-5-34 4187_622 gimel array_jo dudenko R 0:01 1 n-5-34 4187_623 gimel array_jo dudenko R 0:01 1 n-5-34 4187_624 gimel array_jo dudenko R 0:01 1 n-5-35 4187_625 gimel array_jo dudenko R 0:01 1 n-1-17
As root at gimel, it is possible to modify a particular job, e.g., scontrol update jobid=635 TimeLimit=7-00:00:00
Slurm Installation Guide
Detailed step-by-step installation instruction (for sysadmins only)
Setup Node master
TBA
Setup Compute Nodes
Useful links:
https://slurm.schedmd.com/quickstart_admin.html https://wiki.fysik.dtu.dk/niflheim/Slurm_installation
node n-1-17 (Installation of the latest slurm version (20.02.04)? see below in "Migrating to gimel5" section)
- make sure you have there Centos 7: cat /etc/redhat-release
- wget https://download.schedmd.com/slurm/slurm-17.02.11.tar.bz2
- yum install readline-devel perl-ExtUtils-MakeMaker.noarch munge-devel pam-devel openssl-devel
- export VER=17.02.11; rpmbuild -ta slurm-$VER.tar.bz2 --without mysql; mv /root/rpmbuild .
installing built packages from rpmbuild:
- yum install rpmbuild/RPMS/x86_64/slurm-plugins-17.02.11-1.el7.x86_64.rpm
- yum install rpmbuild/RPMS/x86_64/slurm-17.02.11-1.el7.x86_64.rpm
- yum install rpmbuild/RPMS/x86_64/slurm-munge-17.02.11-1.el7.x86_64.rpm
setting up munge:
copy over /etc/munge/munge.key from gimel and put locally to /etc/munge. The key should be identical allover the nodes.
Munge is a daemon responsible for secure data exchange between nodes.
Set permissions accordingly: chown munge:munge /etc/munge/munge.key; chmod 400 /etc/munge/munge.key
starting munge: systemctl enable munge; systemctl start munge
setting up slurm:
- create a user slurm: adduser slurm.
- all UID/GUIDs of slurm user should be identical allover the nodes.
Otherwise, one needs to specify a mapping scheme for translating each UID/GUIDs between nodes.
To edit slurm UID/GUID, do vipw and replace "slurm line" with slurm:x:XXXXX:YYYYY::/nonexistent:/bin/false
XXXXX and YYYYY for slurm user can be found at gimel in /etc/passwd
NB: don't forget to edit /etc/group as well.
- copy /etc/slurm/slurm.conf from gimel and put locally to /etc/slurm.
- figure out what CPU/Memory resources you have at n-1-17 (see /proc/cpuinfo) and append the following line:
NodeName=n-1-17 NodeAddr=10.20.1.17 CPUs=24 State=UNKNOWN
- append n-1-17 to the partition: PartitionName=gimel Nodes=gimel,n-5-34,n-5-35,n-1-17 Default=YES MaxTime=INFINITE State=UP
- create the following folders:
mkdir -p /var/spool/slurm-llnl /var/run/slurm-llnl /var/log/slurm-llnl chown -R slurm:slurm /var/spool/slurm-llnl /var/run/slurm-llnl /var/log/slurm-llnl
- restarting slurm master node at gimel (Centos 6): /etc/init.d/slurm restart
- enabling and starting slurm computing nodes (Centos 7): systemctl enable slurmd; systemctl start slurmd
And last but not least, asking the firewall to allow communication between master node and computing node n-1-17:
- firewall-cmd --permanent --zone=public --add-port=6817/tcp #slurmctld
- firewall-cmd --permanent --zone=public --add-port=6818/tcp #slurmd
- firewall-cmd --reload
To see the current situation of the queue, so sinfo -lNe and you will see:
Wed May 27 09:49:54 2020 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON gimel 1 gimel* drained 24 4:6:1 1 0 1 (null) none n-1-17 1 gimel* idle 24 24:1:1 1 0 1 (null) none n-5-34 1 gimel* idle 80 80:1:1 1 0 1 (null) none n-5-35 1 gimel* idle 80 80:1:1 1 0 1 (null) none
To disable a specific node, do scontrol update NodeName=n-1-17 State=DRAIN Reason=DRAINED
To return back to service, do scontrol update NodeName=n-1-17 State=RESUME
p.s. Some users/scripts may require csh/tcsh.
sudo yum install csh tcsh
Node down after reboot On gimel (master node)
sudo scontrol update NodeName=<node_name> State=RESUME
On GPUs
sudo nvidia-smi -c 3 (to wake up the GPUs and set them in exclusive mode) sudo scontrol update NodeName=<node_name> State=RESUME
Useful links:
https://support.ceci-hpc.be/doc/_contents/QuickStart/SubmittingJobs/SlurmTutorial.html
Migrating to gimel5
yum install rpm-build gcc python3 openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel gtk2-devel libibmad libibumad perl-Switch perl-ExtUtils-MakeMaker xorg-x11-xauth http-parser-devel json-c-devel mysql-devel libssh2-devel man2html munge munge-devel munge-libs -y
wget https://download.schedmd.com/slurm/slurm-22.05.5.tar.bz2
export VER=22.05.5; rpmbuild -ta slurm-$VER.tar.bz2 --with mysql --with slurmrestd
cd /root/rpmbuild/RPMS/x86_64
yum install slurm-$VER*rpm slurm-devel-$VER*rpm slurm-perlapi-$VER*rpm slurm-torque-$VER*rpm slurm-example-configs-$VER*rpm slurm-slurmd-$VER*rpm slurm-libpmi-$VER*rpm slurm-slurmrestd-$VER*rpm -y
scp <user>@gimel2:/etc/munge/munge.key <user>@gimel2:/etc/slurm/slurm.conf /tmp/
mv /tmp/munge.key /etc/munge/
mv /tmp/slurm.conf /etc/slurm/
systemctl enable munge slurmd
systemctl start munge slurmd
systemctl status munge slurmd
Installation for a Backup Controller
Currently (04/05/2021): gimel4
- In gimel5's /etc/slurm/slurm.conf, find "BackupController="
- Set the value to gimel4
- Copy the conf file to gimel4
- yum install rpmbuild/RPMS/x86_64/slurm-20.02.4-1.el7.x86_64.rpm
- yum install rpmbuild/RPMS/x86_64/slurm-slurmd-20.02.4-1.el7.x86_64.rpm
- yum install rpmbuild/RPMS/x86_64/slurm-slurmctld-20.02.4-1.el7.x86_64.rpm
- systemctl enable slurmctld.service
- systemctl start slurmctld.service
GPUs specification
- 32-core: + n-9-34 (GTX 1080 Ti) + n-9-36 (GTX 1080 Ti) + n-1-126 (GTX 980) + n-1-141 (GTX 980) - 40-core: + n-1-28 (RTX 2080 Super) + n-1-38 (RTX 2080 Super) + n-1-101 (RTX 2080 Super) + n-1-105 (RTX 2080 Super) + n-1-124 (RTX 2080 Super)
Log Rotattion
If you haven't had logrotate install:
$ yum install logrotate
Here is a sample logrotate configuration. Make appropriate site modifications and save as /etc/logrotate.d/slurm on all nodes.
## # Slurm Logrotate Configuration ## /var/log/slurm-llnl/*log { compress missingok nocopytruncate nocreate nodelaycompress nomail notifempty noolddir rotate 5 sharedscripts size=500M create 640 slurm root postrotate /etc/init.d/slurm reconfig endscript }
Slurm Admin Notes
Add user on gimel2
sudo sacctmgr add user jji account=bks
Troubleshooting
Zero Bytes were transmitted or received
Error
slurm_load_partitions: Zero Bytes were transmitted or received
This could mean that the clock on worker node is out-of-sync from master
timedatectl set-time [HH:MM:ss]
Back to DOCK_3.7