Difference between revisions of "Slurm"

From DISI
Jump to: navigation, search
m
m
 
(40 intermediate revisions by 2 users not shown)
Line 1: Line 1:
'''Slurm userguide'''
+
'''Slurm user-guide'''
 +
=== Submit Jobs with Slurm ===
  
 +
==== SBATCH-MR (beta) ====
 +
It is a slurm-version of qsub-mr for submitting job on Slurm queueing system. Note: this is have not been extensively tested yet. Please contact me if the script is not working out. We are hoping to fully migrate to Slurm from the out-dated SGE system. Any error report would be helpful - Khanh
  
'''Useful libraries and utilities on master node (gimel)'''
+
New slurm scripts are located in /nfs/soft/tools/utils/sbatch-slice
 +
Just simply replace /nfs/soft/tools/utils/qsub-slice/qsub-mr with /nfs/soft/tools/utils/sbatch-slice/sbatch-mr in your script
  
 +
To check the status of your job:
 +
By username
 +
  $ squeue -u <username>
 +
By jobid
 +
  $ squeue -j <job_id>
  
= ANACONDA Installation (Python 2.7) =
+
==== Submit load2d Jobs ====
 +
 
 +
$ cd <catalog_shortname>
 +
$ source /nfs/exa/work/khtang/ZINC21_load2d/loadenv_zinc21.sh
 +
(development) $ sh /nfs/exa/work/khtang/submit_scripts/sbatch_slice/batch_zinc21.slurm <catalog_shortname>.ism
 +
 
 +
==== Submit DOCK Jobs ====
 +
 
 +
* ANACONDA Installation (Python 2.7)
  
 
Each user is welcome to download anaconda and install into his/her own folder<br>
 
Each user is welcome to download anaconda and install into his/her own folder<br>
Line 14: Line 31:
 
simple installation via ''/bin/sh Anaconda2-2019.10-Linux-x86_64.sh''
 
simple installation via ''/bin/sh Anaconda2-2019.10-Linux-x86_64.sh''
  
You may need to install a few packages:
+
After the installation is completed, you need to install a few packages:
 
  conda install -c free bsddb
 
  conda install -c free bsddb
 
  conda install -c rdkit rdkit
 
  conda install -c rdkit rdkit
Line 20: Line 37:
  
  
Running DOCK-3.7 with Slurm
+
'''Running DOCK-3.7 with Slurm'''
Here is a “guinea pig project”, which has been done with DOCK-3.7 locally.
+
GPR40 example: /mnt/nfs/home/dudenko/TEST_DOCKING_PROJECT
+
ChEMBL ligands: /mnt/nfs/home/dudenko/CHEMBL4422_active_ligands
+
  
This test calculation should run smoothly, if not, then there is a problem.
+
Here is a “guinea pig project”, which has been done with DOCK-3.7 locally.<br>
 +
'''GPR40 example:''' /mnt/nfs/home/dudenko/TEST_DOCKING_PROJECT<br>
 +
'''ChEMBL ligands:''' /mnt/nfs/home/dudenko/CHEMBL4422_active_ligands<br>
 +
This test calculation should run smoothly, if not, then there is a problem.<br>
 +
Ultimately, you may need to compare your results with the reference run:
 +
* CHEMBL4422_active_ligands.mol2 - TOP500 scoring poses
 +
* extract_all.sort.uniq.txt - a print-out of scoring details
  
Slurm queue is installed locally, use it to run this test (and all your future jobs) in parallel.
 
Do not forget to set DOCKBASE variable: export DOCKBASE=/nfs/soft/dock/versions/dock37/DOCK-3.7.3rc1/
 
  
# Useful commands to remind:
 
  
 +
Slurm queue manager is installed locally at gimel, use it to run this test (and all your future jobs) in parallel.<br>
 +
Do not forget to set DOCKBASE variable: ''export DOCKBASE=/nfs/soft/dock/versions/dock37/DOCK-3.7.3rc1/''
 +
 +
'''Useful DOCKING commands to remind:'''
 
  $DOCKBASE/docking/setup/setup_db2_zinc15_file_number.py ./ CHEMBL4422_active_ligands_ CHEMBL4422_active_ligands.sdi 100 count
 
  $DOCKBASE/docking/setup/setup_db2_zinc15_file_number.py ./ CHEMBL4422_active_ligands_ CHEMBL4422_active_ligands.sdi 100 count
  $DOCKBASE/analysis/extract_all.py -s -10
+
  $DOCKBASE/analysis/extract_all.py -s -20
 
  $DOCKBASE/analysis/getposes.py -l 500 -o CHEMBL4422_active_ligands.mol2
 
  $DOCKBASE/analysis/getposes.py -l 500 -o CHEMBL4422_active_ligands.mol2
  
  
# Useful slurm commands (see https://slurm.schedmd.com/quickstart.html):
+
'''Useful slurm commands (see https://slurm.schedmd.com/quickstart.html):'''
 
  to see what machine resources are offered by the cluster, do ''sinfo -lNe''
 
  to see what machine resources are offered by the cluster, do ''sinfo -lNe''
 
  to submit a DOCK-3.7 job, run ''$DOCKBASE/docking/submit/submit_slurm_array.csh''
 
  to submit a DOCK-3.7 job, run ''$DOCKBASE/docking/submit/submit_slurm_array.csh''
Line 46: Line 67:
 
Should your slurm run correctly, type ''squeue'' and you should see something like this:
 
Should your slurm run correctly, type ''squeue'' and you should see something like this:
  
#### BASH command line output to console
 
 
             JOBID PARTITION    NAME    USER ST      TIME  NODES NODELIST(REASON)
 
             JOBID PARTITION    NAME    USER ST      TIME  NODES NODELIST(REASON)
      217_[9-100] pdl-stati array_jo   docker PD      0:00      1 (Resources)
+
  4187_[637-2091]     gimel array_jo dudenko PD      0:00      1 (Resources)
            217_8 pdl-stati array_jo   docker R      0:00      1 pdl-station
+
          4187_629    gimel array_jo  dudenko  R      0:00      1 n-1-20
            217_5 pdl-stati array_jo   docker R      0:08     1 pdl-station
+
          4187_630    gimel array_jo dudenko R      0:00      1 n-5-34
 +
          4187_631    gimel array_jo  dudenko  R      0:00      1 n-1-21
 +
          4187_632    gimel array_jo dudenko R      0:00     1 n-5-34
 +
          4187_633    gimel array_jo  dudenko  R      0:00      1 n-1-21
 +
          4187_634    gimel array_jo  dudenko  R      0:00      1 n-5-34
 +
          4187_635    gimel array_jo  dudenko  R      0:00      1 n-5-35
 +
          4187_636    gimel array_jo  dudenko  R      0:00      1 n-5-34
 +
          4187_622    gimel array_jo  dudenko  R      0:01      1 n-5-34
 +
          4187_623    gimel array_jo  dudenko  R      0:01      1 n-5-34
 +
          4187_624    gimel array_jo  dudenko  R      0:01      1 n-5-35
 +
          4187_625    gimel array_jo  dudenko  R      0:01      1 n-1-17
  
  
Line 56: Line 86:
 
As root at gimel, it is possible to modify a particular job, e.g., ''scontrol update jobid=635 TimeLimit=7-00:00:00''
 
As root at gimel, it is possible to modify a particular job, e.g., ''scontrol update jobid=635 TimeLimit=7-00:00:00''
  
 +
=== Slurm Installation Guide ===
 +
'''Detailed step-by-step installation instruction (for sysadmins only)'''
  
'''Detailed step-by-step installation instruction'''
+
==== Setup Node master ====
 +
'''TBA'''
  
 +
==== Setup Compute Nodes ====
  
Useful link: https://slurm.schedmd.com/quickstart_admin.html
 
  
 +
'''Useful links:'''
 +
https://slurm.schedmd.com/quickstart_admin.html
 +
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation
  
'''node n-1-17'''
+
'''node n-1-17''' (Installation of the latest slurm version (20.02.04)? see below in "Migrating to gimel5" section)
  
 
* make sure you have there Centos 7: ''cat /etc/redhat-release''
 
* make sure you have there Centos 7: ''cat /etc/redhat-release''
 
* ''wget https://download.schedmd.com/slurm/slurm-17.02.11.tar.bz2''
 
* ''wget https://download.schedmd.com/slurm/slurm-17.02.11.tar.bz2''
* ''yum install readline-devel perl-ExtUtils-MakeMaker.noarch munge-devel pam-devel''
+
* ''yum install readline-devel perl-ExtUtils-MakeMaker.noarch munge-devel pam-devel openssl-devel''
 
* ''export VER=17.02.11; rpmbuild -ta slurm-$VER.tar.bz2 --without mysql; mv /root/rpmbuild .''
 
* ''export VER=17.02.11; rpmbuild -ta slurm-$VER.tar.bz2 --without mysql; mv /root/rpmbuild .''
  
Line 74: Line 110:
 
* ''yum install rpmbuild/RPMS/x86_64/slurm-17.02.11-1.el7.x86_64.rpm''
 
* ''yum install rpmbuild/RPMS/x86_64/slurm-17.02.11-1.el7.x86_64.rpm''
 
* ''yum install rpmbuild/RPMS/x86_64/slurm-munge-17.02.11-1.el7.x86_64.rpm''
 
* ''yum install rpmbuild/RPMS/x86_64/slurm-munge-17.02.11-1.el7.x86_64.rpm''
 
  
 
'''setting up munge''':
 
'''setting up munge''':
Line 98: Line 133:
 
   ''chown -R slurm:slurm /var/spool/slurm-llnl /var/run/slurm-llnl /var/log/slurm-llnl''
 
   ''chown -R slurm:slurm /var/spool/slurm-llnl /var/run/slurm-llnl /var/log/slurm-llnl''
 
* restarting slurm master node at gimel (Centos 6): ''/etc/init.d/slurm restart''
 
* restarting slurm master node at gimel (Centos 6): ''/etc/init.d/slurm restart''
* restarting slurm computing nodes (Centos 7): ''systemctl restart slurmd''
+
* enabling and starting slurm computing nodes (Centos 7): ''systemctl enable slurmd; systemctl start slurmd''
  
 
And last but not least, asking the firewall to allow communication between master node and computing node n-1-17:
 
And last but not least, asking the firewall to allow communication between master node and computing node n-1-17:
* ''firewall-cmd --permanent --zone=public --add-port=6818/tcp''
+
* ''firewall-cmd --permanent --zone=public --add-port=6817/tcp''  #slurmctld
 +
* ''firewall-cmd --permanent --zone=public --add-port=6818/tcp''   #slurmd
 
* ''firewall-cmd --reload''
 
* ''firewall-cmd --reload''
  
To disable a specific node, do ''scontrol update NodeName=n-1-17 State=DRAIN Reason=DRAINED''
 
To return back to service, do ''scontrol update NodeName=n-1-17 State=IDLE''
 
  
To see the current situation of the queue, so sinfo -lNe and you will see:
+
To see the current situation of the queue, so ''sinfo -lNe'' and you will see:
 
  Wed May 27 09:49:54 2020
 
  Wed May 27 09:49:54 2020
 
  NODELIST  NODES PARTITION      STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON               
 
  NODELIST  NODES PARTITION      STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON               
Line 114: Line 148:
 
  n-5-34        1    gimel*        idle  80  80:1:1      1        0      1  (null) none                 
 
  n-5-34        1    gimel*        idle  80  80:1:1      1        0      1  (null) none                 
 
  n-5-35        1    gimel*        idle  80  80:1:1      1        0      1  (null) none
 
  n-5-35        1    gimel*        idle  80  80:1:1      1        0      1  (null) none
 +
 +
 +
To disable a specific node, do ''scontrol update NodeName=n-1-17 State=DRAIN Reason=DRAINED''<br>
 +
To return back to service, do ''scontrol update NodeName=n-1-17 State=IDLE''
  
  
Line 119: Line 157:
 
''sudo yum install csh tcsh''
 
''sudo yum install csh tcsh''
  
 +
'''Node down after reboot'''
 +
On gimel (master node)
 +
sudo scontrol update NodeName=<node_name> State=RESUME
 +
 +
 +
'''Useful links:'''
 +
https://support.ceci-hpc.be/doc/_contents/QuickStart/SubmittingJobs/SlurmTutorial.html
 +
 +
 +
==== Migrating to gimel5 ====
 +
 +
* make sure you have there Centos 7: ''cat /etc/redhat-release''
 +
* ''wget https://download.schedmd.com/slurm/slurm-20.02.4.tar.bz2''
 +
* ''yum install rpm-build gcc openssl openssl-devel libssh2-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel gtk2-devel libssh2-devel libibmad libibumad perl-Switch perl-ExtUtils-MakeMaker mysql-devel''
 +
* ''export VER=20.02.4; rpmbuild -ta slurm-$VER.tar.bz2 --with mysql; mv /root/rpmbuild .''
 +
 +
Intallation: on a compute node<br>
 +
* ''yum install rpmbuild/RPMS/x86_64/slurm-20.02.4-1.el7.x86_64.rpm<br>
 +
* ''yum install rpmbuild/RPMS/x86_64/slurm-slurmd-20.02.4-1.el7.x86_64.rpm<br>
 +
* ''systemctl enable slurmd<br>
 +
* ''systemctl start slurmd
 +
 +
==== GPUs specification ====
 +
 +
    - 32-core:
 +
                + n-9-34 (GTX 1080 Ti)
 +
                + n-9-36 (GTX 1080 Ti)
 +
                + n-1-126 (GTX 980)
 +
                + n-1-141 (GTX 980)
 +
    - 40-core:
 +
                + n-1-28 (RTX 2080 Super)
 +
                + n-1-38 (RTX 2080 Super)
 +
                + n-1-101 (RTX 2080 Super)
 +
                + n-1-105 (RTX 2080 Super)
 +
                + n-1-124 (RTX 2080 Super)
  
 
Back to [[DOCK_3.7]]
 
Back to [[DOCK_3.7]]
 +
 +
[[Category : Slurm]]

Latest revision as of 09:55, 17 September 2020

Slurm user-guide

Submit Jobs with Slurm

SBATCH-MR (beta)

It is a slurm-version of qsub-mr for submitting job on Slurm queueing system. Note: this is have not been extensively tested yet. Please contact me if the script is not working out. We are hoping to fully migrate to Slurm from the out-dated SGE system. Any error report would be helpful - Khanh

New slurm scripts are located in /nfs/soft/tools/utils/sbatch-slice

Just simply replace /nfs/soft/tools/utils/qsub-slice/qsub-mr with /nfs/soft/tools/utils/sbatch-slice/sbatch-mr in your script

To check the status of your job:

By username
 $ squeue -u <username>
By jobid
 $ squeue -j <job_id>

Submit load2d Jobs

$ cd <catalog_shortname>
$ source /nfs/exa/work/khtang/ZINC21_load2d/loadenv_zinc21.sh
(development) $ sh /nfs/exa/work/khtang/submit_scripts/sbatch_slice/batch_zinc21.slurm <catalog_shortname>.ism

Submit DOCK Jobs

  • ANACONDA Installation (Python 2.7)

Each user is welcome to download anaconda and install into his/her own folder
https://www.anaconda.com/distribution/
wget https://repo.anaconda.com/archive/Anaconda2-2019.10-Linux-x86_64.sh
NB: It is also available for Python3, which is our nearest future

simple installation via /bin/sh Anaconda2-2019.10-Linux-x86_64.sh

After the installation is completed, you need to install a few packages:

conda install -c free bsddb
conda install -c rdkit rdkit
conda install numpy


Running DOCK-3.7 with Slurm

Here is a “guinea pig project”, which has been done with DOCK-3.7 locally.
GPR40 example: /mnt/nfs/home/dudenko/TEST_DOCKING_PROJECT
ChEMBL ligands: /mnt/nfs/home/dudenko/CHEMBL4422_active_ligands
This test calculation should run smoothly, if not, then there is a problem.
Ultimately, you may need to compare your results with the reference run:

  • CHEMBL4422_active_ligands.mol2 - TOP500 scoring poses
  • extract_all.sort.uniq.txt - a print-out of scoring details


Slurm queue manager is installed locally at gimel, use it to run this test (and all your future jobs) in parallel.
Do not forget to set DOCKBASE variable: export DOCKBASE=/nfs/soft/dock/versions/dock37/DOCK-3.7.3rc1/

Useful DOCKING commands to remind:

$DOCKBASE/docking/setup/setup_db2_zinc15_file_number.py ./ CHEMBL4422_active_ligands_ CHEMBL4422_active_ligands.sdi 100 count
$DOCKBASE/analysis/extract_all.py -s -20
$DOCKBASE/analysis/getposes.py -l 500 -o CHEMBL4422_active_ligands.mol2


Useful slurm commands (see https://slurm.schedmd.com/quickstart.html):

to see what machine resources are offered by the cluster, do sinfo -lNe
to submit a DOCK-3.7 job, run $DOCKBASE/docking/submit/submit_slurm_array.csh
to see what is happening in the queue, run squeue
to see a detailed info for a specific job: scontrol show jobid=_JOBID_
to delete a job from queue, run scancel _JOBID_

Should your slurm run correctly, type squeue and you should see something like this:

            JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
  4187_[637-2091]     gimel array_jo  dudenko PD       0:00      1 (Resources)
         4187_629     gimel array_jo  dudenko  R       0:00      1 n-1-20
         4187_630     gimel array_jo  dudenko  R       0:00      1 n-5-34
         4187_631     gimel array_jo  dudenko  R       0:00      1 n-1-21
         4187_632     gimel array_jo  dudenko  R       0:00      1 n-5-34
         4187_633     gimel array_jo  dudenko  R       0:00      1 n-1-21
         4187_634     gimel array_jo  dudenko  R       0:00      1 n-5-34
         4187_635     gimel array_jo  dudenko  R       0:00      1 n-5-35
         4187_636     gimel array_jo  dudenko  R       0:00      1 n-5-34
         4187_622     gimel array_jo  dudenko  R       0:01      1 n-5-34
         4187_623     gimel array_jo  dudenko  R       0:01      1 n-5-34
         4187_624     gimel array_jo  dudenko  R       0:01      1 n-5-35
         4187_625     gimel array_jo  dudenko  R       0:01      1 n-1-17


As root at gimel, it is possible to modify a particular job, e.g., scontrol update jobid=635 TimeLimit=7-00:00:00

Slurm Installation Guide

Detailed step-by-step installation instruction (for sysadmins only)

Setup Node master

TBA

Setup Compute Nodes

Useful links:

https://slurm.schedmd.com/quickstart_admin.html
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation

node n-1-17 (Installation of the latest slurm version (20.02.04)? see below in "Migrating to gimel5" section)

  • make sure you have there Centos 7: cat /etc/redhat-release
  • wget https://download.schedmd.com/slurm/slurm-17.02.11.tar.bz2
  • yum install readline-devel perl-ExtUtils-MakeMaker.noarch munge-devel pam-devel openssl-devel
  • export VER=17.02.11; rpmbuild -ta slurm-$VER.tar.bz2 --without mysql; mv /root/rpmbuild .

installing built packages from rpmbuild:

  • yum install rpmbuild/RPMS/x86_64/slurm-plugins-17.02.11-1.el7.x86_64.rpm
  • yum install rpmbuild/RPMS/x86_64/slurm-17.02.11-1.el7.x86_64.rpm
  • yum install rpmbuild/RPMS/x86_64/slurm-munge-17.02.11-1.el7.x86_64.rpm

setting up munge: copy over /etc/munge/munge.key from gimel and put locally to /etc/munge. The key should be identical allover the nodes.
Munge is a daemon responsible for secure data exchange between nodes.
Set permissions accordingly: chown munge:munge /etc/munge/munge.key; chmod 400 /etc/munge/munge.key

starting munge: systemctl enable munge; systemctl start munge

setting up slurm:

  • create a user slurm: adduser slurm.
  • all UID/GUIDs of slurm user should be identical allover the nodes.
 Otherwise, one needs to specify a mapping scheme for translating each UID/GUIDs between nodes.
To edit slurm UID/GUID, do vipw and replace "slurm line" with slurm:x:XXXXX:YYYYY::/nonexistent:/bin/false
XXXXX and YYYYY for slurm user can be found at gimel in /etc/passwd
NB: don't forget to edit /etc/group as well.
  • copy /etc/slurm/slurm.conf from gimel and put locally to /etc/slurm.
  • figure out what CPU/Memory resources you have at n-1-17 (see /proc/cpuinfo) and append the following line:
 NodeName=n-1-17 NodeAddr=10.20.1.17 CPUs=24 State=UNKNOWN
  • append n-1-17 to the partition: PartitionName=gimel Nodes=gimel,n-5-34,n-5-35,n-1-17 Default=YES MaxTime=INFINITE State=UP
  • create the following folders:
 mkdir -p /var/spool/slurm-llnl /var/run/slurm-llnl /var/log/slurm-llnl
 chown -R slurm:slurm /var/spool/slurm-llnl /var/run/slurm-llnl /var/log/slurm-llnl
  • restarting slurm master node at gimel (Centos 6): /etc/init.d/slurm restart
  • enabling and starting slurm computing nodes (Centos 7): systemctl enable slurmd; systemctl start slurmd

And last but not least, asking the firewall to allow communication between master node and computing node n-1-17:

  • firewall-cmd --permanent --zone=public --add-port=6817/tcp #slurmctld
  • firewall-cmd --permanent --zone=public --add-port=6818/tcp #slurmd
  • firewall-cmd --reload


To see the current situation of the queue, so sinfo -lNe and you will see:

Wed May 27 09:49:54 2020
NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON              
gimel          1    gimel*     drained   24    4:6:1      1        0      1   (null) none                
n-1-17         1    gimel*        idle   24   24:1:1      1        0      1   (null) none                
n-5-34         1    gimel*        idle   80   80:1:1      1        0      1   (null) none                
n-5-35         1    gimel*        idle   80   80:1:1      1        0      1   (null) none


To disable a specific node, do scontrol update NodeName=n-1-17 State=DRAIN Reason=DRAINED
To return back to service, do scontrol update NodeName=n-1-17 State=IDLE


p.s. Some users/scripts may require csh/tcsh.
sudo yum install csh tcsh

Node down after reboot On gimel (master node)

sudo scontrol update NodeName=<node_name> State=RESUME


Useful links:

https://support.ceci-hpc.be/doc/_contents/QuickStart/SubmittingJobs/SlurmTutorial.html


Migrating to gimel5

  • make sure you have there Centos 7: cat /etc/redhat-release
  • wget https://download.schedmd.com/slurm/slurm-20.02.4.tar.bz2
  • yum install rpm-build gcc openssl openssl-devel libssh2-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel gtk2-devel libssh2-devel libibmad libibumad perl-Switch perl-ExtUtils-MakeMaker mysql-devel
  • export VER=20.02.4; rpmbuild -ta slurm-$VER.tar.bz2 --with mysql; mv /root/rpmbuild .

Intallation: on a compute node

  • yum install rpmbuild/RPMS/x86_64/slurm-20.02.4-1.el7.x86_64.rpm
  • yum install rpmbuild/RPMS/x86_64/slurm-slurmd-20.02.4-1.el7.x86_64.rpm
  • systemctl enable slurmd
  • systemctl start slurmd

GPUs specification

    - 32-core:
               + n-9-34 (GTX 1080 Ti)
               + n-9-36 (GTX 1080 Ti)
               + n-1-126 (GTX 980)
               + n-1-141 (GTX 980)
   - 40-core:
               + n-1-28 (RTX 2080 Super)
               + n-1-38 (RTX 2080 Super)
               + n-1-101 (RTX 2080 Super)
               + n-1-105 (RTX 2080 Super)
               + n-1-124 (RTX 2080 Super)

Back to DOCK_3.7