Slurm: Difference between revisions

From DISI
Jump to navigation Jump to search
mNo edit summary
mNo edit summary
Line 1: Line 1:
'''Slurm userguide'''
'''Slurm userguide'''
'''Useful libraries and utilities on master node (gimel)'''
== ANACONDA Installation (Python 2.7) ==
Each user is welcome to download anaconda and install into his/her own folder<br>
https://www.anaconda.com/distribution/<br>
''wget https://repo.anaconda.com/archive/Anaconda2-2019.10-Linux-x86_64.sh''<br>
NB: It is also available for Python3, which is our nearest future<br>
simple installation via ''/bin/sh Anaconda2-2019.10-Linux-x86_64.sh''
You may need to install a few packages:
conda install -c free bsddb
conda install -c rdkit rdkit
conda install numpy




Line 5: Line 23:
scancel
scancel
squeue
squeue
export DOCKBASE=/home/docker/DOCK-3.7.3rc1
Running DOCK-3.7 with Slurm
Here is a “guinea pig project”, which has been done with DOCK-3.7 locally.
GPR40 example: /home/docker/Desktop/EXAMPLE_PROJECT_GPR40/TEST
This test calculation should run smoothly, if not, then there is a problem.
One needs copy over this test folder and try to reproduce CHEMBL4422_active_ligands.sdi
Slurm queue is installed locally, use it to run this test (and all your future jobs) in parallel.
Do not forget to set DOCKBASE: export DOCKBASE=/home/docker/DOCK-3.7.3rc1
# Useful commands to remind:
scp -C -r -P 3333  -i ~/.ssh/id_rsa_SHUO ProjectX_FOLDER shuo@pdl-station.spdns.org:~/
$DOCKBASE/docking/setup/setup_db2_zinc15_file_number.py ./ CHEMBL4422_active_ligands_ CHEMBL4422_active_ligands.sdi 100 count
$DOCKBASE/analysis/extract_all.py -s -10
$DOCKBASE/analysis/getposes.py -l 500 -o CHEMBL4422_active_ligands.mol2
$DOCKBASE/analysis/enrich.py -i . -l ligands.names.txt -d decoys.names.txt
$DOCKBASE/analysis/plots.py -i . -l ligands.names.txt -d decoys.names.txt
# Useful slurm commands (see https://slurm.schedmd.com/quickstart.html):
To see what machine resources are offered by the cluster, do “sinfo -lNe”
To submit DOCK-3.7 job, run $DOCKBASE/docking/submit/submit_slurm_array.csh
To see what is happening in the queue, run “squeue”
To delete a job from queue, run “scancel _JOBID_”
Should your slurm run correctly, type “squeue” and you should see something like this:
#### BASH command line output to console
            JOBID PARTITION    NAME    USER ST      TIME  NODES NODELIST(REASON)
      217_[9-100] pdl-stati array_jo  docker PD      0:00      1 (Resources)
            217_8 pdl-stati array_jo  docker  R      0:00      1 pdl-station
            217_5 pdl-stati array_jo  docker  R      0:08      1 pdl-station
To delete this job from queue, run “scancel 217”
Running DOCK-3.7 with Slurm
GPR40 example: /home/docker/Desktop/EXAMPLE_PROJECT_GPR40/TEST
# Useful slurm commands (see https://slurm.schedmd.com/quickstart.html):
To see what machine resources are offered by the cluster, do “sinfo -lNe”
To submit DOCK-3.7 job, run $DOCKBASE/docking/submit/submit_slurm_array.csh
To see what is happening in the queue, run “squeue”
To delete a job from queue, run “scancel _JOBID_”
Should your slurm run correctly, type “squeue” and you should see something like this:
#### BASH command line output to console
            JOBID PARTITION    NAME    USER ST      TIME  NODES NODELIST(REASON)
      217_[9-100] pdl-stati array_jo  docker PD      0:00      1 (Resources)
            217_8 pdl-stati array_jo  docker  R      0:00      1 pdl-station
            217_5 pdl-stati array_jo  docker  R      0:08      1 pdl-station
So, to delete this job from queue, run “scancel 217”




Line 71: Line 142:




p.s. Some users/scripts may require csh/tcsh.
p.s. Some users/scripts may require csh/tcsh.<br>
''sudo yum install csh tcsh''
''sudo yum install csh tcsh''




Back to [[DOCK_3.7]]
Back to [[DOCK_3.7]]

Revision as of 14:24, 28 May 2020

Slurm userguide


Useful libraries and utilities on master node (gimel)


ANACONDA Installation (Python 2.7)

Each user is welcome to download anaconda and install into his/her own folder
https://www.anaconda.com/distribution/
wget https://repo.anaconda.com/archive/Anaconda2-2019.10-Linux-x86_64.sh
NB: It is also available for Python3, which is our nearest future

simple installation via /bin/sh Anaconda2-2019.10-Linux-x86_64.sh

You may need to install a few packages:

conda install -c free bsddb
conda install -c rdkit rdkit
conda install numpy


sinfo -lNe scancel squeue export DOCKBASE=/home/docker/DOCK-3.7.3rc1


Running DOCK-3.7 with Slurm Here is a “guinea pig project”, which has been done with DOCK-3.7 locally. GPR40 example: /home/docker/Desktop/EXAMPLE_PROJECT_GPR40/TEST This test calculation should run smoothly, if not, then there is a problem. One needs copy over this test folder and try to reproduce CHEMBL4422_active_ligands.sdi

Slurm queue is installed locally, use it to run this test (and all your future jobs) in parallel. Do not forget to set DOCKBASE: export DOCKBASE=/home/docker/DOCK-3.7.3rc1

  1. Useful commands to remind:

scp -C -r -P 3333 -i ~/.ssh/id_rsa_SHUO ProjectX_FOLDER shuo@pdl-station.spdns.org:~/ $DOCKBASE/docking/setup/setup_db2_zinc15_file_number.py ./ CHEMBL4422_active_ligands_ CHEMBL4422_active_ligands.sdi 100 count $DOCKBASE/analysis/extract_all.py -s -10 $DOCKBASE/analysis/getposes.py -l 500 -o CHEMBL4422_active_ligands.mol2 $DOCKBASE/analysis/enrich.py -i . -l ligands.names.txt -d decoys.names.txt $DOCKBASE/analysis/plots.py -i . -l ligands.names.txt -d decoys.names.txt

  1. Useful slurm commands (see https://slurm.schedmd.com/quickstart.html):

To see what machine resources are offered by the cluster, do “sinfo -lNe” To submit DOCK-3.7 job, run $DOCKBASE/docking/submit/submit_slurm_array.csh To see what is happening in the queue, run “squeue” To delete a job from queue, run “scancel _JOBID_” Should your slurm run correctly, type “squeue” and you should see something like this:

        1. BASH command line output to console
            JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
      217_[9-100] pdl-stati array_jo   docker PD       0:00      1 (Resources)
            217_8 pdl-stati array_jo   docker  R       0:00      1 pdl-station
            217_5 pdl-stati array_jo   docker  R       0:08      1 pdl-station

To delete this job from queue, run “scancel 217”


Running DOCK-3.7 with Slurm GPR40 example: /home/docker/Desktop/EXAMPLE_PROJECT_GPR40/TEST

  1. Useful slurm commands (see https://slurm.schedmd.com/quickstart.html):

To see what machine resources are offered by the cluster, do “sinfo -lNe” To submit DOCK-3.7 job, run $DOCKBASE/docking/submit/submit_slurm_array.csh To see what is happening in the queue, run “squeue” To delete a job from queue, run “scancel _JOBID_” Should your slurm run correctly, type “squeue” and you should see something like this:

        1. BASH command line output to console
            JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
      217_[9-100] pdl-stati array_jo   docker PD       0:00      1 (Resources)
            217_8 pdl-stati array_jo   docker  R       0:00      1 pdl-station
            217_5 pdl-stati array_jo   docker  R       0:08      1 pdl-station

So, to delete this job from queue, run “scancel 217”


scontrol show jobid=635

as root at gimel: scontrol update jobid=635 TimeLimit=7-00:00:00


Detailed step-by-step installation instruction


Useful link: https://slurm.schedmd.com/quickstart_admin.html


node n-1-17

  • make sure you have there Centos 7: cat /etc/redhat-release
  • wget https://download.schedmd.com/slurm/slurm-17.02.11.tar.bz2
  • yum install readline-devel perl-ExtUtils-MakeMaker.noarch munge-devel pam-devel
  • export VER=17.02.11; rpmbuild -ta slurm-$VER.tar.bz2 --without mysql; mv /root/rpmbuild .

installing built packages from rpmbuild:

  • yum install rpmbuild/RPMS/x86_64/slurm-plugins-17.02.11-1.el7.x86_64.rpm
  • yum install rpmbuild/RPMS/x86_64/slurm-17.02.11-1.el7.x86_64.rpm
  • yum install rpmbuild/RPMS/x86_64/slurm-munge-17.02.11-1.el7.x86_64.rpm


setting up munge: copy over /etc/munge/munge.key from gimel and put locally to /etc/munge. The key should be identical allover the nodes.
Munge is a daemon responsible for secure data exchange between nodes.
Set permissions accordingly: chown munge:munge /etc/munge/munge.key; chmod 400 /etc/munge/munge.key

starting munge: systemctl enable munge; systemctl start munge

setting up slurm:

  • create a user slurm: adduser slurm.
  • all UID/GUIDs of slurm user should be identical allover the nodes.
 Otherwise, one needs to specify a mapping scheme for translating each UID/GUIDs between nodes.
To edit slurm UID/GUID, do vipw and replace "slurm line" with slurm:x:XXXXX:YYYYY::/nonexistent:/bin/false
XXXXX and YYYYY for slurm user can be found at gimel in /etc/passwd
NB: don't forget to edit /etc/group as well.
  • copy /etc/slurm/slurm.conf from gimel and put locally to /etc/slurm.
  • figure out what CPU/Memory resources you have at n-1-17 (see /proc/cpuinfo) and append the following line:
 NodeName=n-1-17 NodeAddr=10.20.1.17 CPUs=24 State=UNKNOWN
  • append n-1-17 to the partition: PartitionName=gimel Nodes=gimel,n-5-34,n-5-35,n-1-17 Default=YES MaxTime=INFINITE State=UP
  • create the following folders:
 mkdir -p /var/spool/slurm-llnl /var/run/slurm-llnl /var/log/slurm-llnl
 chown -R slurm:slurm /var/spool/slurm-llnl /var/run/slurm-llnl /var/log/slurm-llnl
  • restarting slurm master node at gimel (Centos 6): /etc/init.d/slurm restart
  • restarting slurm computing nodes (Centos 7): systemctl restart slurmd

And last but not least, asking the firewall to allow communication between master node and computing node n-1-17:

  • firewall-cmd --permanent --zone=public --add-port=6818/tcp
  • firewall-cmd --reload

To disable a specific node, do scontrol update NodeName=n-1-17 State=DRAIN Reason=DRAINED To return back to service, do scontrol update NodeName=n-1-17 State=IDLE

To see the current situation of the queue, so sinfo -lNe and you will see:

Wed May 27 09:49:54 2020
NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON              
gimel          1    gimel*     drained   24    4:6:1      1        0      1   (null) none                
n-1-17         1    gimel*        idle   24   24:1:1      1        0      1   (null) none                
n-5-34         1    gimel*        idle   80   80:1:1      1        0      1   (null) none                
n-5-35         1    gimel*        idle   80   80:1:1      1        0      1   (null) none


p.s. Some users/scripts may require csh/tcsh.
sudo yum install csh tcsh


Back to DOCK_3.7