AWS Auto Scaling

2020-06-02T13:57:23Z

Dudenko:

AWS Auto Scaling

2020-06-02T13:51:51Z

Dudenko:

AWS Auto Scaling

2020-06-02T13:46:27Z

Dudenko:

AWS Auto Scaling

2020-06-02T13:42:54Z

Dudenko:

2020-05-28T14:46:10Z

Dudenko:

'''Slurm userguide'''

'''Useful libraries and utilities on master node (gimel)'''

* ANACONDA Installation (Python 2.7)

Each user is welcome to download anaconda and install into his/her own folder 
https://www.anaconda.com/distribution/ 
''wget https://repo.anaconda.com/archive/Anaconda2-2019.10-Linux-x86_64.sh'' 
NB: It is also available for Python3, which is our nearest future 

simple installation via ''/bin/sh Anaconda2-2019.10-Linux-x86_64.sh''

You may need to install a few packages:
conda install -c free bsddb
conda install -c rdkit rdkit
conda install numpy

Running DOCK-3.7 with Slurm
Here is a “guinea pig project”, which has been done with DOCK-3.7 locally. 
GPR40 example: /mnt/nfs/home/dudenko/TEST_DOCKING_PROJECT 
ChEMBL ligands: /mnt/nfs/home/dudenko/CHEMBL4422_active_ligands

This test calculation should run smoothly, if not, then there is a problem.

Slurm queue is installed locally, use it to run this test (and all your future jobs) in parallel.
Do not forget to set DOCKBASE variable: export DOCKBASE=/nfs/soft/dock/versions/dock37/DOCK-3.7.3rc1/

'''Useful commands to remind:'''
$DOCKBASE/docking/setup/setup_db2_zinc15_file_number.py ./ CHEMBL4422_active_ligands_ CHEMBL4422_active_ligands.sdi 100 count
$DOCKBASE/analysis/extract_all.py -s -10
$DOCKBASE/analysis/getposes.py -l 500 -o CHEMBL4422_active_ligands.mol2

'''Useful slurm commands (see https://slurm.schedmd.com/quickstart.html):'''
to see what machine resources are offered by the cluster, do ''sinfo -lNe''
to submit a DOCK-3.7 job, run ''$DOCKBASE/docking/submit/submit_slurm_array.csh''
to see what is happening in the queue, run ''squeue''
to see a detailed info for a specific job: ''scontrol show jobid=_JOBID_''
to delete a job from queue, run ''scancel _JOBID_''

Should your slurm run correctly, type ''squeue'' and you should see something like this:

#### BASH command line output to console
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
217_[9-100] pdl-stati array_jo docker PD 0:00 1 (Resources)
217_8 pdl-stati array_jo docker R 0:00 1 pdl-station
217_5 pdl-stati array_jo docker R 0:08 1 pdl-station

As root at gimel, it is possible to modify a particular job, e.g., ''scontrol update jobid=635 TimeLimit=7-00:00:00''

'''Detailed step-by-step installation instruction'''

Useful link: https://slurm.schedmd.com/quickstart_admin.html

'''node n-1-17'''

* make sure you have there Centos 7: ''cat /etc/redhat-release''
* ''wget https://download.schedmd.com/slurm/slurm-17.02.11.tar.bz2''
* ''yum install readline-devel perl-ExtUtils-MakeMaker.noarch munge-devel pam-devel''
* ''export VER=17.02.11; rpmbuild -ta slurm-$VER.tar.bz2 --without mysql; mv /root/rpmbuild .''

installing built packages from rpmbuild:
* ''yum install rpmbuild/RPMS/x86_64/slurm-plugins-17.02.11-1.el7.x86_64.rpm''
* ''yum install rpmbuild/RPMS/x86_64/slurm-17.02.11-1.el7.x86_64.rpm''
* ''yum install rpmbuild/RPMS/x86_64/slurm-munge-17.02.11-1.el7.x86_64.rpm''

'''setting up munge''':
copy over /etc/munge/munge.key from gimel and put locally to /etc/munge. The key should be identical allover the nodes. 
Munge is a daemon responsible for secure data exchange between nodes. 
Set permissions accordingly: ''chown munge:munge /etc/munge/munge.key; chmod 400 /etc/munge/munge.key'' 

'''starting munge''': ''systemctl enable munge; systemctl start munge''

'''setting up slurm''':
* create a user slurm: adduser slurm.
* all UID/GUIDs of slurm user should be identical allover the nodes. 
Otherwise, one needs to specify a mapping scheme for translating each UID/GUIDs between nodes. 
To edit slurm UID/GUID, do ''vipw'' and replace "slurm line" with slurm:x:XXXXX:YYYYY::/nonexistent:/bin/false 
XXXXX and YYYYY for slurm user can be found at gimel in /etc/passwd 
NB: don't forget to edit /etc/group as well. 
* copy /etc/slurm/slurm.conf from gimel and put locally to /etc/slurm.
* figure out what CPU/Memory resources you have at n-1-17 (see /proc/cpuinfo) and append the following line:
NodeName=n-1-17 NodeAddr=10.20.1.17 CPUs=24 State=UNKNOWN
* append n-1-17 to the partition: PartitionName=gimel Nodes=gimel,n-5-34,n-5-35,n-1-17 Default=YES MaxTime=INFINITE State=UP
* create the following folders:
''mkdir -p /var/spool/slurm-llnl /var/run/slurm-llnl /var/log/slurm-llnl''
''chown -R slurm:slurm /var/spool/slurm-llnl /var/run/slurm-llnl /var/log/slurm-llnl''
* restarting slurm master node at gimel (Centos 6): ''/etc/init.d/slurm restart''
* restarting slurm computing nodes (Centos 7): ''systemctl restart slurmd''

And last but not least, asking the firewall to allow communication between master node and computing node n-1-17:
* ''firewall-cmd --permanent --zone=public --add-port=6818/tcp''
* ''firewall-cmd --reload''

To disable a specific node, do ''scontrol update NodeName=n-1-17 State=DRAIN Reason=DRAINED''
To return back to service, do ''scontrol update NodeName=n-1-17 State=IDLE''

To see the current situation of the queue, so sinfo -lNe and you will see:
Wed May 27 09:49:54 2020
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
gimel 1 gimel* drained 24 4:6:1 1 0 1 (null) none
n-1-17 1 gimel* idle 24 24:1:1 1 0 1 (null) none
n-5-34 1 gimel* idle 80 80:1:1 1 0 1 (null) none
n-5-35 1 gimel* idle 80 80:1:1 1 0 1 (null) none

p.s. Some users/scripts may require csh/tcsh. 
''sudo yum install csh tcsh''

Back to [[DOCK_3.7]]

Slurm

2020-05-28T14:45:08Z

Dudenko:

'''Slurm userguide'''

'''Useful libraries and utilities on master node (gimel)'''

* ANACONDA Installation (Python 2.7)

Each user is welcome to download anaconda and install into his/her own folder 
https://www.anaconda.com/distribution/ 
''wget https://repo.anaconda.com/archive/Anaconda2-2019.10-Linux-x86_64.sh'' 
NB: It is also available for Python3, which is our nearest future 

simple installation via ''/bin/sh Anaconda2-2019.10-Linux-x86_64.sh''

You may need to install a few packages:
conda install -c free bsddb
conda install -c rdkit rdkit
conda install numpy

Running DOCK-3.7 with Slurm
Here is a “guinea pig project”, which has been done with DOCK-3.7 locally. 
GPR40 example: /mnt/nfs/home/dudenko/TEST_DOCKING_PROJECT 
ChEMBL ligands: /mnt/nfs/home/dudenko/CHEMBL4422_active_ligands

This test calculation should run smoothly, if not, then there is a problem.

Slurm queue is installed locally, use it to run this test (and all your future jobs) in parallel.
Do not forget to set DOCKBASE variable: export DOCKBASE=/nfs/soft/dock/versions/dock37/DOCK-3.7.3rc1/

# Useful commands to remind:

$DOCKBASE/docking/setup/setup_db2_zinc15_file_number.py ./ CHEMBL4422_active_ligands_ CHEMBL4422_active_ligands.sdi 100 count
$DOCKBASE/analysis/extract_all.py -s -10
$DOCKBASE/analysis/getposes.py -l 500 -o CHEMBL4422_active_ligands.mol2

# Useful slurm commands (see https://slurm.schedmd.com/quickstart.html):
to see what machine resources are offered by the cluster, do ''sinfo -lNe''
to submit a DOCK-3.7 job, run ''$DOCKBASE/docking/submit/submit_slurm_array.csh''
to see what is happening in the queue, run ''squeue''
to see a detailed info for a specific job: ''scontrol show jobid=_JOBID_''
to delete a job from queue, run ''scancel _JOBID_''

Should your slurm run correctly, type ''squeue'' and you should see something like this:

#### BASH command line output to console
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
217_[9-100] pdl-stati array_jo docker PD 0:00 1 (Resources)
217_8 pdl-stati array_jo docker R 0:00 1 pdl-station
217_5 pdl-stati array_jo docker R 0:08 1 pdl-station

As root at gimel, it is possible to modify a particular job, e.g., ''scontrol update jobid=635 TimeLimit=7-00:00:00''

'''Detailed step-by-step installation instruction'''

Useful link: https://slurm.schedmd.com/quickstart_admin.html

'''node n-1-17'''

* make sure you have there Centos 7: ''cat /etc/redhat-release''
* ''wget https://download.schedmd.com/slurm/slurm-17.02.11.tar.bz2''
* ''yum install readline-devel perl-ExtUtils-MakeMaker.noarch munge-devel pam-devel''
* ''export VER=17.02.11; rpmbuild -ta slurm-$VER.tar.bz2 --without mysql; mv /root/rpmbuild .''

installing built packages from rpmbuild:
* ''yum install rpmbuild/RPMS/x86_64/slurm-plugins-17.02.11-1.el7.x86_64.rpm''
* ''yum install rpmbuild/RPMS/x86_64/slurm-17.02.11-1.el7.x86_64.rpm''
* ''yum install rpmbuild/RPMS/x86_64/slurm-munge-17.02.11-1.el7.x86_64.rpm''

'''setting up munge''':
copy over /etc/munge/munge.key from gimel and put locally to /etc/munge. The key should be identical allover the nodes. 
Munge is a daemon responsible for secure data exchange between nodes. 
Set permissions accordingly: ''chown munge:munge /etc/munge/munge.key; chmod 400 /etc/munge/munge.key'' 

'''starting munge''': ''systemctl enable munge; systemctl start munge''

'''setting up slurm''':
* create a user slurm: adduser slurm.
* all UID/GUIDs of slurm user should be identical allover the nodes. 
Otherwise, one needs to specify a mapping scheme for translating each UID/GUIDs between nodes. 
To edit slurm UID/GUID, do ''vipw'' and replace "slurm line" with slurm:x:XXXXX:YYYYY::/nonexistent:/bin/false 
XXXXX and YYYYY for slurm user can be found at gimel in /etc/passwd 
NB: don't forget to edit /etc/group as well. 
* copy /etc/slurm/slurm.conf from gimel and put locally to /etc/slurm.
* figure out what CPU/Memory resources you have at n-1-17 (see /proc/cpuinfo) and append the following line:
NodeName=n-1-17 NodeAddr=10.20.1.17 CPUs=24 State=UNKNOWN
* append n-1-17 to the partition: PartitionName=gimel Nodes=gimel,n-5-34,n-5-35,n-1-17 Default=YES MaxTime=INFINITE State=UP
* create the following folders:
''mkdir -p /var/spool/slurm-llnl /var/run/slurm-llnl /var/log/slurm-llnl''
''chown -R slurm:slurm /var/spool/slurm-llnl /var/run/slurm-llnl /var/log/slurm-llnl''
* restarting slurm master node at gimel (Centos 6): ''/etc/init.d/slurm restart''
* restarting slurm computing nodes (Centos 7): ''systemctl restart slurmd''

And last but not least, asking the firewall to allow communication between master node and computing node n-1-17:
* ''firewall-cmd --permanent --zone=public --add-port=6818/tcp''
* ''firewall-cmd --reload''

To disable a specific node, do ''scontrol update NodeName=n-1-17 State=DRAIN Reason=DRAINED''
To return back to service, do ''scontrol update NodeName=n-1-17 State=IDLE''

To see the current situation of the queue, so sinfo -lNe and you will see:
Wed May 27 09:49:54 2020
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
gimel 1 gimel* drained 24 4:6:1 1 0 1 (null) none
n-1-17 1 gimel* idle 24 24:1:1 1 0 1 (null) none
n-5-34 1 gimel* idle 80 80:1:1 1 0 1 (null) none
n-5-35 1 gimel* idle 80 80:1:1 1 0 1 (null) none

p.s. Some users/scripts may require csh/tcsh. 
''sudo yum install csh tcsh''

Back to [[DOCK_3.7]]