Difference between revisions of "Slurm"

From DISI
Jump to: navigation, search
m
m
Line 3: Line 3:
 
'''node n-1-17'''
 
'''node n-1-17'''
  
* make sure you have there Centos 7: cat /etc/redhat-release
+
* make sure you have there Centos 7: ''cat /etc/redhat-release''
* wget https://download.schedmd.com/slurm/slurm-17.02.11.tar.bz2
+
* ''wget https://download.schedmd.com/slurm/slurm-17.02.11.tar.bz2''
* yum install readline-devel perl-ExtUtils-MakeMaker.noarch munge-devel pam-devel
+
* ''yum install readline-devel perl-ExtUtils-MakeMaker.noarch munge-devel pam-devel''
* export VER=17.02.11; rpmbuild -ta slurm-$VER.tar.bz2 --without mysql; mv /root/rpmbuild .
+
* ''export VER=17.02.11; rpmbuild -ta slurm-$VER.tar.bz2 --without mysql; mv /root/rpmbuild .''
  
 
installing built packages from rpmbuild:
 
installing built packages from rpmbuild:
* yum install rpmbuild/RPMS/x86_64/slurm-plugins-17.02.11-1.el7.x86_64.rpm
+
* ''yum install rpmbuild/RPMS/x86_64/slurm-plugins-17.02.11-1.el7.x86_64.rpm''
* yum install rpmbuild/RPMS/x86_64/slurm-17.02.11-1.el7.x86_64.rpm
+
* ''yum install rpmbuild/RPMS/x86_64/slurm-17.02.11-1.el7.x86_64.rpm''
* yum install rpmbuild/RPMS/x86_64/slurm-munge-17.02.11-1.el7.x86_64.rpm
+
* ''yum install rpmbuild/RPMS/x86_64/slurm-munge-17.02.11-1.el7.x86_64.rpm''
  
  
Line 17: Line 17:
 
copy over /etc/munge/munge.key from gimel and put locally to /etc/munge. The key should be identical allover the nodes.<br>
 
copy over /etc/munge/munge.key from gimel and put locally to /etc/munge. The key should be identical allover the nodes.<br>
 
Munge is a daemon responsible for secure data exchange between nodes.<br>
 
Munge is a daemon responsible for secure data exchange between nodes.<br>
Set permissions accordingly: chown munge:munge /etc/munge/munge.key; chmod 400 /etc/munge/munge.key<br>
+
Set permissions accordingly: ''chown munge:munge /etc/munge/munge.key; chmod 400 /etc/munge/munge.key''<br>
  
'''starting munge''': systemctl enable munge; systemctl start munge
+
'''starting munge''': ''systemctl enable munge; systemctl start munge''
  
 
'''setting up slurm''':
 
'''setting up slurm''':
Line 25: Line 25:
 
* all UID/GUIDs of slurm user should be identical allover the nodes.<br>
 
* all UID/GUIDs of slurm user should be identical allover the nodes.<br>
 
   Otherwise, one needs to specify a mapping scheme for translating each UID/GUIDs between nodes.<br>
 
   Otherwise, one needs to specify a mapping scheme for translating each UID/GUIDs between nodes.<br>
   To edit slurm UID/GUID, do "vipw" and replace "slurm line" with slurm:x:XXXXX:YYYYY::/nonexistent:/bin/false<br>
+
   To edit slurm UID/GUID, do ''vipw'' and replace "slurm line" with slurm:x:XXXXX:YYYYY::/nonexistent:/bin/false<br>
 
   XXXXX and YYYYY for slurm user can be found at gimel in /etc/passwd<br>
 
   XXXXX and YYYYY for slurm user can be found at gimel in /etc/passwd<br>
 
   NB: don't forget to edit /etc/group as well.<br>
 
   NB: don't forget to edit /etc/group as well.<br>
Line 33: Line 33:
 
* append n-1-17 to the partition: PartitionName=gimel Nodes=gimel,n-5-34,n-5-35,n-1-17 Default=YES MaxTime=INFINITE State=UP
 
* append n-1-17 to the partition: PartitionName=gimel Nodes=gimel,n-5-34,n-5-35,n-1-17 Default=YES MaxTime=INFINITE State=UP
 
* create the following folders:
 
* create the following folders:
   mkdir -p /var/spool/slurm-llnl /var/run/slurm-llnl /var/log/slurm-llnl
+
   ''mkdir -p /var/spool/slurm-llnl /var/run/slurm-llnl /var/log/slurm-llnl''
   chown -R slurm:slurm /var/spool/slurm-llnl /var/run/slurm-llnl /var/log/slurm-llnl
+
   ''chown -R slurm:slurm /var/spool/slurm-llnl /var/run/slurm-llnl /var/log/slurm-llnl''
* restarting slurm master node at gimel (Centos 6): /etc/init.d/slurm restart
+
* restarting slurm master node at gimel (Centos 6): ''/etc/init.d/slurm restart''
* restarting slurm computing nodes (Centos 7): systemctl restart slurmd
+
* restarting slurm computing nodes (Centos 7): ''systemctl restart slurmd''
  
 
And last but not least, asking the firewall to allow communication between master node and computing node n-1-17:
 
And last but not least, asking the firewall to allow communication between master node and computing node n-1-17:
* firewall-cmd --permanent --zone=public --add-port=6818/tcp
+
* ''firewall-cmd --permanent --zone=public --add-port=6818/tcp''
* firewall-cmd --reload
+
* ''firewall-cmd --reload''
  
 
To disable a specific node, do ''scontrol update NodeName=n-1-17 State=DRAIN Reason=DRAINED''
 
To disable a specific node, do ''scontrol update NodeName=n-1-17 State=DRAIN Reason=DRAINED''

Revision as of 09:55, 27 May 2020

Detailed step-by-step instruction:

node n-1-17

  • make sure you have there Centos 7: cat /etc/redhat-release
  • wget https://download.schedmd.com/slurm/slurm-17.02.11.tar.bz2
  • yum install readline-devel perl-ExtUtils-MakeMaker.noarch munge-devel pam-devel
  • export VER=17.02.11; rpmbuild -ta slurm-$VER.tar.bz2 --without mysql; mv /root/rpmbuild .

installing built packages from rpmbuild:

  • yum install rpmbuild/RPMS/x86_64/slurm-plugins-17.02.11-1.el7.x86_64.rpm
  • yum install rpmbuild/RPMS/x86_64/slurm-17.02.11-1.el7.x86_64.rpm
  • yum install rpmbuild/RPMS/x86_64/slurm-munge-17.02.11-1.el7.x86_64.rpm


setting up munge: copy over /etc/munge/munge.key from gimel and put locally to /etc/munge. The key should be identical allover the nodes.
Munge is a daemon responsible for secure data exchange between nodes.
Set permissions accordingly: chown munge:munge /etc/munge/munge.key; chmod 400 /etc/munge/munge.key

starting munge: systemctl enable munge; systemctl start munge

setting up slurm:

  • create a user slurm: adduser slurm.
  • all UID/GUIDs of slurm user should be identical allover the nodes.
 Otherwise, one needs to specify a mapping scheme for translating each UID/GUIDs between nodes.
To edit slurm UID/GUID, do vipw and replace "slurm line" with slurm:x:XXXXX:YYYYY::/nonexistent:/bin/false
XXXXX and YYYYY for slurm user can be found at gimel in /etc/passwd
NB: don't forget to edit /etc/group as well.
  • copy /etc/slurm/slurm.conf from gimel and put locally to /etc/slurm.
  • figure out what CPU/Memory resources you have at n-1-17 (see /proc/cpuinfo) and append the following line:
 NodeName=n-1-17 NodeAddr=10.20.1.17 CPUs=24 State=UNKNOWN
  • append n-1-17 to the partition: PartitionName=gimel Nodes=gimel,n-5-34,n-5-35,n-1-17 Default=YES MaxTime=INFINITE State=UP
  • create the following folders:
 mkdir -p /var/spool/slurm-llnl /var/run/slurm-llnl /var/log/slurm-llnl
 chown -R slurm:slurm /var/spool/slurm-llnl /var/run/slurm-llnl /var/log/slurm-llnl
  • restarting slurm master node at gimel (Centos 6): /etc/init.d/slurm restart
  • restarting slurm computing nodes (Centos 7): systemctl restart slurmd

And last but not least, asking the firewall to allow communication between master node and computing node n-1-17:

  • firewall-cmd --permanent --zone=public --add-port=6818/tcp
  • firewall-cmd --reload

To disable a specific node, do scontrol update NodeName=n-1-17 State=DRAIN Reason=DRAINED To return back to service, do scontrol update NodeName=n-1-17 State=IDLE

To see the current situation of the queue, so sinfo -lNe and you will see:

Wed May 27 09:49:54 2020
NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON              
gimel          1    gimel*     drained   24    4:6:1      1        0      1   (null) none                
n-1-17         1    gimel*        idle   24   24:1:1      1        0      1   (null) none                
n-5-34         1    gimel*        idle   80   80:1:1      1        0      1   (null) none                
n-5-35         1    gimel*        idle   80   80:1:1      1        0      1   (null) none