Slurm: Difference between revisions
mNo edit summary |
mNo edit summary |
||
Line 3: | Line 3: | ||
'''node n-1-17''' | '''node n-1-17''' | ||
* make sure you have there Centos 7: cat /etc/redhat-release | * make sure you have there Centos 7: ''cat /etc/redhat-release'' | ||
* wget https://download.schedmd.com/slurm/slurm-17.02.11.tar.bz2 | * ''wget https://download.schedmd.com/slurm/slurm-17.02.11.tar.bz2'' | ||
* yum install readline-devel perl-ExtUtils-MakeMaker.noarch munge-devel pam-devel | * ''yum install readline-devel perl-ExtUtils-MakeMaker.noarch munge-devel pam-devel'' | ||
* export VER=17.02.11; rpmbuild -ta slurm-$VER.tar.bz2 --without mysql; mv /root/rpmbuild . | * ''export VER=17.02.11; rpmbuild -ta slurm-$VER.tar.bz2 --without mysql; mv /root/rpmbuild .'' | ||
installing built packages from rpmbuild: | installing built packages from rpmbuild: | ||
* yum install rpmbuild/RPMS/x86_64/slurm-plugins-17.02.11-1.el7.x86_64.rpm | * ''yum install rpmbuild/RPMS/x86_64/slurm-plugins-17.02.11-1.el7.x86_64.rpm'' | ||
* yum install rpmbuild/RPMS/x86_64/slurm-17.02.11-1.el7.x86_64.rpm | * ''yum install rpmbuild/RPMS/x86_64/slurm-17.02.11-1.el7.x86_64.rpm'' | ||
* yum install rpmbuild/RPMS/x86_64/slurm-munge-17.02.11-1.el7.x86_64.rpm | * ''yum install rpmbuild/RPMS/x86_64/slurm-munge-17.02.11-1.el7.x86_64.rpm'' | ||
Line 17: | Line 17: | ||
copy over /etc/munge/munge.key from gimel and put locally to /etc/munge. The key should be identical allover the nodes.<br> | copy over /etc/munge/munge.key from gimel and put locally to /etc/munge. The key should be identical allover the nodes.<br> | ||
Munge is a daemon responsible for secure data exchange between nodes.<br> | Munge is a daemon responsible for secure data exchange between nodes.<br> | ||
Set permissions accordingly: chown munge:munge /etc/munge/munge.key; chmod 400 /etc/munge/munge.key<br> | Set permissions accordingly: ''chown munge:munge /etc/munge/munge.key; chmod 400 /etc/munge/munge.key''<br> | ||
'''starting munge''': systemctl enable munge; systemctl start munge | '''starting munge''': ''systemctl enable munge; systemctl start munge'' | ||
'''setting up slurm''': | '''setting up slurm''': | ||
Line 25: | Line 25: | ||
* all UID/GUIDs of slurm user should be identical allover the nodes.<br> | * all UID/GUIDs of slurm user should be identical allover the nodes.<br> | ||
Otherwise, one needs to specify a mapping scheme for translating each UID/GUIDs between nodes.<br> | Otherwise, one needs to specify a mapping scheme for translating each UID/GUIDs between nodes.<br> | ||
To edit slurm UID/GUID, do | To edit slurm UID/GUID, do ''vipw'' and replace "slurm line" with slurm:x:XXXXX:YYYYY::/nonexistent:/bin/false<br> | ||
XXXXX and YYYYY for slurm user can be found at gimel in /etc/passwd<br> | XXXXX and YYYYY for slurm user can be found at gimel in /etc/passwd<br> | ||
NB: don't forget to edit /etc/group as well.<br> | NB: don't forget to edit /etc/group as well.<br> | ||
Line 33: | Line 33: | ||
* append n-1-17 to the partition: PartitionName=gimel Nodes=gimel,n-5-34,n-5-35,n-1-17 Default=YES MaxTime=INFINITE State=UP | * append n-1-17 to the partition: PartitionName=gimel Nodes=gimel,n-5-34,n-5-35,n-1-17 Default=YES MaxTime=INFINITE State=UP | ||
* create the following folders: | * create the following folders: | ||
mkdir -p /var/spool/slurm-llnl /var/run/slurm-llnl /var/log/slurm-llnl | ''mkdir -p /var/spool/slurm-llnl /var/run/slurm-llnl /var/log/slurm-llnl'' | ||
chown -R slurm:slurm /var/spool/slurm-llnl /var/run/slurm-llnl /var/log/slurm-llnl | ''chown -R slurm:slurm /var/spool/slurm-llnl /var/run/slurm-llnl /var/log/slurm-llnl'' | ||
* restarting slurm master node at gimel (Centos 6): /etc/init.d/slurm restart | * restarting slurm master node at gimel (Centos 6): ''/etc/init.d/slurm restart'' | ||
* restarting slurm computing nodes (Centos 7): systemctl restart slurmd | * restarting slurm computing nodes (Centos 7): ''systemctl restart slurmd'' | ||
And last but not least, asking the firewall to allow communication between master node and computing node n-1-17: | And last but not least, asking the firewall to allow communication between master node and computing node n-1-17: | ||
* firewall-cmd --permanent --zone=public --add-port=6818/tcp | * ''firewall-cmd --permanent --zone=public --add-port=6818/tcp'' | ||
* firewall-cmd --reload | * ''firewall-cmd --reload'' | ||
To disable a specific node, do ''scontrol update NodeName=n-1-17 State=DRAIN Reason=DRAINED'' | To disable a specific node, do ''scontrol update NodeName=n-1-17 State=DRAIN Reason=DRAINED'' |
Revision as of 16:55, 27 May 2020
Detailed step-by-step instruction:
node n-1-17
- make sure you have there Centos 7: cat /etc/redhat-release
- wget https://download.schedmd.com/slurm/slurm-17.02.11.tar.bz2
- yum install readline-devel perl-ExtUtils-MakeMaker.noarch munge-devel pam-devel
- export VER=17.02.11; rpmbuild -ta slurm-$VER.tar.bz2 --without mysql; mv /root/rpmbuild .
installing built packages from rpmbuild:
- yum install rpmbuild/RPMS/x86_64/slurm-plugins-17.02.11-1.el7.x86_64.rpm
- yum install rpmbuild/RPMS/x86_64/slurm-17.02.11-1.el7.x86_64.rpm
- yum install rpmbuild/RPMS/x86_64/slurm-munge-17.02.11-1.el7.x86_64.rpm
setting up munge:
copy over /etc/munge/munge.key from gimel and put locally to /etc/munge. The key should be identical allover the nodes.
Munge is a daemon responsible for secure data exchange between nodes.
Set permissions accordingly: chown munge:munge /etc/munge/munge.key; chmod 400 /etc/munge/munge.key
starting munge: systemctl enable munge; systemctl start munge
setting up slurm:
- create a user slurm: adduser slurm.
- all UID/GUIDs of slurm user should be identical allover the nodes.
Otherwise, one needs to specify a mapping scheme for translating each UID/GUIDs between nodes.
To edit slurm UID/GUID, do vipw and replace "slurm line" with slurm:x:XXXXX:YYYYY::/nonexistent:/bin/false
XXXXX and YYYYY for slurm user can be found at gimel in /etc/passwd
NB: don't forget to edit /etc/group as well.
- copy /etc/slurm/slurm.conf from gimel and put locally to /etc/slurm.
- figure out what CPU/Memory resources you have at n-1-17 (see /proc/cpuinfo) and append the following line:
NodeName=n-1-17 NodeAddr=10.20.1.17 CPUs=24 State=UNKNOWN
- append n-1-17 to the partition: PartitionName=gimel Nodes=gimel,n-5-34,n-5-35,n-1-17 Default=YES MaxTime=INFINITE State=UP
- create the following folders:
mkdir -p /var/spool/slurm-llnl /var/run/slurm-llnl /var/log/slurm-llnl chown -R slurm:slurm /var/spool/slurm-llnl /var/run/slurm-llnl /var/log/slurm-llnl
- restarting slurm master node at gimel (Centos 6): /etc/init.d/slurm restart
- restarting slurm computing nodes (Centos 7): systemctl restart slurmd
And last but not least, asking the firewall to allow communication between master node and computing node n-1-17:
- firewall-cmd --permanent --zone=public --add-port=6818/tcp
- firewall-cmd --reload
To disable a specific node, do scontrol update NodeName=n-1-17 State=DRAIN Reason=DRAINED To return back to service, do scontrol update NodeName=n-1-17 State=IDLE
To see the current situation of the queue, so sinfo -lNe and you will see:
Wed May 27 09:49:54 2020 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON gimel 1 gimel* drained 24 4:6:1 1 0 1 (null) none n-1-17 1 gimel* idle 24 24:1:1 1 0 1 (null) none n-5-34 1 gimel* idle 80 80:1:1 1 0 1 (null) none n-5-35 1 gimel* idle 80 80:1:1 1 0 1 (null) none