SGE notes
Jump to navigation
Jump to search
ALL ABOUT SGE (SUN GRID ENGINE)
obviously this needs to be edited.... domain must be replaced by the domain throughout...
To add an exec node: yum -y install gridengine gridengine-execd export SGE_ROOT=/usr/share/gridengine export SGE_CELL=bkslab cp -v /nfs/init/gridengine/install.conf /tmp/gridengine-install.conf vim /tmp/gridengine-install.conf -> CHANGE EXEC_HOST_LIST=" " TO EXEC_HOST_LIST="$HOSTNAME" cd /usr/share/gridengine/ ./inst_sge -x -s -auto /tmp/gridengine-install.conf > /tmp/gridengine.log cat /tmp/gridengine.log | tee -a /root/gridengine-install.log if [ -e ${SGE_CELL} ]; then mv -v ${SGE_CELL} ${SGE_CELL}.local; fi ln -vs /nfs/gridengine/${SGE_CELL} /usr/share/gridengine/${SGE_CELL} rm -vf /etc/sysconfig/gridengine echo "SGE_ROOT=${SGE_ROOT}" >> /etc/sysconfig/gridengine echo "SGE_CELL=${SGE_CELL}" >> /etc/sysconfig/gridengine mkdir -pv /var/spool/gridengine/`hostname -s` chown -Rv sgeadmin:sgeadmin /var/spool/gridengine chkconfig --levels=345 sge_execd on Go to sgemaster and do this: qconf -ae --> CHANGE THE HOSTNAME FROM "template" to hostname_of_new_exec qconf -as hostname HOW TO EDIT THE NUMBER OF SLOTS FOR A EXEC_HOST: qconf -mattr exechost complex_values slots=32 raiders.c.domain "complex_values" of "exechost" is empty - Adding new element(s). root@pan.slot-27.rack-1.pharmacy.cluster.domain modified "raiders.c.domain" in exechost list HOW TO ADD A HOSTGROUP: qconf -ahgrp @custom ADD THE EXECHOST TO A HOSTGROUP: qconf -mhgrp @custom service sgemaster restart Then back on the exec_host: service sge_execd start To suspend jobs you do: qmod -sj job_number To delete nodes I did the following: qconf -shgrpl -> To see a list of host groups qconf -shgrp @HOST_GROUP_NAME -> For each host group to see if the nodes you want to delete are listed If it is listed then: qconf-mhgrp @HOST_GROUP_NAME -> Modify this file (delete the line with the node you want to delete). Once you've deleted the node you want to delete from all the hostgroups: qconf -de node_you_want _to_delete >/dev/null qmod -de node_you_want _to_delete A more formal note removal pipeline (as BASH): for HG in $( qconf -shgrpl ) ; do qconf -dattr hostgrop hostlist NODE_NAME_HERE $HG done qconf -purge queue slots *.q@NODE_NAME_HERE (or all.q) qconf -ds NODE_NAME_HERE qconf -dconf NODE_NAME_HERE qconf -de NODE_NAME_HERE To alter the priority on all the jobs for a user: qstat -u user | cut -d ' ' -f2 >> some_file Edit some_file and delete the first couple lines (the header lines) for OUTPUT in $`cat some_file`; do qalter -p 1022 $OUTPUT; done; Priorities are -1024 to 1023 DEBUGGING SGE: qstat -explain a for HOSTGROUP in `qconf -shgrpl`; do for HOSTLIST in `qconf -shgrp $HOSTGROUP`; do echo $HOSTLIST; done; done | grep node-1.slot-27.rack-2.pharmacy.cluster.domain Look at the logs for both master and exec (raiders:/var/spool/gridengine/raiders/messages and pan:/var/spool/gridengine/bkslab/qmaster/messages) Make sure resolv.conf looks like this: nameserver 142.150.250.10 nameserver 10.10.16.64 search cluster.domain domain bkslab.org [root@pan ~]# for X in $`qconf -shgrpl`; do qconf -shgrp $X; done; Host group "$@24-core" does not exist group_name @64-core hostlist node-26.rack-2.pharmacy.cluster.domain group_name @8-core hostlist node-2.slot-27.rack-1.pharmacy.cluster.domain \ node-1.slot-27.rack-1.pharmacy.cluster.domain group_name @allhosts hostlist @physical @virtual group_name @physical hostlist node-26.rack-2.pharmacy.cluster.domain group_name @virtual hostlist node-2.slot-27.rack-1.pharmacy.cluster.domain \ node-1.slot-27.rack-1.pharmacy.cluster.domain 1) In one screen I would type strace qstat -f and then in the other screen I would type ps -ax | grep qstat to get the pid. Then ls -l /proc/pid/fd/ I did this because when I typed strace qstat -f everytime it would get stuck saying this: poll([{fd=3, events=POLLIN|POLLPRI}], 1, 1000) = 0 (Timeout) gettimeofday({1390262563, 742705}, NULL) = 0 gettimeofday({1390262563, 742741}, NULL) = 0 gettimeofday({1390262563, 742771}, NULL) = 0 gettimeofday({1390262563, 742801}, NULL) = 0 gettimeofday({1390262563, 742828}, NULL) = 0 gettimeofday({1390262563, 742855}, NULL) = 0 gettimeofday({1390262563, 742881}, NULL) = 0 gettimeofday({1390262563, 742909}, NULL) = 0 and then eventually it would say this: poll([{fd=3, events=POLLIN|POLLPRI}], 1, 1000) = 1 ([{fd=3, revents=POLLIN}]) gettimeofday({1390262563, 960292}, NULL) = 0 gettimeofday({1390262563, 960321}, NULL) = 0 gettimeofday({1390262563, 960349}, NULL) = 0 read(3, "<gmsh><dl>99</dl></gms", 22) = 22 read(3, "h", 1) = 1 read(3, ">", 1) = 1 read(3, "<mih version=\"0.1\"><mid>2</mid><"..., 99) = 99 read(3, "<ccrm version=\"0.1\"></ccrm>", 27) = 27 gettimeofday({1390262563, 960547}, NULL) = 0 gettimeofday({1390262563, 960681}, NULL) = 0 gettimeofday({1390262563, 960709}, NULL) = 0 gettimeofday({1390262563, 960741}, NULL) = 0 gettimeofday({1390262563, 960769}, NULL) = 0 gettimeofday({1390262563, 960797}, NULL) = 0 gettimeofday({1390262563, 960823}, NULL) = 0 shutdown(3, 2 /* send and receive */) = 0 close(3) = 0 gettimeofday({1390262563, 961009}, NULL) = 0 gettimeofday({1390262563, 961036}, NULL) = 0 gettimeofday({1390262563, 961064}, NULL) = 0 gettimeofday({1390262563, 961093}, NULL) = 0 gettimeofday({1390262563, 961120}, NULL) = 0 gettimeofday({1390262563, 961148}, NULL) = 0 The thing that is wierd about this is when I typed ls -l /proc/pid/fd/ there was never a file descriptor "3" 2) I tried to delete the nodes that we moved to SF by doing the following: qconf -dattr @physical "node-1.rack-3.pharmacy.cluster.domain node-10.rack-3.pharmacy.cluster.domain node-11.rack-3.pharmacy.cluster.domain node-12.rack-3.pharmacy.cluster.domain node-13.rack-3.pharmacy.cluster.domain node-14.rack-3.pharmacy.cluster.domain node-15.rack-3.pharmacy.cluster.domain node-2.rack-3.pharmacy.cluster.domain node-26.rack-3.pharmacy.cluster.domain node-27.rack-3.pharmacy.cluster.domain node-29.rack-3.pharmacy.cluster.domain node-3.rack-3.pharmacy.cluster.domain node-4.rack-3.pharmacy.cluster.domain node-5.rack-3.pharmacy.cluster.domain node-6.rack-3.pharmacy.cluster.domain node-7.rack-3.pharmacy.cluster.domain node-8.rack-3.pharmacy.cluster.domain node-9.rack-3.pharmacy.cluster.domain" node-1.rack-3.pharmacy.cluster.domain @physical > /dev/null I would get the error: Modification of object "@physical" not supported 3) I tried to see the queues complex attributes by typing qconf -sc and saw this: #name shortcut type relop requestable consumable default urgency slots s INT <= YES YES 1 1000 I am not quite sure what urgency = 1000 means. All other names had "0" under urgency. 4) I tried qmod -cq '*' to clear the error state of all the queues. It would tell me this: Queue instance "all.q@node-1.rack-3.pharmacy.cluster.domain" is already in the specified state: no error Queue instance "all.q@node-1.slot-27.rack-1.pharmacy.cluster.domain" is already in the specified state: no error Queue instance "all.q@node-1.slot-27.rack-2.pharmacy.cluster.domain" is already in the specified state: no error Queue instance "all.q@node-10.rack-3.pharmacy.cluster.domain" is already in the specified state: no error Queue instance "all.q@node-11.rack-3.pharmacy.cluster.domain" is already in the specified state: no error Queue instance "all.q@node-12.rack-3.pharmacy.cluster.domain" is already in the specified state: no error Queue instance "all.q@node-13.rack-3.pharmacy.cluster.domain" is already in the specified state: no error Queue instance "all.q@node-14.rack-3.pharmacy.cluster.domain" is already in the specified state: no error Queue instance "all.q@node-15.rack-3.pharmacy.cluster.domain" is already in the specified state: no error Queue instance "all.q@node-2.rack-3.pharmacy.cluster.domain" is already in the specified state: no error Queue instance "all.q@node-2.slot-27.rack-1.pharmacy.cluster.domain" is already in the specified state: no error Queue instance "all.q@node-2.slot-27.rack-2.pharmacy.cluster.domain" is already in the specified state: no error Queue instance "all.q@node-26.rack-2.pharmacy.cluster.domain" is already in the specified state: no error Queue instance "all.q@node-26.rack-3.pharmacy.cluster.domain" is already in the specified state: no error Queue instance "all.q@node-27.rack-3.pharmacy.cluster.domain" is already in the specified state: no error Queue instance "all.q@node-29.rack-3.pharmacy.cluster.domain" is already in the specified state: no error Queue instance "all.q@node-3.rack-3.pharmacy.cluster.domain" is already in the specified state: no error Queue instance "all.q@node-3.slot-27.rack-2.pharmacy.cluster.domain" is already in the specified state: no error Queue instance "all.q@node-4.rack-3.pharmacy.cluster.domain is already in the specified state: no error Queue instance "all.q@node-4.slot-27.rack-2.pharmacy.cluster.domain" is already in the specified state: no error Queue instance "all.q@node-5.rack-3.pharmacy.cluster.domain" is already in the specified state: no error Queue instance "all.q@node-5.slot-27.rack-2.pharmacy.cluster.domain" is already in the specified state: no error Queue instance "all.q@node-6.rack-3.pharmacy.cluster.domain" is already in the specified state: no error Queue instance "all.q@node-6.slot-27.rack-2.pharmacy.cluster.domain" is already in the specified state: no error Queue instance "all.q@node-7.rack-3.pharmacy.cluster.domain" is already in the specified state: no error Queue instance "all.q@node-7.slot-27.rack-2.pharmacy.cluster.domain" is already in the specified state: no error Queue instance "all.q@node-8.rack-3.pharmacy.cluster.domain" is already in the specified state: no error Queue instance "all.q@node-9.rack-3.pharmacy.cluster.domain" is already in the specified state: no error 5) I tried deleting a node like this instead: qconf -ds node-1.rack-3.pharmacy.cluster.domain But when I typed qconf -sel it was still there. 6) I tried to see what the hostlist for @physical was by typing qconf -ahgrp @physical. It said: group_name @physical, hostlist NONE Then I typed qconf -shgrpl to see a list of all hostgroups and tried typing qconf -ahgrp. All of them said the hostlist was NONE, but when I tried to type qconf -ahgrp @allhosts I got this message: denied: "root" must be manager for this operation error: commlib error: got select error (Connection reset by peer) 7) I looked at the messages in the file: /var/spool/gridengine/bkslab/qmaster/messages and it said this (over and over again): 01/20/2014 19:41:35|listen|pan|E|commlib error: got read error (closing "pan.slot-27.rack-1.pharmacy.cluster.domain/qconf/2") 01/20/2014 19:43:24| main|pan|W|local configuration pan.slot-27.rack-1.pharmacy.cluster.domain not defined - using global configuration 01/20/2014 19:43:24| main|pan|W|can't resolve host name "node-3-3.rack-3.pharmacy.cluster.domain": undefined commlib error code 01/20/2014 19:43:24| main|pan|W|can't resolve host name "node-3-4.rack-3.pharmacy.cluster.domain": undefined commlib error code 01/20/2014 19:43:53| main|pan|I|read job database with 468604 entries in 29 seconds 01/20/2014 19:43:55| main|pan|I|qmaster hard descriptor limit is set to 8192 01/20/2014 19:43:55| main|pan|I|qmaster soft descriptor limit is set to 8192 01/20/2014 19:43:55| main|pan|I|qmaster will use max. 8172 file descriptors for communication 01/20/2014 19:43:55| main|pan|I|qmaster will accept max. 99 dynamic event clients 01/20/2014 19:43:55| main|pan|I|starting up GE 6.2u5p3 (lx26-amd64) 8) Periodically i would get this error: ERROR: failed receiving gdi request response for mid=3 (got no message). 9) I also tried delete the pid in the file: /var/spool/gridengine/bkslab/qmaster/qmaster.pid That didn't do anything. It eventually just replaced it with a different number. It's wierd because it's not even the right pid. For example the real pid was 8286 and the pid in the file was 8203: [root@pan qmaster]# service sgemaster start Starting sge_qmaster: [ OK ] [root@pan qmaster]# ps -ax |grep sge Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ 8286 ? Rl 0:03 /usr/bin/sge_qmaster 8301 pts/0 S+ 0:00 grep sge [root@pan qmaster]# cat qmaster.pid 8203 10) When I typed tail /var/log/messages I saw this: Jan 20 14:25:05 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2) Jan 20 14:27:05 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2) Jan 20 14:29:05 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2) Jan 20 14:31:05 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2) Jan 20 14:33:06 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2) Jan 20 14:35:06 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2) Jan 20 14:36:29 pan kernel: Registering the id_resolver key type Jan 20 14:36:29 pan kernel: FS-Cache: Netfs 'nfs' registered for caching Jan 20 14:36:29 pan nfsidmap[2536]: nss_getpwnam: name 'root@rack-1.pharmacy.cluster.domain' does not map into domain 'domain' Jan 20 14:37:06 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2) This was what happened when I restarted the machine.