SGE notes

From DISI
Jump to navigation Jump to search

ALL ABOUT SGE (SUN GRID ENGINE)

obviously this needs to be edited.... domain must be replaced by the domain throughout...

To add an exec node:
  yum -y install gridengine gridengine-execd
  export SGE_ROOT=/usr/share/gridengine
  export SGE_CELL=bkslab
  cp -v /nfs/init/gridengine/install.conf /tmp/gridengine-install.conf
  vim /tmp/gridengine-install.conf   -> CHANGE EXEC_HOST_LIST=" " TO EXEC_HOST_LIST="$HOSTNAME"
  cd /usr/share/gridengine/
  ./inst_sge -x -s -auto /tmp/gridengine-install.conf > /tmp/gridengine.log
  cat /tmp/gridengine.log | tee -a /root/gridengine-install.log
  if [ -e ${SGE_CELL} ]; then     	mv -v ${SGE_CELL} ${SGE_CELL}.local; fi
  ln -vs /nfs/gridengine/${SGE_CELL} /usr/share/gridengine/${SGE_CELL}
  rm -vf /etc/sysconfig/gridengine
  echo "SGE_ROOT=${SGE_ROOT}" >> /etc/sysconfig/gridengine
  echo "SGE_CELL=${SGE_CELL}" >> /etc/sysconfig/gridengine
  mkdir -pv /var/spool/gridengine/`hostname -s`
  chown -Rv sgeadmin:sgeadmin /var/spool/gridengine
  chkconfig --levels=345 sge_execd on

  Go to sgemaster and do this:
  qconf -ae --> CHANGE THE HOSTNAME FROM "template" to hostname_of_new_exec
  qconf -as hostname

HOW TO EDIT THE NUMBER OF SLOTS FOR A EXEC_HOST:
 qconf -mattr exechost complex_values slots=32 raiders.c.domain
"complex_values" of "exechost" is empty - Adding new element(s).

root@pan.slot-27.rack-1.pharmacy.cluster.domain modified "raiders.c.domain" in exechost list

  HOW TO ADD A HOSTGROUP:
  qconf -ahgrp @custom 

  ADD THE EXECHOST TO A HOSTGROUP:
  qconf -mhgrp @custom

  service sgemaster restart
 
  Then back on the exec_host:
  
  service sge_execd start


To suspend jobs you do:

qmod -sj job_number

To delete nodes I did the following:

qconf -shgrpl  -> To see a list of host groups
qconf -shgrp @HOST_GROUP_NAME  -> For each host group to see if the nodes you want to delete are listed
If it is listed then:
qconf-mhgrp @HOST_GROUP_NAME -> Modify this file (delete the line with the node you want to delete).
Once you've deleted the node you want to delete from all the hostgroups:
qconf -de node_you_want _to_delete >/dev/null
qmod -de node_you_want _to_delete


A more formal note removal pipeline (as BASH):

    for HG in $( qconf -shgrpl ) ; do
        qconf -dattr hostgrop hostlist NODE_NAME_HERE $HG
    done
    qconf -purge queue slots *.q@NODE_NAME_HERE (or all.q)
    qconf -ds NODE_NAME_HERE
    qconf -dconf NODE_NAME_HERE
    qconf -de NODE_NAME_HERE

To alter the priority on all the jobs for a user:
qstat -u user | cut -d ' ' -f2 >> some_file
Edit some_file and delete the first couple lines (the header lines)
for OUTPUT in $`cat some_file`; do qalter -p 1022 $OUTPUT; done;
Priorities are -1024 to 1023

DEBUGGING SGE:

qstat -explain a

for HOSTGROUP in `qconf -shgrpl`; do for HOSTLIST in `qconf -shgrp $HOSTGROUP`; do  echo $HOSTLIST; done; done | grep node-1.slot-27.rack-2.pharmacy.cluster.domain

Look at the logs for both master and exec 
(raiders:/var/spool/gridengine/raiders/messages and pan:/var/spool/gridengine/bkslab/qmaster/messages)

Make sure resolv.conf looks like this:
nameserver 142.150.250.10
nameserver 10.10.16.64
search cluster.domain domain bkslab.org                                              	

[root@pan ~]# for X in $`qconf -shgrpl`; do qconf -shgrp $X; done;
Host group "$@24-core" does not exist
group_name @64-core
hostlist node-26.rack-2.pharmacy.cluster.domain
group_name @8-core
hostlist node-2.slot-27.rack-1.pharmacy.cluster.domain \
         node-1.slot-27.rack-1.pharmacy.cluster.domain
group_name @allhosts
hostlist @physical @virtual
group_name @physical
hostlist node-26.rack-2.pharmacy.cluster.domain
group_name @virtual
hostlist node-2.slot-27.rack-1.pharmacy.cluster.domain \
         node-1.slot-27.rack-1.pharmacy.cluster.domain

1)  In one screen I would type strace qstat -f and then in the other screen I would type ps -ax | grep qstat to get the pid.  Then ls -l /proc/pid/fd/
I did this because when I typed strace qstat -f everytime it would get stuck saying this:
poll([{fd=3, events=POLLIN|POLLPRI}], 1, 1000) = 0 (Timeout)
gettimeofday({1390262563, 742705}, NULL) = 0
gettimeofday({1390262563, 742741}, NULL) = 0
gettimeofday({1390262563, 742771}, NULL) = 0
gettimeofday({1390262563, 742801}, NULL) = 0
gettimeofday({1390262563, 742828}, NULL) = 0
gettimeofday({1390262563, 742855}, NULL) = 0
gettimeofday({1390262563, 742881}, NULL) = 0
gettimeofday({1390262563, 742909}, NULL) = 0

and then eventually it would say this:
poll([{fd=3, events=POLLIN|POLLPRI}], 1, 1000) = 1 ([{fd=3, revents=POLLIN}])
gettimeofday({1390262563, 960292}, NULL) = 0
gettimeofday({1390262563, 960321}, NULL) = 0
gettimeofday({1390262563, 960349}, NULL) = 0
read(3, "<gmsh><dl>99</dl></gms", 22)   = 22
read(3, "h", 1)                     	= 1
read(3, ">", 1)                     	= 1
read(3, "<mih version=\"0.1\"><mid>2</mid><"..., 99) = 99
read(3, "<ccrm version=\"0.1\"></ccrm>", 27) = 27
gettimeofday({1390262563, 960547}, NULL) = 0
gettimeofday({1390262563, 960681}, NULL) = 0
gettimeofday({1390262563, 960709}, NULL) = 0
gettimeofday({1390262563, 960741}, NULL) = 0
gettimeofday({1390262563, 960769}, NULL) = 0
gettimeofday({1390262563, 960797}, NULL) = 0
gettimeofday({1390262563, 960823}, NULL) = 0
shutdown(3, 2 /* send and receive */)   = 0
close(3)                            	= 0
gettimeofday({1390262563, 961009}, NULL) = 0
gettimeofday({1390262563, 961036}, NULL) = 0
gettimeofday({1390262563, 961064}, NULL) = 0
gettimeofday({1390262563, 961093}, NULL) = 0
gettimeofday({1390262563, 961120}, NULL) = 0
gettimeofday({1390262563, 961148}, NULL) = 0

The thing that is wierd about this is when I typed ls -l /proc/pid/fd/ there was never a file descriptor "3"

2) I tried to delete the nodes that we moved to SF by doing the following:
qconf -dattr @physical "node-1.rack-3.pharmacy.cluster.domain node-10.rack-3.pharmacy.cluster.domain node-11.rack-3.pharmacy.cluster.domain node-12.rack-3.pharmacy.cluster.domain node-13.rack-3.pharmacy.cluster.domain node-14.rack-3.pharmacy.cluster.domain node-15.rack-3.pharmacy.cluster.domain node-2.rack-3.pharmacy.cluster.domain node-26.rack-3.pharmacy.cluster.domain node-27.rack-3.pharmacy.cluster.domain node-29.rack-3.pharmacy.cluster.domain node-3.rack-3.pharmacy.cluster.domain node-4.rack-3.pharmacy.cluster.domain node-5.rack-3.pharmacy.cluster.domain node-6.rack-3.pharmacy.cluster.domain node-7.rack-3.pharmacy.cluster.domain node-8.rack-3.pharmacy.cluster.domain node-9.rack-3.pharmacy.cluster.domain" node-1.rack-3.pharmacy.cluster.domain @physical > /dev/null

I would get the error: Modification of object "@physical" not supported

3) I tried to see the queues complex attributes by typing qconf -sc and saw this:

#name   	shortcut   type    	relop requestable consumable default  urgency 

slots           	s      	INT     	<=        YES     	YES        	1    	1000

I am not quite sure what urgency = 1000 means.
All other names had "0" under urgency.

4) I tried qmod -cq '*'  to clear the error state of all the queues.  
It would tell me this:

Queue instance "all.q@node-1.rack-3.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-1.slot-27.rack-1.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-1.slot-27.rack-2.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-10.rack-3.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-11.rack-3.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-12.rack-3.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-13.rack-3.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-14.rack-3.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-15.rack-3.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-2.rack-3.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-2.slot-27.rack-1.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-2.slot-27.rack-2.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-26.rack-2.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-26.rack-3.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-27.rack-3.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-29.rack-3.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-3.rack-3.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-3.slot-27.rack-2.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-4.rack-3.pharmacy.cluster.domain is already in the specified state: no error
Queue instance "all.q@node-4.slot-27.rack-2.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-5.rack-3.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-5.slot-27.rack-2.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-6.rack-3.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-6.slot-27.rack-2.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-7.rack-3.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-7.slot-27.rack-2.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-8.rack-3.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-9.rack-3.pharmacy.cluster.domain" is already in the specified state: no error


5) I tried deleting a node like this instead:
qconf -ds node-1.rack-3.pharmacy.cluster.domain
But when I typed qconf -sel it was still there.

6)  I tried to see what the hostlist for @physical was by typing qconf -ahgrp @physical.  It said: group_name @physical, hostlist NONE
	Then I typed qconf -shgrpl to see a list of all hostgroups and tried typing qconf -ahgrp.  All of them said the hostlist was NONE, 
   but when I tried to type qconf -ahgrp @allhosts I got this message:
   denied: "root" must be manager for this operation
   error: commlib error: got select error (Connection reset by peer)

7) I looked at the messages in the file: /var/spool/gridengine/bkslab/qmaster/messages and it said this (over and over again):

01/20/2014 19:41:35|listen|pan|E|commlib error: got read error (closing "pan.slot-27.rack-1.pharmacy.cluster.domain/qconf/2")
01/20/2014 19:43:24|  main|pan|W|local configuration pan.slot-27.rack-1.pharmacy.cluster.domain not defined - using global configuration
01/20/2014 19:43:24|  main|pan|W|can't resolve host name "node-3-3.rack-3.pharmacy.cluster.domain": undefined commlib error code
01/20/2014 19:43:24|  main|pan|W|can't resolve host name "node-3-4.rack-3.pharmacy.cluster.domain": undefined commlib error code
01/20/2014 19:43:53|  main|pan|I|read job database with 468604 entries in 29 seconds
01/20/2014 19:43:55|  main|pan|I|qmaster hard descriptor limit is set to 8192
01/20/2014 19:43:55|  main|pan|I|qmaster soft descriptor limit is set to 8192
01/20/2014 19:43:55|  main|pan|I|qmaster will use max. 8172 file descriptors for communication
01/20/2014 19:43:55|  main|pan|I|qmaster will accept max. 99 dynamic event clients
01/20/2014 19:43:55|  main|pan|I|starting up GE 6.2u5p3 (lx26-amd64)

8)  Periodically i would get this error:  ERROR: failed receiving gdi request response for mid=3 (got no message).

9)  I also tried delete the pid in the file: /var/spool/gridengine/bkslab/qmaster/qmaster.pid
  That didn't do anything.  It eventually just replaced it with a different number. 

 It's wierd because it's not even the right pid.  For example the real pid was 8286 and the pid in the file was 8203:

  [root@pan qmaster]# service sgemaster start
Starting sge_qmaster:                                  	[  OK  ]
[root@pan qmaster]# ps -ax |grep sge
Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ
 8286 ?    	Rl 	0:03 /usr/bin/sge_qmaster
 8301 pts/0	S+ 	0:00 grep sge
[root@pan qmaster]# cat qmaster.pid 
8203

10)   When I typed tail /var/log/messages I saw this:

Jan 20 14:25:05 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)
Jan 20 14:27:05 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)
Jan 20 14:29:05 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)
Jan 20 14:31:05 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)
Jan 20 14:33:06 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)
Jan 20 14:35:06 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)
Jan 20 14:36:29 pan kernel: Registering the id_resolver key type
Jan 20 14:36:29 pan kernel: FS-Cache: Netfs 'nfs' registered for caching
Jan 20 14:36:29 pan nfsidmap[2536]: nss_getpwnam: name 'root@rack-1.pharmacy.cluster.domain' does not map into domain 'domain'
Jan 20 14:37:06 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)
This was what happened when I restarted the machine.