SGE notes: Difference between revisions

Latest revision as of 22:02, 16 May 2016

ALL ABOUT SGE (SUN GRID ENGINE)
obviously this needs to be edited.... domain must be replaced by the domain throughout...
To add an exec node:
  yum -y install gridengine gridengine-execd
  export SGE_ROOT=/usr/share/gridengine
  export SGE_CELL=bkslab
  cp -v /nfs/init/gridengine/install.conf /tmp/gridengine-install.conf
  vim /tmp/gridengine-install.conf   -> CHANGE EXEC_HOST_LIST=" " TO EXEC_HOST_LIST="$HOSTNAME"
  cd /usr/share/gridengine/
  ./inst_sge -x -s -auto /tmp/gridengine-install.conf > /tmp/gridengine.log
  cat /tmp/gridengine.log | tee -a /root/gridengine-install.log
  if [ -e ${SGE_CELL} ]; then     	mv -v ${SGE_CELL} ${SGE_CELL}.local; fi
  ln -vs /nfs/gridengine/${SGE_CELL} /usr/share/gridengine/${SGE_CELL}
  rm -vf /etc/sysconfig/gridengine
  echo "SGE_ROOT=${SGE_ROOT}" >> /etc/sysconfig/gridengine
  echo "SGE_CELL=${SGE_CELL}" >> /etc/sysconfig/gridengine
  mkdir -pv /var/spool/gridengine/`hostname -s`
  chown -Rv sgeadmin:sgeadmin /var/spool/gridengine
  chkconfig --levels=345 sge_execd on

  Go to sgemaster and do this:
  qconf -ae --> CHANGE THE HOSTNAME FROM "template" to hostname_of_new_exec
  qconf -as hostname

HOW TO EDIT THE NUMBER OF SLOTS FOR A EXEC_HOST:
 qconf -mattr exechost complex_values slots=32 raiders.c.domain
"complex_values" of "exechost" is empty - Adding new element(s).

root@pan.slot-27.rack-1.pharmacy.cluster.domain modified "raiders.c.domain" in exechost list

  HOW TO ADD A HOSTGROUP:
  qconf -ahgrp @custom 

  ADD THE EXECHOST TO A HOSTGROUP:
  qconf -mhgrp @custom

  service sgemaster restart
 
  Then back on the exec_host:
  
  service sge_execd start


To suspend jobs you do:

qmod -sj job_number

To delete nodes I did the following:

qconf -shgrpl  -> To see a list of host groups
qconf -shgrp @HOST_GROUP_NAME  -> For each host group to see if the nodes you want to delete are listed
If it is listed then:
qconf-mhgrp @HOST_GROUP_NAME -> Modify this file (delete the line with the node you want to delete).
Once you've deleted the node you want to delete from all the hostgroups:
qconf -de node_you_want _to_delete >/dev/null
qmod -de node_you_want _to_delete


A more formal note removal pipeline (as BASH):

    for HG in $( qconf -shgrpl ) ; do
        qconf -dattr hostgrop hostlist NODE_NAME_HERE $HG
    done
    qconf -purge queue slots *.q@NODE_NAME_HERE (or all.q)
    qconf -ds NODE_NAME_HERE
    qconf -dconf NODE_NAME_HERE
    qconf -de NODE_NAME_HERE

To alter the priority on all the jobs for a user:
qstat -u user | cut -d ' ' -f2 >> some_file
Edit some_file and delete the first couple lines (the header lines)
for OUTPUT in $`cat some_file`; do qalter -p 1022 $OUTPUT; done;
Priorities are -1024 to 1023

DEBUGGING SGE:

qstat -explain a

for HOSTGROUP in `qconf -shgrpl`; do for HOSTLIST in `qconf -shgrp $HOSTGROUP`; do  echo $HOSTLIST; done; done | grep node-1.slot-27.rack-2.pharmacy.cluster.domain

Look at the logs for both master and exec 
(raiders:/var/spool/gridengine/raiders/messages and pan:/var/spool/gridengine/bkslab/qmaster/messages)

Make sure resolv.conf looks like this:
nameserver 142.150.250.10
nameserver 10.10.16.64
search cluster.domain domain bkslab.org                                              	

[root@pan ~]# for X in $`qconf -shgrpl`; do qconf -shgrp $X; done;
Host group "$@24-core" does not exist
group_name @64-core
hostlist node-26.rack-2.pharmacy.cluster.domain
group_name @8-core
hostlist node-2.slot-27.rack-1.pharmacy.cluster.domain \
         node-1.slot-27.rack-1.pharmacy.cluster.domain
group_name @allhosts
hostlist @physical @virtual
group_name @physical
hostlist node-26.rack-2.pharmacy.cluster.domain
group_name @virtual
hostlist node-2.slot-27.rack-1.pharmacy.cluster.domain \
         node-1.slot-27.rack-1.pharmacy.cluster.domain

1)  In one screen I would type strace qstat -f and then in the other screen I would type ps -ax | grep qstat to get the pid.  Then ls -l /proc/pid/fd/
I did this because when I typed strace qstat -f everytime it would get stuck saying this:
poll([{fd=3, events=POLLIN|POLLPRI}], 1, 1000) = 0 (Timeout)
gettimeofday({1390262563, 742705}, NULL) = 0
gettimeofday({1390262563, 742741}, NULL) = 0
gettimeofday({1390262563, 742771}, NULL) = 0
gettimeofday({1390262563, 742801}, NULL) = 0
gettimeofday({1390262563, 742828}, NULL) = 0
gettimeofday({1390262563, 742855}, NULL) = 0
gettimeofday({1390262563, 742881}, NULL) = 0
gettimeofday({1390262563, 742909}, NULL) = 0

and then eventually it would say this:
poll([{fd=3, events=POLLIN|POLLPRI}], 1, 1000) = 1 ([{fd=3, revents=POLLIN}])
gettimeofday({1390262563, 960292}, NULL) = 0
gettimeofday({1390262563, 960321}, NULL) = 0
gettimeofday({1390262563, 960349}, NULL) = 0
read(3, "<gmsh><dl>99</dl></gms", 22)   = 22
read(3, "h", 1)                     	= 1
read(3, ">", 1)                     	= 1
read(3, "<mih version=\"0.1\"><mid>2</mid><"..., 99) = 99
read(3, "<ccrm version=\"0.1\"></ccrm>", 27) = 27
gettimeofday({1390262563, 960547}, NULL) = 0
gettimeofday({1390262563, 960681}, NULL) = 0
gettimeofday({1390262563, 960709}, NULL) = 0
gettimeofday({1390262563, 960741}, NULL) = 0
gettimeofday({1390262563, 960769}, NULL) = 0
gettimeofday({1390262563, 960797}, NULL) = 0
gettimeofday({1390262563, 960823}, NULL) = 0
shutdown(3, 2 /* send and receive */)   = 0
close(3)                            	= 0
gettimeofday({1390262563, 961009}, NULL) = 0
gettimeofday({1390262563, 961036}, NULL) = 0
gettimeofday({1390262563, 961064}, NULL) = 0
gettimeofday({1390262563, 961093}, NULL) = 0
gettimeofday({1390262563, 961120}, NULL) = 0
gettimeofday({1390262563, 961148}, NULL) = 0

The thing that is wierd about this is when I typed ls -l /proc/pid/fd/ there was never a file descriptor "3"

2) I tried to delete the nodes that we moved to SF by doing the following:
qconf -dattr @physical "node-1.rack-3.pharmacy.cluster.domain node-10.rack-3.pharmacy.cluster.domain node-11.rack-3.pharmacy.cluster.domain node-12.rack-3.pharmacy.cluster.domain node-13.rack-3.pharmacy.cluster.domain node-14.rack-3.pharmacy.cluster.domain node-15.rack-3.pharmacy.cluster.domain node-2.rack-3.pharmacy.cluster.domain node-26.rack-3.pharmacy.cluster.domain node-27.rack-3.pharmacy.cluster.domain node-29.rack-3.pharmacy.cluster.domain node-3.rack-3.pharmacy.cluster.domain node-4.rack-3.pharmacy.cluster.domain node-5.rack-3.pharmacy.cluster.domain node-6.rack-3.pharmacy.cluster.domain node-7.rack-3.pharmacy.cluster.domain node-8.rack-3.pharmacy.cluster.domain node-9.rack-3.pharmacy.cluster.domain" node-1.rack-3.pharmacy.cluster.domain @physical > /dev/null

I would get the error: Modification of object "@physical" not supported

3) I tried to see the queues complex attributes by typing qconf -sc and saw this:

#name   	shortcut   type    	relop requestable consumable default  urgency 

slots           	s      	INT     	<=        YES     	YES        	1    	1000

I am not quite sure what urgency = 1000 means.
All other names had "0" under urgency.

4) I tried qmod -cq '*'  to clear the error state of all the queues.  
It would tell me this:

Queue instance "all.q@node-1.rack-3.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-1.slot-27.rack-1.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-1.slot-27.rack-2.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-10.rack-3.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-11.rack-3.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-12.rack-3.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-13.rack-3.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-14.rack-3.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-15.rack-3.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-2.rack-3.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-2.slot-27.rack-1.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-2.slot-27.rack-2.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-26.rack-2.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-26.rack-3.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-27.rack-3.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-29.rack-3.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-3.rack-3.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-3.slot-27.rack-2.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-4.rack-3.pharmacy.cluster.domain is already in the specified state: no error
Queue instance "all.q@node-4.slot-27.rack-2.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-5.rack-3.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-5.slot-27.rack-2.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-6.rack-3.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-6.slot-27.rack-2.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-7.rack-3.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-7.slot-27.rack-2.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-8.rack-3.pharmacy.cluster.domain" is already in the specified state: no error
Queue instance "all.q@node-9.rack-3.pharmacy.cluster.domain" is already in the specified state: no error


5) I tried deleting a node like this instead:
qconf -ds node-1.rack-3.pharmacy.cluster.domain
But when I typed qconf -sel it was still there.

6)  I tried to see what the hostlist for @physical was by typing qconf -ahgrp @physical.  It said: group_name @physical, hostlist NONE
	Then I typed qconf -shgrpl to see a list of all hostgroups and tried typing qconf -ahgrp.  All of them said the hostlist was NONE, 
   but when I tried to type qconf -ahgrp @allhosts I got this message:
   denied: "root" must be manager for this operation
   error: commlib error: got select error (Connection reset by peer)

7) I looked at the messages in the file: /var/spool/gridengine/bkslab/qmaster/messages and it said this (over and over again):

01/20/2014 19:41:35|listen|pan|E|commlib error: got read error (closing "pan.slot-27.rack-1.pharmacy.cluster.domain/qconf/2")
01/20/2014 19:43:24|  main|pan|W|local configuration pan.slot-27.rack-1.pharmacy.cluster.domain not defined - using global configuration
01/20/2014 19:43:24|  main|pan|W|can't resolve host name "node-3-3.rack-3.pharmacy.cluster.domain": undefined commlib error code
01/20/2014 19:43:24|  main|pan|W|can't resolve host name "node-3-4.rack-3.pharmacy.cluster.domain": undefined commlib error code
01/20/2014 19:43:53|  main|pan|I|read job database with 468604 entries in 29 seconds
01/20/2014 19:43:55|  main|pan|I|qmaster hard descriptor limit is set to 8192
01/20/2014 19:43:55|  main|pan|I|qmaster soft descriptor limit is set to 8192
01/20/2014 19:43:55|  main|pan|I|qmaster will use max. 8172 file descriptors for communication
01/20/2014 19:43:55|  main|pan|I|qmaster will accept max. 99 dynamic event clients
01/20/2014 19:43:55|  main|pan|I|starting up GE 6.2u5p3 (lx26-amd64)

8)  Periodically i would get this error:  ERROR: failed receiving gdi request response for mid=3 (got no message).

9)  I also tried delete the pid in the file: /var/spool/gridengine/bkslab/qmaster/qmaster.pid
  That didn't do anything.  It eventually just replaced it with a different number. 

 It's wierd because it's not even the right pid.  For example the real pid was 8286 and the pid in the file was 8203:

  [root@pan qmaster]# service sgemaster start
Starting sge_qmaster:                                  	[  OK  ]
[root@pan qmaster]# ps -ax |grep sge
Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ
 8286 ?    	Rl 	0:03 /usr/bin/sge_qmaster
 8301 pts/0	S+ 	0:00 grep sge
[root@pan qmaster]# cat qmaster.pid 
8203

10)   When I typed tail /var/log/messages I saw this:

Jan 20 14:25:05 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)
Jan 20 14:27:05 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)
Jan 20 14:29:05 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)
Jan 20 14:31:05 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)
Jan 20 14:33:06 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)
Jan 20 14:35:06 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)
Jan 20 14:36:29 pan kernel: Registering the id_resolver key type
Jan 20 14:36:29 pan kernel: FS-Cache: Netfs 'nfs' registered for caching
Jan 20 14:36:29 pan nfsidmap[2536]: nss_getpwnam: name 'root@rack-1.pharmacy.cluster.domain' does not map into domain 'domain'
Jan 20 14:37:06 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)
This was what happened when I restarted the machine.
SGE notes: Difference between revisions

Latest revision as of 22:02, 16 May 2016

Navigation menu

Search