Sun Grid Engine (SGE): Difference between revisions

From DISI
Jump to navigation Jump to search
mNo edit summary
No edit summary
Line 438: Line 438:
  Jan 20 14:37:06 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)
  Jan 20 14:37:06 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)
This was what happened when I restarted the machine.
This was what happened when I restarted the machine.
==Sun Grid Engine Commands==
To disable a host from queue:
qmod -d '*@<hostname>'
To view jobs running on host queue:
qhost -h <hostname> -j


[[Category: Sysadmin]]
[[Category: Sysadmin]]

Revision as of 18:03, 20 September 2016

ALL ABOUT SGE (SUN GRID ENGINE)

To add an exec node:

 yum -y install gridengine gridengine-execd
 export SGE_ROOT=/usr/share/gridengine
 export SGE_CELL=bkslab
 cp -v /nfs/init/gridengine/install.conf /tmp/gridengine-install.conf
+++++++++++++++++++++++++++++++++++++++++++++++++++++
#-------------------------------------------------
# SGE default configuration file
#------------------------------------------------- 
# Use always fully qualified pathnames, please 
# SGE_ROOT Path, this is basic information #(mandatory for qmaster and execd installation) SGE_ROOT="/usr/share/gridengine"
# SGE_QMASTER_PORT is used by qmaster for communication # Please enter the port in this way: 1300 # Please do not this: 1300/tcp #(mandatory for qmaster installation) SGE_QMASTER_PORT=6444
# SGE_EXECD_PORT is used by execd for communication # Please enter the port in this way: 1300 # Please do not this: 1300/tcp #(mandatory for qmaster installation) SGE_EXECD_PORT=6445
# SGE_ENABLE_SMF # if set to false SMF will not control SGE services SGE_ENABLE_SMF="false"
# SGE_ENABLE_ST # if set to false Sun Service Tags will not be used SGE_ENABLE_ST="true"
# SGE_CLUSTER_NAME # Name of this cluster (used by SMF as an service instance name) SGE_CLUSTER_NAME="bkslab"
# SGE_JMX_PORT is used by qmasters JMX MBean server # mandatory if install_qmaster -jmx -auto <cfgfile> # range: 1024-65500 SGE_JMX_PORT="6446"
# SGE_JMX_SSL is used by qmasters JMX MBean server # if SGE_JMX_SSL=true, the mbean server connection uses # SSL authentication SGE_JMX_SSL="true"
# SGE_JMX_SSL_CLIENT is used by qmasters JMX MBean server # if SGE_JMX_SSL_CLIENT=true, the mbean server connection uses # SSL authentication of the client in addition SGE_JMX_SSL_CLIENT="true"
# SGE_JMX_SSL_KEYSTORE is used by qmasters JMX MBean server # if SGE_JMX_SSL=true the server keystore found here is used # e.g. /var/sgeCA/port<sge_qmaster_port>/<sge_cell>/private/keystore SGE_JMX_SSL_KEYSTORE="/var/sgeCA/sge_qmaster/bkslab/private/keystore"
# SGE_JMX_SSL_KEYSTORE_PW is used by qmasters JMX MBean server # password for the SGE_JMX_SSL_KEYSTORE file SGE_JMX_SSL_KEYSTORE_PW="secret"
# SGE_JVM_LIB_PATH is used by qmasters jvm thread # path to libjvm.so # if value is missing or set to "none" JMX thread will not be installed # when the value is empty or path does not exit on the system, Grid Engine # will try to find a correct value, if it cannot do so, value is set to # "jvmlib_missing" and JMX thread will be configured but will fail to start SGE_JVM_LIB_PATH="none"
# SGE_ADDITIONAL_JVM_ARGS is used by qmasters jvm thread # jvm specific arguments as -verbose:jni etc. # optional, can be empty SGE_ADDITIONAL_JVM_ARGS="-Xmx256m"
# CELL_NAME, will be a dir in SGE_ROOT, contains the common dir # Please enter only the name of the cell. No path, please #(mandatory for qmaster and execd installation) CELL_NAME="bkslab"
# ADMIN_USER, if you want to use a different admin user than the owner, # of SGE_ROOT, you have to enter the user name, here # Leaving this blank, the owner of the SGE_ROOT dir will be used as admin user ADMIN_USER=""
# The dir, where qmaster spools this parts, which are not spooled by DB #(mandatory for qmaster installation) QMASTER_SPOOL_DIR="/var/spool/gridengine/bkslab/qmaster"
# The dir, where the execd spools (active jobs) # This entry is needed, even if your are going to use # berkeley db spooling. Only cluster configuration and jobs will # be spooled in the database. The execution daemon still needs a spool # directory #(mandatory for qmaster installation) EXECD_SPOOL_DIR="/var/spool/gridengine"
# For monitoring and accounting of jobs, every job will get # unique GID. So you have to enter a free GID Range, which # is assigned to each job running on a machine. # If you want to run 100 Jobs at the same time on one host you # have to enter a GID-Range like that: 16000-16100 #(mandatory for qmaster installation) GID_RANGE="16000-16100"
# If SGE is compiled with -spool-dynamic, you have to enter here, which # spooling method should be used. (classic or berkeleydb) #(mandatory for qmaster installation) SPOOLING_METHOD="berkeleydb"
# Name of the Server, where the Spooling DB is running on # if spooling methode is berkeleydb, it must be "none", when # using no spooling server and it must contain the servername # if a server should be used. In case of "classic" spooling, # can be left out DB_SPOOLING_SERVER="none"
# The dir, where the DB spools # If berkeley db spooling is used, it must contain the path to # the spooling db. Please enter the full path. (eg. /tmp/data/spooldb) # Remember, this directory must be local on the qmaster host or on the # Berkeley DB Server host. No NFS mount, please DB_SPOOLING_DIR="/var/spool/gridengine/bkslab/spooldb"
# This parameter set the number of parallel installation processes. # The prevent a system overload, or exeeding the number of open file # descriptors the user can limit the number of parallel install processes. # eg. set PAR_EXECD_INST_COUNT="20", maximum 20 parallel execd are installed. PAR_EXECD_INST_COUNT="20"
# A List of Host which should become admin hosts # If you do not enter any host here, you have to add all of your hosts # by hand, after the installation. The autoinstallation works without # any entry ADMIN_HOST_LIST=""
# A List of Host which should become submit hosts # If you do not enter any host here, you have to add all of your hosts # by hand, after the installation. The autoinstallation works without # any entry SUBMIT_HOST_LIST=""
# A List of Host which should become exec hosts # If you do not enter any host here, you have to add all of your hosts # by hand, after the installation. The autoinstallation works without # any entry # (mandatory for execution host installation) EXEC_HOST_LIST=""
# The dir, where the execd spools (local configuration) # If you want configure your execution daemons to spool in # a local directory, you have to enter this directory here. # If you do not want to configure a local execution host spool directory # please leave this empty EXECD_SPOOL_DIR_LOCAL="/var/spool/gridengine"
# If true, the domainnames will be ignored, during the hostname resolving # if false, the fully qualified domain name will be used for name resolving HOSTNAME_RESOLVING="false"
# Shell, which should be used for remote installation (rsh/ssh) # This is only supported, if your hosts and rshd/sshd is configured, # not to ask for a password, or promting any message. SHELL_NAME="ssh"
# This remote copy command is used for csp installation. # The script needs the remote copy command for distributing # the csp certificates. Using ssl the command scp has to be entered, # using the not so secure rsh the command rcp has to be entered. # Both need a passwordless ssh/rsh connection to the hosts, which # should be connected to. (mandatory for csp installation mode) COPY_COMMAND="scp"
# Enter your default domain, if you are using /etc/hosts or NIS configuration DEFAULT_DOMAIN="none"
# If a job stops, fails, finish, you can send a mail to this adress ADMIN_MAIL="none"
# If true, the rc scripts (sgemaster, sgeexecd, sgebdb) will be added, # to start automatically during boottime ADD_TO_RC="true"
#If this is "true" the file permissions of executables will be set to 755 #and of ordenary file to 644. SET_FILE_PERMS="true"
# This option is not implemented, yet. # When a exechost should be uninstalled, the running jobs will be rescheduled RESCHEDULE_JOBS="wait"
# Enter a one of the three distributed scheduler tuning configuration sets # (1=normal, 2=high, 3=max) SCHEDD_CONF="1"
# The name of the shadow host. This host must have read/write permission # to the qmaster spool directory # If you want to setup a shadow host, you must enter the servername # (mandatory for shadowhost installation) SHADOW_HOST=""
# Remove this execution hosts in automatic mode # (mandatory for unistallation of execution hosts) EXEC_HOST_LIST_RM=""
# This option is used for startup script removing. # If true, all rc startup scripts will be removed during # automatic deinstallation. If false, the scripts won't # be touched. # (mandatory for unistallation of execution/qmaster hosts) REMOVE_RC="true"
# This is a Windows specific part of the auto isntallation template # If you going to install windows executions hosts, you have to enable the # windows support. To do this, please set the WINDOWS_SUPPORT variable # to "true". ("false" is disabled) # (mandatory for qmaster installation, by default WINDOWS_SUPPORT is # disabled) WINDOWS_SUPPORT="false"
# Enabling the WINDOWS_SUPPORT, recommends the following parameter. # The WIN_ADMIN_NAME will be added to the list of SGE managers. # Without adding the WIN_ADMIN_NAME the execution host installation # won't install correctly. # WIN_ADMIN_NAME is set to "Administrator" which is default on most # Windows systems. In some cases the WIN_ADMIN_NAME can be prefixed with # the windows domain name (eg. DOMAIN+Administrator) # (mandatory for qmaster installation, if windows hosts should be installed) WIN_ADMIN_NAME="Administrator"
# This parameter is used to switch between local ADMINUSER and Windows # Domain Adminuser. Setting the WIN_DOMAIN_ACCESS variable to true, the # Adminuser will be a Windows Domain User. It is recommended that # a Windows Domain Server is configured and the Windows Domain User is # created. Setting this variable to false, the local Adminuser will be # used as ADMINUSER. The install script tries to create this user account # but we recommend, because it will be saver, to create this user, # before running the installation. # (mandatory for qmaster installation, if windows hosts should be installed) WIN_DOMAIN_ACCESS="false"
# This section is used for csp installation mode. # CSP_RECREATE recreates the certs on each installtion, if true. # In case of false, the certs will be created, if not existing. # Existing certs won't be overwritten. (mandatory for csp install) CSP_RECREATE="true"
# The created certs won't be copied, if this option is set to false # If true, the script tries to copy the generated certs. This # requires passwordless ssh/rsh access for user root to the # execution hosts CSP_COPY_CERTS="false"
# csp information, your country code (only 2 characters) # (mandatory for csp install) CSP_COUNTRY_CODE="CA"
# your state (mandatory for csp install) CSP_STATE="Ontario"
# your location, eg. the building (mandatory for csp install) CSP_LOCATION="Faculty of Pharmacy"
# your arganisation (mandatory for csp install) CSP_ORGA="University of Toronto"
# your organisation unit (mandatory for csp install) CSP_ORGA_UNIT="Shoichet Lab"
# your email (mandatory for csp install) CSP_MAIL_ADDRESS="admin@bkslab.org"
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ vim /tmp/gridengine-install.conf -> CHANGE EXEC_HOST_LIST=" " TO EXEC_HOST_LIST="$HOSTNAME" cd /usr/share/gridengine/ ./inst_sge -x -s -auto /tmp/gridengine-install.conf > /tmp/gridengine.log cat /tmp/gridengine.log | tee -a /root/gridengine-install.log if [ -e ${SGE_CELL} ]; then mv -v ${SGE_CELL} ${SGE_CELL}.local; fi ln -vs /nfs/gridengine/${SGE_CELL} /usr/share/gridengine/${SGE_CELL} rm -vf /etc/sysconfig/gridengine echo "SGE_ROOT=${SGE_ROOT}" >> /etc/sysconfig/gridengine echo "SGE_CELL=${SGE_CELL}" >> /etc/sysconfig/gridengine mkdir -pv /var/spool/gridengine/`hostname -s` chown -Rv sgeadmin:sgeadmin /var/spool/gridengine chkconfig --levels=345 sge_execd on
Go to sgemaster and do this: qconf -ae --> CHANGE THE HOSTNAME FROM "template" to hostname_of_new_exec qconf -as hostname

HOW TO EDIT THE NUMBER OF SLOTS FOR A EXEC_HOST:

qconf -mattr exechost complex_values slots=32 raiders.c.uoft.bkslab.org
"complex_values" of "exechost" is empty - Adding new element(s). 
root@pan.slot-27.rack-1.pharmacy.cluster.uoft.bkslab.org modified "raiders.c.uoft.bkslab.org" in exechost list

HOW TO ADD A HOSTGROUP:

 qconf -ahgrp @custom 

ADD THE EXECHOST TO A HOSTGROUP:

 qconf -mhgrp @custom
 service sgemaster restart
 # Then back on the exec_host:
 service sge_execd start

To suspend jobs you do:

qmod -sj job_number

To delete nodes I did the following:

qconf -shgrpl  -> To see a list of host groups
qconf -shgrp @HOST_GROUP_NAME  -> For each host group to see if the nodes you want to delete are listed

If it is listed then:

qconf-mhgrp @HOST_GROUP_NAME -> Modify this file (delete the line with the node you want to delete).

Once you've deleted the node you want to delete from all the hostgroups:

qconf -de node_you_want _to_delete >/dev/null
qmod -de node_you_want _to_delete

To alter the priority on all the jobs for a user:

qstat -u user | cut -d ' ' -f2 >> some_file

Edit some_file and delete the first couple lines (the header lines)

for OUTPUT in $`cat some_file`; do qalter -p 1022 $OUTPUT; done;
Priorities are -1024 to 1023

DEBUGGING SGE:

qstat -explain a
for HOSTGROUP in `qconf -shgrpl`; do for HOSTLIST in `qconf -shgrp $HOSTGROUP`; do  echo $HOSTLIST; done; done | grep node-1.slot-27.rack-2.pharmacy.cluster.uoft.bkslab.org

Look at the logs for both master and exec (raiders:/var/spool/gridengine/raiders/messages and pan:/var/spool/gridengine/bkslab/qmaster/messages)

Make sure resolv.conf looks like this:

nameserver 142.150.250.10
nameserver 10.10.16.64
search cluster.uoft.bkslab.org uoft.bkslab.org bkslab.org                                                  
[root@pan ~]# for X in $`qconf -shgrpl`; do qconf -shgrp $X; done;
Host group "$@24-core" does not exist
group_name @64-core
hostlist node-26.rack-2.pharmacy.cluster.uoft.bkslab.org
group_name @8-core
hostlist node-2.slot-27.rack-1.pharmacy.cluster.uoft.bkslab.org \
        node-1.slot-27.rack-1.pharmacy.cluster.uoft.bkslab.org
group_name @allhosts
hostlist @physical @virtual
group_name @physical
hostlist node-26.rack-2.pharmacy.cluster.uoft.bkslab.org
group_name @virtual
hostlist node-2.slot-27.rack-1.pharmacy.cluster.uoft.bkslab.org \
        node-1.slot-27.rack-1.pharmacy.cluster.uoft.bkslab.org

1) In one screen I would type strace qstat -f and then in the other screen I would type ps -ax | grep qstat to get the pid.
Then ls -l /proc/pid/fd/
I did this because when I typed strace qstat -f everytime it would get stuck saying this:

poll([{fd=3, events=POLLIN|POLLPRI}], 1, 1000) = 0 (Timeout)
gettimeofday({1390262563, 742705}, NULL) = 0
gettimeofday({1390262563, 742741}, NULL) = 0
gettimeofday({1390262563, 742771}, NULL) = 0
gettimeofday({1390262563, 742801}, NULL) = 0
gettimeofday({1390262563, 742828}, NULL) = 0
gettimeofday({1390262563, 742855}, NULL) = 0
gettimeofday({1390262563, 742881}, NULL) = 0
gettimeofday({1390262563, 742909}, NULL) = 0

and then eventually it would say this:

poll([{fd=3, events=POLLIN|POLLPRI}], 1, 1000) = 1 ([{fd=3, revents=POLLIN}])
gettimeofday({1390262563, 960292}, NULL) = 0
gettimeofday({1390262563, 960321}, NULL) = 0
gettimeofday({1390262563, 960349}, NULL) = 0
read(3, "<gmsh><dl>99</dl></gms", 22)   = 22
read(3, "h", 1)                         = 1
read(3, ">", 1)                         = 1
read(3, "<mih version=\"0.1\"><mid>2</mid><"..., 99) = 99
read(3, "<ccrm version=\"0.1\"></ccrm>", 27) = 27
gettimeofday({1390262563, 960547}, NULL) = 0
gettimeofday({1390262563, 960681}, NULL) = 0
gettimeofday({1390262563, 960709}, NULL) = 0
gettimeofday({1390262563, 960741}, NULL) = 0
gettimeofday({1390262563, 960769}, NULL) = 0
gettimeofday({1390262563, 960797}, NULL) = 0
gettimeofday({1390262563, 960823}, NULL) = 0
shutdown(3, 2 /* send and receive */)   = 0
close(3)                                = 0
gettimeofday({1390262563, 961009}, NULL) = 0
gettimeofday({1390262563, 961036}, NULL) = 0
gettimeofday({1390262563, 961064}, NULL) = 0
gettimeofday({1390262563, 961093}, NULL) = 0
gettimeofday({1390262563, 961120}, NULL) = 0
gettimeofday({1390262563, 961148}, NULL) = 0 

The thing that is weird about this is when I typed ls -l /proc/pid/fd/ there was never a file descriptor "3"

2) I tried to delete the nodes that we moved to SF by doing the following:

qconf -dattr @physical "node-1.rack-3.pharmacy.cluster.uoft.bkslab.org node-10.rack-3.pharmacy.cluster.uoft.bkslab.org node-11.rack-3.pharmacy.cluster.uoft.bkslab.org node-12.rack-3.pharmacy.cluster.uoft.bkslab.org  node-13.rack-3.pharmacy.cluster.uoft.bkslab.org node-14.rack-3.pharmacy.cluster.uoft.bkslab.org node-15.rack-3.pharmacy.cluster.uoft.bkslab.org node-2.rack-3.pharmacy.cluster.uoft.bkslab.org node-26.rack-3.pharmacy.cluster.uoft.bkslab.org node-27.rack-3.pharmacy.cluster.uoft.bkslab.org node-29.rack-3.pharmacy.cluster.uoft.bkslab.org node-3.rack-3.pharmacy.cluster.uoft.bkslab.org node-4.rack-3.pharmacy.cluster.uoft.bkslab.org node-5.rack-3.pharmacy.cluster.uoft.bkslab.org node-6.rack-3.pharmacy.cluster.uoft.bkslab.org node-7.rack-3.pharmacy.cluster.uoft.bkslab.org node-8.rack-3.pharmacy.cluster.uoft.bkslab.org node-9.rack-3.pharmacy.cluster.uoft.bkslab.org" node-1.rack-3.pharmacy.cluster.uoft.bkslab.org @physical > /dev/null

I would get the error:

Modification of object "@physical" not supported

3) I tried to see the queues complex attributes by typing qconf -sc and saw this:

#name       shortcut   type        relop requestable consumable default  urgency 
slots               s          INT         <=        YES         YES            1        1000

I am not quite sure what urgency = 1000 means. All other names had "0" under urgency.

4) I tried qmod -cq '*' to clear the error state of all the queues. It would tell me this:

Queue instance "all.q@node-1.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-1.slot-27.rack-1.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-1.slot-27.rack-2.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-10.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-11.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-12.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-13.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-14.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-15.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-2.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-2.slot-27.rack-1.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-2.slot-27.rack-2.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-26.rack-2.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-26.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-27.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-29.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-3.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-3.slot-27.rack-2.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-4.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-4.slot-27.rack-2.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-5.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-5.slot-27.rack-2.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-6.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-6.slot-27.rack-2.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-7.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-7.slot-27.rack-2.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-8.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-9.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error


5) I tried deleting a node like this instead:

qconf -ds node-1.rack-3.pharmacy.cluster.uoft.bkslab.org

But when I typed qconf -sel it was still there.

6) I tried to see what the hostlist for @physical was by typing qconf -ahgrp @physical. It said: group_name @physical, hostlist NONE Then I typed qconf -shgrpl to see a list of all hostgroups and tried typing qconf -ahgrp. All of them said the hostlist was NONE, but when I tried to type qconf -ahgrp @allhosts I got this message:

denied: "root" must be manager for this operation
error: commlib error: got select error (Connection reset by peer)

7) I looked at the messages in the file: /var/spool/gridengine/bkslab/qmaster/messages and it said this (over and over again):

01/20/2014 19:41:35|listen|pan|E|commlib error: got read error (closing "pan.slot-27.rack-1.pharmacy.cluster.uoft.bkslab.org/qconf/2")
01/20/2014 19:43:24|  main|pan|W|local configuration pan.slot-27.rack-1.pharmacy.cluster.uoft.bkslab.org not defined - using global configuration
01/20/2014 19:43:24|  main|pan|W|can't resolve host name "node-3-3.rack-3.pharmacy.cluster.uoft.bkslab.org": undefined commlib error code
01/20/2014 19:43:24|  main|pan|W|can't resolve host name "node-3-4.rack-3.pharmacy.cluster.uoft.bkslab.org": undefined commlib error code
01/20/2014 19:43:53|  main|pan|I|read job database with 468604 entries in 29 seconds
01/20/2014 19:43:55|  main|pan|I|qmaster hard descriptor limit is set to 8192
01/20/2014 19:43:55|  main|pan|I|qmaster soft descriptor limit is set to 8192
01/20/2014 19:43:55|  main|pan|I|qmaster will use max. 8172 file descriptors for communication
01/20/2014 19:43:55|  main|pan|I|qmaster will accept max. 99 dynamic event clients
01/20/2014 19:43:55|  main|pan|I|starting up GE 6.2u5p3 (lx26-amd64)

8) Periodically i would get this error:

ERROR: failed receiving gdi request response for mid=3 (got no message).

9) I also tried delete the pid in the file: /var/spool/gridengine/bkslab/qmaster/qmaster.pid That didn't do anything. It eventually just replaced it with a different number. It's weird because it's not even the right pid. For example the real pid was 8286 and the pid in the file was 8203:

[root@pan qmaster]# service sgemaster start
Starting sge_qmaster:                                      [  OK  ]
[root@pan qmaster]# ps -ax |grep sge
Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ
8286 ?        Rl     0:03 /usr/bin/sge_qmaster
8301 pts/0    S+     0:00 grep sge
[root@pan qmaster]# cat qmaster.pid 
8203

10) When I typed tail /var/log/messages I saw this:

Jan 20 14:25:05 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)
Jan 20 14:27:05 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)
Jan 20 14:29:05 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)
Jan 20 14:31:05 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)
Jan 20 14:33:06 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)
Jan 20 14:35:06 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)
Jan 20 14:36:29 pan kernel: Registering the id_resolver key type
Jan 20 14:36:29 pan kernel: FS-Cache: Netfs 'nfs' registered for caching
Jan 20 14:36:29 pan nfsidmap[2536]: nss_getpwnam: name 'root@rack-1.pharmacy.cluster.uoft.bkslab.org' does not map into domain 'uoft.bkslab.org'
Jan 20 14:37:06 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)

This was what happened when I restarted the machine.

Sun Grid Engine Commands

To disable a host from queue:

qmod -d '*@<hostname>'

To view jobs running on host queue:

qhost -h <hostname> -j