Sun Grid Engine (SGE): Difference between revisions

From DISI
Jump to navigation Jump to search
(Creating page based on "ALL ABOUT SGE (SUN GRID ENGINE)" from Lab Manual)
 
No edit summary
 
(2 intermediate revisions by the same user not shown)
Line 438: Line 438:
  Jan 20 14:37:06 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)
  Jan 20 14:37:06 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)
This was what happened when I restarted the machine.
This was what happened when I restarted the machine.
==Sun Grid Engine Commands==
To disable a host from queue:
qmod -d '*@<hostname>'
To view jobs running on host queue:
qhost -h <hostname> -j
==External Links==
Add/Remove Administrative, Execution, Submit Hosts: http://gridscheduler.sourceforge.net/howto/commontasks.html
[[Category: Sysadmin]]

Latest revision as of 20:56, 23 January 2017

ALL ABOUT SGE (SUN GRID ENGINE)

To add an exec node:

 yum -y install gridengine gridengine-execd
 export SGE_ROOT=/usr/share/gridengine
 export SGE_CELL=bkslab
 cp -v /nfs/init/gridengine/install.conf /tmp/gridengine-install.conf
+++++++++++++++++++++++++++++++++++++++++++++++++++++
#-------------------------------------------------
# SGE default configuration file
#------------------------------------------------- 
# Use always fully qualified pathnames, please 
# SGE_ROOT Path, this is basic information #(mandatory for qmaster and execd installation) SGE_ROOT="/usr/share/gridengine"
# SGE_QMASTER_PORT is used by qmaster for communication # Please enter the port in this way: 1300 # Please do not this: 1300/tcp #(mandatory for qmaster installation) SGE_QMASTER_PORT=6444
# SGE_EXECD_PORT is used by execd for communication # Please enter the port in this way: 1300 # Please do not this: 1300/tcp #(mandatory for qmaster installation) SGE_EXECD_PORT=6445
# SGE_ENABLE_SMF # if set to false SMF will not control SGE services SGE_ENABLE_SMF="false"
# SGE_ENABLE_ST # if set to false Sun Service Tags will not be used SGE_ENABLE_ST="true"
# SGE_CLUSTER_NAME # Name of this cluster (used by SMF as an service instance name) SGE_CLUSTER_NAME="bkslab"
# SGE_JMX_PORT is used by qmasters JMX MBean server # mandatory if install_qmaster -jmx -auto <cfgfile> # range: 1024-65500 SGE_JMX_PORT="6446"
# SGE_JMX_SSL is used by qmasters JMX MBean server # if SGE_JMX_SSL=true, the mbean server connection uses # SSL authentication SGE_JMX_SSL="true"
# SGE_JMX_SSL_CLIENT is used by qmasters JMX MBean server # if SGE_JMX_SSL_CLIENT=true, the mbean server connection uses # SSL authentication of the client in addition SGE_JMX_SSL_CLIENT="true"
# SGE_JMX_SSL_KEYSTORE is used by qmasters JMX MBean server # if SGE_JMX_SSL=true the server keystore found here is used # e.g. /var/sgeCA/port<sge_qmaster_port>/<sge_cell>/private/keystore SGE_JMX_SSL_KEYSTORE="/var/sgeCA/sge_qmaster/bkslab/private/keystore"
# SGE_JMX_SSL_KEYSTORE_PW is used by qmasters JMX MBean server # password for the SGE_JMX_SSL_KEYSTORE file SGE_JMX_SSL_KEYSTORE_PW="secret"
# SGE_JVM_LIB_PATH is used by qmasters jvm thread # path to libjvm.so # if value is missing or set to "none" JMX thread will not be installed # when the value is empty or path does not exit on the system, Grid Engine # will try to find a correct value, if it cannot do so, value is set to # "jvmlib_missing" and JMX thread will be configured but will fail to start SGE_JVM_LIB_PATH="none"
# SGE_ADDITIONAL_JVM_ARGS is used by qmasters jvm thread # jvm specific arguments as -verbose:jni etc. # optional, can be empty SGE_ADDITIONAL_JVM_ARGS="-Xmx256m"
# CELL_NAME, will be a dir in SGE_ROOT, contains the common dir # Please enter only the name of the cell. No path, please #(mandatory for qmaster and execd installation) CELL_NAME="bkslab"
# ADMIN_USER, if you want to use a different admin user than the owner, # of SGE_ROOT, you have to enter the user name, here # Leaving this blank, the owner of the SGE_ROOT dir will be used as admin user ADMIN_USER=""
# The dir, where qmaster spools this parts, which are not spooled by DB #(mandatory for qmaster installation) QMASTER_SPOOL_DIR="/var/spool/gridengine/bkslab/qmaster"
# The dir, where the execd spools (active jobs) # This entry is needed, even if your are going to use # berkeley db spooling. Only cluster configuration and jobs will # be spooled in the database. The execution daemon still needs a spool # directory #(mandatory for qmaster installation) EXECD_SPOOL_DIR="/var/spool/gridengine"
# For monitoring and accounting of jobs, every job will get # unique GID. So you have to enter a free GID Range, which # is assigned to each job running on a machine. # If you want to run 100 Jobs at the same time on one host you # have to enter a GID-Range like that: 16000-16100 #(mandatory for qmaster installation) GID_RANGE="16000-16100"
# If SGE is compiled with -spool-dynamic, you have to enter here, which # spooling method should be used. (classic or berkeleydb) #(mandatory for qmaster installation) SPOOLING_METHOD="berkeleydb"
# Name of the Server, where the Spooling DB is running on # if spooling methode is berkeleydb, it must be "none", when # using no spooling server and it must contain the servername # if a server should be used. In case of "classic" spooling, # can be left out DB_SPOOLING_SERVER="none"
# The dir, where the DB spools # If berkeley db spooling is used, it must contain the path to # the spooling db. Please enter the full path. (eg. /tmp/data/spooldb) # Remember, this directory must be local on the qmaster host or on the # Berkeley DB Server host. No NFS mount, please DB_SPOOLING_DIR="/var/spool/gridengine/bkslab/spooldb"
# This parameter set the number of parallel installation processes. # The prevent a system overload, or exeeding the number of open file # descriptors the user can limit the number of parallel install processes. # eg. set PAR_EXECD_INST_COUNT="20", maximum 20 parallel execd are installed. PAR_EXECD_INST_COUNT="20"
# A List of Host which should become admin hosts # If you do not enter any host here, you have to add all of your hosts # by hand, after the installation. The autoinstallation works without # any entry ADMIN_HOST_LIST=""
# A List of Host which should become submit hosts # If you do not enter any host here, you have to add all of your hosts # by hand, after the installation. The autoinstallation works without # any entry SUBMIT_HOST_LIST=""
# A List of Host which should become exec hosts # If you do not enter any host here, you have to add all of your hosts # by hand, after the installation. The autoinstallation works without # any entry # (mandatory for execution host installation) EXEC_HOST_LIST=""
# The dir, where the execd spools (local configuration) # If you want configure your execution daemons to spool in # a local directory, you have to enter this directory here. # If you do not want to configure a local execution host spool directory # please leave this empty EXECD_SPOOL_DIR_LOCAL="/var/spool/gridengine"
# If true, the domainnames will be ignored, during the hostname resolving # if false, the fully qualified domain name will be used for name resolving HOSTNAME_RESOLVING="false"
# Shell, which should be used for remote installation (rsh/ssh) # This is only supported, if your hosts and rshd/sshd is configured, # not to ask for a password, or promting any message. SHELL_NAME="ssh"
# This remote copy command is used for csp installation. # The script needs the remote copy command for distributing # the csp certificates. Using ssl the command scp has to be entered, # using the not so secure rsh the command rcp has to be entered. # Both need a passwordless ssh/rsh connection to the hosts, which # should be connected to. (mandatory for csp installation mode) COPY_COMMAND="scp"
# Enter your default domain, if you are using /etc/hosts or NIS configuration DEFAULT_DOMAIN="none"
# If a job stops, fails, finish, you can send a mail to this adress ADMIN_MAIL="none"
# If true, the rc scripts (sgemaster, sgeexecd, sgebdb) will be added, # to start automatically during boottime ADD_TO_RC="true"
#If this is "true" the file permissions of executables will be set to 755 #and of ordenary file to 644. SET_FILE_PERMS="true"
# This option is not implemented, yet. # When a exechost should be uninstalled, the running jobs will be rescheduled RESCHEDULE_JOBS="wait"
# Enter a one of the three distributed scheduler tuning configuration sets # (1=normal, 2=high, 3=max) SCHEDD_CONF="1"
# The name of the shadow host. This host must have read/write permission # to the qmaster spool directory # If you want to setup a shadow host, you must enter the servername # (mandatory for shadowhost installation) SHADOW_HOST=""
# Remove this execution hosts in automatic mode # (mandatory for unistallation of execution hosts) EXEC_HOST_LIST_RM=""
# This option is used for startup script removing. # If true, all rc startup scripts will be removed during # automatic deinstallation. If false, the scripts won't # be touched. # (mandatory for unistallation of execution/qmaster hosts) REMOVE_RC="true"
# This is a Windows specific part of the auto isntallation template # If you going to install windows executions hosts, you have to enable the # windows support. To do this, please set the WINDOWS_SUPPORT variable # to "true". ("false" is disabled) # (mandatory for qmaster installation, by default WINDOWS_SUPPORT is # disabled) WINDOWS_SUPPORT="false"
# Enabling the WINDOWS_SUPPORT, recommends the following parameter. # The WIN_ADMIN_NAME will be added to the list of SGE managers. # Without adding the WIN_ADMIN_NAME the execution host installation # won't install correctly. # WIN_ADMIN_NAME is set to "Administrator" which is default on most # Windows systems. In some cases the WIN_ADMIN_NAME can be prefixed with # the windows domain name (eg. DOMAIN+Administrator) # (mandatory for qmaster installation, if windows hosts should be installed) WIN_ADMIN_NAME="Administrator"
# This parameter is used to switch between local ADMINUSER and Windows # Domain Adminuser. Setting the WIN_DOMAIN_ACCESS variable to true, the # Adminuser will be a Windows Domain User. It is recommended that # a Windows Domain Server is configured and the Windows Domain User is # created. Setting this variable to false, the local Adminuser will be # used as ADMINUSER. The install script tries to create this user account # but we recommend, because it will be saver, to create this user, # before running the installation. # (mandatory for qmaster installation, if windows hosts should be installed) WIN_DOMAIN_ACCESS="false"
# This section is used for csp installation mode. # CSP_RECREATE recreates the certs on each installtion, if true. # In case of false, the certs will be created, if not existing. # Existing certs won't be overwritten. (mandatory for csp install) CSP_RECREATE="true"
# The created certs won't be copied, if this option is set to false # If true, the script tries to copy the generated certs. This # requires passwordless ssh/rsh access for user root to the # execution hosts CSP_COPY_CERTS="false"
# csp information, your country code (only 2 characters) # (mandatory for csp install) CSP_COUNTRY_CODE="CA"
# your state (mandatory for csp install) CSP_STATE="Ontario"
# your location, eg. the building (mandatory for csp install) CSP_LOCATION="Faculty of Pharmacy"
# your arganisation (mandatory for csp install) CSP_ORGA="University of Toronto"
# your organisation unit (mandatory for csp install) CSP_ORGA_UNIT="Shoichet Lab"
# your email (mandatory for csp install) CSP_MAIL_ADDRESS="admin@bkslab.org"
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ vim /tmp/gridengine-install.conf -> CHANGE EXEC_HOST_LIST=" " TO EXEC_HOST_LIST="$HOSTNAME" cd /usr/share/gridengine/ ./inst_sge -x -s -auto /tmp/gridengine-install.conf > /tmp/gridengine.log cat /tmp/gridengine.log | tee -a /root/gridengine-install.log if [ -e ${SGE_CELL} ]; then mv -v ${SGE_CELL} ${SGE_CELL}.local; fi ln -vs /nfs/gridengine/${SGE_CELL} /usr/share/gridengine/${SGE_CELL} rm -vf /etc/sysconfig/gridengine echo "SGE_ROOT=${SGE_ROOT}" >> /etc/sysconfig/gridengine echo "SGE_CELL=${SGE_CELL}" >> /etc/sysconfig/gridengine mkdir -pv /var/spool/gridengine/`hostname -s` chown -Rv sgeadmin:sgeadmin /var/spool/gridengine chkconfig --levels=345 sge_execd on
Go to sgemaster and do this: qconf -ae --> CHANGE THE HOSTNAME FROM "template" to hostname_of_new_exec qconf -as hostname

HOW TO EDIT THE NUMBER OF SLOTS FOR A EXEC_HOST:

qconf -mattr exechost complex_values slots=32 raiders.c.uoft.bkslab.org
"complex_values" of "exechost" is empty - Adding new element(s). 
root@pan.slot-27.rack-1.pharmacy.cluster.uoft.bkslab.org modified "raiders.c.uoft.bkslab.org" in exechost list

HOW TO ADD A HOSTGROUP:

 qconf -ahgrp @custom 

ADD THE EXECHOST TO A HOSTGROUP:

 qconf -mhgrp @custom
 service sgemaster restart
 # Then back on the exec_host:
 service sge_execd start

To suspend jobs you do:

qmod -sj job_number

To delete nodes I did the following:

qconf -shgrpl  -> To see a list of host groups
qconf -shgrp @HOST_GROUP_NAME  -> For each host group to see if the nodes you want to delete are listed

If it is listed then:

qconf-mhgrp @HOST_GROUP_NAME -> Modify this file (delete the line with the node you want to delete).

Once you've deleted the node you want to delete from all the hostgroups:

qconf -de node_you_want _to_delete >/dev/null
qmod -de node_you_want _to_delete

To alter the priority on all the jobs for a user:

qstat -u user | cut -d ' ' -f2 >> some_file

Edit some_file and delete the first couple lines (the header lines)

for OUTPUT in $`cat some_file`; do qalter -p 1022 $OUTPUT; done;
Priorities are -1024 to 1023

DEBUGGING SGE:

qstat -explain a
for HOSTGROUP in `qconf -shgrpl`; do for HOSTLIST in `qconf -shgrp $HOSTGROUP`; do  echo $HOSTLIST; done; done | grep node-1.slot-27.rack-2.pharmacy.cluster.uoft.bkslab.org

Look at the logs for both master and exec (raiders:/var/spool/gridengine/raiders/messages and pan:/var/spool/gridengine/bkslab/qmaster/messages)

Make sure resolv.conf looks like this:

nameserver 142.150.250.10
nameserver 10.10.16.64
search cluster.uoft.bkslab.org uoft.bkslab.org bkslab.org                                                  
[root@pan ~]# for X in $`qconf -shgrpl`; do qconf -shgrp $X; done;
Host group "$@24-core" does not exist
group_name @64-core
hostlist node-26.rack-2.pharmacy.cluster.uoft.bkslab.org
group_name @8-core
hostlist node-2.slot-27.rack-1.pharmacy.cluster.uoft.bkslab.org \
        node-1.slot-27.rack-1.pharmacy.cluster.uoft.bkslab.org
group_name @allhosts
hostlist @physical @virtual
group_name @physical
hostlist node-26.rack-2.pharmacy.cluster.uoft.bkslab.org
group_name @virtual
hostlist node-2.slot-27.rack-1.pharmacy.cluster.uoft.bkslab.org \
        node-1.slot-27.rack-1.pharmacy.cluster.uoft.bkslab.org

1) In one screen I would type strace qstat -f and then in the other screen I would type ps -ax | grep qstat to get the pid.
Then ls -l /proc/pid/fd/
I did this because when I typed strace qstat -f everytime it would get stuck saying this:

poll([{fd=3, events=POLLIN|POLLPRI}], 1, 1000) = 0 (Timeout)
gettimeofday({1390262563, 742705}, NULL) = 0
gettimeofday({1390262563, 742741}, NULL) = 0
gettimeofday({1390262563, 742771}, NULL) = 0
gettimeofday({1390262563, 742801}, NULL) = 0
gettimeofday({1390262563, 742828}, NULL) = 0
gettimeofday({1390262563, 742855}, NULL) = 0
gettimeofday({1390262563, 742881}, NULL) = 0
gettimeofday({1390262563, 742909}, NULL) = 0

and then eventually it would say this:

poll([{fd=3, events=POLLIN|POLLPRI}], 1, 1000) = 1 ([{fd=3, revents=POLLIN}])
gettimeofday({1390262563, 960292}, NULL) = 0
gettimeofday({1390262563, 960321}, NULL) = 0
gettimeofday({1390262563, 960349}, NULL) = 0
read(3, "<gmsh><dl>99</dl></gms", 22)   = 22
read(3, "h", 1)                         = 1
read(3, ">", 1)                         = 1
read(3, "<mih version=\"0.1\"><mid>2</mid><"..., 99) = 99
read(3, "<ccrm version=\"0.1\"></ccrm>", 27) = 27
gettimeofday({1390262563, 960547}, NULL) = 0
gettimeofday({1390262563, 960681}, NULL) = 0
gettimeofday({1390262563, 960709}, NULL) = 0
gettimeofday({1390262563, 960741}, NULL) = 0
gettimeofday({1390262563, 960769}, NULL) = 0
gettimeofday({1390262563, 960797}, NULL) = 0
gettimeofday({1390262563, 960823}, NULL) = 0
shutdown(3, 2 /* send and receive */)   = 0
close(3)                                = 0
gettimeofday({1390262563, 961009}, NULL) = 0
gettimeofday({1390262563, 961036}, NULL) = 0
gettimeofday({1390262563, 961064}, NULL) = 0
gettimeofday({1390262563, 961093}, NULL) = 0
gettimeofday({1390262563, 961120}, NULL) = 0
gettimeofday({1390262563, 961148}, NULL) = 0 

The thing that is weird about this is when I typed ls -l /proc/pid/fd/ there was never a file descriptor "3"

2) I tried to delete the nodes that we moved to SF by doing the following:

qconf -dattr @physical "node-1.rack-3.pharmacy.cluster.uoft.bkslab.org node-10.rack-3.pharmacy.cluster.uoft.bkslab.org node-11.rack-3.pharmacy.cluster.uoft.bkslab.org node-12.rack-3.pharmacy.cluster.uoft.bkslab.org  node-13.rack-3.pharmacy.cluster.uoft.bkslab.org node-14.rack-3.pharmacy.cluster.uoft.bkslab.org node-15.rack-3.pharmacy.cluster.uoft.bkslab.org node-2.rack-3.pharmacy.cluster.uoft.bkslab.org node-26.rack-3.pharmacy.cluster.uoft.bkslab.org node-27.rack-3.pharmacy.cluster.uoft.bkslab.org node-29.rack-3.pharmacy.cluster.uoft.bkslab.org node-3.rack-3.pharmacy.cluster.uoft.bkslab.org node-4.rack-3.pharmacy.cluster.uoft.bkslab.org node-5.rack-3.pharmacy.cluster.uoft.bkslab.org node-6.rack-3.pharmacy.cluster.uoft.bkslab.org node-7.rack-3.pharmacy.cluster.uoft.bkslab.org node-8.rack-3.pharmacy.cluster.uoft.bkslab.org node-9.rack-3.pharmacy.cluster.uoft.bkslab.org" node-1.rack-3.pharmacy.cluster.uoft.bkslab.org @physical > /dev/null

I would get the error:

Modification of object "@physical" not supported

3) I tried to see the queues complex attributes by typing qconf -sc and saw this:

#name       shortcut   type        relop requestable consumable default  urgency 
slots               s          INT         <=        YES         YES            1        1000

I am not quite sure what urgency = 1000 means. All other names had "0" under urgency.

4) I tried qmod -cq '*' to clear the error state of all the queues. It would tell me this:

Queue instance "all.q@node-1.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-1.slot-27.rack-1.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-1.slot-27.rack-2.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-10.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-11.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-12.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-13.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-14.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-15.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-2.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-2.slot-27.rack-1.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-2.slot-27.rack-2.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-26.rack-2.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-26.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-27.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-29.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-3.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-3.slot-27.rack-2.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-4.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-4.slot-27.rack-2.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-5.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-5.slot-27.rack-2.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-6.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-6.slot-27.rack-2.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-7.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-7.slot-27.rack-2.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-8.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-9.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error


5) I tried deleting a node like this instead:

qconf -ds node-1.rack-3.pharmacy.cluster.uoft.bkslab.org

But when I typed qconf -sel it was still there.

6) I tried to see what the hostlist for @physical was by typing qconf -ahgrp @physical. It said: group_name @physical, hostlist NONE Then I typed qconf -shgrpl to see a list of all hostgroups and tried typing qconf -ahgrp. All of them said the hostlist was NONE, but when I tried to type qconf -ahgrp @allhosts I got this message:

denied: "root" must be manager for this operation
error: commlib error: got select error (Connection reset by peer)

7) I looked at the messages in the file: /var/spool/gridengine/bkslab/qmaster/messages and it said this (over and over again):

01/20/2014 19:41:35|listen|pan|E|commlib error: got read error (closing "pan.slot-27.rack-1.pharmacy.cluster.uoft.bkslab.org/qconf/2")
01/20/2014 19:43:24|  main|pan|W|local configuration pan.slot-27.rack-1.pharmacy.cluster.uoft.bkslab.org not defined - using global configuration
01/20/2014 19:43:24|  main|pan|W|can't resolve host name "node-3-3.rack-3.pharmacy.cluster.uoft.bkslab.org": undefined commlib error code
01/20/2014 19:43:24|  main|pan|W|can't resolve host name "node-3-4.rack-3.pharmacy.cluster.uoft.bkslab.org": undefined commlib error code
01/20/2014 19:43:53|  main|pan|I|read job database with 468604 entries in 29 seconds
01/20/2014 19:43:55|  main|pan|I|qmaster hard descriptor limit is set to 8192
01/20/2014 19:43:55|  main|pan|I|qmaster soft descriptor limit is set to 8192
01/20/2014 19:43:55|  main|pan|I|qmaster will use max. 8172 file descriptors for communication
01/20/2014 19:43:55|  main|pan|I|qmaster will accept max. 99 dynamic event clients
01/20/2014 19:43:55|  main|pan|I|starting up GE 6.2u5p3 (lx26-amd64)

8) Periodically i would get this error:

ERROR: failed receiving gdi request response for mid=3 (got no message).

9) I also tried delete the pid in the file: /var/spool/gridengine/bkslab/qmaster/qmaster.pid That didn't do anything. It eventually just replaced it with a different number. It's weird because it's not even the right pid. For example the real pid was 8286 and the pid in the file was 8203:

[root@pan qmaster]# service sgemaster start
Starting sge_qmaster:                                      [  OK  ]
[root@pan qmaster]# ps -ax |grep sge
Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ
8286 ?        Rl     0:03 /usr/bin/sge_qmaster
8301 pts/0    S+     0:00 grep sge
[root@pan qmaster]# cat qmaster.pid 
8203

10) When I typed tail /var/log/messages I saw this:

Jan 20 14:25:05 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)
Jan 20 14:27:05 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)
Jan 20 14:29:05 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)
Jan 20 14:31:05 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)
Jan 20 14:33:06 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)
Jan 20 14:35:06 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)
Jan 20 14:36:29 pan kernel: Registering the id_resolver key type
Jan 20 14:36:29 pan kernel: FS-Cache: Netfs 'nfs' registered for caching
Jan 20 14:36:29 pan nfsidmap[2536]: nss_getpwnam: name 'root@rack-1.pharmacy.cluster.uoft.bkslab.org' does not map into domain 'uoft.bkslab.org'
Jan 20 14:37:06 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)

This was what happened when I restarted the machine.

Sun Grid Engine Commands

To disable a host from queue:

qmod -d '*@<hostname>'

To view jobs running on host queue:

qhost -h <hostname> -j 

External Links

Add/Remove Administrative, Execution, Submit Hosts: http://gridscheduler.sourceforge.net/howto/commontasks.html