Sun Grid Engine (SGE): Difference between revisions

Latest revision as of 20:56, 23 January 2017

ALL ABOUT SGE (SUN GRID ENGINE)

To add an exec node:

 yum -y install gridengine gridengine-execd
 export SGE_ROOT=/usr/share/gridengine
 export SGE_CELL=bkslab
 cp -v /nfs/init/gridengine/install.conf /tmp/gridengine-install.conf
+++++++++++++++++++++++++++++++++++++++++++++++++++++
#-------------------------------------------------
# SGE default configuration file
#------------------------------------------------- 
# Use always fully qualified pathnames, please 

# SGE_ROOT Path, this is basic information
#(mandatory for qmaster and execd installation)
SGE_ROOT="/usr/share/gridengine" 

# SGE_QMASTER_PORT is used by qmaster for communication
# Please enter the port in this way: 1300
# Please do not this: 1300/tcp
#(mandatory for qmaster installation)
SGE_QMASTER_PORT=6444 

# SGE_EXECD_PORT is used by execd for communication
# Please enter the port in this way: 1300
# Please do not this: 1300/tcp
#(mandatory for qmaster installation)
SGE_EXECD_PORT=6445 

# SGE_ENABLE_SMF
# if set to false SMF will not control SGE services
SGE_ENABLE_SMF="false"

# SGE_ENABLE_ST
# if set to false Sun Service Tags will not be used
SGE_ENABLE_ST="true" 

# SGE_CLUSTER_NAME
# Name of this cluster (used by SMF as an service instance name)
SGE_CLUSTER_NAME="bkslab" 

# SGE_JMX_PORT is used by qmasters JMX MBean server
# mandatory if install_qmaster -jmx -auto <cfgfile>
# range: 1024-65500 
SGE_JMX_PORT="6446" 

# SGE_JMX_SSL is used by qmasters JMX MBean server
# if SGE_JMX_SSL=true, the mbean server connection uses
# SSL authentication
SGE_JMX_SSL="true" 

# SGE_JMX_SSL_CLIENT is used by qmasters JMX MBean server
# if SGE_JMX_SSL_CLIENT=true, the mbean server connection uses
# SSL authentication of the client in addition
SGE_JMX_SSL_CLIENT="true" 

# SGE_JMX_SSL_KEYSTORE is used by qmasters JMX MBean server
# if SGE_JMX_SSL=true the server keystore found here is used
# e.g. /var/sgeCA/port<sge_qmaster_port>/<sge_cell>/private/keystore
SGE_JMX_SSL_KEYSTORE="/var/sgeCA/sge_qmaster/bkslab/private/keystore" 

# SGE_JMX_SSL_KEYSTORE_PW is used by qmasters JMX MBean server
# password for the SGE_JMX_SSL_KEYSTORE file
SGE_JMX_SSL_KEYSTORE_PW="secret" 

# SGE_JVM_LIB_PATH is used by qmasters jvm thread
# path to libjvm.so
# if value is missing or set to "none" JMX thread will not be installed
# when the value is empty or path does not exit on the system, Grid Engine 
# will try to find a correct value, if it cannot do so, value is set to 
# "jvmlib_missing" and JMX thread will be configured but will fail to start
SGE_JVM_LIB_PATH="none" 

# SGE_ADDITIONAL_JVM_ARGS is used by qmasters jvm thread 
# jvm specific arguments as -verbose:jni etc.
# optional, can be empty
SGE_ADDITIONAL_JVM_ARGS="-Xmx256m" 

# CELL_NAME, will be a dir in SGE_ROOT, contains the common dir
# Please enter only the name of the cell. No path, please
#(mandatory for qmaster and execd installation)
CELL_NAME="bkslab" 

# ADMIN_USER, if you want to use a different admin user than the owner,
# of SGE_ROOT, you have to enter the user name, here
# Leaving this blank, the owner of the SGE_ROOT dir will be used as admin user
ADMIN_USER="" 

# The dir, where qmaster spools this parts, which are not spooled by DB
#(mandatory for qmaster installation)
QMASTER_SPOOL_DIR="/var/spool/gridengine/bkslab/qmaster" 

# The dir, where the execd spools (active jobs)
# This entry is needed, even if your are going to use
# berkeley db spooling. Only cluster configuration and jobs will
# be spooled in the database. The execution daemon still needs a spool
# directory  
#(mandatory for qmaster installation)
EXECD_SPOOL_DIR="/var/spool/gridengine" 

# For monitoring and accounting of jobs, every job will get
# unique GID. So you have to enter a free GID Range, which
# is assigned to each job running on a machine.
# If you want to run 100 Jobs at the same time on one host you
# have to enter a GID-Range like that: 16000-16100
#(mandatory for qmaster installation)
GID_RANGE="16000-16100" 

# If SGE is compiled with -spool-dynamic, you have to enter here, which
# spooling method should be used. (classic or berkeleydb)
#(mandatory for qmaster installation)
SPOOLING_METHOD="berkeleydb" 

# Name of the Server, where the Spooling DB is running on
# if spooling methode is berkeleydb, it must be "none", when
# using no spooling server and it must contain the servername
# if a server should be used. In case of "classic" spooling,
# can be left out
DB_SPOOLING_SERVER="none" 

# The dir, where the DB spools
# If berkeley db spooling is used, it must contain the path to
# the spooling db. Please enter the full path. (eg. /tmp/data/spooldb)
# Remember, this directory must be local on the qmaster host or on the
# Berkeley DB Server host. No NFS mount, please
DB_SPOOLING_DIR="/var/spool/gridengine/bkslab/spooldb" 

# This parameter set the number of parallel installation processes.
# The prevent a system overload, or exeeding the number of open file
# descriptors the user can limit the number of parallel install processes.
# eg. set PAR_EXECD_INST_COUNT="20", maximum 20 parallel execd are installed.
PAR_EXECD_INST_COUNT="20" 

# A List of Host which should become admin hosts
# If you do not enter any host here, you have to add all of your hosts
# by hand, after the installation. The autoinstallation works without
# any entry
ADMIN_HOST_LIST="" 

# A List of Host which should become submit hosts
# If you do not enter any host here, you have to add all of your hosts
# by hand, after the installation. The autoinstallation works without
# any entry
SUBMIT_HOST_LIST="" 

# A List of Host which should become exec hosts
# If you do not enter any host here, you have to add all of your hosts
# by hand, after the installation. The autoinstallation works without
# any entry
# (mandatory for execution host installation)
EXEC_HOST_LIST="" 

# The dir, where the execd spools (local configuration)
# If you want configure your execution daemons to spool in
# a local directory, you have to enter this directory here.
# If you do not want to configure a local execution host spool directory
# please leave this empty
EXECD_SPOOL_DIR_LOCAL="/var/spool/gridengine" 

# If true, the domainnames will be ignored, during the hostname resolving
# if false, the fully qualified domain name will be used for name resolving
HOSTNAME_RESOLVING="false" 

# Shell, which should be used for remote installation (rsh/ssh)
# This is only supported, if your hosts and rshd/sshd is configured,
# not to ask for a password, or promting any message.
SHELL_NAME="ssh" 

# This remote copy command is used for csp installation.
# The script needs the remote copy command for distributing
# the csp certificates. Using ssl the command scp has to be entered,
# using  the not so secure rsh the command rcp has to be entered.
# Both need a passwordless ssh/rsh connection to the hosts, which
# should be connected to. (mandatory for csp installation mode)
COPY_COMMAND="scp" 

# Enter your default domain, if you are using /etc/hosts or NIS configuration
DEFAULT_DOMAIN="none" 

# If a job stops, fails, finish, you can send a mail to this adress
ADMIN_MAIL="none" 

# If true, the rc scripts (sgemaster, sgeexecd, sgebdb) will be added,
# to start automatically during boottime
ADD_TO_RC="true" 

#If this is "true" the file permissions of executables will be set to 755
#and of ordenary file to 644.  
SET_FILE_PERMS="true" 

# This option is not implemented, yet.
# When a exechost should be uninstalled, the running jobs will be rescheduled
RESCHEDULE_JOBS="wait" 

# Enter a one of the three distributed scheduler tuning configuration sets
# (1=normal, 2=high, 3=max)
SCHEDD_CONF="1" 

# The name of the shadow host. This host must have read/write permission
# to the qmaster spool directory
# If you want to setup a shadow host, you must enter the servername
# (mandatory for shadowhost installation)
SHADOW_HOST="" 

# Remove this execution hosts in automatic mode
# (mandatory for unistallation of execution hosts)
EXEC_HOST_LIST_RM="" 

# This option is used for startup script removing. 
# If true, all rc startup scripts will be removed during
# automatic deinstallation. If false, the scripts won't
# be touched.
# (mandatory for unistallation of execution/qmaster hosts)
REMOVE_RC="true" 
 
# This is a Windows specific part of the auto isntallation template
# If you going to install windows executions hosts, you have to enable the
# windows support. To do this, please set the WINDOWS_SUPPORT variable
# to "true". ("false" is disabled)
# (mandatory for qmaster installation, by default WINDOWS_SUPPORT is
# disabled)
WINDOWS_SUPPORT="false" 

# Enabling the WINDOWS_SUPPORT, recommends the following parameter.
# The WIN_ADMIN_NAME will be added to the list of SGE managers.
# Without adding the WIN_ADMIN_NAME the execution host installation
# won't install correctly.
# WIN_ADMIN_NAME is set to "Administrator" which is default on most
# Windows systems. In some cases the WIN_ADMIN_NAME can be prefixed with
# the windows domain name (eg. DOMAIN+Administrator)
# (mandatory for qmaster installation, if windows hosts should be installed)
WIN_ADMIN_NAME="Administrator" 

# This parameter is used to switch between local ADMINUSER and Windows
# Domain Adminuser. Setting the WIN_DOMAIN_ACCESS variable to true, the
# Adminuser will be a Windows Domain User. It is recommended that 
# a Windows Domain Server is configured and the Windows Domain User is
# created. Setting this variable to false, the local Adminuser will be
# used as ADMINUSER. The install script tries to create this user account
# but we recommend, because it will be saver, to create this user, 
# before running the installation. 
# (mandatory for qmaster installation, if windows hosts should be installed)
WIN_DOMAIN_ACCESS="false" 

# This section is used for csp installation mode.
# CSP_RECREATE recreates the certs on each installtion, if true.
# In case of false, the certs will be created, if not existing.
# Existing certs won't be overwritten. (mandatory for csp install)
CSP_RECREATE="true" 

# The created certs won't be copied, if this option is set to false
# If true, the script tries to copy the generated certs. This
# requires passwordless ssh/rsh access for user root to the
# execution hosts
CSP_COPY_CERTS="false" 

# csp information, your country code (only 2 characters)
# (mandatory for csp install)
CSP_COUNTRY_CODE="CA" 

# your state (mandatory for csp install)
CSP_STATE="Ontario" 
 
# your location, eg. the building (mandatory for csp install)
CSP_LOCATION="Faculty of Pharmacy" 

# your arganisation (mandatory for csp install)
CSP_ORGA="University of Toronto" 

# your organisation unit (mandatory for csp install)
CSP_ORGA_UNIT="Shoichet Lab" 

# your email (mandatory for csp install)
CSP_MAIL_ADDRESS="admin@bkslab.org" 
                                         
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 vim /tmp/gridengine-install.conf   -> CHANGE EXEC_HOST_LIST=" " TO EXEC_HOST_LIST="$HOSTNAME"
 cd /usr/share/gridengine/
 ./inst_sge -x -s -auto /tmp/gridengine-install.conf > /tmp/gridengine.log
 cat /tmp/gridengine.log | tee -a /root/gridengine-install.log
 if [ -e ${SGE_CELL} ]; then         mv -v ${SGE_CELL} ${SGE_CELL}.local; fi
 ln -vs /nfs/gridengine/${SGE_CELL} /usr/share/gridengine/${SGE_CELL}
 rm -vf /etc/sysconfig/gridengine
 echo "SGE_ROOT=${SGE_ROOT}" >> /etc/sysconfig/gridengine
 echo "SGE_CELL=${SGE_CELL}" >> /etc/sysconfig/gridengine
 mkdir -pv /var/spool/gridengine/`hostname -s`
 chown -Rv sgeadmin:sgeadmin /var/spool/gridengine
 chkconfig --levels=345 sge_execd on 

 Go to sgemaster and do this:
 qconf -ae --> CHANGE THE HOSTNAME FROM "template" to hostname_of_new_exec
 qconf -as hostname

HOW TO EDIT THE NUMBER OF SLOTS FOR A EXEC_HOST:

qconf -mattr exechost complex_values slots=32 raiders.c.uoft.bkslab.org
"complex_values" of "exechost" is empty - Adding new element(s). 

root@pan.slot-27.rack-1.pharmacy.cluster.uoft.bkslab.org modified "raiders.c.uoft.bkslab.org" in exechost list

HOW TO ADD A HOSTGROUP:

 qconf -ahgrp @custom

ADD THE EXECHOST TO A HOSTGROUP:

 qconf -mhgrp @custom
 service sgemaster restart
 # Then back on the exec_host:
 service sge_execd start

To suspend jobs you do:

qmod -sj job_number

To delete nodes I did the following:

qconf -shgrpl  -> To see a list of host groups
qconf -shgrp @HOST_GROUP_NAME  -> For each host group to see if the nodes you want to delete are listed

If it is listed then:

qconf-mhgrp @HOST_GROUP_NAME -> Modify this file (delete the line with the node you want to delete).

Once you've deleted the node you want to delete from all the hostgroups:

qconf -de node_you_want _to_delete >/dev/null
qmod -de node_you_want _to_delete

To alter the priority on all the jobs for a user:

qstat -u user | cut -d ' ' -f2 >> some_file

Edit some_file and delete the first couple lines (the header lines)

for OUTPUT in $`cat some_file`; do qalter -p 1022 $OUTPUT; done;
Priorities are -1024 to 1023

DEBUGGING SGE:

qstat -explain a
for HOSTGROUP in `qconf -shgrpl`; do for HOSTLIST in `qconf -shgrp $HOSTGROUP`; do  echo $HOSTLIST; done; done | grep node-1.slot-27.rack-2.pharmacy.cluster.uoft.bkslab.org

Look at the logs for both master and exec (raiders:/var/spool/gridengine/raiders/messages and pan:/var/spool/gridengine/bkslab/qmaster/messages)

Make sure resolv.conf looks like this:

nameserver 142.150.250.10
nameserver 10.10.16.64
search cluster.uoft.bkslab.org uoft.bkslab.org bkslab.org

[root@pan ~]# for X in $`qconf -shgrpl`; do qconf -shgrp $X; done;
Host group "$@24-core" does not exist
group_name @64-core
hostlist node-26.rack-2.pharmacy.cluster.uoft.bkslab.org
group_name @8-core
hostlist node-2.slot-27.rack-1.pharmacy.cluster.uoft.bkslab.org \
        node-1.slot-27.rack-1.pharmacy.cluster.uoft.bkslab.org
group_name @allhosts
hostlist @physical @virtual
group_name @physical
hostlist node-26.rack-2.pharmacy.cluster.uoft.bkslab.org
group_name @virtual
hostlist node-2.slot-27.rack-1.pharmacy.cluster.uoft.bkslab.org \
        node-1.slot-27.rack-1.pharmacy.cluster.uoft.bkslab.org

1) In one screen I would type strace qstat -f and then in the other screen I would type ps -ax | grep qstat to get the pid.
Then ls -l /proc/pid/fd/
I did this because when I typed strace qstat -f everytime it would get stuck saying this:

poll([{fd=3, events=POLLIN|POLLPRI}], 1, 1000) = 0 (Timeout)
gettimeofday({1390262563, 742705}, NULL) = 0
gettimeofday({1390262563, 742741}, NULL) = 0
gettimeofday({1390262563, 742771}, NULL) = 0
gettimeofday({1390262563, 742801}, NULL) = 0
gettimeofday({1390262563, 742828}, NULL) = 0
gettimeofday({1390262563, 742855}, NULL) = 0
gettimeofday({1390262563, 742881}, NULL) = 0
gettimeofday({1390262563, 742909}, NULL) = 0

and then eventually it would say this:

poll([{fd=3, events=POLLIN|POLLPRI}], 1, 1000) = 1 ([{fd=3, revents=POLLIN}])
gettimeofday({1390262563, 960292}, NULL) = 0
gettimeofday({1390262563, 960321}, NULL) = 0
gettimeofday({1390262563, 960349}, NULL) = 0
read(3, "<gmsh><dl>99</dl></gms", 22)   = 22
read(3, "h", 1)                         = 1
read(3, ">", 1)                         = 1
read(3, "<mih version=\"0.1\"><mid>2</mid><"..., 99) = 99
read(3, "<ccrm version=\"0.1\"></ccrm>", 27) = 27
gettimeofday({1390262563, 960547}, NULL) = 0
gettimeofday({1390262563, 960681}, NULL) = 0
gettimeofday({1390262563, 960709}, NULL) = 0
gettimeofday({1390262563, 960741}, NULL) = 0
gettimeofday({1390262563, 960769}, NULL) = 0
gettimeofday({1390262563, 960797}, NULL) = 0
gettimeofday({1390262563, 960823}, NULL) = 0
shutdown(3, 2 /* send and receive */)   = 0
close(3)                                = 0
gettimeofday({1390262563, 961009}, NULL) = 0
gettimeofday({1390262563, 961036}, NULL) = 0
gettimeofday({1390262563, 961064}, NULL) = 0
gettimeofday({1390262563, 961093}, NULL) = 0
gettimeofday({1390262563, 961120}, NULL) = 0
gettimeofday({1390262563, 961148}, NULL) = 0

The thing that is weird about this is when I typed ls -l /proc/pid/fd/ there was never a file descriptor "3"

2) I tried to delete the nodes that we moved to SF by doing the following:

qconf -dattr @physical "node-1.rack-3.pharmacy.cluster.uoft.bkslab.org node-10.rack-3.pharmacy.cluster.uoft.bkslab.org node-11.rack-3.pharmacy.cluster.uoft.bkslab.org node-12.rack-3.pharmacy.cluster.uoft.bkslab.org  node-13.rack-3.pharmacy.cluster.uoft.bkslab.org node-14.rack-3.pharmacy.cluster.uoft.bkslab.org node-15.rack-3.pharmacy.cluster.uoft.bkslab.org node-2.rack-3.pharmacy.cluster.uoft.bkslab.org node-26.rack-3.pharmacy.cluster.uoft.bkslab.org node-27.rack-3.pharmacy.cluster.uoft.bkslab.org node-29.rack-3.pharmacy.cluster.uoft.bkslab.org node-3.rack-3.pharmacy.cluster.uoft.bkslab.org node-4.rack-3.pharmacy.cluster.uoft.bkslab.org node-5.rack-3.pharmacy.cluster.uoft.bkslab.org node-6.rack-3.pharmacy.cluster.uoft.bkslab.org node-7.rack-3.pharmacy.cluster.uoft.bkslab.org node-8.rack-3.pharmacy.cluster.uoft.bkslab.org node-9.rack-3.pharmacy.cluster.uoft.bkslab.org" node-1.rack-3.pharmacy.cluster.uoft.bkslab.org @physical > /dev/null

I would get the error:

Modification of object "@physical" not supported

3) I tried to see the queues complex attributes by typing qconf -sc and saw this:

#name       shortcut   type        relop requestable consumable default  urgency 
slots               s          INT         <=        YES         YES            1        1000

I am not quite sure what urgency = 1000 means. All other names had "0" under urgency.

4) I tried qmod -cq '*' to clear the error state of all the queues. It would tell me this:

Queue instance "all.q@node-1.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-1.slot-27.rack-1.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-1.slot-27.rack-2.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-10.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-11.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-12.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-13.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-14.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-15.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-2.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-2.slot-27.rack-1.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-2.slot-27.rack-2.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-26.rack-2.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-26.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-27.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-29.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-3.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-3.slot-27.rack-2.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-4.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-4.slot-27.rack-2.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-5.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-5.slot-27.rack-2.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-6.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-6.slot-27.rack-2.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-7.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-7.slot-27.rack-2.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-8.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error
Queue instance "all.q@node-9.rack-3.pharmacy.cluster.uoft.bkslab.org" is already in the specified state: no error

5) I tried deleting a node like this instead:

qconf -ds node-1.rack-3.pharmacy.cluster.uoft.bkslab.org

But when I typed qconf -sel it was still there.

6) I tried to see what the hostlist for @physical was by typing qconf -ahgrp @physical. It said: group_name @physical, hostlist NONE Then I typed qconf -shgrpl to see a list of all hostgroups and tried typing qconf -ahgrp. All of them said the hostlist was NONE, but when I tried to type qconf -ahgrp @allhosts I got this message:

denied: "root" must be manager for this operation
error: commlib error: got select error (Connection reset by peer)

7) I looked at the messages in the file: /var/spool/gridengine/bkslab/qmaster/messages and it said this (over and over again):

01/20/2014 19:41:35|listen|pan|E|commlib error: got read error (closing "pan.slot-27.rack-1.pharmacy.cluster.uoft.bkslab.org/qconf/2")
01/20/2014 19:43:24|  main|pan|W|local configuration pan.slot-27.rack-1.pharmacy.cluster.uoft.bkslab.org not defined - using global configuration
01/20/2014 19:43:24|  main|pan|W|can't resolve host name "node-3-3.rack-3.pharmacy.cluster.uoft.bkslab.org": undefined commlib error code
01/20/2014 19:43:24|  main|pan|W|can't resolve host name "node-3-4.rack-3.pharmacy.cluster.uoft.bkslab.org": undefined commlib error code
01/20/2014 19:43:53|  main|pan|I|read job database with 468604 entries in 29 seconds
01/20/2014 19:43:55|  main|pan|I|qmaster hard descriptor limit is set to 8192
01/20/2014 19:43:55|  main|pan|I|qmaster soft descriptor limit is set to 8192
01/20/2014 19:43:55|  main|pan|I|qmaster will use max. 8172 file descriptors for communication
01/20/2014 19:43:55|  main|pan|I|qmaster will accept max. 99 dynamic event clients
01/20/2014 19:43:55|  main|pan|I|starting up GE 6.2u5p3 (lx26-amd64)

8) Periodically i would get this error:

ERROR: failed receiving gdi request response for mid=3 (got no message).

9) I also tried delete the pid in the file: /var/spool/gridengine/bkslab/qmaster/qmaster.pid That didn't do anything. It eventually just replaced it with a different number. It's weird because it's not even the right pid. For example the real pid was 8286 and the pid in the file was 8203:

[root@pan qmaster]# service sgemaster start
Starting sge_qmaster:                                      [  OK  ]
[root@pan qmaster]# ps -ax |grep sge
Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ
8286 ?        Rl     0:03 /usr/bin/sge_qmaster
8301 pts/0    S+     0:00 grep sge
[root@pan qmaster]# cat qmaster.pid 
8203

10) When I typed tail /var/log/messages I saw this:

Jan 20 14:25:05 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)
Jan 20 14:27:05 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)
Jan 20 14:29:05 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)
Jan 20 14:31:05 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)
Jan 20 14:33:06 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)
Jan 20 14:35:06 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)
Jan 20 14:36:29 pan kernel: Registering the id_resolver key type
Jan 20 14:36:29 pan kernel: FS-Cache: Netfs 'nfs' registered for caching
Jan 20 14:36:29 pan nfsidmap[2536]: nss_getpwnam: name 'root@rack-1.pharmacy.cluster.uoft.bkslab.org' does not map into domain 'uoft.bkslab.org'
Jan 20 14:37:06 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)

This was what happened when I restarted the machine.

Sun Grid Engine Commands

To disable a host from queue:

qmod -d '*@<hostname>'

To view jobs running on host queue:

qhost -h <hostname> -j

External Links

Add/Remove Administrative, Execution, Submit Hosts: http://gridscheduler.sourceforge.net/howto/commontasks.html

Sun Grid Engine (SGE): Difference between revisions

Latest revision as of 20:56, 23 January 2017

Sun Grid Engine Commands

External Links

Navigation menu

Search