<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>http://wiki.docking.org/index.php?action=history&amp;feed=atom&amp;title=SGE_notes</id>
	<title>SGE notes - Revision history</title>
	<link rel="self" type="application/atom+xml" href="http://wiki.docking.org/index.php?action=history&amp;feed=atom&amp;title=SGE_notes"/>
	<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=SGE_notes&amp;action=history"/>
	<updated>2026-04-06T07:42:24Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.39.1</generator>
	<entry>
		<id>http://wiki.docking.org/index.php?title=SGE_notes&amp;diff=9386&amp;oldid=prev</id>
		<title>Teague Sterling at 22:02, 16 May 2016</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=SGE_notes&amp;diff=9386&amp;oldid=prev"/>
		<updated>2016-05-16T22:02:55Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 22:02, 16 May 2016&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l59&quot;&gt;Line 59:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 59:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;qconf -de node_you_want _to_delete &amp;gt;/dev/null&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;qconf -de node_you_want _to_delete &amp;gt;/dev/null&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;qmod -de node_you_want _to_delete&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;qmod -de node_you_want _to_delete&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;A more formal note removal pipeline (as BASH):&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;    for HG in $( qconf -shgrpl ) ; do&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;        qconf -dattr hostgrop hostlist NODE_NAME_HERE $HG&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;    done&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;    qconf -purge queue slots *.q@NODE_NAME_HERE (or all.q)&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;    qconf -ds NODE_NAME_HERE&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;    qconf -dconf NODE_NAME_HERE&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;    qconf -de NODE_NAME_HERE&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;To alter the priority on all the jobs for a user:&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;To alter the priority on all the jobs for a user:&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;!-- diff cache key wikidb:diff::1.12:old-7359:rev-9386 --&gt;
&lt;/table&gt;</summary>
		<author><name>Teague Sterling</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=SGE_notes&amp;diff=7359&amp;oldid=prev</id>
		<title>Frodo: Created page with &quot;ALL ABOUT SGE (SUN GRID ENGINE)  obviously this needs to be edited.... &#039;&#039;&#039;domain&#039;&#039;&#039; must be replaced by the domain throughout...  &lt;pre&gt; To add an exec node:   yum -y install g...&quot;</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=SGE_notes&amp;diff=7359&amp;oldid=prev"/>
		<updated>2014-03-19T18:19:35Z</updated>

		<summary type="html">&lt;p&gt;Created page with &amp;quot;ALL ABOUT SGE (SUN GRID ENGINE)  obviously this needs to be edited.... &amp;#039;&amp;#039;&amp;#039;domain&amp;#039;&amp;#039;&amp;#039; must be replaced by the domain throughout...  &amp;lt;pre&amp;gt; To add an exec node:   yum -y install g...&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;ALL ABOUT SGE (SUN GRID ENGINE)&lt;br /&gt;
&lt;br /&gt;
obviously this needs to be edited....&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;domain&amp;#039;&amp;#039;&amp;#039; must be replaced by the domain throughout...&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
To add an exec node:&lt;br /&gt;
  yum -y install gridengine gridengine-execd&lt;br /&gt;
  export SGE_ROOT=/usr/share/gridengine&lt;br /&gt;
  export SGE_CELL=bkslab&lt;br /&gt;
  cp -v /nfs/init/gridengine/install.conf /tmp/gridengine-install.conf&lt;br /&gt;
  vim /tmp/gridengine-install.conf   -&amp;gt; CHANGE EXEC_HOST_LIST=&amp;quot; &amp;quot; TO EXEC_HOST_LIST=&amp;quot;$HOSTNAME&amp;quot;&lt;br /&gt;
  cd /usr/share/gridengine/&lt;br /&gt;
  ./inst_sge -x -s -auto /tmp/gridengine-install.conf &amp;gt; /tmp/gridengine.log&lt;br /&gt;
  cat /tmp/gridengine.log | tee -a /root/gridengine-install.log&lt;br /&gt;
  if [ -e ${SGE_CELL} ]; then     	mv -v ${SGE_CELL} ${SGE_CELL}.local; fi&lt;br /&gt;
  ln -vs /nfs/gridengine/${SGE_CELL} /usr/share/gridengine/${SGE_CELL}&lt;br /&gt;
  rm -vf /etc/sysconfig/gridengine&lt;br /&gt;
  echo &amp;quot;SGE_ROOT=${SGE_ROOT}&amp;quot; &amp;gt;&amp;gt; /etc/sysconfig/gridengine&lt;br /&gt;
  echo &amp;quot;SGE_CELL=${SGE_CELL}&amp;quot; &amp;gt;&amp;gt; /etc/sysconfig/gridengine&lt;br /&gt;
  mkdir -pv /var/spool/gridengine/`hostname -s`&lt;br /&gt;
  chown -Rv sgeadmin:sgeadmin /var/spool/gridengine&lt;br /&gt;
  chkconfig --levels=345 sge_execd on&lt;br /&gt;
&lt;br /&gt;
  Go to sgemaster and do this:&lt;br /&gt;
  qconf -ae --&amp;gt; CHANGE THE HOSTNAME FROM &amp;quot;template&amp;quot; to hostname_of_new_exec&lt;br /&gt;
  qconf -as hostname&lt;br /&gt;
&lt;br /&gt;
HOW TO EDIT THE NUMBER OF SLOTS FOR A EXEC_HOST:&lt;br /&gt;
 qconf -mattr exechost complex_values slots=32 raiders.c.domain&lt;br /&gt;
&amp;quot;complex_values&amp;quot; of &amp;quot;exechost&amp;quot; is empty - Adding new element(s).&lt;br /&gt;
&lt;br /&gt;
root@pan.slot-27.rack-1.pharmacy.cluster.domain modified &amp;quot;raiders.c.domain&amp;quot; in exechost list&lt;br /&gt;
&lt;br /&gt;
  HOW TO ADD A HOSTGROUP:&lt;br /&gt;
  qconf -ahgrp @custom &lt;br /&gt;
&lt;br /&gt;
  ADD THE EXECHOST TO A HOSTGROUP:&lt;br /&gt;
  qconf -mhgrp @custom&lt;br /&gt;
&lt;br /&gt;
  service sgemaster restart&lt;br /&gt;
 &lt;br /&gt;
  Then back on the exec_host:&lt;br /&gt;
  &lt;br /&gt;
  service sge_execd start&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
To suspend jobs you do:&lt;br /&gt;
&lt;br /&gt;
qmod -sj job_number&lt;br /&gt;
&lt;br /&gt;
To delete nodes I did the following:&lt;br /&gt;
&lt;br /&gt;
qconf -shgrpl  -&amp;gt; To see a list of host groups&lt;br /&gt;
qconf -shgrp @HOST_GROUP_NAME  -&amp;gt; For each host group to see if the nodes you want to delete are listed&lt;br /&gt;
If it is listed then:&lt;br /&gt;
qconf-mhgrp @HOST_GROUP_NAME -&amp;gt; Modify this file (delete the line with the node you want to delete).&lt;br /&gt;
Once you&amp;#039;ve deleted the node you want to delete from all the hostgroups:&lt;br /&gt;
qconf -de node_you_want _to_delete &amp;gt;/dev/null&lt;br /&gt;
qmod -de node_you_want _to_delete&lt;br /&gt;
&lt;br /&gt;
To alter the priority on all the jobs for a user:&lt;br /&gt;
qstat -u user | cut -d &amp;#039; &amp;#039; -f2 &amp;gt;&amp;gt; some_file&lt;br /&gt;
Edit some_file and delete the first couple lines (the header lines)&lt;br /&gt;
for OUTPUT in $`cat some_file`; do qalter -p 1022 $OUTPUT; done;&lt;br /&gt;
Priorities are -1024 to 1023&lt;br /&gt;
&lt;br /&gt;
DEBUGGING SGE:&lt;br /&gt;
&lt;br /&gt;
qstat -explain a&lt;br /&gt;
&lt;br /&gt;
for HOSTGROUP in `qconf -shgrpl`; do for HOSTLIST in `qconf -shgrp $HOSTGROUP`; do  echo $HOSTLIST; done; done | grep node-1.slot-27.rack-2.pharmacy.cluster.domain&lt;br /&gt;
&lt;br /&gt;
Look at the logs for both master and exec &lt;br /&gt;
(raiders:/var/spool/gridengine/raiders/messages and pan:/var/spool/gridengine/bkslab/qmaster/messages)&lt;br /&gt;
&lt;br /&gt;
Make sure resolv.conf looks like this:&lt;br /&gt;
nameserver 142.150.250.10&lt;br /&gt;
nameserver 10.10.16.64&lt;br /&gt;
search cluster.domain domain bkslab.org                                              	&lt;br /&gt;
&lt;br /&gt;
[root@pan ~]# for X in $`qconf -shgrpl`; do qconf -shgrp $X; done;&lt;br /&gt;
Host group &amp;quot;$@24-core&amp;quot; does not exist&lt;br /&gt;
group_name @64-core&lt;br /&gt;
hostlist node-26.rack-2.pharmacy.cluster.domain&lt;br /&gt;
group_name @8-core&lt;br /&gt;
hostlist node-2.slot-27.rack-1.pharmacy.cluster.domain \&lt;br /&gt;
         node-1.slot-27.rack-1.pharmacy.cluster.domain&lt;br /&gt;
group_name @allhosts&lt;br /&gt;
hostlist @physical @virtual&lt;br /&gt;
group_name @physical&lt;br /&gt;
hostlist node-26.rack-2.pharmacy.cluster.domain&lt;br /&gt;
group_name @virtual&lt;br /&gt;
hostlist node-2.slot-27.rack-1.pharmacy.cluster.domain \&lt;br /&gt;
         node-1.slot-27.rack-1.pharmacy.cluster.domain&lt;br /&gt;
&lt;br /&gt;
1)  In one screen I would type strace qstat -f and then in the other screen I would type ps -ax | grep qstat to get the pid.  Then ls -l /proc/pid/fd/&lt;br /&gt;
I did this because when I typed strace qstat -f everytime it would get stuck saying this:&lt;br /&gt;
poll([{fd=3, events=POLLIN|POLLPRI}], 1, 1000) = 0 (Timeout)&lt;br /&gt;
gettimeofday({1390262563, 742705}, NULL) = 0&lt;br /&gt;
gettimeofday({1390262563, 742741}, NULL) = 0&lt;br /&gt;
gettimeofday({1390262563, 742771}, NULL) = 0&lt;br /&gt;
gettimeofday({1390262563, 742801}, NULL) = 0&lt;br /&gt;
gettimeofday({1390262563, 742828}, NULL) = 0&lt;br /&gt;
gettimeofday({1390262563, 742855}, NULL) = 0&lt;br /&gt;
gettimeofday({1390262563, 742881}, NULL) = 0&lt;br /&gt;
gettimeofday({1390262563, 742909}, NULL) = 0&lt;br /&gt;
&lt;br /&gt;
and then eventually it would say this:&lt;br /&gt;
poll([{fd=3, events=POLLIN|POLLPRI}], 1, 1000) = 1 ([{fd=3, revents=POLLIN}])&lt;br /&gt;
gettimeofday({1390262563, 960292}, NULL) = 0&lt;br /&gt;
gettimeofday({1390262563, 960321}, NULL) = 0&lt;br /&gt;
gettimeofday({1390262563, 960349}, NULL) = 0&lt;br /&gt;
read(3, &amp;quot;&amp;lt;gmsh&amp;gt;&amp;lt;dl&amp;gt;99&amp;lt;/dl&amp;gt;&amp;lt;/gms&amp;quot;, 22)   = 22&lt;br /&gt;
read(3, &amp;quot;h&amp;quot;, 1)                     	= 1&lt;br /&gt;
read(3, &amp;quot;&amp;gt;&amp;quot;, 1)                     	= 1&lt;br /&gt;
read(3, &amp;quot;&amp;lt;mih version=\&amp;quot;0.1\&amp;quot;&amp;gt;&amp;lt;mid&amp;gt;2&amp;lt;/mid&amp;gt;&amp;lt;&amp;quot;..., 99) = 99&lt;br /&gt;
read(3, &amp;quot;&amp;lt;ccrm version=\&amp;quot;0.1\&amp;quot;&amp;gt;&amp;lt;/ccrm&amp;gt;&amp;quot;, 27) = 27&lt;br /&gt;
gettimeofday({1390262563, 960547}, NULL) = 0&lt;br /&gt;
gettimeofday({1390262563, 960681}, NULL) = 0&lt;br /&gt;
gettimeofday({1390262563, 960709}, NULL) = 0&lt;br /&gt;
gettimeofday({1390262563, 960741}, NULL) = 0&lt;br /&gt;
gettimeofday({1390262563, 960769}, NULL) = 0&lt;br /&gt;
gettimeofday({1390262563, 960797}, NULL) = 0&lt;br /&gt;
gettimeofday({1390262563, 960823}, NULL) = 0&lt;br /&gt;
shutdown(3, 2 /* send and receive */)   = 0&lt;br /&gt;
close(3)                            	= 0&lt;br /&gt;
gettimeofday({1390262563, 961009}, NULL) = 0&lt;br /&gt;
gettimeofday({1390262563, 961036}, NULL) = 0&lt;br /&gt;
gettimeofday({1390262563, 961064}, NULL) = 0&lt;br /&gt;
gettimeofday({1390262563, 961093}, NULL) = 0&lt;br /&gt;
gettimeofday({1390262563, 961120}, NULL) = 0&lt;br /&gt;
gettimeofday({1390262563, 961148}, NULL) = 0&lt;br /&gt;
&lt;br /&gt;
The thing that is wierd about this is when I typed ls -l /proc/pid/fd/ there was never a file descriptor &amp;quot;3&amp;quot;&lt;br /&gt;
&lt;br /&gt;
2) I tried to delete the nodes that we moved to SF by doing the following:&lt;br /&gt;
qconf -dattr @physical &amp;quot;node-1.rack-3.pharmacy.cluster.domain node-10.rack-3.pharmacy.cluster.domain node-11.rack-3.pharmacy.cluster.domain node-12.rack-3.pharmacy.cluster.domain node-13.rack-3.pharmacy.cluster.domain node-14.rack-3.pharmacy.cluster.domain node-15.rack-3.pharmacy.cluster.domain node-2.rack-3.pharmacy.cluster.domain node-26.rack-3.pharmacy.cluster.domain node-27.rack-3.pharmacy.cluster.domain node-29.rack-3.pharmacy.cluster.domain node-3.rack-3.pharmacy.cluster.domain node-4.rack-3.pharmacy.cluster.domain node-5.rack-3.pharmacy.cluster.domain node-6.rack-3.pharmacy.cluster.domain node-7.rack-3.pharmacy.cluster.domain node-8.rack-3.pharmacy.cluster.domain node-9.rack-3.pharmacy.cluster.domain&amp;quot; node-1.rack-3.pharmacy.cluster.domain @physical &amp;gt; /dev/null&lt;br /&gt;
&lt;br /&gt;
I would get the error: Modification of object &amp;quot;@physical&amp;quot; not supported&lt;br /&gt;
&lt;br /&gt;
3) I tried to see the queues complex attributes by typing qconf -sc and saw this:&lt;br /&gt;
&lt;br /&gt;
#name   	shortcut   type    	relop requestable consumable default  urgency &lt;br /&gt;
&lt;br /&gt;
slots           	s      	INT     	&amp;lt;=        YES     	YES        	1    	1000&lt;br /&gt;
&lt;br /&gt;
I am not quite sure what urgency = 1000 means.&lt;br /&gt;
All other names had &amp;quot;0&amp;quot; under urgency.&lt;br /&gt;
&lt;br /&gt;
4) I tried qmod -cq &amp;#039;*&amp;#039;  to clear the error state of all the queues.  &lt;br /&gt;
It would tell me this:&lt;br /&gt;
&lt;br /&gt;
Queue instance &amp;quot;all.q@node-1.rack-3.pharmacy.cluster.domain&amp;quot; is already in the specified state: no error&lt;br /&gt;
Queue instance &amp;quot;all.q@node-1.slot-27.rack-1.pharmacy.cluster.domain&amp;quot; is already in the specified state: no error&lt;br /&gt;
Queue instance &amp;quot;all.q@node-1.slot-27.rack-2.pharmacy.cluster.domain&amp;quot; is already in the specified state: no error&lt;br /&gt;
Queue instance &amp;quot;all.q@node-10.rack-3.pharmacy.cluster.domain&amp;quot; is already in the specified state: no error&lt;br /&gt;
Queue instance &amp;quot;all.q@node-11.rack-3.pharmacy.cluster.domain&amp;quot; is already in the specified state: no error&lt;br /&gt;
Queue instance &amp;quot;all.q@node-12.rack-3.pharmacy.cluster.domain&amp;quot; is already in the specified state: no error&lt;br /&gt;
Queue instance &amp;quot;all.q@node-13.rack-3.pharmacy.cluster.domain&amp;quot; is already in the specified state: no error&lt;br /&gt;
Queue instance &amp;quot;all.q@node-14.rack-3.pharmacy.cluster.domain&amp;quot; is already in the specified state: no error&lt;br /&gt;
Queue instance &amp;quot;all.q@node-15.rack-3.pharmacy.cluster.domain&amp;quot; is already in the specified state: no error&lt;br /&gt;
Queue instance &amp;quot;all.q@node-2.rack-3.pharmacy.cluster.domain&amp;quot; is already in the specified state: no error&lt;br /&gt;
Queue instance &amp;quot;all.q@node-2.slot-27.rack-1.pharmacy.cluster.domain&amp;quot; is already in the specified state: no error&lt;br /&gt;
Queue instance &amp;quot;all.q@node-2.slot-27.rack-2.pharmacy.cluster.domain&amp;quot; is already in the specified state: no error&lt;br /&gt;
Queue instance &amp;quot;all.q@node-26.rack-2.pharmacy.cluster.domain&amp;quot; is already in the specified state: no error&lt;br /&gt;
Queue instance &amp;quot;all.q@node-26.rack-3.pharmacy.cluster.domain&amp;quot; is already in the specified state: no error&lt;br /&gt;
Queue instance &amp;quot;all.q@node-27.rack-3.pharmacy.cluster.domain&amp;quot; is already in the specified state: no error&lt;br /&gt;
Queue instance &amp;quot;all.q@node-29.rack-3.pharmacy.cluster.domain&amp;quot; is already in the specified state: no error&lt;br /&gt;
Queue instance &amp;quot;all.q@node-3.rack-3.pharmacy.cluster.domain&amp;quot; is already in the specified state: no error&lt;br /&gt;
Queue instance &amp;quot;all.q@node-3.slot-27.rack-2.pharmacy.cluster.domain&amp;quot; is already in the specified state: no error&lt;br /&gt;
Queue instance &amp;quot;all.q@node-4.rack-3.pharmacy.cluster.domain is already in the specified state: no error&lt;br /&gt;
Queue instance &amp;quot;all.q@node-4.slot-27.rack-2.pharmacy.cluster.domain&amp;quot; is already in the specified state: no error&lt;br /&gt;
Queue instance &amp;quot;all.q@node-5.rack-3.pharmacy.cluster.domain&amp;quot; is already in the specified state: no error&lt;br /&gt;
Queue instance &amp;quot;all.q@node-5.slot-27.rack-2.pharmacy.cluster.domain&amp;quot; is already in the specified state: no error&lt;br /&gt;
Queue instance &amp;quot;all.q@node-6.rack-3.pharmacy.cluster.domain&amp;quot; is already in the specified state: no error&lt;br /&gt;
Queue instance &amp;quot;all.q@node-6.slot-27.rack-2.pharmacy.cluster.domain&amp;quot; is already in the specified state: no error&lt;br /&gt;
Queue instance &amp;quot;all.q@node-7.rack-3.pharmacy.cluster.domain&amp;quot; is already in the specified state: no error&lt;br /&gt;
Queue instance &amp;quot;all.q@node-7.slot-27.rack-2.pharmacy.cluster.domain&amp;quot; is already in the specified state: no error&lt;br /&gt;
Queue instance &amp;quot;all.q@node-8.rack-3.pharmacy.cluster.domain&amp;quot; is already in the specified state: no error&lt;br /&gt;
Queue instance &amp;quot;all.q@node-9.rack-3.pharmacy.cluster.domain&amp;quot; is already in the specified state: no error&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
5) I tried deleting a node like this instead:&lt;br /&gt;
qconf -ds node-1.rack-3.pharmacy.cluster.domain&lt;br /&gt;
But when I typed qconf -sel it was still there.&lt;br /&gt;
&lt;br /&gt;
6)  I tried to see what the hostlist for @physical was by typing qconf -ahgrp @physical.  It said: group_name @physical, hostlist NONE&lt;br /&gt;
	Then I typed qconf -shgrpl to see a list of all hostgroups and tried typing qconf -ahgrp.  All of them said the hostlist was NONE, &lt;br /&gt;
   but when I tried to type qconf -ahgrp @allhosts I got this message:&lt;br /&gt;
   denied: &amp;quot;root&amp;quot; must be manager for this operation&lt;br /&gt;
   error: commlib error: got select error (Connection reset by peer)&lt;br /&gt;
&lt;br /&gt;
7) I looked at the messages in the file: /var/spool/gridengine/bkslab/qmaster/messages and it said this (over and over again):&lt;br /&gt;
&lt;br /&gt;
01/20/2014 19:41:35|listen|pan|E|commlib error: got read error (closing &amp;quot;pan.slot-27.rack-1.pharmacy.cluster.domain/qconf/2&amp;quot;)&lt;br /&gt;
01/20/2014 19:43:24|  main|pan|W|local configuration pan.slot-27.rack-1.pharmacy.cluster.domain not defined - using global configuration&lt;br /&gt;
01/20/2014 19:43:24|  main|pan|W|can&amp;#039;t resolve host name &amp;quot;node-3-3.rack-3.pharmacy.cluster.domain&amp;quot;: undefined commlib error code&lt;br /&gt;
01/20/2014 19:43:24|  main|pan|W|can&amp;#039;t resolve host name &amp;quot;node-3-4.rack-3.pharmacy.cluster.domain&amp;quot;: undefined commlib error code&lt;br /&gt;
01/20/2014 19:43:53|  main|pan|I|read job database with 468604 entries in 29 seconds&lt;br /&gt;
01/20/2014 19:43:55|  main|pan|I|qmaster hard descriptor limit is set to 8192&lt;br /&gt;
01/20/2014 19:43:55|  main|pan|I|qmaster soft descriptor limit is set to 8192&lt;br /&gt;
01/20/2014 19:43:55|  main|pan|I|qmaster will use max. 8172 file descriptors for communication&lt;br /&gt;
01/20/2014 19:43:55|  main|pan|I|qmaster will accept max. 99 dynamic event clients&lt;br /&gt;
01/20/2014 19:43:55|  main|pan|I|starting up GE 6.2u5p3 (lx26-amd64)&lt;br /&gt;
&lt;br /&gt;
8)  Periodically i would get this error:  ERROR: failed receiving gdi request response for mid=3 (got no message).&lt;br /&gt;
&lt;br /&gt;
9)  I also tried delete the pid in the file: /var/spool/gridengine/bkslab/qmaster/qmaster.pid&lt;br /&gt;
  That didn&amp;#039;t do anything.  It eventually just replaced it with a different number. &lt;br /&gt;
&lt;br /&gt;
 It&amp;#039;s wierd because it&amp;#039;s not even the right pid.  For example the real pid was 8286 and the pid in the file was 8203:&lt;br /&gt;
&lt;br /&gt;
  [root@pan qmaster]# service sgemaster start&lt;br /&gt;
Starting sge_qmaster:                                  	[  OK  ]&lt;br /&gt;
[root@pan qmaster]# ps -ax |grep sge&lt;br /&gt;
Warning: bad syntax, perhaps a bogus &amp;#039;-&amp;#039;? See /usr/share/doc/procps-3.2.8/FAQ&lt;br /&gt;
 8286 ?    	Rl 	0:03 /usr/bin/sge_qmaster&lt;br /&gt;
 8301 pts/0	S+ 	0:00 grep sge&lt;br /&gt;
[root@pan qmaster]# cat qmaster.pid &lt;br /&gt;
8203&lt;br /&gt;
&lt;br /&gt;
10)   When I typed tail /var/log/messages I saw this:&lt;br /&gt;
&lt;br /&gt;
Jan 20 14:25:05 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)&lt;br /&gt;
Jan 20 14:27:05 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)&lt;br /&gt;
Jan 20 14:29:05 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)&lt;br /&gt;
Jan 20 14:31:05 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)&lt;br /&gt;
Jan 20 14:33:06 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)&lt;br /&gt;
Jan 20 14:35:06 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)&lt;br /&gt;
Jan 20 14:36:29 pan kernel: Registering the id_resolver key type&lt;br /&gt;
Jan 20 14:36:29 pan kernel: FS-Cache: Netfs &amp;#039;nfs&amp;#039; registered for caching&lt;br /&gt;
Jan 20 14:36:29 pan nfsidmap[2536]: nss_getpwnam: name &amp;#039;root@rack-1.pharmacy.cluster.domain&amp;#039; does not map into domain &amp;#039;domain&amp;#039;&lt;br /&gt;
Jan 20 14:37:06 pan puppet-agent[2021]: Could not request certificate: Connection refused - connect(2)&lt;br /&gt;
This was what happened when I restarted the machine.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Category:Sysadmin]]&lt;/div&gt;</summary>
		<author><name>Frodo</name></author>
	</entry>
</feed>