Here is a cookbook of things you can do with SGE.
Explain error on queue
qstat -f -explain E
Investigate the error. Disk full? Needs a reboot? Then, clear the error on (the queues on) a machine
qmod -c '*@<machine-name>*'
Who is running jobs on the GPUs ?
qstat -q gpu.q -f -u '*'
Find jobs in the Eqw state.
qstat -u '*' | grep Eqw
Investigate it. Directory was removed? or authentication problem? We know about this. For now, just clear the error on a job in the Eqw state
qmod -cj <jobid>