31 August 2012

225. Sun GridEngine: commlib error: got select error (Connection refused)


The issue:
On doing qhost I get
error: commlib error: got select error (Connection refused)
error: unable to send message to qmaster using port 6444 on host "beryllium": got send error
Beryllium is my hostnode.

Same thing happens with qstat and any other imaginable SGE command.

The solution:
It's an obvious one -- just restart the services. I mean, it took me twenty minutes to re-remember that, but it should have been obvious. Most of the time, services are managed using scripts in /etc/init.d/ and that's the case here too. So, hanging my head in shame, here's the solution:

ls /etc/init.d/grid*
/etc/init.d/gridengine-exec  /etc/init.d/gridengine-master

sudo service gridengine-master restart

qhost should now slowly be populated

Done.

How I got there:


 ps aux|grep sge
sgeadmin  3173  0.0  0.0  56844  3428 ?        Sl   Aug20   6:29 /usr/lib/gridengine/sge_execd

 tree /var/spool/gridengine -L 4 -d
/var/spool/gridengine
|-- execd
|   `-- beryllium
|       |-- active_jobs
|       |-- jobs
|       `-- job_scripts
|-- qmaster
|   `-- job_scripts
`-- spooldb

Looking at /var/spool/gridengine/execd/beryllium/messages

08/20/2012 10:47:17|  main|beryllium|I|starting up GE 6.2u5 (lx26-amd64)
08/30/2012 15:06:57|  main|beryllium|E|commlib error: got read error (closing "beryllium/qmaster/1")
08/30/2012 15:06:58|  main|beryllium|W|can't register at qmaster "beryllium": abort qmaster registration due to communication errors
less /var/lib/gridengine/rupert/common/act_qmaster
beryllium

So looks ok.

Oddly, there's nothing funny in /tmp -- no execd_messages.* files.

 ps aux|grep sge
sgeadmin  3173  0.0  0.0  56844  3428 ?        Sl   Aug20   6:29 /usr/lib/gridengine/sge_execd
sudo kill 3173


start-stop-daemon --exec /usr/sbin/sge_execd --start --user sgeadmin
Which didn't seem to do anything.

start-stop-daemon --exec /usr/sbin/sge_qmaster --start --user sgeadmin
which doesn't seem to do anything either.

/usr/lib/gridengine/gethostname -aname
critical error: Please set the environment variable SGE_ROOT.
export SGE_ROOT=/var/lib/gridengine
/usr/lib/gridengine/gethostname -aname
beryllium

 service gridengine-master restart
Restarting Sun Grid Engine Master Scheduler: sge_qmasterrm: cannot remove `/var/run/gridengine/qmaster.pid': Permission denied
.
cat /var/run/gridengine/qmaster.pid
3198
ps aux|grep 3198
yields nothing
sudo rm  /var/run/gridengine/qmaster.pid

sudo service gridengine-master restart
Restarting Sun Grid Engine Master Scheduler: sge_qmaster.
ps aux|grep sge
sgeadmin 32178  2.5  0.0  69004  6112 ?        Sl   09:40   0:00 /usr/lib/gridengine/sge_qmaster
qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
    715 0.75000 submit__la me         r     08/22/2012 08:10:32 six.q@boron                        6        
    720 0.25194 submit__63 me         r     08/22/2012 11:15:02 four.q@tantalum                    4        
    716 0.74817 submit__la me         qw    08/22/2012 08:11:28                                    6        
    719 0.70429 submit__la me         qw    08/22/2012 08:38:17                                    6        
    721 0.25071 submit__63 me         qw    08/22/2012 11:15:35                                    4        
    722 0.25000 submit__32 me         qw    08/22/2012 11:16:01                                    4    

 qhost
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
beryllium               lx26-amd64      3     -    7.8G       -   14.9G       -
boron                   lx26-amd64      6  6.10    7.6G    1.4G   14.9G  240.8M
tantalum                lx26-amd64      4  4.01    7.7G    1.6G   14.9G     0.0

sudo service gridengine-exec restart
Restarting Sun Grid Engine Execution Daemon: sge_execd.

qhost
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
beryllium               lx26-amd64      3  0.22    7.8G    3.0G   14.9G  141.3M
boron                   lx26-amd64      6  6.09    7.6G    1.4G   14.9G  240.8M
tantalum                lx26-amd64      4  4.01    7.7G    1.6G   14.9G     0.0




6 comments:

  1. Hi, I can't fix this error. I dont know what to do. I've done step by step your tips, but the issue just does not get solved.

    Could you help me?

    ReplyDelete
    Replies
    1. Difficult to troubleshoot without more information, and I'm not much of an expert on SGE. What does
      ps aux|grep sge
      give?

      Delete
  2. Thanks for answer! I've restart the services, but it does not work (I've done some tries after restart, ps aux|grep sge shows sge_execd and qmaster, but still the problem)

    ps aux|grep sge
    sgeadmin 1651 0.0 0.1 53072 1660 ? Sl 11:45 0:00 /usr/lib/gridengine/sge_execd

    /etc/hosts.conf
    127.0.0.1 localhost
    172.25.80.144 prueba.borja #qmaster
    172.25.80.140 clienteprueba1 #qclient

    ReplyDelete
    Replies
    1. You say you get both sge_qmaster and sge_qmaster but only show sge_execd?

      ps aux|grep sge
      sgeadmin 3125 0.3 0.0 137024 4972 ? Sl Nov27 50:39 /usr/lib/gridengine/sge_qmaster
      sgeadmin 3169 0.0 0.0 54796 1560 ? Sl Nov27 3:51 /usr/lib/gridengine/sge_execd

      Last question: is this an issue that has suddenly appeared, or is have you never managed to get SGE working? I'm asking since it could also be due to the way you set it up -- SGE was very temperamental during set-up, in particular when it comes to hostnames.

      Delete
    2. It has been solved! The problem were in the hostnames: SGE does not like . or numbers! I changed the hostname prueba.borja to pruebaborja; clienteprueba1 to clienteprueba and the issue dissapeared.

      Thanks! Now Im following your guide for setting up three nodes.
      Congrats for your blog

      Delete
    3. Thanks for reporting back!

      Delete