Dienstag, 9. August 2011

Don't panic!!!

This morning, I realized that one of our servers had trouble with an agent that seemed to be out of control. It was listed as 'fatal' in the server-monitoring view of Lotus Domino Administrator. I tried to stop the zombie-agent using tell amgr cancel "dbpath" 'agent' but that didn't seem to stop it. The agent was still showing as being executed. I tried stopping the agent-manager-task itself, but that didn't help either - it even made the situation worse, since the agent-manager couldn't be started any more. Now the server repeatedly printed thousands of "error connecting to ..." messages to the console, so I decided to restart the whole domino server. Unfortunately, the server didn't respond to the shutdown-request, so I had to use "nsd -kill". I rebooted the whole system and executed the domino startup script. And guess what - the next problem that I encountered looked even more challenging:

ti="00237CC4-C12578E7" sq="00000031" THREAD [08129:00002-4107441872] WAITING FOR WRITE LOCK ON RWSEM 0x02A2 Semaphore controlling per-process init/termination in NSF (@F452A3B8) (R=0,W=1,WRITER=08127:4107818704,OWNER=08127:4107818704) FOR 30000 ms
ti="0023887C-C12578E7" sq="00000032" THREAD [08129:00002-4107441872] WAITING FOR WRITE LOCK ON RWSEM 0x02A2 Semaphore controlling per-process init/termination in NSF (@F452A3B8) (R=0,W=1,WRITER=08127:4107818704,OWNER=08127:4107818704) FOR 30000 ms
ti="00239434-C12578E7" sq="00000033" THREAD [08129:00002-4107441872] WAITING FOR WRITE LOCK ON RWSEM 0x02A2 Semaphore controlling per-process init/termination in NSF (@F452A3B8) (R=0,W=1,WRITER=08127:4107818704,OWNER=08127:4107818704) FOR 30000 ms
ti="00239FEC-C12578E7" sq="00000034" THREAD [08129:00002-4107441872] WAITING FOR WRITE LOCK ON RWSEM 0x02A2 Semaphore controlling per-process init/termination in NSF (@F452A3B8) (R=0,W=1,WRITER=08127:4107818704,OWNER=08127:4107818704) FOR 30000 ms

Those messages seemed to repeat for ever and the server was not accessible in that state. Now I got a little nervous and wondered what could be causing this. Google found two articles that matched the error message which I encountered, but the workarounds mentioned there didn't seem to apply to our system. Now that more than a half an hour had already passed since I had initially restarted the server, I was just about to ask IBM for help. I had just finished writing the problem description to the service request page, and hit the send-button, when suddenly the server continued its operation as if nothing had happened... So - just don't panic ;-)

Kommentare:

  1. Take 3 NSDs around the time of the hang. Also if the agent is Java you can do the following at the domino console.

    tell http xsp javadump
    tell http xsp heapdump

    The IBM_TECHNICAL_SUPPORT folder will then give more information on the issue.

    AntwortenLöschen
  2. So - just don't panic ;-) ... or maby not.

    We had the same problem and it slowed down our machine for hours until it became totally unresponsive one day.

    We figured out that it was something related to transaction logging and got a HF from IBM. No problem since then.

    AntwortenLöschen
  3. I forgot to mention: we're running 8.5.2FP3 on our mailservers. The problematic agent had a do-while-loop which resulted in an infinite loop. Now I still wonder why this agent couldn't be canceled through the corresponding amgr-command...
    @ShadowBJ21:
    In our case, the semaphore messages showed up for about a half an hour, immediately after restarting the server (following the "media recovery replay" message). As soon as those messages disappeared, the server became available immediately. Did this happen to you as well?

    AntwortenLöschen