Dienstag, 9. August 2011

Don't panic!!!

This morning, I realized that one of our servers had trouble with an agent that seemed to be out of control. It was listed as 'fatal' in the server-monitoring view of Lotus Domino Administrator. I tried to stop the zombie-agent using tell amgr cancel "dbpath" 'agent' but that didn't seem to stop it. The agent was still showing as being executed. I tried stopping the agent-manager-task itself, but that didn't help either - it even made the situation worse, since the agent-manager couldn't be started any more. Now the server repeatedly printed thousands of "error connecting to ..." messages to the console, so I decided to restart the whole domino server. Unfortunately, the server didn't respond to the shutdown-request, so I had to use "nsd -kill". I rebooted the whole system and executed the domino startup script. And guess what - the next problem that I encountered looked even more challenging:

ti="00237CC4-C12578E7" sq="00000031" THREAD [08129:00002-4107441872] WAITING FOR WRITE LOCK ON RWSEM 0x02A2 Semaphore controlling per-process init/termination in NSF (@F452A3B8) (R=0,W=1,WRITER=08127:4107818704,OWNER=08127:4107818704) FOR 30000 ms
ti="0023887C-C12578E7" sq="00000032" THREAD [08129:00002-4107441872] WAITING FOR WRITE LOCK ON RWSEM 0x02A2 Semaphore controlling per-process init/termination in NSF (@F452A3B8) (R=0,W=1,WRITER=08127:4107818704,OWNER=08127:4107818704) FOR 30000 ms
ti="00239434-C12578E7" sq="00000033" THREAD [08129:00002-4107441872] WAITING FOR WRITE LOCK ON RWSEM 0x02A2 Semaphore controlling per-process init/termination in NSF (@F452A3B8) (R=0,W=1,WRITER=08127:4107818704,OWNER=08127:4107818704) FOR 30000 ms
ti="00239FEC-C12578E7" sq="00000034" THREAD [08129:00002-4107441872] WAITING FOR WRITE LOCK ON RWSEM 0x02A2 Semaphore controlling per-process init/termination in NSF (@F452A3B8) (R=0,W=1,WRITER=08127:4107818704,OWNER=08127:4107818704) FOR 30000 ms

Those messages seemed to repeat for ever and the server was not accessible in that state. Now I got a little nervous and wondered what could be causing this. Google found two articles that matched the error message which I encountered, but the workarounds mentioned there didn't seem to apply to our system. Now that more than a half an hour had already passed since I had initially restarted the server, I was just about to ask IBM for help. I had just finished writing the problem description to the service request page, and hit the send-button, when suddenly the server continued its operation as if nothing had happened... So - just don't panic ;-)

Synchronizing Database Quotas

Since we heavily rely on database quotas, I find it quite annoying that there's no built-in mechanism to synchronize the quota values (and many other database properties) between different servers in a cluster. Because of that, I wrote a small Lotus Script agent to accomplish that task. Here's the code:

 Option Public
  Option Declare
  Sub Initialize
  Dim dbdir As New NotesDbDirectory("mail01/srv")
  Dim s As New notessession
  Dim db As NotesDatabase
  Dim db2 As NotesDatabase
  Set db = dbdir.GetFirstDatabase(DATABASE)
  While Not(db Is Nothing)
   Print "Processing " & db.Filepath
   If db.sizequota <> 0 then
    Set db2 = s.getdatabase("mail02/srv", db.Filepath , False)
    If Not db2 Is Nothing then 
     If db2.sizequota <> db.sizequota then
      db2.sizeQuota = db.sizequota
      db2.sizewarning = db.sizewarning
      print "Quota for " & db2.filepath & " changed to " & db.sizequota & " Bytes"
      Set db2 = Nothing
     End if
    End If
   End If
   Set db = dbdir.GetNextDatabase
  Wend
 End Sub

Just save the agent-code to a database. Set the agent to run e.g. once a night or once a week, and sign it with an ID that has the right to access all the databases on both servers. The rest should be self-explanatory.