Operations Manager 2007 Agent Failover

I have been in the sandbox today, playing with failover. In this scenario I had four machine,

  • COOM01, root management server with SQL
  • COOM02, management server
  • CODC01, domain controller and DNS, also agent managed from operations manager
  • COXP01, XP workstation for operations manager administrators 

I deployed a agent to the domain controller (CODC01) from COOM01 and assign CODC01 that management server. During the agent deployment you can setup failover for a agent. If you use Active Directory integrated settings for your agents you can specify failover settings there. With default settings the agent will automatically failover to any other management server in the management group.

When a Operations Manager administrator started the console at his workstation, COXP01, he connect to COOM01 as it is the root management server (RMS). 

I then turned off the root management server. On the agent, CODC01, I first got an event ID 21006 that told me that the agent could not connect to CODC01, and that I should check the network. Two minutes later I got an event ID 21015 that told me that the agent will resume communication when COOM01 are back online. I also got, two minutes after the first event, an event telling me that the agent had successfully failed over to COOM02.

 

 

I then turned the root management server back on and after a couple of minutes the agent generated a event ID 21019

 

Everything was back to normal. When I opened the console I saw no alerts at all, at least no new ones. This was most likely because the root management server was down and could not update the database. Then I shut down the root management server again, this time with the console open at COXP01. I got the event at CODC01 telling me that the management server COOM01 was not available and that the agent had failed over to COOM02. I did not got any error messages in the console at COXP01, but if I tried to change the view in the console I did get a error.  If I try to open the console at COOM02 I get this

 

 

After that the root management server went down the administrator at the workstation could not see anything in his console. The SDK service is responsible for authentication to the database, the SDK service are running on the root management server. This mean that if there is no root management server running the SDK service nothing can access the database. 

If this had been a real environment and the root management server was down and could not be recover the next step would be to move the root management server role to another management server. You can read how to do that here. But as you can read in the guide one of the steps are to backup the root management server key and use it when promote another server. This means that you must have the key to recover, do not forget to backup it. If you don´t have the key there is no way to promote another server to become root management server. The root management server support cluster, so there don´t need to be a single point of failure, remember this when you design your Operations Manager 2007 environment. 

One thought on “Operations Manager 2007 Agent Failover

Comments are closed.