Fault-tolerance in Opalis policies

Lately I have been working with Opalis and special around fault-tolerance both in a policy and in the Opalis infrastructure. I think this is two areas that needs to be combined to get a really fault-tolerance Opalis implementation. In this blog post I want to show you some things to think about when building your opalis policy.

Opalis is an automation platform for orchestrating and integrating IT tools to drive down the cost of datacenter operations, while improving the reliability of IT processes. It enables IT organizations to automate best practices, such as those found in Microsoft Operations Framework (MOF) and Information Technology Infrastructure Library (ITIL). This is achieved through workflow processes that coordinate System Center and other management tools to automate incident response, change and compliance, and service-lifecycle management processes. More info at source.

When you install your first action server it will be your default/primary action server, you can then install more action servers that can work as standby servers. If the Opalis Management Server don’t receive heartbeats from the primary action server it can failover all policies to the secondary server. You can then install another action server that can be standby for the first standby server. A nice thing is that you can design some of your policies to run on the standby machine, even when the primary action server is alive. Could be good in some resource intensive policies or policies that require a special software. To mention one more of the Opalis failover features, you can place the database on a cluster, and this is also something to think about when building you Opalis infrastructure.

One important note about failover to another action server is that your policy will start at the beginning of the workflow. In other word if the policy was running when the primary action server went offline the policy will start in at the beginning again. That is not always a bad thing, but it is something you need to think about when building your policies. It will lead us to the second area of fault-tolerance in Opalis, fault-tolerance in Opalis policy.

When you design your policy you should consider fault-tolerance and not only build a workflow as one large straight stretch. For example if you build a policy that creates a new user account, creates a mailbox, adds the account to a couple of security groups and also network folder like this

What if the fileserver is offline or the user account already exists? The first question could be how will you notice it? Will you monitor Opalis Operator Console or monitor it with Operations Manager? You policy will fail and you will have a user in the Active Directory without a mailbox or network folder. Most likely you will need to open you MMC consoles and create the mailbox and folder yourself manually. If you instead add a couple of extra objects like this

Your policy will start with checking if the user is already in AD or for example if the username is already in use. It will then check that both the mail server and fileserver is up and running. If not, it will stop and write this to log files. Log files you can easy monitor and use to troubleshoot. If you then add some more objects like this

This version, in addition to the two other versions, will check if the user name exists already, and if it exists it will continue along the orange path. The orange path will generate another samaccount name and user attributes then the green path. Both paths will also send an e-mail when the account is created. If the create mailbox object fail the policy will delete the account, or clean up what it has created so far.

You can of course come up with a lot of other scenarios where different parts of the policy fail and you need to take action based on it. One idea could be if the action server failover to the secondary server, then you need to have a policy that knows where to start, for example we might already have the account in AD and want to continue with the mailbox, not create a new account according to the orange path. There is not one easy general answer to how you need to build your policy to cover all different scenarios, but it is something you should think about when designing your policies. Then of course, use the failover mechanism in Opalis to get a fault-tolerance infrastructure.