Home » Orchestrator » SMA » State tracking and failover in SMA

State tracking and failover in SMA

As you might know a runbook restarts at the first activity when it fails over to secondary runbook server in Orchestrator. The failover mechanism is automatically in Orchestrator, but restarting at first activity is a challenge. Read more about state tracking in Orchestrator here.

SMA do this a bit different  A great news for all Orchestrator administrators is that you can create checkpoints in your runbooks. A checkpoint is a snapshot of a running runbook or workflow, including variable values, output and everything done until that point. You can place a checkpoint anywhere you want in a runbook, but each time you do a checkpoint it cost a bit of performance and storage. A best practices is to add a checkpoint after each important part of the runbook that you don’t want the runbook to re-run if it is restarted. For example if you have a runbook building a virtual machine and then configure it you might want to do a checkpoint after the virtual machine is created. If the runbook is interrupted it will continue with the configuration when it is resumed. As checkpoints are stored in the SMA database they are worker independent, meaning that one worker can start the workflow, then if interrupted and another worker can pick up at latest checkpoint. Another nice benefit with checkpoints is that you can suspend a running runbook. For example if you need to do maintenance on the worker or if the runbook is interrupted due to a network error, you can suspend it, fix the network error and resume the runbook.

When a worker (Runbook Service) starts it decide which queue “slots” to process. Let’s say you have two worker servers, then they will pick up half of the “slots” each. If one worker goes offline you need to update runbook worker deployment settings to only include one worker. Else there is a risk that jobs will not be picked up and suspended jobs will not be resumed. This re-configuration is done with Powershell, the New-SmaRunbookWorkerDeployment cmdlet. As the worker service configure queue “slots” when starting it is important to stop all workers before running new-smarunbookworkerdeployment cmdlet, else there is a risk that jobs will be picked up multiple times or jobs become corrupt. One important thing to note about that cmdlet is that New-SmaRunbookWorkerDeployment will replace the existing settings. If you want to add one worker you need to run New-SmaRunbookWorkerDeployment and add both the old and new worker. The following script adds a worker named SMA02 to the existing configuration. WAP01 is the machine running SMA web service.

$youbService = “https://wap01” $workers = (Get-SmaRunbookWorkerDeployment -YoubServiceEndpoint $youbService).ComputerName if($workers -isnot [system.array]) {$workers = @($workers)} $workers += “SMA02” New-SmaRunbookWorkerDeployment -YoubServiceEndpoint $youbService -ComputerName $workers

If you have two workers, SMA01 and SMA02 and SMA01 goes offline you can run the following command to remove SMA01 and configure only SMA02 as worker

New-SmaRunbookWorkerDeployment -WebServiceEndpoint https://wap01 -ComputerName SMA02

Once you have run the New-smarunbookworkerdeployment cmdlet you can start the worker services again.

Quick summary

  • Make sure to include checkpoints in your runbooks
  • Make sure to have multiple workers
  • When one worker goes offline, or you need to do maintenance, use New-SmaRunbookWorkerDeployment to remove/add workers. But first stop all running workers (Runbook Service)
  • Start all workers (runbook service)

Common questions

Q: What if I need to run new-smarunbookworkerdeployment when I have running runbooks?
A: It is possible to configure drain time. Drain time can be up to 20 minutes and is started when you stop the runbook service. During drain time the worker will not pick up any new jobs and running jobs will suspend if possible (if they have a checkpoint). Runbooks without a checkpoint will continue to run and might be interrupted when the runbook service stops after drain time.

Q: How many checkpoints can I create in a runbook?
A: You can create as many as you like, but SMA will only use the latest one. You can only resume at the latest one.

Q: What will happened if a runbook have no checkpoints and it is moved to another worker?
A: If there are no checkpoints in the runbook and it is transferred to another worker, then the runbook will restart at the second worker

Q: how do I create a checkpoint?
A: Include “Checkpoint-Workflow” in your runbook

Q: I see “Queued” as job status after I reconfigured workers, seems like no jobs are running?
A: This can happened if no worker is running, make sure at least one of the workers are running

Q: When I tries to start the Runbook Service on a worker I get an error in the System log saying “The Runbook Service service terminated with the following error: Incorrect function”
A: Most likely you are trying to start a worker that is not in the runbook worker deployment settings. Re-configure with new-smarunbookworkerdeployment

Q: Is it possible to automate the “new-smarunbookworkerdeployment” part?
A: Yes! See this blog post for example

Example

In this example I have two runbook workers, SMA02 and WAP01. I have a runbook named test_failover. This runbook runs 20 loops. For each loop it writes a timestamp to a SQL table, creates a checkpoint and waits two minutes.

In the SQL table I can see the test_failover runbook writing rows, one for each loop.

If I now stop the runbook service on SMA02 I can see that the Runbook service is “Stopping” for 10 minutes. Drain time in my environment is 10 minutes. The test_failover runbook stopped at loop number 4.

I stop the Runbook Service on WAP01 and run “New-SmaRunbookWorkerDeployment -WebServiceEndpoint https://wap01 -ComputerName wap01″ to remove SMA02 from my configuration. Then start the Runbook Service on WAP01 again. I can see that the runbook now running on WAP01. As you can see in the figure loop 4 and 3 was written same second. When I stopped the Runbook Service loop 3 was just about to be written by the “WriteLog” runbook, but it was queued until WAP01 resumed the jobs.

 

Note that this is provided “AS-IS” with no warranties at all. This is not a production ready solution, just an idea and an example.


Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.