I ran into a interesting scenario some time ago. A customer have first line operators online 24/7. During none business hours they receive all alerts and needs to call the on-call engineer if needed. But first line don’t have deep knowledge about the environment so sometimes the alerts from Operations Manager is a bit complicated to connect to a service, for example if the alert only tells you that database Y has a problem, and also to understand how critical the alerts are. For example if only one IIS in the IIS farm goes offline, they should not call the on-call engineer in the middle of the night.
We had for example a service including two Windows services. As long as one of them are running, there should not be an alert, and if there is an alert, it should include a simple non-technical description. First we needed to create a distributed application with the two services. We used the Configure Health Rollup feature to configure rollup algorithm to “best health state”. As long as any service is health, the component box will be healthy.
When one of the services are stopped, you will receive an alert telling you for example “the print spooler service on computer X has stopped running”. If you don’t need it you can override the monitor and configure it not to generate alerts. When booth services are down the distributed application will switch to critical status. But you will not receive an alert, only for the two services included in the distributed application.
If you need an alert when both services are offline, when the component box switch state, you can override the aggregate rollup monitor in the distributed application. Override it to both configure the alert description and also rename the alert to get a better alert name in the console. In this scenario I override the aggregate monitor on top of my two Availability monitors.
Now when both services are offline I get one alert, saying that first line should contact the on-call engineer.