Managed Availability is one of Exchange 2013's interesting new features. It is essentially Exchange's built-in monitoring and remediation platform.
It's the latter part -- remediation -- that continues to confuse Exchange admins. Because Managed Availability is tightly integrated into Exchange 2013, it will take mitigating actions whenever it discovers an issue. Actions depend on the issue, but rebooting the server -- referred to as a "bugcheck" -- is one example.
Before diving into the nuts and bolts of Managed Availability, let's have a look at how the feature works and how it's designed to do its job.
How Managed Availability maintains a server's health
Managed Availability uses a multi-layered approach when it comes to evaluating and maintaining a server's health. At the foundation, a series of tests or polls (counters, for example) are executed; these tests are referred to as probes. Each of the probes yields a result, which is in turn inspected by the second layer in Managed Availability: monitors.
As the name implies, monitors are the logic that interprets the results that the different probes yield. How a monitor interprets the result of a probe is programmatically determined; it's based on Microsoft's experience with the product and likely to be influenced by what the company sees is happening in Office 365.
Depending on the results of the probes, monitors will either do nothing or kick off so-called responders. Responders are responsible for executing specific actions to try and remediate the potential issue one of the probes discovered. An example of a responder action is restarting a service.
When the service restarts, a probe might be able to receive different -- and better -- results the next time it runs. The monitor might then determine the service is healthy again. However, sometimes the results are still negative, which could occur when the service stops again or because it's not performing well.
In this case of negative results, many scenarios are possible. If the monitor determines the service is still unhealthy, it might use a different responder. This responder might then take a different action (for instance, restarting the server). The actions taken and the order of the actions depend on the definition of the monitor and what the component is. For some components, multiple responders will first be tried before the issue escalates, so you don't have to worry that a low-impact service might cause a server reboot. Additionally, responders are throttled, so certain actions can only be taken once or twice a day.
If none of the actions solve the issue, Managed Availability will escalate the issue, meaning it will notify an Exchange admin that an issue exists and should be examined by raising an alert in the Event Logs. This event can then be picked up by whatever monitoring option you have, which in turn can alert the admin. The escalation of an issue is another responder that is programmed to create the alert (Figure 1).
Health Manager Service and the worker process
Managed Availability is composed of two different services or processes, much like the new Exchange 2013 store service. The Health Manager Service is the parent process that controls the Health Manager Worker, the child process. The worker process is responsible for executing the different tasks Managed Availability has to perform.
The process hierarchy is clearly exposed when using Process Explorer (Figure 2), for example.
The Health Manager Service isn't only responsible for starting or stopping the worker process -- it also ensures the worker process works correctly. If it finds that the process hangs (or that it didn't start), it will restart the process as needed.
In a multi-site environment, the Health Manager Service itself is also being watched; only it's not a process on the server itself -- the Health Manager Service on another Exchange Server does so.
About the author:
Michael Van Horenbeeck is a technology consultant, Microsoft Certified Trainer and Exchange MVP from Belgium, mainly working with Exchange Server, Office 365, Active Directory and a bit of Lync. He has been active in the industry for 12 years and is a frequent blogger, a member of the Belgian Unified Communications User Group Pro-Exchange and a regular contributor to The UC Architects podcast.
This is part one of a series on Managed Availability.
Part two covers the locations where Managed Availability will log its activity and how to leverage that information to make sense of what's happening in your environment..
Stay tuned for part three, which discusses responders and explains how to retrieve what actions Managed Availability took.