Monday, July 9, 2012

RHQ's New Availability Checking

The latest RHQ release has introduced a few new things related to its availability scheduling and checking. (side note: you'll notice that link points to RHQ's new documentation home! The wiki moved to jboss.org)

It used to be that RHQ's availability scanning was pretty rigid. RHQ scanned every resource on a fixed 5 minute schedule. Though this was configurable, it was only configurable on an agent-wide scale (meaning, you could not say that you want to check resource A every 5 minutes, but resource B every 1 minute).

In addition, when the agent would go down (for whatever reason), it took a long time (around 15 minutes by default) for the RHQ Server to consider the agent to be "missing in action" and mark all of the resources that agent was managing as "down". Prior to those 15 minutes (what RHQ calls the "Agent Max Quiet Time Allowed"), you really had no indication there was a problem and some users found this to be too long before alerts started firing. And even though this Agent Max Quiet Time Allowed setting is configurable, under many situations, setting it to something appreciably lower than 15 minutes would cause the RHQ Server to think agents were down when they weren't, triggering false alerts.

RHQ has since addressed these issues. Now, you will notice all server and service resources have a common "Availability" metric in their Monitoring>Schedules subtab in the GUI. This works just like other metric schedules - you can individually configure these as you wish just like normal metrics. By default, all server resources will have their availability checked every 1 minutes (for example, see the server resource snapshot in figure 1). By default, all service resources will have their availability checked every 10 minutes (for example, see the service resource snapshot in figure 2).

Figure 1 - A server resource and its Availability metric, with default interval of 1 minute
Figure 2 - A service resource and its Availability metric, with default interval of 10 minutes
The reason for the difference in defaults is this - we felt most times it didn't pay to scan all services at the same rate as servers. That just added more load on the agent and more load on the managed resources for no appreciable gain in the default case. Because normally, if a service is down, we could have already detected that by seeing that its server is down. And if a server is up, generally speaking, all of its services are normally up as well. So we found checking the server resources more frequently, combined with checking all of their services less frequently, helps detect the common failures (those being, crashed servers) more quickly while lowering the total amount of work performed. Obviously, these defaults are geared toward the general case; if these assumptions don't match your expectations, you are free to alter your own resources' Availability schedules to fit your own needs. But we felt that, out-of-box, we wanted to reduce the load the agent puts on the machine and the resources it is managing while at the same time we wanted to be able to detect downed servers more quickly. And this goes a good job of that.

Another change that was made is that when an agent gracefully shuts down, it sends one final message to the RHQ Server to indicate it is shutting down and at that point the RHQ Server immediately marks that agent's platform as "down" rather than wait for the agent to go quiet for a period of 15 minutes (the Agent Max Quiet Time Allowed period) before being marked "down". Additionally, although the platform is marked "down", that platform's child resources (all of its servers and services) are now marked as "unknown". This is because, technically, RHQ doesn't really know what's happening on that box now that the agent is no longer reporting in. Those managed resources that the agent was monitoring (the servers and services) could be down, but RHQ can't make that determination - they could very well still be up. Therefore, the RHQ Server places all the servers and services in the new availability state known as "unknown" as indicated by the question-mark-within-a-gray-circle icon. See figure 3 as an example.

Figure 3 - Availability data for a service that has been flagged as being in an "unknown" state
As part of these infrastructure changes, the Agent Max Quiet Time Allowed default setting has been lowered to 5 minutes. This means that, in the rare case that the agent actually crashes and was unable to gracefully notify the RHQ Server, the RHQ Server will consider the agent "missing in action" faster than before (now 5 minutes instead of 15 minutes). Thus, alerts can be fired in a more timely manner when agents go down unexpectedly.

No comments:

Post a Comment