Friday, July 13, 2012

Co-Located Management Client For JBossAS 7

Back in the "olden days" (pre-JBossAS 7 :-), the JBoss Application Server provided management capabilities through MBeans hosted in its MBeanServer. Your own code could remotely manage the app server by connecting to its MBeanServer through remote connectors, but, if your code was running co-located in the same app server that you wanted to manage, your code could talk directly to the MBeanServer through the normal JMX API.

JBossAS 7 has a new management subsystem and its management API has been completely redesigned and is no longer dependent on JMX. The new AS7 management subsystem is exposed in several ways; in fact, as I see it, there are five different ways to manage AS7 - two of which are programmatic:

  1. The HTTP management endpoint which uses JSON and a de-typed RPC style API.
  2. "ModelControllerClient" (found in the jboss-as-controller-client library) which provides a factory that allows you to remotely connect to an AS7 instance and utilizes jboss-dmr as its de-typed API.
  3. The administration console (aka "the web interface") which is a GWT application that remotely manages AS7 via the previously mentioned HTTP management endpoint.
  4. The command line interface (aka "the CLI") which is a console-based and Swing GUI-based user interface that remotely manages AS7 via the previously mentioned ModelControllerClient.
  5. Manual editing of XML configuration files.
You can use (1) and (2) programmatically to build management tools. And, in fact, you can see this is what the AS7 folks did when they built AS7's own management tools (the web interface (3) utilizes the HTTP endpoint and the CLI (4) utilizes the ModelControllerClient).

However, as you may have noticed, one thing is conspicuously absent - that being a way to manage an application server which is co-located with the management tool itself without going over some sort of remote connector or endpoint. In the previous JMX-based way of doing JBossAS management, if your management code was running in the app server itself, you could obtain the app server's MBeanServer without going through some sort of remote connector thereby allowing you to call the management API by directly accessing that MBeanServer. However, in AS7, there is nothing readily available that gives you local, intra-VM, access to the underlying management capabilities. Even though your code may be running in the same AS7 instance that you want to manage, you still have to make a "remote" round-trip connection back to yourself (using either an HTTP connection or the standard ModelControllerClient factory).

In this blog I propose* a sixth mechanism to manage AS7 - that being, a local mechanism that circumvents any need for a remote connection and allows you to directly access the management API.
(* full credit to this idea actually belongs to Kabir Khan who kindly gave me the original prototype implementation - thanks Kabir!)

Obviously, this sixth mechanism is only useful if your management code is co-located with the AS7 being managed. But, when you have this situation, it does seem that this need has been, up to now, left unfulfilled. This mechanism actually will utilize the same ModelControllerClient and jboss-dmr API as the CLI uses, however, it does not require that you connect the client to the app server over a remote connector. Rather, what this will do is activate a service within AS7 - that service will get injected with AS7's internal ModelController object and then it will expose that object via a ModelControllerClient. With that client, any code running inside the AS7 instance can talk directly to AS7's ModelController thus giving you direct access to the underlying management functionality without the need for going through a remote connector.

Here's one way you can do this. In this example, let's assume there is a WAR which will contain an AS7 MSC service (I'll just refer to it as a "service"). This AS7 service will expose the local ModelControllerClient to the WAR via a very simple singleton pattern. I could conceive of other ways in which you could expose the local client - you could even do so through JMX! Just put the ModelControllerClient in an attribute on some MBean that the AS7 service registers in a local MBeanServer. Your local JMX clients could then obtain the ModelControllerClient via JMX. But I digress.

First, you need to declare that your WAR deployment has a ServiceActivator. You do this very simply - by placing a file called "org.jboss.msc.service.ServiceActivator" in your WAR's META-INF/services directory (note: there is currently a bug that requires META-INF/services/org.jboss.msc.service.ServiceActivator to actually be placed under the WAR's WEB-INF/classes directory. Refer to that bug report for more information). The content of this file is the fully qualified class name of a ServiceActivator implementation class that can be found in the WAR (in my example, this would be the full class name of the ManagementService class that I will describe below).

Second, because this mechanism requires that our service utilize some AS7 specific libraries, you need to add some dependencies to the WAR so it can access those AS7 libraries. The dependencies are identified by module names (these are the names that AS7 uses to identify the libraries). Some of the libraries that the WAR needs access to are identified by module names such as org.jboss.msc, org.jboss.as.controller-client, et. al. You can specify these additional dependencies by simply adding a Dependencies entry in the WAR's META-INF/MANIFEST.MF file like this:
Dependencies: org.jboss.msc,org.jboss.as.controller-client,org.jboss.a s.controller,org.jboss.as.server
Third, you need to actually implement this ServiceActivator class and place it in the WAR (e.g. it can go in WEB-INF/classes). For this example, I will call it "ManagementService" because it is a service that simply exposes the local management client to the WAR. Here is the code for this service:

public class ManagementService implements ServiceActivator {
   private static volatile ModelController controller;
   private static volatile ExecutorService executor;


   public static ModelControllerClient getClient() {
      return controller.createClient(executor);
   }


   @Override
   public void activate(ServiceActivatorContext context) throws ServiceRegistryException {
      final GetModelControllerService service = new GetModelControllerService();
      context
          .getServiceTarget()
          .addService(ServiceName.of("management", "client", "getter"), service)
          .addDependency(Services.JBOSS_SERVER_CONTROLLER, ModelController.class, service.modelControllerValue)
          .install();
   }


   private class GetModelControllerService implements Service[Void> {
      private InjectedValue[ModelController> modelControllerValue = new InjectedValue[ModelController>();


      @Override
      public Void getValue() throws IllegalStateException, IllegalArgumentException {
         return null;
      }


      @Override
      public void start(StartContext context) throws StartException {
         ManagementService.executor = Executors.newFixedThreadPool(5, new ThreadFactory() {
             @Override
             public Thread newThread(Runnable r) {
                 Thread t = new Thread(r);
                 t.setDaemon(true);
                 t.setName("ManagementServiceModelControllerClientThread");
                 return t;
             }
         });
         ManagementService.controller = modelControllerValue.getValue();
      }


      @Override
      public void stop(StopContext context) {
         try {
            ManagementService.executor.shutdownNow();
         } finally {
            ManagementService.executor = null;
            ManagementService.controller = null;
         }
      }
   }
}

You'll notice that its activate() method simply injects the existing AS7 management controller (whose name is defined by Services.JBOSS_SERVER_CONTROLLER) into the service being activated. This AS7 management controller is the "secret sauce" - it is the object that interacts with the AS7's management subsystem. The actual service is defined via a small private class (GetModelControllerService) whose main job is simply to take the injected ModelController and store it in the ManagementService.controller static field.

Once this service is activated (which happens when the WAR is deployed), classes deployed in the WAR itself can access the management API by obtaining a ModelControllerClient via ManagementService.getClient(). This wraps the ModelController in a client object which interacts with the ModelController thus allowing any class in the WAR to start managing the AS7 instance via the client. But unlike the typical usage of ModelControllerClient (like what you see in the CLI) it doesn't go through a remote connector. It literally talks directly to the underlying AS7 ModelController object giving you a direct channel to the AS7 management subsystem.

Monday, July 9, 2012

RHQ's New Availability Checking

The latest RHQ release has introduced a few new things related to its availability scheduling and checking. (side note: you'll notice that link points to RHQ's new documentation home! The wiki moved to jboss.org)

It used to be that RHQ's availability scanning was pretty rigid. RHQ scanned every resource on a fixed 5 minute schedule. Though this was configurable, it was only configurable on an agent-wide scale (meaning, you could not say that you want to check resource A every 5 minutes, but resource B every 1 minute).

In addition, when the agent would go down (for whatever reason), it took a long time (around 15 minutes by default) for the RHQ Server to consider the agent to be "missing in action" and mark all of the resources that agent was managing as "down". Prior to those 15 minutes (what RHQ calls the "Agent Max Quiet Time Allowed"), you really had no indication there was a problem and some users found this to be too long before alerts started firing. And even though this Agent Max Quiet Time Allowed setting is configurable, under many situations, setting it to something appreciably lower than 15 minutes would cause the RHQ Server to think agents were down when they weren't, triggering false alerts.

RHQ has since addressed these issues. Now, you will notice all server and service resources have a common "Availability" metric in their Monitoring>Schedules subtab in the GUI. This works just like other metric schedules - you can individually configure these as you wish just like normal metrics. By default, all server resources will have their availability checked every 1 minutes (for example, see the server resource snapshot in figure 1). By default, all service resources will have their availability checked every 10 minutes (for example, see the service resource snapshot in figure 2).

Figure 1 - A server resource and its Availability metric, with default interval of 1 minute
Figure 2 - A service resource and its Availability metric, with default interval of 10 minutes
The reason for the difference in defaults is this - we felt most times it didn't pay to scan all services at the same rate as servers. That just added more load on the agent and more load on the managed resources for no appreciable gain in the default case. Because normally, if a service is down, we could have already detected that by seeing that its server is down. And if a server is up, generally speaking, all of its services are normally up as well. So we found checking the server resources more frequently, combined with checking all of their services less frequently, helps detect the common failures (those being, crashed servers) more quickly while lowering the total amount of work performed. Obviously, these defaults are geared toward the general case; if these assumptions don't match your expectations, you are free to alter your own resources' Availability schedules to fit your own needs. But we felt that, out-of-box, we wanted to reduce the load the agent puts on the machine and the resources it is managing while at the same time we wanted to be able to detect downed servers more quickly. And this goes a good job of that.

Another change that was made is that when an agent gracefully shuts down, it sends one final message to the RHQ Server to indicate it is shutting down and at that point the RHQ Server immediately marks that agent's platform as "down" rather than wait for the agent to go quiet for a period of 15 minutes (the Agent Max Quiet Time Allowed period) before being marked "down". Additionally, although the platform is marked "down", that platform's child resources (all of its servers and services) are now marked as "unknown". This is because, technically, RHQ doesn't really know what's happening on that box now that the agent is no longer reporting in. Those managed resources that the agent was monitoring (the servers and services) could be down, but RHQ can't make that determination - they could very well still be up. Therefore, the RHQ Server places all the servers and services in the new availability state known as "unknown" as indicated by the question-mark-within-a-gray-circle icon. See figure 3 as an example.

Figure 3 - Availability data for a service that has been flagged as being in an "unknown" state
As part of these infrastructure changes, the Agent Max Quiet Time Allowed default setting has been lowered to 5 minutes. This means that, in the rare case that the agent actually crashes and was unable to gracefully notify the RHQ Server, the RHQ Server will consider the agent "missing in action" faster than before (now 5 minutes instead of 15 minutes). Thus, alerts can be fired in a more timely manner when agents go down unexpectedly.