Thoughts From A Management Platform Developer: 2012

Monday, November 26, 2012

Monitoring IP Endpoints Via RHQ

RHQ has a little known plugin called the "netservices" plugin. Someone was asking about how RHQ can monitor HTTP endpoints on the #rhq freenode chat room - you use this netservices plugin to do this.

There are actually two different resource types that plugin provides - one allows you to monitor an HTTP URL endpoint (e.g. so you can monitor for different HTTP status codes). The other is a very basic resource type that just verifies that you can ping a specific IP or hostname.

Here's the RHQ 4.5.1 version of that plugin and its source:

http://mirrors.ibiblio.org/maven2/org/rhq/rhq-netservices-plugin/4.5.1/

(note: if anyone in the community is interested in doing some work on RHQ and looking for something simple to do - here's a perfect opportunity. It would be nice to have some documentation written for that plugin, but more importantly, I think we could beef up this plugin. For example, it would be nice to have another resource type (similar to that basic PingService) that is able to confirm that a particular port on a particular IP/host can be connected to. We also could use some testing done on this plugin to make sure it still works as expected.)

Thursday, August 2, 2012

Fedorahosted GIT links have changed (again)

Well, fedorahosted did it again.

All of my links to RHQ source code in all of my blog entries (and everywhere else I used them - bugzilla entries, wiki pages, etc) will be invalid now, because fedorahosted once again changed their git URLs and AFAICS the old URLs no longer work and do not redirect to the new URLs.

This is at least the second time that I know of that they did this - it may have happened more. Previously, I went through all of my blog entries and updated my source links, but I am not going to do it this time. It is too time consuming and I am not convinced this won't happen again.

So, if you click a link to source code in my blogs and get an error message saying the link is invalid, you know why. :-/

(Heiko is right, we need to move to github :-)

Friday, July 13, 2012

Co-Located Management Client For JBossAS 7

Back in the "olden days" (pre-JBossAS 7 :-), the JBoss Application Server provided management capabilities through MBeans hosted in its MBeanServer. Your own code could remotely manage the app server by connecting to its MBeanServer through remote connectors, but, if your code was running co-located in the same app server that you wanted to manage, your code could talk directly to the MBeanServer through the normal JMX API.

JBossAS 7 has a new management subsystem and its management API has been completely redesigned and is no longer dependent on JMX. The new AS7 management subsystem is exposed in several ways; in fact, as I see it, there are five different ways to manage AS7 - two of which are programmatic:

The HTTP management endpoint which uses JSON and a de-typed RPC style API.
"ModelControllerClient" (found in the jboss-as-controller-client library) which provides a factory that allows you to remotely connect to an AS7 instance and utilizes jboss-dmr as its de-typed API.
The administration console (aka "the web interface") which is a GWT application that remotely manages AS7 via the previously mentioned HTTP management endpoint.
The command line interface (aka "the CLI") which is a console-based and Swing GUI-based user interface that remotely manages AS7 via the previously mentioned ModelControllerClient.
Manual editing of XML configuration files.

You can use (1) and (2) programmatically to build management tools. And, in fact, you can see this is what the AS7 folks did when they built AS7's own management tools (the web interface (3) utilizes the HTTP endpoint and the CLI (4) utilizes the ModelControllerClient).

However, as you may have noticed, one thing is conspicuously absent - that being a way to manage an application server which is co-located with the management tool itself without going over some sort of remote connector or endpoint. In the previous JMX-based way of doing JBossAS management, if your management code was running in the app server itself, you could obtain the app server's MBeanServer without going through some sort of remote connector thereby allowing you to call the management API by directly accessing that MBeanServer. However, in AS7, there is nothing readily available that gives you local, intra-VM, access to the underlying management capabilities. Even though your code may be running in the same AS7 instance that you want to manage, you still have to make a "remote" round-trip connection back to yourself (using either an HTTP connection or the standard ModelControllerClient factory).

In this blog I propose* a sixth mechanism to manage AS7 - that being, a local mechanism that circumvents any need for a remote connection and allows you to directly access the management API.
(* full credit to this idea actually belongs to Kabir Khan who kindly gave me the original prototype implementation - thanks Kabir!)

Obviously, this sixth mechanism is only useful if your management code is co-located with the AS7 being managed. But, when you have this situation, it does seem that this need has been, up to now, left unfulfilled. This mechanism actually will utilize the same ModelControllerClient and jboss-dmr API as the CLI uses, however, it does not require that you connect the client to the app server over a remote connector. Rather, what this will do is activate a service within AS7 - that service will get injected with AS7's internal ModelController object and then it will expose that object via a ModelControllerClient. With that client, any code running inside the AS7 instance can talk directly to AS7's ModelController thus giving you direct access to the underlying management functionality without the need for going through a remote connector.

Here's one way you can do this. In this example, let's assume there is a WAR which will contain an AS7 MSC service (I'll just refer to it as a "service"). This AS7 service will expose the local ModelControllerClient to the WAR via a very simple singleton pattern. I could conceive of other ways in which you could expose the local client - you could even do so through JMX! Just put the ModelControllerClient in an attribute on some MBean that the AS7 service registers in a local MBeanServer. Your local JMX clients could then obtain the ModelControllerClient via JMX. But I digress.

First, you need to declare that your WAR deployment has a ServiceActivator. You do this very simply - by placing a file called "org.jboss.msc.service.ServiceActivator" in your WAR's META-INF/services directory (note: there is currently a bug that requires META-INF/services/org.jboss.msc.service.ServiceActivator to actually be placed under the WAR's WEB-INF/classes directory. Refer to that bug report for more information). The content of this file is the fully qualified class name of a ServiceActivator implementation class that can be found in the WAR (in my example, this would be the full class name of the ManagementService class that I will describe below).

Second, because this mechanism requires that our service utilize some AS7 specific libraries, you need to add some dependencies to the WAR so it can access those AS7 libraries. The dependencies are identified by module names (these are the names that AS7 uses to identify the libraries). Some of the libraries that the WAR needs access to are identified by module names such as org.jboss.msc, org.jboss.as.controller-client, et. al. You can specify these additional dependencies by simply adding a Dependencies entry in the WAR's META-INF/MANIFEST.MF file like this:


Dependencies: org.jboss.msc,org.jboss.as.controller-client,org.jboss.a
 s.controller,org.jboss.as.server

Third, you need to actually implement this ServiceActivator class and place it in the WAR (e.g. it can go in WEB-INF/classes). For this example, I will call it "ManagementService" because it is a service that simply exposes the local management client to the WAR. Here is the code for this service:

public class ManagementService implements ServiceActivator {
private static volatile ModelController controller;
private static volatile ExecutorService executor;

public static ModelControllerClient getClient() {
return controller.createClient(executor);
}

@Override
public void activate(ServiceActivatorContext context) throws ServiceRegistryException {
final GetModelControllerService service = new GetModelControllerService();
context
.getServiceTarget()
.addService(ServiceName.of("management", "client", "getter"), service)
.addDependency(Services.JBOSS_SERVER_CONTROLLER, ModelController.class, service.modelControllerValue)
.install();
}

private class GetModelControllerService implements Service[Void> {
private InjectedValue[ModelController> modelControllerValue = new InjectedValue[ModelController>();

@Override
public Void getValue() throws IllegalStateException, IllegalArgumentException {
return null;
}

@Override
public void start(StartContext context) throws StartException {
ManagementService.executor = Executors.newFixedThreadPool(5, new ThreadFactory() {
@Override
public Thread newThread(Runnable r) {
Thread t = new Thread(r);
t.setDaemon(true);
t.setName("ManagementServiceModelControllerClientThread");
return t;
}
});
ManagementService.controller = modelControllerValue.getValue();
}

@Override
public void stop(StopContext context) {
try {
ManagementService.executor.shutdownNow();
} finally {
ManagementService.executor = null;
ManagementService.controller = null;
}
}
}
}

You'll notice that its activate() method simply injects the existing AS7 management controller (whose name is defined by Services.JBOSS_SERVER_CONTROLLER) into the service being activated. This AS7 management controller is the "secret sauce" - it is the object that interacts with the AS7's management subsystem. The actual service is defined via a small private class (GetModelControllerService) whose main job is simply to take the injected ModelController and store it in the ManagementService.controller static field.

Once this service is activated (which happens when the WAR is deployed), classes deployed in the WAR itself can access the management API by obtaining a ModelControllerClient via ManagementService.getClient(). This wraps the ModelController in a client object which interacts with the ModelController thus allowing any class in the WAR to start managing the AS7 instance via the client. But unlike the typical usage of ModelControllerClient (like what you see in the CLI) it doesn't go through a remote connector. It literally talks directly to the underlying AS7 ModelController object giving you a direct channel to the AS7 management subsystem.

Monday, July 9, 2012

RHQ's New Availability Checking

The latest RHQ release has introduced a few new things related to its availability scheduling and checking. (side note: you'll notice that link points to RHQ's new documentation home! The wiki moved to jboss.org)

It used to be that RHQ's availability scanning was pretty rigid. RHQ scanned every resource on a fixed 5 minute schedule. Though this was configurable, it was only configurable on an agent-wide scale (meaning, you could not say that you want to check resource A every 5 minutes, but resource B every 1 minute).

In addition, when the agent would go down (for whatever reason), it took a long time (around 15 minutes by default) for the RHQ Server to consider the agent to be "missing in action" and mark all of the resources that agent was managing as "down". Prior to those 15 minutes (what RHQ calls the "Agent Max Quiet Time Allowed"), you really had no indication there was a problem and some users found this to be too long before alerts started firing. And even though this Agent Max Quiet Time Allowed setting is configurable, under many situations, setting it to something appreciably lower than 15 minutes would cause the RHQ Server to think agents were down when they weren't, triggering false alerts.

RHQ has since addressed these issues. Now, you will notice all server and service resources have a common "Availability" metric in their Monitoring>Schedules subtab in the GUI. This works just like other metric schedules - you can individually configure these as you wish just like normal metrics. By default, all server resources will have their availability checked every 1 minutes (for example, see the server resource snapshot in figure 1). By default, all service resources will have their availability checked every 10 minutes (for example, see the service resource snapshot in figure 2).

Figure 1 - A server resource and its Availability metric, with default interval of 1 minute

Figure 2 - A service resource and its Availability metric, with default interval of 10 minutes

The reason for the difference in defaults is this - we felt most times it didn't pay to scan all services at the same rate as servers. That just added more load on the agent and more load on the managed resources for no appreciable gain in the default case. Because normally, if a service is down, we could have already detected that by seeing that its server is down. And if a server is up, generally speaking, all of its services are normally up as well. So we found checking the server resources more frequently, combined with checking all of their services less frequently, helps detect the common failures (those being, crashed servers) more quickly while lowering the total amount of work performed. Obviously, these defaults are geared toward the general case; if these assumptions don't match your expectations, you are free to alter your own resources' Availability schedules to fit your own needs. But we felt that, out-of-box, we wanted to reduce the load the agent puts on the machine and the resources it is managing while at the same time we wanted to be able to detect downed servers more quickly. And this goes a good job of that.

Another change that was made is that when an agent gracefully shuts down, it sends one final message to the RHQ Server to indicate it is shutting down and at that point the RHQ Server immediately marks that agent's platform as "down" rather than wait for the agent to go quiet for a period of 15 minutes (the Agent Max Quiet Time Allowed period) before being marked "down". Additionally, although the platform is marked "down", that platform's child resources (all of its servers and services) are now marked as "unknown". This is because, technically, RHQ doesn't really know what's happening on that box now that the agent is no longer reporting in. Those managed resources that the agent was monitoring (the servers and services) could be down, but RHQ can't make that determination - they could very well still be up. Therefore, the RHQ Server places all the servers and services in the new availability state known as "unknown" as indicated by the question-mark-within-a-gray-circle icon. See figure 3 as an example.

Figure 3 - Availability data for a service that has been flagged as being in an "unknown" state

As part of these infrastructure changes, the Agent Max Quiet Time Allowed default setting has been lowered to 5 minutes. This means that, in the rare case that the agent actually crashes and was unable to gracefully notify the RHQ Server, the RHQ Server will consider the agent "missing in action" faster than before (now 5 minutes instead of 15 minutes). Thus, alerts can be fired in a more timely manner when agents go down unexpectedly.

Wednesday, June 27, 2012

Simple Tools To Analyze Thread Dumps

When debugging problems in, or analyzing the performance of, your Java applications, you sometimes have to sift through one or more Java thread dumps. Depending on the size and complexity of your Java applications, these thread dumps can be lengthy. It is hard to use a text editor and scan it for things like long running threads, blocked threads and deadlocked threads.

Clearly, you can use sophisticated Java analysis tools (such as JProfiler) for such a job. However, there are instances where you simply don't have access to such tools. For example, if you are supporting a remote customer and all you have is the ability to ask the customer to use "jstack" or to send a SIGQUIT signal to the Java app process. You usually cannot ask a customer to stop an app and restart it with the appropriate debugging system properties and options and then connect to it with a tool like JProfiler (which also assumes they even have such a tool installed and available).

So, what you normally have to fall back on is obtaining a thread dump (which is nothing more than normal text). Once that thread dump is captured, you can save it as a .txt file, send it around via email, post it via pastebin or even send it over IM (if its small enough).

A simple way you can get a thread dump is via the "jstack" utility that ships with the JDK. Once you know the pid of your Java JVM process (to get this information, you can use the standard "ps" utility or you can use "jps", which is another JDK utility), you simply tell "jstack" to output a thread dump associated with that process which you redirect to a file:

# find the process IDs of all of my running Java applications
$ jps
21147 Jps
16640 Main

# take an initial thread dump snapshot of my "Main" Java application
$ jstack 16640 > /tmp/my-thread-dump.txt

# take a second thread dump snapshot and append it to the original
$ jstack 16640 >> /tmp/my-thread-dump.txt

The question then is - how do you analyze it? I recently came across a couple of small, free, easy-to-install, easy-to-run GUI tools that help assist in analyzing thread dumps. The first is Samurai and the second is TDA (Thread Dump Analyzer). They are easy to download and install:

Samurai is downloaded as a simple samurai.jar file that you start via "java -jar samurai.jar"
TDA is downloaded as a zip file that you unzip which again gives you a simple tda.jar that you start via "java -jar tda.jar"

I like the extremely simple way to install and run these.

Each tool has basically the same way to start - you just tell it to open up a text file that contains one or more thread dumps in them (such as the "my-thread-dump.txt" file that my example above captured).

Each tool has its own way of displaying the threads:

Samurai Main Screen That Shows Threads and Their States


TDA Main Screen That Shows Threads and Their States

Its fairly obvious how to use these tools, so I won't bore you with details. Just click around, and you'll get it fairly quickly. They both have relatively intuitive user interfaces. There isn't much too them - don't expect any artificial intelligence to analyze your threads - but you can use them to navigate through the stack traces of all the threads easier than scrolling up and down the base text file.

Notice also that both tools can analyze multiple thread dumps if it detects more than one in your thread dump text file (in my snapshots above, I had three thread dumps). It is excruciating trying to do this by hand (scrolling through multiple thread dumps in a single file) so this is when these tools are really of great help. I also like how Samurai, in one view, shows the states of the threads that are common in the multiple thread dumps. So you can see which threads were running, blocked, etc and which ones changed states between the different thread dumps. However, Samurai loses points because, if thread names are really long, you are forced to scroll horizontally to see the names of the shorter-named threads and to see the states.

One final thing I wanted to show is how they can quickly point out the threads that are deadlocked. This is where manually scanning thread dumps with your eyes can be slow, so these tools can definitely save some time. I took some deadlock code I found online and ran it. Using "jstack", I captured a thread dump and used both tools to see how they report the deadlock. Here's what to expect if you analyze thread dumps that have deadlocks in them:

Samurai Table of Threads Showing The Two That Are Deadlocked

Samurai Showing In Red The Stacktraces of the Deadlocked Threads

TDA Showing The Deadlocked Threads and Their Stacktraces

That's all I wanted to show. It looks like either of these tools can be useful to do simple analysis of thread dumps and to help quickly determine if a thread dump has one or more deadlocks.

Monday, June 18, 2012

EJB Calltime Monitoring

I wanted to give a brief illustration of RHQ's calltime monitoring feature by showing how it is used to collect EJB calltime statistics. The idea here is that I have an EJB application (in my example, I am using the RHQ Server itself as the monitored application!) and I want to see calltime statistics for the EJB method calls being made. For example, in my EJB application being monitored, I have many EJB Stateless Session beans (SLSBs). I want to see how my EJB SLSBs are behaving by looking at how many times my EJB SLSB methods are called and how efficient they are (that is, I want to see the maximum, minimum and average time each EJB SLSB method took).

So, first what I do is go to my EJB SLSB resources that are in inventory and I enable the "Method Invocation Time" calltime metric. You can do this on an individual EJB resource or you can do a bulk change of all your EJB's schedules by navigating to your EJB autogroup and changing the schedule in the autogroup's Schedules subtab:

At this point, the RHQ Agent will collect the information and send it up to the RHQ Server just like any other metric collection. Over time, I can now examine my EJB SLSB calltime statistics by traversing to my EJB resource's Calltime subtab:

I can even look at an aggregate view of all my EJBs via the EJB autogroup. In the RHQ GUI, you will see in the left hand tree all of my EJB SLSBs are children of the autogroup node "EJB3 Session Beans". If I select that autogroup node and navigate to its Calltime subtab, I can see all of the measurements for all of my EJB SLSBs:

Note that this calltime measurement collection feature is not specific to EJBs, or the JBossAS 4 plugin. This is a generic subsystem supported by any plugin that wants to enable this feature. If you want to write your own custom plugin that wants to monitor calltime-like statistics (say, for HTTP URL endpoints or any other type of infrastructure that has a "duration" type metric that can be collected) you can utilize the calltime subsystem to collect your information and let the RHQ Server store it and display it in this Calltime subtab for your resource.

Thursday, March 8, 2012

Monitoring Custom Data from DB Queries

There is an interesting little feature in RHQ that I thought I would quickly mention here. Specifically, it's a feature of the Postgres plugin that let's you track metrics from any generic query you specify.

Suppose you have data in your database and you want to expose that data as a metric. For example, suppose you want to track the total number of users that are currently logged into your application and that information is tucked away in some database table that you can query.

Import your Postgres database into RHQ and manually add a "Query" resource under your Postgres Database Server resource (see the image below where the "Import" menu provides you with the names of the resource types you can manually add as a child to the database server resource - in this case, the only option is the Query resource type).

When you "import" this Query resource through the manual add feature, you will be asked for, among other things, the query that you want to execute that extracts your metric data.

Once you do, you'll have a new Query resource in your RHQ inventory that is now tracking your metric value like any other metric (e.g. you will be able to see the historical values of your data in the graph on the Monitoring tab; you'll be able to alert on those values; etc.)

The one quirky thing about this is the query needs to return a single row of two columns - the first column must have a value of "metricColumn" (literally) and the second column must be a numeric value. To follow the earlier example (tracking the number of users currently logged in), it could be something like:

SELECT 'metricColumn', count(id) FROM my_application_user WHERE is_logged_in = true

That's it. A pretty simple feature, but it seems like this could have a wide range of uses. Hopefully, this little tidbit can spark an idea in your head about how you can use this feature while monitoring your systems.

Friday, January 27, 2012

Sending RHQ Alerts over XMPP

Rafael has created a very cool server plugin to allow RHQ to send alerts to his Google account over XMPP. Not only that, but he was able to use that same XMPP channel to send commands to the RHQ Server, like another kind of CLI.

Watch his demo to see how it works. Very awesome!

This is exactly the kind of innovation we envisioned the community being able to do via the server plugins, as I mentioned in my earlier blog entry titled "RHQ Server Plugins - Innovation Made Easy"

Nice job, Rafael!

Thoughts From A Management Platform Developer