Sunday, January 25, 2009

Classloaders Keeping Jar Files Open

If you write code that creates classloaders, you need to know about this bug:

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5041014

It is very insidious and something I just came across myself in some code.

You normally only have to worry about this if you are writing code that creates and destroys classloaders (for example, if you have some kind of pluggable architecture where a pluggable component found in a jar file gets its own classloader, and you want that pluggable component to be hot-deployable - that is, you want to be able to overwrite or modify that jar file with updated code). In Jopr's case, this happens on the agent - each "product plugin" (e.g. the JBossAS plugin, or the Postgres plugin, etc) has its own classloader, managed separately and kept independent of other plugin classloaders (there is a dependency model in place, but ignore that for this discussion).

Well, this VM bug is so bad it seems that anytime a classloader loads in a jar file, that jar file's file descriptor remains open for the lifetime of that VM (in other words, the classloader never calls JarFile.close() for all the jar files it previously streamed content from). At least that's what the bug report infers and what I'm seeing when I was debugging this. There is a nifty tool from Timothy Quinn that he used to track issues in Glassfish, but this tool is useful to track this kind of problem for any application, not just Glassfish - in fact, I used it to debug the issue in the Jopr agent. This bug manifested itself in the Jopr agent when hot-deploying agent plugins on Windows (Windows has the "feature" of not being able to manipulate files that are locked by others). I suspect similar issues will occur on UNIX because, even though UNIX doesn't do the file locking that Windows does, the file descriptors are still open and copying a file with the same name over the opened file will probably just create a second file descriptor.

The worst part about this is - there is no real workaround. The Jopr agent has its own classloader implementation - it is very basic and extends java.net.URLClassLoader to reuse most of its functionality. But the Java classloader API has no public, protected or package-scoped method or data field that you can override or access within URLClassLoader to help workaround the problem.

To actually fix the problem, it is simple - when you know you are done with a classloader, you just need to have that classloader close all .jar files it previously had opened. Alas, there is no "close" type method on the classloader object - there is absolutely no way to tell a classloader "I am done with you, clean up any resources you have open".

Once a classloader opens a jar file, that jar file's file descriptor remains open by the operating system for the lifetime of the VM. I find this completely unacceptable - this is clearly a design flaw that slipped through the cracks when the Java API was conceived and implemented. In order to support hot-deployable Java code, one would need to destroy and recreate classloaders. The current Java implementation does not make it easy to do this (requiring people to write their own classloader implementations from scratch does not meet the definition of "easy-to-do" and doesn't that defeat the purpose of OO and code reuse anyway?).

So, how do you support hot-deployable code and not see this bug? There are two main ways to do this as I see it:

1) write your own classloader implementation that allows you to close the open file descriptors when the classloader is no longer needed
2) copy the jar files that a classloader needs to a temporary location and put the temporary jars in the classloader (NOT the original jar files). When you need to hot-deploy an updated jar file, simply copy that new jar to a new temporary location, throw away the old classloader (which still has the file descriptor open, but its the old temporary jar file) and create a new classloader that opens the new temporary jar file. This sucks because if you hot-deploy frequently, you may run into your limit of the number of allowed open file descriptors (along with the problem that Windows presents - that being you can't delete the old temporary jar files until your VM exits).

Anyway, here is some code you can use to "workaround" this issue. It is a major hack - it only works if you are running in a SUN VM and because it relies on the implementation of internal SUN classes and code, you may break in the future should SUN decide to change how these classes are implemented (however, the good thing about this code is it has no compile time dependencies on any SUN-specific classes). I tested this code on SUN's Java6 JRE.

This method needs to be placed in your classloader that extends URLClassLoader. It uses reflection to iterate over the set of currently opened jar files as found in a private data member (URLClassLoader.ucp.loaders) of the classloader you want to discard. After running this code, I verified that no more jar files are left open.


public void close() {
try {
Class clazz = java.net.URLClassLoader.class;
java.lang.reflect.Field ucp = clazz.getDeclaredField("ucp");
ucp.setAccessible(true);
Object sun_misc_URLClassPath = ucp.get(this);
java.lang.reflect.Field loaders =
sun_misc_URLClassPath.getClass().getDeclaredField("loaders");
loaders.setAccessible(true);
Object java_util_Collection = loaders.get(sun_misc_URLClassPath);
for (Object sun_misc_URLClassPath_JarLoader :
((java.util.Collection) java_util_Collection).toArray()) {
try {
java.lang.reflect.Field loader =
sun_misc_URLClassPath_JarLoader.getClass().getDeclaredField("jar");
loader.setAccessible(true);
Object java_util_jar_JarFile =
loader.get(sun_misc_URLClassPath_JarLoader);
((java.util.jar.JarFile) java_util_jar_JarFile).close();
} catch (Throwable t) {
// if we got this far, this is probably not a JAR loader so skip it
}
}
} catch (Throwable t) {
// probably not a SUN VM
}
return;
}



If you happen to be using JNI (native libraries), you might also have to play games like the above to close the JNI jars too (same cavets as above apply regarding this needing to access the SUN implementation code). You can add this code to the close() method above:



// now do native libraries
clazz = ClassLoader.class;
java.lang.reflect.Field nativeLibraries = clazz.getDeclaredField("nativeLibraries");
nativeLibraries.setAccessible(true);
java.util.Vector java_lang_ClassLoader_NativeLibrary =
(java.util.Vector) nativeLibraries.get(this);
for (Object lib : java_lang_ClassLoader_NativeLibrary) {
java.lang.reflect.Method finalize =
lib.getClass().getDeclaredMethod("finalize", new Class[0]);
finalize.setAccessible(true);
finalize.invoke(lib, new Object[0]);
}



But even if you do this, I'm still not sure everything will work due to yet more SUN VM bugs (well, I think these are all basically the same bug):

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4299094
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4642062
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4286309

In the end, the Jopr agent didn't really need to do the above. I found that in the Jopr agent code, it was creating temporary classloaders unnecessarily which was locking the plugin jars. Once I removed the unnecessary classloaders from being created, the agent hot-deployment worked just fine since the plugin jars no longer got locked. For the record, the Jopr agent uses method #2 as described above to do its hot-deployment.

Saturday, January 10, 2009

Jopr Agent Auto Update Complete

I have completed the agent auto-update functionality. This provides the ability for a Jopr agent running in an environment to automatically detect that it needs to be updated and does so without the need for manual intervention.

The cool thing about this is it is completely cross platform! I've testing on Windows and Linux and I see no reason why this wouldn't work on other UNIX flavors such as HP-UX, AIX and MacOS.

Here's the basics of how it works:

When a Jopr Agent tries to connect or register with a Jopr Server, that server verifies the version of that agent. If the agent is not a compatible version, the server will forbid that agent from connecting/registering and will tell the agent it needs to update itself.

At this point, the agent will shutdown all of its internals, download the latest agent update binary (either from the server or some other download location previously configured in the agent), fork another Java VM that will unpackage the new agent binary and update the old agent with the new binary. The old agent will shutdown its VM and the new agent VM will be started.

From an administrator's point of view, this all happens under the covers and automatically and the agent just looks like it goes offline for a minute or two before coming back online.

Tangential to this, is the addition to several features to the agent plugin. The agent resource metadata now includes several more child services that allow you to configure your agent without having to manually log onto the agent box (i.e. we are using Jopr to manage Jopr!). You can now even change agent JVM settings and restart the agent with those new settings (in case you need to change a -Xmx option, for example). You can read this wiki page to learn about these new plugin features.

All of this code will be forthcoming in our next RHQ/Jopr release.

Here's some additional documentation you can read if curious:

http://www.rhq-project.org/display/JOPR2/RHQ+Agent+Installation#RHQAgentInstallation-PreparingYourAgentToBeAutoUpdatable
http://www.rhq-project.org/display/RHQ/Design-AgentAutoUpdate