AppDomains Won't Protect Host From a Failing Plugin

Several of my recent consulting projects dealt with composite applications, specifically desktop composite applications. A composite application consists of a host (shell) and a number of plugins, often developed by different teams of programmers.

In this scenario it is usually desirable to isolate the host from plugin failures. I often used AppDomains for this purpose. Eventually, I came to the conclusion that AppDomains are not very good isolators, for two main reasons:

Error handling is very difficult to do right.
Unloading plugins is not guaranteed.

Thus, meeting even basic robustness requirements for the application is difficult or even impossible.

This does not mean that AppDomains are useless. They still provide convenient partitioning mechanism, especially if one team controls all moving parts. The shortcomings outlined in this article may or may not be important for a particular project. If the project is able to tolerate certain degree of failure, AppDomains may still be a viable isolation solution for it.

The Idea Behind AppDomains

In a nutshell, AppDomains were invented for efficient isolation of third party code (plugins, components, web applications).

A host process, such as ASP.NET server, needs to load plugins (web applications) securely and efficiently. Of course, Win32 processes already provide such isolation, but they were deemed too heavyweight for the job, as described in this blog entry by Chris Brumme from Microsoft.

Main difference between an AppDomain and a process is that processes have their own threads, and AppDomains don't. To visualize that, let's say that if threads are like cars, then AppDomains are like countries and processes are like continents. While you drive your Chevrolet Thread in the AppDomain of USA you can see only American data. You can drive it to the AppDomain of Canada, but the moment you cross the border, you are cut off from American data and can now see only Canadian data. Then, after your finished with your Canadian plans, you can return your thread to the AppDomain of USA, and even cross into Mexico if required. Your Chevrolet Thread is not pinned to a particular AppDomain. However, no matter what road it takes, it cannot leave the continent of North America (we deliberately ignore Panama land bridge in order to keep our analogy simple).

Similarly, someone driving a BMW Thread in the process of Europe may cross from the AppDomain of France to the AppDomain of Spain, but they can never reach North America and access American or Canadian data.

Isolation Requirements

To run a reliable, secure, and efficient host, our isolation mechanism should have the following properties:

We must be able to load and execute plugins, with restricted security if necessary.
Plugins should not be able to corrupt host data.
If a plugin fails, the host must be able to detect this and unload the failing plugin.
It must be possible to unload plugins on demand.
Unloading a plugin should clean up any resources allocated for that plugin. If it does not, the host process will accumulate waste and will eventually fail.

Operating system processes satisfy all these requirements. Achieving restricted security for a child process may be tricky, but it is typically possible.

Unfortunately, AppDomains do not fare very well with these requirements. They do excellent job with #1 and #2. One can easily restrict security of the plugin, and host data is protected. However, we run into major difficulties with #3, #4. The sad reality is that

There is no way to reliably detect a failure in an AppDomain. And, even if we could
There is no way to reliably unload a failing AppDomain.

Also, there are some issues with #5. Per Chris Brumme there is a small memory leak on each AppDomain unload. More importantly, there is no way to unload any domain neutral assemblies: once loaded into the process, they are there to stay. This, however, looks minor compared to the problems we have with the exception handling.

Legacy vs. Default Exception Handling

Default Exception Handling

By default, an unhandled exception in any thread terminates the application unconditionally. This is bad news for runtime hosts. If a plugin creates a thread and that thread causes an unhandled exception, the whole host process dies. We can do last ditch effort error handling in AppDomain.UnhandledException handler, but termination of the process cannot be stopped.

In WPF and Windows Forms applications, UI threads can be protected from unhandled exceptions, because they have a built-in try/catch block supplied by the UI framework. However, worker threads lack such protection. In a desktop application it is considered best practice to perform long operations on a worker thread. So, the scenario where a plugin spawns a worker thread and that thread causes and unhandled exception is very real and possible. This makes default exception handling policy a bad choice for host-plugin architecture.

Legacy Exception Handling

Fortunately, default exception handling is not the only option. Prior to .NET 2.0 unhandled exceptions in worker threads did not automatically kill the process. To revert to this legacy behavior we can add the following snippet to the application configuration:

<configuration>
   <runtime>
      <legacyUnhandledExceptionPolicy enabled="1"/>
   </runtime>
</configuration>

Unfortunately, this still does not buy us full protection from plugin failures - read on.

Exception! Whose Fault Is That?

To effectively unload the crashing plugin we must first detect which plugin has crashed. Frankly, even with legacy exception handling this is virtually impossible.

When an unhandled exception occurs, the framework raises AppDomain.UnhandledException event. Each AppDomain may have its own UnhandledException handler. In a typical scenario, UnhandledException will first be raised in the failing AppDomain and then again in the main AppDomain. This works reasonably well if the exception type is [Serializable]. But if it's not, by the time the flow execution reaches main AppDomain things become muddy:

The original exception is replaced with SerializationException.
Information about AppDomain that caused the exception is lost.
A parasitic SerializationException will be thrown in the main AppDomain.

SerializationException contains surprisingly little information about what happened. At this point is not distinguishable from a genuine unhandled SerializationException that could have occurred in the host itself.

The original idea of the AppDomain.UnhandledException design was perhaps to allow main AppDomain to process all unhandled exceptions regardless of origin. In practice that goal was not achieved. It is also worth noting that most user-defined exception classes will not be marked as [Serializable]. simply because application programmers don't see a need to do that.

The host may try to pass exception information from the plugin's AppDomain using some custom method. E.g. UnhandledException handler in the plugin's AppDomain can explicitly call a centralized exception monitor object located in the main AppDomain, passing it only serializable objects like plugin's AppDomain name and exception string. This scheme, however, would still be prone to failure, because the plugin's AppDomain may be in unknown state after an unhandled exception, and successful communication with the host's exception monitor cannot be guaranteed. A mechanism supported by the framework is required for reliable operation, but such mechanism does not exist.

Unloading Failing Plugin

Even if we managed to figure out what plugin is causing trouble, this is not the end of the story. There is no way to gracefully unload plugin that is in an unknown state.

If plugin is executing native code that cannot be interrupted (e.g., file I/O), it will not be unloaded at all. AppDomain.Unload() will fail with an exception similar to this:

System.CannotUnloadAppDomainException: Error while unloading appdomain. (Exception from HRESULT: 0x80131015)

If plugin is executing background threads, they will be aborted with ThreadAbortException. In default exception handling mode this exception will be then quietly swallowed by the framework. However, in legacy exception handling mode it will raise AppDomain.UnhandledException in the main AppDomain with AppDomainUnloadedException.

Again, AppDomainUnloaded exception carries surprisingly little information. In particular, it does not say what AppDomain was unloaded. Therefore, it is impossible to figure out whether this is an expected exception from dying background threads of a plugin that is being unloaded, or some other peculiar error.

ASP.NET uses AppDomains. How Does It Survive?

Experiment shows that ASP.NET takes a hands-off approach to reliability. Each application pool runs a worker process (w3wp.exe). Each web application in the pool runs in an AppDomain. When an application causes an exception on a worker thread, the whole process dies, taking down all other applications, perfectly good applications with it. If those applications were processing web requests, these requests will be remembered. ASP.NET will then create a new worker process, and pass it cached requests (if any) for handling.

This approach works relatively well mostly because the Web is stateless. Any state passed between requests, such as cookies is small and well-defined. The demise and resurrection of the ASP.NET worker process remains invisible to the user or the application programmer, unless they take special steps to detect it.

Obviously, such hands-off approach is not viable for a desktop application: restarting the whole application and losing unsaved data when a single plugin fails would not be welcome by the users.

Conclusion

AppDomains provide certain degree of isolation between parts of the application, but this isolation is limited. A number of design decisions and features of .NET framework make proper error handling very difficult. Exceptions pop up in unexpected places, and exception objects carry very little context with them.

Unloading plugins is not guaranteed. This is hardly framework designer's fault: Windows threads were not designed to be gracefully interruptible, but this gives little consolation to the application authors.

Depending on the requirements, AppDomains still can be very useful, especially if efficiency is more important than absolutely reliability, such as in case of ASP.NET.

However, for truly isolated application one may want to consider using processes instead of AppDomains, like in Baktun Shell. Unfortunately, this is not a panacea either: multi-process desktop application are not mainstream, and many unexpected pitfalls may arise, especially when using third party libraries.

For better or for worse, such is the nature of software development: there is no easy way out, it is all about tradeoffs.

Ivan Krivyakov

Premature optimization is the root of all evil