Error Kernels

In any computer system, we can split the system into the part which must be correct—no matter what happens; and the part where we don’t need the system to be correct all the time.

A typical example of this is the operating system kernel. We believe the operating system kernel to be correct, but we don’t a priori trust the userland applications. If one of those fails, the kernel can take over, reap the process by killing it and reclaim the resources the process used. The reason for this is memory protection and the fact that the kernel controls the memory space.

For programs, there is often a similar piece of the system which has to be correct at all times. But there are also computations we don’t care about. The only thing we care about is that if the system transitions from a state S1 and reaches a state S2, then this new state is consistent. And if we fail in between the transition, we can clean up by removing the failing computation from the system.

Smart programmers will identify the Error Kernel of their computer system. This is the part of the code base which MUST not fail in any way. And then they will seek to make that kernel as small as possible. This corresponds to the concept of limiting the trusted computer base of a system.

Once you have identified the Error Kernel, you can design your system in a way such that you protect this part of the system the most. If you do it correctly, you should have very few lines of code which needs uttermost correctness. And better yet, you can often keep performance critical and complex code outside of the kernel.

In most languages, the way to control the Error Kernel is by the use of an exception. The idea is that code proceeds along a given path. If something goes wrong along that path, then the code raises an exception which is then handled elsewhere to clean up afterwards.

The key in exception handling code is that you must foresee the possibility of an error. Some languages used checked exceptions in order to force you to handle all possible exceptions. Others use the type system by turning the effect into a monad and then force handling through the type system. And yet others rely on the concept of human oversight to make sure that all exceptional paths are handled correctly.

Erlang uses a different mechanism—albeit it is slightly related to the exception mechanism of most languages. In Erlang, an error crashes the process. And you let another process handle the error. Done correctly, this has a number of interesting consequences:

Your program will automatically have stop-gap measures which limits faults whenever you have a concurrent activity. And you can use processes to build compartments which encapsulate error in the system. The key tools here are persistence and isolation of individual Erlang processes.
Even if you forgot to handle error in an individual process, you can use the default stance to protect the system from total failure. This is the primary reason as to why Erlang copes well with unforeseen errors in the system.
You can omit writing defensive code in a lot of places. Any process which operates outside the error kernel does not need to protect itself by being defensive. This does not remove the need to check that operations succeed, but it removes the code path that tries to cope with the error locally. Rather, if something goes wrong, you “nuke it from orbit” and start all over again later or with a different premise.

The notion of the Error Kernel is not confined to Erlang. It persists in any program you will write. Thinking about it tend to yield more robust system designs, and I encourage any programmer to identify it for their programs.