On Observability

The hard part is not debugging the code. The hard part is figuring out where the bug is. This is what observability is.

— Charity Majors

I really like the notion of “observability”, which stems from Control Theory. The idea is that we have some system, with inputs, some internal state and some outputs. A system is observable if we can determine its internal state, solely from a finite set of outputs (in finite time).

The contrast, a non-observable system, has some internal state we cannot infer just by looking at the outputs over some time frame.

Credit should be given to Charity Majors for transplanting this idea into tech (thank you!), and also creating a whole company around the idea (see http://honeycomb.io if you are interested).

When we look at program output, we need to take a (w)holistic approach. There are the immediate replies to user requests; but there are also log lines, trace probes, metrics and so on. Everything can be considered an output, even if it is a side-effect of the primary computation. And that set is what we look at when we try to determine what is happening inside a running system.

Most computer systems are strictly non-observable. We don’t log the relevant information, we don’t have the relevant metrics, and we don’t have the ability to trace arbitrarily in most systems. As a result, we have no chance when a system misbehaves. In many cases we won’t even know that the system misbehaved in some way.

Logs, Metrics and Tracing are necessary but certainly not sufficient properties needed to achieve observability in a system. People will do all kinds of aggregations, filters, reductions and so on to their data in order to cut down on it.

Logs tend to have all kinds of problems associated with them. The good old syslog system has the problem it treats each line as a separate event. It also has no structure in the log lines. Asking questions in these are hell. A little better is if the log lines has structure, where the structure is flat. But the best situation is if you just log an Erlang term, and S-expression or the like, so you have all relevant information in the log line.

Metrics tend to aggregate. You know you had an error, but you don’t know any context. You also tend to have a temporal problem, in which you take 30 seconds of errors and store in one value. This makes spikes impossible to detect.

When people say tracing, they often mean “We added this static set of probe points to our code base, and we want to output all of those when we increase the log level.” People don’t enable this because it kills their production servers with on-disk log writes.

How we transplant the notion from control theory is a bit vague:

We could treat the notion as a continous axis where systems can converge toward being (fully) observable.
We could treat the notion as a discrete property. Either you have a system which is, or you have a system which isn’t.
We could accept a discrete notion, but with partiality. In some cases our system is observable, but not in others. Then define a fully observable system as one which is observable in all cases.

Personally, I lean toward the latter of these. Suppose you have a fault in your software. If the fault can be found and fixed by looking at your systems output only, then the system was observable in this case. If, on the other hand, you need to reproduce the error in a development environment, attach a debugger, step through the program and scratch your head for several hours, the system was not observable.

I stress this is on a case-by-case basis. A system is 80% fault-observable if 80% of all faults are observable according to the above notion.

Crash Logs

In Erlang systems, faults generate crashes. A crash log contains:

The current stack trace of the failing process
The last event the process received
The state of the process before processing said event

My experience is that often, this is enough to provide observability in a fault scenario. You can work from the state and figure out how the event might have lead to the stack trace. In particular you can often figure out what code path was taken and how that would lead to the faulty situation.

Most other systems has a state space which are several gigabytes in size. So small dumps like these are impossible since we cannot find the relevant piece of information. In contrast, process isolation in Erlang can often limit us to the core state for the fault.

What you can and should do, however, is to take all coredumps and move these into persistent storage. If your core dump dies with your container, you have no post-mortem debugging and you will not be wiser. In some situations, if you can detect the fault, you can force the core-dump so you have a state you can inspect.

Assertions

If you assert your invariants in your code, then Erlang systems will crash if an invariant is broken, which leads to crash logs for the faulty process. Erlang systems often assert processing as they are executing. This vastly increases cases where the system is observable. As an example, suppose you open a file on disk. You assert that you successfully open the file. If any error occurs, this produces output which allows you to observe the fault.

The key insight is that you don’t know which kind of fault is occurring. It could be that the file is not existent. Or you don’t have access to the file. So by asserting on any non-successful return, you get to learn important information necessary for internal-state-reconstruction. This information is added to the context of the crash log. In erlang systems you often see {ok, _} = file:open(FName) which asserts the intended operation.

Dynamic Tracing

Tools such as dTrace, eBPF, and the built-in Erlang tracer are tools which can make a system observable. If a fault is detected, you can trace the fault in detail and capture enough information about the fault such that it becomes observable.

Note that the system doesn’t start out as being observable. You often encounter the fault, and scratch your head. Then you add tracing which is specific to a user-id, or a type of request. This trace is what changes the system, dynamically, from a non-observable system to one that is. Once the fault has been dealt with, you can go back and disable tracing.

Tracing is dynamic though. You cannot just add some extra lines to the code and then redeploy to capture the problem. You need to be able to change the system while it is in operation, and without having an impact on the system. The reason this is important is because the system might reset itself under a redeploy. Suppose you have a data structure which some times ends up in a pathological state making your system slow. If you redeploy, you reset this data structure, so now you cannot figure out why it is slow. This is why you need to be able to query the system dynamically.

Also, the impact of adding tracing must be proportional to the haystack/needle problem you have. If you add tracing for a specific customer, we cannot have this affecting any other customer in the system. It might take weeks before we hit the fault again, so we need to have this enabled for a while. If tracing impacts the system efficiency, people won’t enable it in production. And all really interesting errors occur in production.

Cardinality

Consider the following:

I know a web request happened
I know a web request happened, and I know if it was 2xx, 4xx, 5xx or something else
I know a web request happened, and I know the exact status code

The last case of these can be used to derive the other two. But not vice versa. The last case also needs to store more information, because it needs to discriminate the exact status code. Now consider:

I know the user id of the web request

If we have a million users, the space of possible values has a cardinality of a million. Storing this efficiently is non-trivial in most current systems. However, it is paramount to get a scenario where we can observe the system.