Systems in Production

You embark upon the journey. You build the system. You test the system. The system seems to work. You spend time deploying the system into production. And then what?

The maintenance period of a project is often measured in years, whereas the time to build the project is measured in months. Thus, the maintenance period by far dwarf the initial construction period. And modern software is dynamically updated all the time: new customers, new uses of the software nobody thought of, operational change around the software, and new patches adding or fixing functionality.

Production software also has the bar set high from the start: tools such as testing and continuous integration will help in the elimination of faults early on in the software cycle. So the deployed production software is less likely to have grave errors, unless it was rushed into production by an overeager project manager.

These characteristics leads to a certain specific set of constraints on production bugs, because the high bar eliminates the simple errors:

The fault will not occur anywhere else but in the production deploy.
The fault will not be caught by the type system, and the type modeling will be too weak to capture the fault.
The fault will have no logging which could expose it.
The fault is often a collaboration of many minor faults which together topples the system.
The fault will not bring the system to a halt but degrade it to the point of being unusable.
The fault will be outside of the system specification.
The fault will be rather obvious in hindsight, yet nobody had in their wildest imagination thought it could happen.
The fault will be a nontrivial complex interaction between several parts of the system.
The fault will be a result of using the system in a way it was not designed for.

The reasoning should be fairly straightforward: simple bugs are caught early in the development cycle and eliminated by the programmers. The bugs that do slip through and wreak havoc in production are exactly those not caught. This amplifies the severity of production bugs in practice.

There are two major ways in which one can try to combat the production errors. One is to be proactive, and devise better methods for catching bugs early on. Use a language with a better type system, such as OCaml, Elm, Haskell, or Purescript and eliminate errors at development time. Write extensive stress tests for the system and put it under pressure in a controlled environment. Take the beast out for a walk outside the lab and see if it survives. Add more logging. Improve the requirements handling and the initial specification of the software. Add more testing. Employ Property Based Testing. Gradually roll out the software. Iterate on the software in smaller steps and only add functionality in a controlled fashion. Fire the project manager who destroys the stability of the software. Build your system as a flat set of modules, avoiding hierarchy at all costs. Simplify the responsibility of each part and decouple the software.

Yet, if you are only ever proactive, you do nothing but amplify the remaining bugs. The bugs you don’t catch in development are now hideous abominations, who will utterly destroy the system when they are uncovered.

This leads to the tenets of the person who has run systems in production:

In a production system you must be able to query its state in an ad-hoc fashion. And you must be able to carry out post-mortem debugging. You must also have metrics.

Observations about reactive fault handling

First of all, since the fault you have is in production, there are number of things which are hard to do. You usually cannot attach a debugger to the system, because it would halt the system in production while you are debugging it. You may not even have the ability to attach a debugger to the production environment. And even if you could attach a debugger, it would lead to timeouts in parts of the system over which you have no control. Modern systems are distributed and rarely run in a single location anymore. So there is no way in which you can halt the system as a whole.

Your ad-hoc queries into the system state cannot have any impact on the system when it is not enabled, and you need to query any aspect of the system. The experienced developer is willing to sacrifice lots of speed in normal operation for the ability to run ad-hoc queries against the system. Knowing why something is going wrong is far more important than raw execution speed of the system. Performance problems are often due to misunderstandings in operation anyway.

You must be able to query any aspect of the system, be it kernel, userland, garbage collection metrics, or mutex conflict time. The fault will usually occur between log statements and hide there, so normal logging usually won’t cut it. Also, outputting debug logging can be expensive in tight loops, so it is rarely something you can enable there.

Your ad-hoc queries must be safe. If you enable ad-hoc query and tracing on the system and then disable it again, there must no segfaults, no kernel crashes and no long-term impact. Going back and forth between on-line tracing and query mode should have no stability impact on the system.

When a fault occurs in production, you immediately have higher stress levels than you would normally have. And the onus will be to fix the mistake. Often, people reboot the system to get it out of the bad state. But this also means they throw away the information which lead to the fault. The only way to fix this is by employing some way to post-mortem debug the failure by snapshotting the system at the bad state, and reboot the service and then look at the fault in a less stressful environment. The mistake is most easily fixed the next morning after you’ve had a good nights sleep and some coffee. Not at 3 in the middle of the night with alarms going off everywhere. So you need to snapshot the bad state for later where you have time to figure out what went wrong. Post-mortem analysis often leads to deep insights into the incorrectness of your software and how to fix it in the long run.

The necessity of metrics becomes apparent once you hit the first couple of degradation faults. You must establish a baseline of how the system operates normally before you can say, statistically, if any behavior is inconsistent with the baseline. A good system is transparent and discoverable. The system will tell, by metrics, how it is currently operating. Experienced programmers are willing to sacrifice quite a lot of execution performance for knowing how the system behaves at any moment. By having a profile of the system, it is often possible to massively improve its performance, and having a profile all the time tend to be more important than a couple of microseconds here and there.

About Tools

Several tools support the above observations:

UNIX in general has lot of coarse-grained inspection commands which lets you probe into the overall behavior of an application. Look up Brendan Gregg’s USE method for instance.
DTrace is the swiss army knife when hunting for bugs in UNIX systems. The cool thing about DTrace on Illumos is that it is safe and you don’t have to worry about it. I’ve had DTrace crashes on FreeBSD, which you don’t want in production systems. Also, Linux is currently lacking a good tracing tool at that level which is universally accepted.
Core dumps are extremely valuable and should be used for post-mortem debugging. Illumos’ modular debugger deserves special mention, but even a simple gdb(1) session can often tell lots about what went wrong in a system.
Erlang has some nice tooling built in for tracing. Post-mortem debugging is a staple by using the crash dumps or the crash log. Fred Hebert’s recon tool deserves special mention as it improves the introspection capabilities of the base system quite a lot. Also, the ability to snapshot the state of any running process is invaluable for debugging purposes.
Metrics usually needs a two-pronged attack: first collect metric data inside the application, and store it in volatile memory. Then ship the collected data periodically to a persistent store outside of the system. Graph the metrics in a dedicated tool which spans all of your platform rather than building a tool into each small part.

Sadly, though, these kinds of observations are mostly void in many other environments. Academic languages, where the prototype is all that is ever produced, has no need for the above. And building the tooling doesn’t produce either code nor papers so there is little focus on it. Immature platforms often adds these as an afterthought after many years of operation.

And if your platform is not focusing on production tooling first, it is not fit for production and it won’t be before the tooling is added. Most popular platforms of today are utterly failing in this regard, and it is a good way to analyze a piece of system for its production readiness.