The Erlang Shell

(Front Line Assembly: Civilization, Eastern Sun: In Emptiness)

As an Erlang programmer I often claim that “You can’t pick parts of Erlang and then claim you have the same functionality. It all comes together as a whole”. This is true for many programming environments where the main components are built to be orthogonal from each other and the parts form the cohesive whole. A good example of this approach would be Go as well.

A compelling way of deploying software is what supposedly originated with FLEX (Alan Kay). The program, the system and its data are all kept inside an image which can be persisted to disk and restarted later. In essence we specify which world we operate in by giving an image. Many Smalltalk systems utilize this notion of images. So do Common Lisp systems. And they even understand how to reconnect to networks and reopen files.

Erlang provides its own, weaker, mechanism for assembling software called a release. A release consists of the runtime together with a set of Erlang applications. They are started as a whole—in a specific order. The same release is usually booted across several machines if we want to have resilience against hardware faults. The big shift compared to images is that there are no on-disk persistence. The ideology is different: the system should never stop, so even if one node() in the cluster is stopped, the data is on other nodes as well and lives. Erlang systems also allow for seamless upgrades from one release to another while they are running.

But there are some resemblance from Common Lisp / Smalltalk images and Erlang releases. While they don’t persist the data, Erlang images do define a separate enclosed system with no link to the original system.

The strength of these persistent models come apparent late in the development cycle. Software usually goes through several phases

\(\[\text{Analysis} \rightarrow \text{Design} \rightarrow \text{Implementation} \rightarrow \text{Test} \rightarrow \text{Deploy} \rightarrow \text{Maintenance}]\)

It is important to stress that development of software is a dynamic activity. We repeatedly change the software in production by layering more and more complexity/features on top of the system. We also dynamically fix bugs in the software while it is in production.

The recent years, development tends to so-called Agile methods—where there are many small dynamic iterations of the software construction process running all at the same time. We have social tooling in place which tries to achieve this (Scrum, Kanban,…), and we have technical tooling in place to reach the goal (git, Mercurial,…).

The “Maintenance” part is very expensive. Maintaining running software has periodic costs associated with it. In a world where everything is a service, we have to pay operators, pay for hardware resources, developers, and so on.

When we program, we try to remove errors early. We employ static type systems, we do extensive testing, we use static analysis. Perhaps we even use probalistic model checkers like QuickCheck, exhaustive model checkers like SPIN or prove our software in Coq. We know, inherently, that eradicating bugs early on in the software life cycle means less work in the maintenance phase.

But interestingly, all this only raises the bar for errors. When we have done all our hard work, the errors that do remain are all of the subtle kind. These are errors which were not caught by our initial guardian systems. Most static type systems won’t capture the class of faults which has to do with slow algorithms or excessive memory consumption for instance. A proper benchmark suite will—but only if we can envision the failure case up front.

The class of faults that tend to be interesting is the class that can survive a static type check. The mere fact we could not capture it by a static analysis in the compile phase makes the error much more subtle. Also, it often means they are much harder to trigger in production systems. If the fault furthermore survives the test suite it becomes even more interesting. The viral strain has a certain basic DNA which mutated it so it could get past two barriers of correctness tests. Now it becomes a latent bug in your software.

Aside: I tend to absolutely love static type systems. I enjoy them a lot when I program in Go, Standard ML, OCaml or Haskell. I am all for the richer description that comes with having a static type system.

There is a great power in being able to say \(v \colon \tau\) rather than just \(v\)—exactly because the former representation is richer in structure. Richer structure helps documentation, makes it possible to pick better in-memory representations, makes the programs go faster and forces a more coherent programming model.

Yet, I also recognize that most of the errors caught by static type systems are not interesting. They are of the kind where a simple run of the program will find them instantly.

End of Aside.

Concurrency and Distribution failures

When systems have faults due to concurrency and distribution, debuggers will not work. The problem is that you can’t stop the world and then go inspect it. A foreign system will inspect an answer in time or it will time out. Many modern systems have large parts of which you have no direct control anymore. Such is life in the Post-1991 era of computing where the internet defines the interface to your program and its components. An Erlang system with two nodes is enough to be problematic. Even if I could snapshot one node, the other node will carry on.

The same is true for concurrency errors. They often incorporate race conditions which must trigger. Attaching a debugger alters the execution schedule making the race condition disappear in the process. The only way to debug such systems is by analysing post-mortem traces of what went wrong—or by inspecting the systems online while they are running.

To make matters worse, a lot of races only occur when data sizes are increased to production system batches. Say you have a small write conflict in the data store due to inappropriate transactional serialization and isolation. If your test system has few users, this conflict will never show up. And if it does, you will disregard it as a one-time fluke that will never happen again. Yet—on the production system, as you increase capacity, this problem will start to occur. The statistical “Birthday Paradox” will come and rear its ugly head and you will be hitting the conflict more and more often. Up until the point where it occurs multiple times a day.

In conclusion, capturing these kinds of bugs up front is deceptively hard.

The Erlang Shell

The Erlang shell is a necessary tool for producing correct software. Its usefulness is mostly targeted at the maintenance phase, but it is also useful in the initial phases of development. A running Erlang system can be connected to while it is running:

(qlglicko@127.0.0.1)3>

This provides a REPL so you can work with the software. But note that this is a REPL on the running production system. If I run commands on the system:

(qlglicko@127.0.0.1)3> qlg_db:players_to_refresh(1000). {ok,[]}
(qlglicko@127.0.0.1)4>

I hook into running processes. In this case qlg_db which does connection pooling towards the Postgres database. This allows me to go probe the system while it is running to check for its correct operation. Any exported functionality can be probed from the shell.

I often keep a module around, named z.erl which I can compile and inject into the running system:

(qlglicko@127.0.0.1)6> c("../../z.erl").
{ok,z}
(qlglicko@127.0.0.1)7>

This dynamically compiles and loads the z module into the running system. It makes the functions of the module available for system introspection and manipulation. When debugging hard-to-find bugs on systems, you need this functionality.

And yes, if you want, Erlang nodes contains the compiler application so they can compile modules.

In Erlang, linking is deliberately “as late as possible”. This means you can change software in the system while it is running. There is no linker phase up front at compile time. Linkage is done when you call another module. Yes, this costs performance. But on the other hand, it means you can always rely on the system calling the newest loaded version of the module. The ability to hot-patch a system while it is running can help a lot. You don’t have to interrupt the system for small fixes for instance. If you know that you only changed a single module in your test build, you can opt to just push that compiled byte code to production and then inject it into that system. As long as you systematically add the change to your standard deployment pipeline, this works.

The shell also provides a lot of nice tooling to help you when you are looking for problems in a system:

There is built-in job-control in the sense of the sh(1) shell. You can have several shells open at the same time. You can reconnect to shells, either locally or remote. And you can kill shells which have hung for one reason or the other.
Erlang has built in trace capabilities. These provide DTrace-like behaviour on the system directly without effort. Enabling tracing only impacts the traced modules and it is generally non-intrusive (unless you make a mistake when setting trace patterns, heh). You can mask events: only when this process calls. And only these two functions. And only when the 3rd passed parameter is 37. The Erlang shell makes this all possible dynamically on the running system.
Want to know what state a given process has? Fear not, you have online introspection via the shell.
Want to know how many messages there is the inbox of a process? Fear not…
Want to insert a new log statement? Recompile the code and hot-deploy it via the shell. Fear not…

And all this without service interruption. And you get all this for free, just because you picked Erlang as the implementation language.

Here is the thing: the first time you use the Erlang shell in production to fix a hard-to-debug problem it becomes very very hard to live without it. I’d willingly give up static typing for the ability to look at the running system. Problems that survive past the tests and into production tend to be sinister and evil. And subtly elusive. You need a system there where you can go and inspect it, while the error is occuring in production.

It is the same traits that made UNIX a success (and what makes Plan9 alluring and appealing). Your system can be inspected and manipulated while it is being developed and changed dynamically. Except that in Erlang, we have much finer grained control over the running UNIX-process since we can go inside it and inspect running processes inside the node.