Full Queues and Their Woes
Suppose you have a bounded queue of size 10. In a normal setting, you system may load this queue with up to 5 elements. Under load, the queue might increase to say 7 or 8 elements, but you are not going to reach the bound of the queue. Typically the queue is empty, some elements gets added to the queue and then drained again by processing.
In contrast, suppose the queue is full at 10 elements, and that the load increases on the queue. The additional load has to be dropped, and this is a nice feedback mechanism which tells the caller that you have an overload situation. But also note, that you losing information: namely how much additional load that were added on the system. Unless you add some kind of ghost-counter to your queue, so you know how much data you dropped recently, you have no idea of the load situation apart from the fact that the system is over its load capacity.
There are many situations in software engineering where this basic principle applies to your organization. Your queues can’t run at the full capacity because then you lose the ability to see what fluctuations are happening. A solution is to track “ghost entries” of your queues. That is, track elements beyond the queue limit with a count only and a timestamp, not storing data. When consuming a ghost-entry, it is consumed immediately, its processing time not counted against the normal processing time. Above a certain amount of ghost entries, you stop caring about them, but it allows you to answer the question of what would happen should you increase the queue a bit in size.
I claim full queues tend to be a pattern to avoid in software engineering. When you reach the bound, something is amiss in your internals, and you better engineer yourself around that than letting the queue bound be the controller of the flow. When a value is at its maximum, you can’t measure what lies beyond. Interestingly, there are many similar situations, where the concept of a“full” resource blinds you to information that would otherwise have been useful.
One example is the recently developed BBR congestion control mechanism for TCP. Standing network buffers as queues has been a bane of modern networks for a while now (see BufferBloat). The problem is that in order to stop packet loss, manufacturers add yet more buffers to their equipment and the result is that the TCP algorithm fills up these queues. When this happens, we see increased latency and jitter in our TCP connections, which makes the internet unsuitable for low-latency operations. Luckily BBR is a promising fix for these problems.
In the BBR-algorithm, it is noted that if a TCP link runs at the full bandwidth utilization, it is impossible to detect a standing queue in the network. In order to optimize latency, the algorithm periodically lowers the bandwidth in order to detect if there is a queue which should be drained in the network. This is exactly the problem of an overflowing queue as above. At the maximal bandwidth, data enters the queue at exactly the same rate as it leaves. So we can’t detect if there is a standing queue in the system. But if we lower the bandwidth utilization, we run the risk of less efficiency in the link. So hence the periodic probing.
Another example is to run your development team at 100% utilization. If something unforeseen emerges, there is no slack utilization to use, and hence someone has to work overtime to handle the situation or you have to move a deadline. You are essentially in an overflow situation, and thus you can’t detect and react to events beyond the barrier. Had you dropped utilization to, say, 60–80%, then emergencies can be handled as they appear. This is also an example of a “full queue” which permits no flexibility. What usually happens is that the deadline gets moved. And now everyone is unhappy because the deadline was moved.
Worse, good ideas doesn’t happen at 100% utilization. They happen when you have some slack in the schedule and can spend some extra time tuning parts of the system as it goes along. In any kind of process planning in a factory, people know the price of running the plant at 100% utilization: it gets sensitive to small errors. So nobody does that.
The same applies to consultancy: run your consultants at 100% utilization, and there will be no way your consultants can stay ahead of the curve by gaining new knowledge. Worse, it is often hard to spread information between consultants when they are all working at 100%. This means knowledge doesn’t get anchored in the business and you lose valuable information you learned on a project.
A full queue in a Go program means that you have a critical path blocking the speed of the program. But it has no dualized shadow-pricing: if the impeding path is unblocked or removed, you don’t know where the next bottleneck is going to be. If the queue is not full, you can measure the relative change in the queue which can tell you where to worry next.
If a system is at 70% CPU load, you know that it can handle more work before succumbing to increased processing latency. But at 100% CPU load, you don’t know if your system is overloaded and thus queuing work, or if it is an Erlang system that is just using the additional resources for background processing work. The former is probably bad.
Another applicability of the idea is that of Netflix“Chaos Monkey”. A system with 100% uptime over a period will not probe any kind of recovery scenario. Thus, your environment is blind to partial failure of subsystems. The Chaos Monkey introduces errors in the system which then works as “checks & balances” making sure other subsystems can cope with partial failure. Amazon’s AWS also seems to be using fault injection methods to make sure systems running on their infrastructure can cope with transient errors. Suppose that the normal injection rate is 1 in a 1,000,000. Then if the error rate increases to 1 in a 1,000 suddenly, the systems will still cope.
As already written, this insight is well-known. Don Reinertsen has addressed this over the course of many years. At high levels of use, the system can’t cope with more load and it becomes more sensitive. In turn, productivity of the system as a whole tend to falter when this happens. The principles, as this post hopes to show, are widely applicable to more than a single field. The whole concept of “overflow” is dangerous in systems. It is also a good reason for being wary about “bounded queues”. It is better to build a load regulation system on top of an unbounded queue than it is to hard limit the queue bound. It solves the problem, but it pries you from valuable information. At the very least, use something like a ghost-list or provide a warning whenever a queue grows full.
The ghost-list idea on queues are taken from the ARC cache of ZFS. Starting from the well-known LRU cache, where we knock out the least recently used element from a full cache, we start with an observation: entries which are only referenced once in the LRU cache are wasted space. The 2Q cache solves this by having two caches. One cache runs as an LRU cache, but elements are promoted to the second cache if they have been referenced more than once. It only takes a single bit to handle this, and it wipes out the lone stragglers in the LRU cache.
ARC improves on the 2Q scheme by observing that the sizes of the 2 caches are static in 2Q. A ghost entry is an entry in the cache referenced by key only. It doesn’t take up much space. Whenever we boot out an element, we keep a ghost key for that element. By keeping ghost-entries for each cache, it is possible for ARC to answer the question “if we had a bit more cache space, would we have been able to successfully hit this element?” And that can guide automatic sizing of the two caches in a 2Q cache (this is a much condensed description, papers have the details).
 Cardwell, Cheng, Gunn, Yeganeh, and Jacobson http://queue.acm.org/detail.cfm?id=3022184
 In linear and mixed-integer programming, this is often used in order to build “what-if” scenarios.
 This is based on intuition on my part. Their network engineers are way better than the fault rate we are seeing, so we can only assume they have built the system to inject faults.