Erlang String Handling

The cat is out of the bag. Now, since WhatsApp had success using Erlang, more people will gravitate toward the platform. One of the common things that will happen as a result is that people who do not understand the string handling capabilities of the Erlang BEAM VM will fail to use it correctly. This costs a lot of unneeded clock cycles and this post aims to thwart it.

First of all, since this is rather VM specific, it also applies to Elixir users. The VM basically handles strings the same way, so you should be thinking in the same general direction as I am doing here, for Erlang. Though Elixir, to the best of my knowledge, maps its string type onto binaries.

The first thing to understand is that Erlang is a functional language first and foremost. This means, among other things, data will be immutable by construction and there is no way to circumvent that. It also means certain tricks pertaining to ephermeral mutability is out the window. And it would be outright against the spirit of Erlang to have mutable data in any form.

When you should avoid using Erlang in the first place

Erlang as a language fares bad at programs where you have a small computational kernel which has to be extremely fast. This is the case for a lot of heavyweight processing tasks. It is often better to take your kernel and move it into another language, or in some cases move the kernel onto an FPGA or into hardware, depending on your needs.

Depending on the problem, it may not be efficient to spend your time implementing the code in Erlang. Unless you plan to utilise some of the other Erlang strengths, like robustness, you may be better off by just implementing your system in Go or Clojure. Alternatively, you can write a NIF and then handle the core kernel in C. Don’t use Erlang when another tool solves the job in a way better and way faster way. A rough estimate is that C code is 20-50 times faster for tight computational kernels.

The Erlang language is built for handling more complex interactions where the raw execution speed matter less and the program structure matters more. If your system has no specific kernel which is executing for heavyweight processing, then Erlang might suit your needs. If your system is bound to memory or I/O then Erlang may be very suitable.

You goal is to get most processing out of the emulator loop and onto the BEAM VM core. The runtime core is very efficient at handling large amounts of data and many structures are built such that they can sustain heavy load. A good way to look at this is with the linux perf(1) tool. Run perf on the beam.smp emulator and if beam_emu.c is spending too much time, chances are you are processing heavy. Even processing-heavy applications I have written tend to have 20% time spent in the emulator loop. The consequence is you can’t win much by replacing the interpreter with a JIT or a native code compiler. You can only, at best, squeeze the 20% so it is limited how much faster you can get the code by doing this.

Aside: A JIT would still be useful so more of the Erlang system could be written in Erlang rather than C. But for most code, it is not given a priori it would yield a speedup.

What is a string()?

The Erlang string type is implemented as a single-linked-list of unicode code points. That is, if we write “Hello” in the language, this is represented as [$H, $e, $l, $l, $o]. The overhead of this representation is massive. Each Cons-cell use 8 bytes for the code point and 8 bytes for the pointer to the next value. This means that the 5-byte ASCII-representation of “Hello” is 5*16 = 80 bytes in the Erlang representation.

Strings are shared. If we write L = "World", followed by A = "Hello" ++ L, then A and L shares the tail, due to persistence and immutability. In some cases this can be utilised to limit the amount of data present in a system, and somewhat help with the massive data blowup you will otherwise see.

Erlang is not unique in having the string-as-integer-list representation. It is the same representation which is used by Haskell. But since no type information is present, you can’t discriminate between a list of integers and a string. You will have to know from the code at hand.

There are a lot of disadvantages of this representation. It results in memory blowup, more cache misses, more data bandwidth use, and so on. On the other hand, all the normal list-processing functions are present and can be used on strings alike. So to manipulate strings, you have access to four modules: [string, lists, unicode, re].

The module string contains valuable helper functions for manipulation of strings. Most importantly:

string:tokens/2: which can be used to split a string into parts, and
string:join/2: which can join strings together with a separator.

I normally only use plain strings like these when the data sizes are rather small and we are not on a critical code path. Otherwise, you need to do something else in order to make things work quickly and efficiently.

The lists module contains many functions which are useful for working with strings. As an example, lists:prefix/2 can be used on strings as well.

The re module implements regular expressions. Here is the basic rule:

Do NOT use the re module

Usually, there are better ways to handle data than to supply a regular expression to handle it. Mind you, I utterly love using regexes in editors and I also like using them in lexers. But for most normal data processing, regular expressions are overkill. Also, the regular expression engine in Erlang has been altered so it preempts. This avoids a long-running regex to destroy the VM’s scheduling mechanism.

Typical code will have maybe one or two calls to the re module per 5000 lines of code. If you have more than that, chances are you are trying to program PERL in Erlang. And that is a bad idea.

However, the re module is way faster than the older regexp module which was deprecated.

The binary() type

Erlang, like Haskell, saw the problem with a memory-heavy string type. So they both implement a type which is more efficient at handling large amounts of data. In Erlang, this is called a binary and in Haskell it is called a ByteString. Binaries are arrays of binary data. They are immutable, which means they can be shared. Also, referral to subparts of a binary can be shared by having the system store 3 values called a sub-binary:

A pointer to the original binary
An offset
A size

Note that sub-binaries work a bit like Go’s slice construction in this way. The VM is built such that passing around binaries and subbinaries are always efficient. The trick is immutability, which allows the system to pass pointers on binaries rather than passing the binary value itself.

Binaries, like in Go, also has extra capacity in the sense that in some cases a binary can be appended to efficiently without having to copy the binary data to a new place. The system will automatically extend binaries with extra capacity when they are copied, ensuring efficient append.

When programming Erlang, the compiler and VM will automatically generate binaries and sub-binaries for you. Write your code in a straightforward and readable manner first. Then compile your program with +bin_opt_info

to have the compiler report on which binaries were not optimised in code which is heavily traversed by the program.

Binaries can be pattern matched. This is extremely efficient, but sometimes you can’t write a matching rule since they essentially work from the beginning always. You can’t “search” in a binary until you hit something which matches by a single pattern match. The way to handle this problem is by using the binary module:

binary:split/3 is extremely powerful. It is the binary variant of string:tokens/2 but it is returning shared data and so does only produce a small amount of garbage. The split string simply refers into the original binary through sub-binaries. Be very aware of the option “[global]” which will allow you to split the binary into more than two parts.
binary:match/3 is your generic search routine for picking out parts deeply in binary data.
binary:compile_pattern/1 allows you to build some simple compiled patterns like a weaker (but way faster) regular expression search
binary:copy/1 forces a copy of a binary. This is useful if you have a 1 megabyte binary and you have found a 45 byte sequence in it—and you only want that sequence. Then you can copy the sequence which means you don’t hold on to the 1 megabyte binary anymore—in turn allowing the garbage collector to collect it. This is extremely useful if you are cutting input into pieces (with split/3) and storing it at rest for a long time. For instance in ETS.

The iodata() type

There is another quite important data type which I want to describe. These are called iodata() or iolists(). The rule is the following:

A list of integers in the range 0..255 is IOData.
A binary is IOData.
Lists of IOData is IOData.

In particular, this means you can form IOData by collecting IOData as lists. This means string concatenation in the language is O(1). p(IOData) → ["", IOData, ""].

The p/1 function given here will wrap IOData in a standard paragraph section in HTML, but it will not reallocate any data, nor will it generate any garbage. The sections “” and “” are constants, so the only allocation that will happen will be for the front of the list, two list elements.

Most IO functions in Erlang understands IOData directly. This means you can avoid having to flatten data, but just send the IOData out over a socket or the like. It avoids costly allocations and copies on the IO pipe in your program. Highly recommended!

A good way to gauge how well thought out a library is is to look at how well it handles IOData. If it doesn’t and requires you to explicitly call iolist_to_binary/1 then chances are the library is not that well written.

Handling `unicode()` data

Unicode data in Erlang is handled by the unicode module, which can convert between representations of Unicode. My recommendation however, would be to keep most unicode data as UTF-8 strings in binaries. You can match on unicode code-points:

-module(z).
-export([len/1]).
len(B) when is_binary(B) ->
len(B, 0);
len(<<>>, K) -> K;
len(<<_ /utf8, Rest/binary>>, K) -> len(Rest, K+1).

This is useful together with the ability to input character strings as UTF-8:

Erlang R16B03-1 (erts-5.10.4) [source-ce3d6e8] [64-bit] [smp:8:8] [async-threads:10] [hipe] [kernel-poll:false]
[dtrace]
Eshell V5.10.4 (abort with ^G)
...2> z:len(<<”Rødgrød med fløde”/utf8>>).
17
3>

In Release 17.0, the default will be UTF-8 in Erlang files. This will probably have some deep effects on Erlang source code, but it will ultimately help getting Unicode into Erlang. Release 18.0 is planned to accept unicode atoms as well, to open up the design space.

Attack Vectors

The basic rule of all Erlang string handling is this:

Never never NEVER work on stringly typed data

When data is “stringly” typed, it means that your data has no structure and everything are represented in strings. This is fairly expensive to work on for a system since you are limited to use the string-handling code. Parsing is always expensive and this hurts your processing speed. Some languagues, like awk or perl are built to process these kind of things. But you rather do not want to do this processing in Erlang.

The way you avoid stringly typed data is to take the input and transform it as early as possible into structured data inside your code. That way, you can avoid working on the string data, and you only need it in a few places. The more structure you can attach to the data, the better.

The primary problem is when you have to work with a bad data format. Again, the trick is to turn the bad format into something sensible quickly and then process it as sensible data.

Erlang is designed to work with data that has structure. With structure, you can pattern match which is fast. Without structure, you have to rely on standard techniques and this is almost always going to be a pain in the language. So don’t do it. Convert data into a structured format and then proceed by handling the structure with pattern matches.

Immutability

Erlang takes a stance. All data are immutable. In particular, strings are immutable. Binaries are immutable. There will be an overhead to this stance. If you can’t accept this, you must pick another language. That said, the advantages of immutability far outshine the benefits of immutability.

Erlang is immutable because it eliminates a large source of programming errors and programming mistakes. After all, the value of an incorrect program is lower than a correct one. This stance is highly unlikely to be changed, since the safety guarantee provided by immutability is part of the Erlang-DNA.

When you have control of data

In some situations you control the format of the data you are going to use. This is an excellent oppurtunity to pick some clever ways of working with data. In particular to enforce structure on data by default. If communicating between Erlang systems, you can use term_to_binary/1 & binary_to_term/2 and just put data at rest in the standard Erlang-format. If the foreign system also supports this format, it is an excellent way to interchange data. The encoder/decoder for this format is written in C and it also handles very large terms with grace—the running process will be preempted while producing the binary.

The man page for inet, erl -man inet / setopts\( RET describes common socket options you can set on a socket. By setting the packet option you can control how the system decodes inbound data. Most interesting, you can set ASN.1 BER encoding or Line-wise encoding. If you value speed, you should consider an efficient binary format.

JSON

The ubiquitous format today, you need to handle is JSON data. I don’t particularly like JSON as a data exchange format, since it is very weak in what types it encodes. I’d much rather have a format like Joe Armstrong’s UBF or the Clojure edn & data.fressian encodings. But JSON it is. There are two good JSON libraries for Erlang:

jsx: http://github.com/talentdeficit/jsx
jiffy: http://github.com/davisp/jiffy

They differ in how they are implemented. The jsx library is implemented in pure Erlang and is the slower of the two. The jiffy library uses a C NIF (Natively Implemented Function) to run the encoder and decoder and is about 10 times faster.

Beware: the jiffy library can’t be used to decode large JSON data sets. The decoder is not a well-behaved NIF and as such it can mess up the schedulers if it is used to decode large data strings. If the JSON takes more than 1ms to decode, you should probably avoid using it. In Release 17.0, we get so-called dirty schedulers which can be used to work around this problem.

The other problem with JSON data is the internal Erlang representation. Right now, there is no isomorphic mapping for JSON objects/dictionaries into Erlang. This will be fixed in Release 17.0 as well, since it includes maps so you can obtain a native mapping of objects/dictionaries into Erlang data. I also have a side project on the run to properly handle JSON to Record encoding in Erlang, but this still ongoing work. And it will take some time before it is fully implemented.

On the other hand, note that JSON will never be a fast interchange format. If you use JSON to move large amounts of data, you are screwed. Plain and simple. You best bet is then to hope data are sent as many small pieces so you can use jiffy on them. Or wait till 17.0 so you can get jiffy in a dirty-scheduler variant.

Files

You should study the section “PERFORMANCE” of the man page erl -man file Note that in order to have fast file I/O you need to open the file in “raw” mode, use binaries, and you can usually also benefit by following some of the advice in the section about performance. The general rule of IOData applies: If you supply IOData, the underlying file driver is able to map this onto a unix pwrite(2)` call which is highly efficient.

Not opening in raw mode does have its benefits though, because you can then get the IO subsystem to work. This subsystem allow you to open a file on a foreign file system (on another node) and then operate on that file. If you don’t need the high speed, this is desirable in many situations, should your system span multiple nodes.

Closing off

It is a myth erlang strings are slow. You will have to think a bit more about what you do in order for the system to speed up. But chances are that string processing won’t be your limit. It is much more conceivable your bottleneck will have to do with a lock, or the wrong structure of processes than it will be slow strings.