ProgLang design with evidence

Let us assume the programming language market is effective and free. In this case, the best programming languages are the most popular ones: PHP, Javascript, Java, C#, C, C++ and so on. By definition, fringe languages can’t be the best languages. New fringe languages, like Go and Rust still has a chance in this world order, but long-time languages like Common Lisp, Haskell and OCaml are all dead. They have tried to show their worth for real-world development, but have been forever displaced into the dark corners of academia.

The problem, by far, is that we don’t design languages based on evidence, but design them based on whim. New programming languages are designed by looking at existing languages and re-hashing the ideas. Only rarely does a language introduce a genuinely new concept from research. There has been some initial work by Andreas Stefik et.al, in the language Quorum where language constructs are designed by evidence. This means that until you have a stastistical significant test for a given construct, it is not included.

But most languages are designed at random based on other constructions:

Standard ML has a formal specification, which is as close to being mechanized (In Twelf) as you can get. This means the language bases itself under an assumption of “Logic is a good guide for language design”. In the world of proof assistants and very formal settings, this is almost unvariably true. There is no way you can get a system like Coq, Twelf, or Agda to work without using the knowledge of Logic. Otherwise, encoding logic and mathematics would be almost impossible.
Haskell and OCaml are functional languages, picking typical functional features. Both languages have a tiny kernel doing the base computation. But also, both languages are growing bigger and bigger with every new release. Learning all parts of the languages will take time.

Essentially, this is the school of language academia. By alluding to logic and math, the hope is to get good, elegant and productive programming languages. The problem however, is that we have no evidence. There are relatively few controlled experiments, and those which exist have several shortcomings.

On the contrary, most industrial languages are rehashes of older languages. You can trace most modern languages back to C[0]. And a lot of the language designs are influenced by a few ideas: imperative execution, and object oriented programming. Even modern variants of the strain, Go and Rust, takes inspiration from the world of the well-known, rather than inventing something truly new. The traits of these languages are:

Each new generation is a rehash of an older generation with ideas systematically cherry-picked from academia: Garbage collection, lambda-expressions, parametric polymorphism (generics), structural subtyping to name a few examples.
A distinct industry-focus: large and comprehensive standard libraries with support for the data format of the decade: XML, JSON, or ASN.1. Large IDEs that support the development effort. Eco-systems for todays technology. Debuggers, profilers, linters and static analysis tools to make up for the shortcomings of the language.
Large development teams, spending efforts on maximizing the language performance, often paid for by the industry.
Development effort does not focus on research and features, but on stability and robustness.
The programmer is expendable to a certain extent. It is more imporant getting 100 people to work together than to get each one to perform at their optimum.

To a certain extent, the industry will find an optimal language for programmers. The inputs deciding the language will be many, some of which will be highly doubtful: programmer availability, programmer expendability, safety in numbers, nobody ever got fired for choosing IBM.

But an optimal language could be a locally extreme value. It may be there are other languages out there which are far better for the industry, but random process would have we ended up picking a weaker language as the language we choose.

Worse, in industry, you may have a selection bias against the best language. Managers measure their power in the size of their staff. A better language means you need a smaller staff to carry out the task, which is in opposition to gaining power within the corporate structure. Consultants have a harder time working on projects with fewer oppurtunities for fixing errors. Employees lose their sense of value if they can’t fix the bugs the programming language inadvertedly introduce for them. Certain programmers take pride in solving deep complex shared-state bugs by lurking over a debugging screen for days.

In the following I revisit old behemoths of discussion. My purpose is to make it obvious that there are many good questions to ask in the design of programming languages. I am not viewing language design from a theoretic perspective, nor am I viewing it entirely from a question of practicality. That is, the language design doesn’t stop at the construction of an operational semantics specification. And likewise, you can’t say you define a language by its sole implementation and thus define how it operates.

Types

With respect to types, there are two major claims:

Static typing makes the programmers more effective and productive.
Dynamic typing makes the programmers more effective and productive.

The experiment here has a null hypothesis:

There are no measureable difference in programmer productivity in a controlled experiment where we evaluate a dynamic vs static typing discipline.

If there is a difference, it doesn’t matter that much to which side it falls in the first place. Just that there is a measurable difference in itself would be interesting.

It is not a priori obvious what works best. There are good arguments for and against the static/dynamic typing discipline question. What makes it hard to answer decisively is that the experience of different programmers are varying to a large degree.

As an example, proponents of static typing often cite that it works well for large programs, since it captures bugs in the large. Yet, the key of software modularity is to split up programs in small modules which then can work independently. And modularity works in a dynamic typing discipline as well; weakening the claim considerably.

On the other hand, proponents of dynamic typing often cite the added tediousness of adding types to code as a slowdown in productivity. Yet, type inference automatically discovers types, and constructions like “open types” and “polymorphic variants” in OCaml can simplify many situations where the use of static typing would normally require a lot of ascriptional work of relating type to value.

Dan Luu has done a magnificent job. His post “The Empirical Evidence That Types Affect Productivity and Correctness” summarizes a large set of papers and goes through each one in order to describe what it is trying to measure and how well it does it. In almost every paper, he has valid critique of the experiment, methodology and approach.

The point is that while these papers shows almost no effect in either direction, they are often used to “Justify your view” in one direction or the other. At best, many of these studies are inconclusive. And unfortunately, few people read the underlying papers, which means more misinformation.

There are other interesting questions to ask inside the umbrella of types. For instance:

The language will automatically convert values of differing types under certain operators. A good example is promotion of integers to floats in C, integer size conversions in C, or the automatic conversion of integers to strings in string concatenation in Javascript. Commonly this is called weak typing, but other names exist for the concept.
The language disallows automatic conversion and forces the programmer to convert between types. This is true in Python, Ruby, Go, OCaml, Haskell, Erlang and a whole other slew of languages. Often this is called strong typing.

Again, it is a toss-up what is the most effective. Certainly, weak typing is beneficial when the programmer gets to type less and the conversions work like they should. Given this view, strong typing feels like a nuisance and a source of irritation. On the other hand, the automatic conversion sometimes introduces subtle bugs in the program which will not occur in a language with manual type conversion.

NULL

Many languages include a “NULL” or “nil” value, which is represented as a pointer to the address of zero. In the Java platform, the dreaded NullPointerException rears its ugly head whenever you try to dereference such a value.

Many academic languages, and a few industrial ones (Erlang for example), has no concept of a nil value. Instead, you have to explicitly nominate when a value can be invalid. This default is akin to defining all your database columns as NOT NULL.

Again, we have a relevant question: is there a statistically significant difference in the error rate of languages with a NULL default compared to a NOT NULL default.

Garbage collection

Does a garbage collector improve productivity? And for what kind of programs? Certainly, not having to manually work with memory is helpful in many situations. But it is also true that a garbage collector doesn’t automatically remove memory leaks. Bad programming can still make the program use up more memory than it should. And a logical memory leak is still possible. Haskell for instance—with its lazy evaluation model—is prone to leaking computation into memory in what could be called a thunk-leak.

The benefit of faster development due to Garbage collection must be measured against the time used for tuning the GC algorithm of the program running in production. Increased service latency due to long GC pause times is a very real problem and it is not a priori clear the inclusion of garbage collection by default is a step in the right direction.

Persistent/ephemeral data

In programs, data are either persistent (immutable) or ephemeral (mutable). Functional programming languages usually default on using persistent data structures, eschewing the ephemeral ones in the process. It is not that you can’t get access to mutable structure when you need it. It is simply not the default mode of operation. In such languages a question arises:

Null hypothesis: there are no measurable difference between programs written with mutable data structures and immutable data structures.

Another worthy hypothesis to ask is if there is a measurable difference in the efficiency of the executing software, depending on data structure choice. The answer is usually yes, but a more interesting question is wether in practice the choice between an O(n) ephemeral structure and a O(n lg n) persistent ditto matters much. And what robustness guarantees are obtained by picking one over the other.

A critique of current experiments

Currently, our experiments are weak and use some dubious methodology when carrying out experiments of programming languages.

It is important to stress the activity of programming is one involving human beings. That is, the programming language is the ultimate test of user experience (UX) for a computer. The programming language interface to the computer is the universal one which lets you write anything you want. In contrast to many normal programs which are about limiting the experience to a few well-defined things you can do, the idea of programming languages falls in the opposite category, where the goal is to be able to extend the existing machine. And do so with maximal efficiency and productivity.

Because it is a human activity, it has to involve human beings. This means we have to select a sample of programmers to test our hypothesis on.

It is here we see, what I often deem to be an almost universal weakness: test on freshmen. The problem with undergraduates in the 1st year is variance. Some will have seen programming languages for the first time when they enter the study. And some will have been programming for 5–10 years in advance. This variant diversity hurts your statistical models since you need far larger groups to show a difference. In fact, it may be you can avoid a lot of the problems by only selecting those students with no to little prior programming experience.

The other weakness of freshmen are the lack of experience they may have. Some people claim that while dynamic typing is easier to learn, static typing is only appreciated when you have written code for some time. If true, this will affect your experiments.

Another weakness is not to control for language difference. Comparing Standard ML to Python for instance:

Python has a vast large standard library. Standard ML does not. You will have to cripple Pythons Stdlib so it is on par with SML or extend the SML library to the same extent as Pythons.
Python has built-in hash-tables in the form of “dictionaries”. You will have to provide the same structure to users of Standard ML. Otherwise, you risk the availability of dictionaries in Python to be a confounding variable in your experiment.
How do people interact with Python? If they have access to an IDE providing automatic help for programming, it will confound the experiment.

While it may seem unfair to artificially cripple one language to match it to the other language, it is important to ask “what are you measuring?”. Any experiment is about controlling for weaknesses in the method used. And ignorance is the fastest way to a wrong conclusion.

The question of generic programmer variance is also up in the open. There is an oft cited paper, “EXPLORATORY EXPERIMENTAL STUDIES COMPARING ONLINE AND OFFLINE PROGRAMING PERFORMANCE”, from 1966. The paper is often cited in the scope of the 10x programmer myth: “some programmers are 10 times as effective as their colleagues”. In reality the paper is sound statistical work providing two major insights at the time:

Programmers would either write their programs on punched cards with no help from a computer, or their would develop the program “online”directly on the computer. The former, offline, method was argued to be better by some because it forced the programmer to think before writing code. The study found that online editing is significantly better. At the time, this was a trade-off since time on the computer was expensive and limited. Today, this limitation is non-existent.
Programmers writing in higher-level languages (JOVIAL Time-sharing System—JTS) were significantly faster at solving the task compared to programmers writing in a low-level language.

In passing, the paper mentions the large variance between programmers and invites more research in that area. It would seem this kind of research is still up for grabs for any interested party.

Furthermore, it would be interesting to see if programming variance changes with experience. Clearly, freshmen, one with 2 months of programming experience, the other with 10 years would likely be different. The programmer with 10 years of experience is more skilled. Even in the event where the language in which they are writing is a new language. Writing code in different languages have a peculiar overlap and learning a new language becomes easier the more languages you happen to know.

But take the same two people as graduates and 3–4 years of experience in industry. Is the variance still large, or has it changed in any way?

The kind of tasks in the tests also needs some work. A large chunk of modern programming is not about building something genuinely new, but rather to glue together existing systems. A lot of programmer productivity today can be measured on how well code fit together. Interestingly, a language like Go seem to focus much more on seamless implicit glueing of code than many other languages. I’d love to see an experiment where one is to integrate with existing code.

Another experiment I would like to see is the solution of a typical industrial problem with industrial and academic languages in the experiment. Implementing a spelling checker, while interesting, is not the typical task of a modern programmer.

It is my hope we see more falsifying experiments. We need experiments that soundly attempt to disprove certain commonly held beliefs. It would seems that we are a point where we have lots of questions and relatively few sound answers. Starting in the small by destroying some common myths would be a good way to move us foward.

The purpose of science is surprise. We want to have studies which surprises us and shows us something deeper about the world we did not know. I’d love repeated experiments which fails to reject to null hypothesis that static/dynamic languages are equally productive. That would argue the discussion is pointless. Or perhaps that there is a significant difference, which also makes the discussion pointless. It would move us forward and we could start looking at other merits of static/dynamic typing.

Language design is not additive. Like genes, changing one aspect of a language may affect other aspects. Rob Pike put it succinctly in “Less is exponentially more”, where he makes the argument that it is the sum and not the individual parts which makes up a language design.

In this light, I also hope for surprising outcomes:

The execution efficency of programs doesn’t matter as much as proper software design. Efficient parallelism is more important than single-core execution speed.
Logic programming in Prolog or Mercury yields significantly fewer program errors than writing in Haskell. The programs also run faster.
Dependent types are found to slow down the programmer too much. Hindley-milner inference turns out to be the soft-spot providing the best of all worlds.
There is no measurable advantage of writing unit tests for a statically typed language.
In practice, QuickCheck finds all the bugs which was found by formal verification in Coq—and did so in 10 times less development effort.
Modern C static checkers: flexelint, coverity, address sanitization, and valgrind can find any bug in practice so it doesn’t matter C has undefined behaviour.
Functional programming in the large is more memory efficient than imperative programming and thus executes faster on modern memory-constrained machines.
Using formal parser and grammar theory can eliminate all security bugs w.r.t input parsing.
DSLs are easier to write in homoiconic languages, especially Clojure. It is found that they are far harder to write in Haskell.
Experiments show message passing is more correct and more efficient than shared-memory methods of software-transactional-memory, mutexes and lock free data structures.
The myth of garbage collection pauses is dispelled and garbage collection can be used for hard realtime systems.
Over a large experiment, it is shown that there is a frequency-dependent system in which most people prefer imperative languages, but a small size of the population thrives on using functional programming. There is no hope in forcing one group to use the tools of the other group.
An experiment connects the aptitude of natural language to the aptitude of programming languages. As a corollary, women—usually outpacing men in language-based tests—happens to be the better programmers on average.
It is shown that the 10x-programmers outperform others not due to their ability to program, but because they have a much better understanding of human sociology.

[0] The right root is perhaps ALGOL68, but certainly, the syntax and semantics of C permeate almost all modern programs written.

jlouis' Ramblings