As a historical note, the Mercury period-vs-comma¹ bug probably wasn't a typo as such. In that era, programmers didn't type their programs; they wrote them on paper, using special printed forms² so that their intent would be clear to the professional typists who entered it. To prevent typos, everything would be typed at least twice; in the days before diff, there was special-purpose hardware for this.³
I can think of three possibilities for the programmer making such an error: actually writing it wrong (which seems unlikely to me because the mental context of writing a loop range doesn't lend itself to writing a single number); a skipping pen; or a locale issue — i.e. a programmer of European origin mixing up the characters.
I cannot find how they controlled for the wordiness of the different languages. They changed one token in each file, but the number of tokens per file might be different. For example, Python likely will be shorter than Java due to its significant whitespace.
Also, the 'replace a single character in a token by noise' change may have hugely different effects, not only because of differences in keywords (begin…end vs {…}) but also, and probably more so, because of average variable and function name length (for the languages tested, this is a cultural issue, but it would not surprise me if the effect were large. You won't find 'FooFactory' in a perl program)
Using my complete lack of statistical knowledge, I multiplied the wrong output rate % by the total lines of code in the examples from original paper here http://www.spinellis.gr/pubs/conf/2012-PLATEAU-Fuzzer/pub/ht... to get a very bad approximation of fat fingering adjusted for program length. You'd expect more typos in a longer program; the original experiment always introduced 1 typo per run regardless of program length.
You guys enjoy while I prepare for the lynch mob of Statisticians :-)
Lang Err % LOC LOC adjusted Err %
Ruby 0.17 159 27.03
Python 0.15 161 24.15
Perl 0.22 156 34.32
PHP 0.36 224 80.64
JS 0.18 102 18.36
Java 0.1 331 33.1
Haskell 0.15 114 17.1
C# 0.095 389 36.955
C++ 0.08 461 36.88
C 0.1 458 45.8
Looks like an improvement to my (completely unbiased, of course) eyes. Haskell moves away from C++/Java, C moves awy from them in the reverse direction, and PHP moves into its own league.
The surprises, IMO, are JavaScript (I would place it close to PHP) and perl (apparently, it is easy to come up with character sequences that are not valid perl :-))
Thinking of ways to get a perfect language according to this metric: the way to get there is to introduce lots of redundancies in the grammar. For example, if one requires two exact copies of the same source before code compiles, any single change will give compilation errors. However, programmers would build tools to defeat such strategies.
Maybe, one should scale for actual content, e.g. by weighing against the size of gzipped source code?
Yeah, look at the JavaScript LOC! Who wrote the rosetta code for those, Brendan Eich?!
This hints at another way to optimize for this metric; make the language as expressive as possible. Less characters should translate into less typos. Paul Graham strikes again! (http://www.paulgraham.com/power.html)
As to your point about redundancy, I think the researchers are in agreement with you on that one if you consider unit tests to be a sort of redundancy, expressing the same concept in two different ways. They bring this up repeatedly in their report.
Obligatory Perl jab: It surprised me that any of the Perl solutions used more than one line. :-P
> Thinking of ways to get a perfect language according to
> this metric: the way to get there is to introduce lots of
> redundancies in the grammar. For example, if one requires
> two exact copies of the same source before code compiles
So, assuming all typos inserted are in fact a serious error to the program and assume that the output they used to compare is a unit test. Then an interesting number to look at is:
Errors remaining = Successful run - Faults caught in unit test
In that regard most languages in the study are quite equal. Although with the static languages you most likely catch the errors much earlier and you probably get a much better hint of where the error is, instead of just an assertion that you have an error. And of course this assumes that you do have unit tests.
Java : lots of syntax to fuzz, all have to be right.
Haskell : few tokens, stronger checking (i.e no corecions), though Rosetta code is not as type-heavy as real Haskell code.
I have a conjecture that Haskell examples designed by experts - with types in mind as we do in production systems - would have lower compile rates than the Rosetta examples, that are written mostly by non-experts without regard to maintainability.
In production Haskell code, I will usually wrap Double values with newtypes, to e.g distinguish currency amounts, percent data and ratios from each other, specifically to guard against typos where I accidentally pass doubles parameters in the wrong order. Designing code with an intent to make it less vulnerable to fat fingers is certainly possible.
Features such as automatic creation of mentioned variables and dynamic typing allow a mistake in code to change a correct program into a syntactically valid program that does the wrong thing.
I know my evidence is just anecdotal, but I see it a lot. Man, do I miss my C++ and my Haskell when I have to use Python and javascript at work.
Most people seem to disagree with that happening "a lot", so maybe I'm working with bad codebases or just being unlucky. Indeed, a controlled study would yield more reliable information.
It is a comparison of several programming languages regarding how likely it is that a typo causes the program to still compile and run but produce wrong output. If a typo goes unnoticed you would really want the parser to notice or at least an obvious behaviour at runtime (e.g. a crash) instead of the code silently working and producing the wrong results.
>> I think that the most significant outcome of our study is the demonstration of the potential of comparative language fuzz testing for evaluating programming language designs.
I can think of three possibilities for the programmer making such an error: actually writing it wrong (which seems unlikely to me because the mental context of writing a loop range doesn't lend itself to writing a single number); a skipping pen; or a locale issue — i.e. a programmer of European origin mixing up the characters.
¹ http://catless.ncl.ac.uk/Risks/9.54.html#subj1.1 ² http://en.wikipedia.org/wiki/File:FortranCodingForm.png ³ http://en.wikipedia.org/wiki/Keypunch#IBM_056_Card_Verifier