Sign In

Communications of the ACM

[email protected]

Empirical Answers to Important Software Engineering Questions (Part 2 of 2)


Bertrand Meyer

In part 1 of this article (please read it first, the present discussion will not make sense otherwise), I praised the development of empirical software engineering, gave examples of interesting results, and mentioned that, inevitably, the field started with properties that can be measuredparticularly in the product rather than process areaat the possible expense of what should be measured because it is of interest to software engineers.

The field has matured enough that we should now, in my view, change the focus: start from the problems of utmost importance to practicing developers and managers.

Indeed, this is what we are entitled to expect from empirical studies: guidance. The slogan of empirical software engineering is that software is worthy of study just like geological strata, photons, and lilies-of-the-valley; OK, sure, but we are talking about human artifacts rather than wonders of the natural world, and the idea should be to help us produce better software and produce software better.

Whenever we call for guidance from empirical studies, we should immediately include a caveat: every empirical study has its limitations (politely called "threats to validity") and one must be careful about any generalization. The following horror story serves as caution [1]. The fashion today in programming language design is to use the semicolon not as separator in the Algol tradition (instruction1 ; instruction2) but as a terminator in the C tradition (instruction1; instruction2;). The original justification, particularly in the case of Ada [2], is an empirical paper by Gannon and Horning [3], which purported to show that the terminator convention led to fewer errors. (The authors themselves not only give their experimental results but, departing from the experimenter's reserve, explicitly jump to the conclusion that terminators are better.) This view defies reason: witness, among others, the ever-recommenced tragedy of if c then a; else; b where the semicolon after else is an error (a natural one, since one gets into the habit of adding semicolons just in case) but the code compiles, with the result that b will be executed in all cases rather than (as intended) just when c is false [4]. How in the world could an empirical study come up with such a bizarre conclusion?

Go back to the original Gannon-Horning paper and the explanation becomes clear: the experiments used subjects who were familiar with the PL/I programming language, where semicolons are used generously and an extra semicolon is harmless, as it is in all practical languages (two successive semicolons being simply interpreted as the insertion of an empty instruction, causing no harm); but the experimental separator-based language and compiler used to the experiment treated an extra semicolon as an error! As if this were not enough, checking the details of the article reveals that the terminator language is terminator-based for both declarations and instructions, whereas the example delimiter language is only delimiter-based for instructions, but terminator-based for declarations. Talk about a biased experiment! The experiment was bogus and so are the results.

One should not be too harsh about a paper from 1975, when the very idea of systematic experimental studies of programming was novel, and some of its other results are worthy of consideration. But the sad terminator story, even though it only affected a syntax property, should serve as a reminder that we should not accept a view blindly just because someone invokes some empirical study to justify it. We should assess the study itself, its methods and its credibility.

With this warning in mind, we should still expect empirical software engineering to help us practitioners. It should help address important software engineering problems.

Ideally, I should now list the open issues of software engineering, but I am in no position even to start such a list. All I can do is to give a few examples. They may not be important to you, but they give an idea:

  • What are the respective values of upfront design and refactoring? How best can we combine these approaches?
  • Specification and testing are complementary techniques. Specifications are in principles superior to testing in general, but testing remains necessary. What combination of specification and testing works best?
  • What is the best commit/release technique, and in particular should we use RTC (Review Then Commit, as with Apache originally then Google) or CTR (Commit To Review, as Apache later) [5]?
  • What measure of code properties best correlates with effort? Many fancy metrics have appeared in the literature over the years, but there is still a nagging feeling among many of us that for all its obvious limitations the vulgar SLOC metrics (Source Lines Of Code) still remains the least bad.
  • When can a manager decide to stop testing? We did some work on the topic [6], but it is only a start.
  • Is test coverage a good measure of test quality [7] (spoiler: it is not, but again we need more studies)?

And so on. These examples may not be the cases that you consider most important; indeed what we need is input from many software engineers to help steer empirical software engineering towards the topics that truly matter to the community.

To provide a venue for that discussion, a workshop will take place 10-12 September 2018 (provisional dates) in the Toulouse area, involving many of the tenors in empirical software engineering, with the same title as these two articles: Empirical Answers to Important Software Engineering Questions. The key idea is to start not from the solutions side (the lamppost) but from the actual challenges facing software engineers. It will not just be a traditional publication-oriented meeting but will also include ample time for discussions and joint work.

If you would like to contribute your example "important questions", please use any appropriate support (responses to this blog, email to me, Facebook, LinkedIn, anything as long as one can find it). Suggestions will be taken into consideration for the workshop. Empirical software engineering has already established itself as a core area of research; it is time feed that research with problems that actually matter to software developers, managers and users.

Notes

[1] This matter is analyzed in more detail in section 26.5 of my book Object-Oriented Software Construction, 2nd edition, Prentice Hall. No offense to the memory of Jim Horning, a great computer scientist and a great colleague. Even great computer scientists can be wrong once in a while.

[2] I know this from the source: Jean Ichbiah, the original designer of Ada, told me explicitly that this was the reason for his choice of  the terminator convention for semicolons, a significant decision since it was expected that the language syntax would be based on Pascal, a delimiter language.

[3] Gannon & Horning, Language Design for Programming Reliability, IEEE Transactions on Software Engineering, vol. SE-1, no. 2, June 1975, pages 179-191, see here.

[4] This quirk of C and similar languages is not unlike the source of the Apple SSL/TLS bug discussed earlier in this blog under the title Those Who Say Code Does Not Matter.

[5] Peter C. Rigby, Daniel M. German, Margaret-Anne Storey: Open Source Software Peer Review Practices: a Case study of the Apache Server, in ICSE (International Conference on Software Engineering) 2008, pages 541-550, see here.

[6] Carlo A. Furia, Bertrand Meyer, Manuel Oriol, Andrey Tikhomirov and  Yi Wei:The Search for the Laws of Automatic Random Testing, in Proceedings of the 28th ACM Symposium on Applied Computing (SAC 2013), Coimbra (Portugal), ACM Press, 2013, see here.

[7] Yi Wei, Bertrand Meyer and Manuel Oriol: Is Coverage a Good Measure of Testing Effectiveness?, in Empirical Software Engineering and Verification (LASER 2008-2010), eds. Bertrand Meyer and Martin Nordio, Lecture Notes in Computer Science 7007, Springer, February 2012, see here.


Comments


Bonita Sharif

We found your two-part blog article very relevant to our Dagstuhl seminar titled Evidence About Programmers for Programming Language Design taking place Feb 4 9 2018. We find many of the arguments relate directly to programming language design that needs to be informed through programmer-based studies. Our goal is to initiate collaborations between PL designers, empirical researchers, educators, and biometric experts in order to determine a priority list for such studies. Please refer to http://www.dagstuhl.de/18061
- A. Stefik, S. Hanenberg, B. Myers, B. Sharif


Displaying 1 comment

Sign In for Full Access
» Forgot Password? » Create an ACM Web Account
ACM Resources