Research and Advances
Artificial Intelligence and Machine Learning Research highlights

Technical Perspective: Software Is Natural

  1. Article
  2. Author
Read the related Research Paper

To view the accompanying paper, visit

In just half of the average 150-year life span of a Galápagos tortoise, high-level programming languages have come into existence and have undergone a remarkable transformation. The earliest forms of higher-level languages made it easier to state mathematical expressions than was possible in machine code, but it was not long before the need for languages to go beyond mathematical computations and to express solutions for business problems was recognized. One of the languages designed in response to this need was Cobol, with a stated desire to maximize the use of simple English phrases. From this point on, it became more common for programming languages to include a mix of keywords and constructs based on natural language phrases. Combined with the ability for programmers to state variable names using semantically meaningful tokens and the ability to embed comments that explain the code in natural language, programming languages continue to ease the expressions of complex ideas into software.

The ability to more easily express programs has enabled the creation of software to tackle complex problems. This software is often large and complex itself. To understand, extend, and solve problems in this intricate software, software engineers need tools. Many tools have been created, from ones that indicate where the expression of software does not meet syntactic rules of the language to defect predictors that aim to suggest where bugs may be lurking in the software and where effort might be directed to eradicate such bugs before they affect users of the software. The vast majority of the tools built to aid software engineers use deep knowledge of a programming language, often requiring specific work for each kind of programming language and often requiring extensive development effort.

The following work by Hindle et al. takes an entirely new approach to providing tools to help build software. The authors demonstrate how the phrases used to express software are often repetitive and predictable, similar to how natural languages, like English, are used in practice. Given these similarities, corpus-based statistically rigorous methods that are now being applied to natural language may also be of use when applied to help understand and build software.

The following paper takes an entirely new approach to providing tools to help build software.

In taking this approach, the authors demonstrate some fascinating characteristics of software. For example, the ways in which humans express software are more regular than how humans express text in English. At first glance, this result does not seem so surprising; after all, programming languages, like Java, require programmers to express the desired software in the framework of many keywords and syntactic constructs. However, the authors show this regularity is not the result of the syntax, rather it is due to the project and application domain for which the program is being expressed. The authors show this regularity goes beyond Java to other programming languages as well.

The treatment of software by statistical approaches built for natural language opens up many possibilities for building flexible tools to help develop software. For example, other researchers have now used statistical language modeling to develop translation tools from an API in one programming language to an API in another programming language, building on existing techniques used by Google for its translation service.

One of the exciting aspects of this paper is the range of potential uses of statistical language models to software. For example, the approach can apply seamlessly to both code and to natural language documents associated with the code, whether they are bug reports, commit messages, design documents, or Web-based question and answer sites. This seamless application enables the use of the approaches to associate a range of documentation with code, enabling tools that suggest code to use given a natural language description or vice versa, to suggest documentation for a given piece of code. Even more provocative, the authors describe how the overall approach might be used to create scalable approximations for determining usually costly program properties, such as a mined rule about a protocol or a fact about how or which storage might be accessed when a program is executed.

It is not often that research is conducted that changes the course of a field. The demonstration by the authors that software is natural and that statistical language models apply fundamentally opens up new approaches to creating scalable, useful software development tools.

Back to Top

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More