News
Artificial Intelligence and Machine Learning

Software Synthesis Learns By Example

Posted
DeepCoder can take requirements from a developer, search a database of code snippets, and deliver working code in seconds.
A Microsoft Research team developed an algorithm, labeled DeepCoder, that can analyze and learn from existing programs, and use that knowledge to create new software.

The vast body of code produced by humans is now beginning to teach software how to create programs on demand. Rather than replacing programmers, however, the results will be used to help humans program and make the code that machines generate easier for people to comprehend.

Late last year, a team from Microsoft Research demonstrated neural networks could analyze existing programs and use the knowledge to create new software, rather than relying on hard-coded heuristics and rules.

Alexander Gaunt, a member of the Microsoft Research team that developed the DeepCoder algorithm, said, "The idea is for the system to be able learn from a history of observed programs over time. Whereas most program synthesis techniques take a problem and solve it from scratch, this system learns from the programs it sees."

DeepCoder's neural network does not generate programs directly, but uses the knowledge it gleans from existing software, written in a domain-specific language (DSL) created for the experiment, to predict which functions the target program is likely to need. This dramatically reduces the search time of the synthesis engine that constructs the code.

For the experiment, the team did not have a large-enough collection of existing programs that were suitable for training, "so we enumerated possible programs in the DSL and pruned away those that made no sense," said Marc Brockschmidt, researcher at Microsoft Research

The DeepCoder system uses short lists of input and output data to figure out what kind of algorithms the synthesized program should use. The same type of programming-by-example method is used by the Flash Fill synthesizer found in Microsoft's Excel product to automate operations for end-users. As with Flash Fill, the aim was for a system that "can target scenarios that end-users encounter," Brockschmidt said. "Things like automating actions in your file or Web browser seem realistic."

In separate work, Martin Vechev and colleagues at ETH Zurich have developed systems that can mine information from existing code repositories such as Github. "I believe such programming tools, based on big data, can have a tremendous impact on many areas associated with software and artificial intelligence," Vechev said.

One result of the work was the JS Nice online service hosted by the Software Reliability Lab at ETH Zurich. The service helps de-obfuscate Javascript programs that have been put through a 'minifying' process that is used to reduce code size.

JS Nice generates meaningful names and type annotations for variables that are often reduced to single-letter names in minified code. Languages like Javascript make de-obfuscation more difficult because they are not strongly typed. Probabilistic analysis based on analysis of many programs stored on Github helps generate accurate data-type information.

The system has attracted some 200,000 users since its launch in June 2014, some of them using the tool for security checks, such as detecting malware on websites.

The 'big code' approach used at ETH Zurich has led to several other systems based on machine learning that include the DeepSyn and Deep3 program-synthesis engines and a spinout company DeepCode founded by Vechev and recent Ph.D. graduate Veselin Raychev, who worked on these systems as part of his thesis.

The probabilistic models that the engines learn can help to complete partially written programs by suggesting valid library calls and functions, or repair code after problems have been detected.

A key problem with creating machine learning systems that can make predictions related to code is the ability of humans to interpret the results. A motivation for the DeepCoder project was to generate code that looks familiar to programmers, rather than train a neural network to generate a numerical matrix that operates on the data directly.

"It's important you communicate what you are doing, or the user has no trust that it works. You could have it simply output a matrix; that's useful, but not readable. Hopefully, if it's readable, you can interpret it," Brockschmidt said.

Vechev sees the goal of systems such as Deep3 as being more ambitious. They not only create intelligible programs, but also are able to explain the prediction through decision trees expressed in a human-readable DSL. "In fact, the probabilistic model behind Deep3 is not only applicable to program generation, but also to other domains, such as natural language," Vechev claimed.

The ETH Zurich researchers will describe a system that shows how the approach behind Deep3 can be used to predict some types of natural-language statements at the upcoming International Conference on Learning Representations (ICLR) in Toulons, France in late April. The same conference will also feature the Microsoft team's work on DeepCoder and machine learning.

Chris Edwards is a Surrey, U.K.-based writer who reports on electronics, IT, and synthetic biology.

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More