Who Wrote This?

An unknown coder working in solitude. — An international team of computer scientists and security engineers have developed a de-anonymization technique that can help identify the coder(s) behind malware.

Attributing cyberattacks to specific computer programmers is a difficult and contentious task. Apportioning blame is often left to informed guesswork, with digital forensics teams assessing technical, operational, and geopolitical strategic factors to infer who a coder is most likely to be, and with which hacker team they might be associated.

Such guesswork can leave different analysts fingering different perpetrators. For instance, an FBI analysis pointed to North Korean attackers being responsible for the commercially devastating hack on the movie studio Sony Pictures Entertainment in 2014. Other security experts, however, say the evidence pointed to a long-term inside job by disgruntled former Sony employees.

To reduce the scope for error, better attribution tools are needed—especially with ongoing investigations into the Russian hacking of the 2016 U.S. presidential election, and into China allegedly implanting malware to acquire Western intellectual property, such as draft patents. This is why organizations like the U.S. Defense Advanced Research Projects Agency are funding research to create "new science around the ability to quickly, objectively and positively identify" the perpetrators of cyberattacks.

While that new science develops, however, recent research is attacking a major plank of the attribution problem: the programmer obfuscation caused by the very act of compiling software. In compilation, the high-level language (HLL) source code that a programmer writes is translated into the digital 0s and 1s that percolate through the logic gates inside microprocessors when the code actually runs. This compiled code is referred to as a binary.

The problem is that binaries—random-looking jumbles of 0s and 1s—lose all the HLL's distinguishing features (variable names, commands, function calls, and the program flow structure) that demonstrate the unique way in which every coder writes software. A coder's narrative structure is as distinctive as a novelist's writing style, and so-called "stylometric" analyses of source code can attribute it to an author with great accuracy.

With the binaries rather than source code normally found in malware, ransomware, and cyberattacks, source-code-style attribution has not been possible. It is conventional wisdom in computer security circles that once code is converted to that binary form, all attribution bets are off.

However, that is no longer the case. An international team of computer scientists and security engineers led by Aylin Caliskan and Arvind Narayanan at Princeton University in New Jersey, and including Konrad Rieck at the Technical University of Braunschweig in Germany, have developed a de-anonymization technique that reassembles a form of high level language from the binary, and then uses a machine learning classifier to identify the author, with great accuracy.

Unveiled at February's Network and Distributed Systems Security conference in San Diego, CA, this de-anonymizing technique reverse-engineers an HLL version of the software source code from the binary, a version good enough for a classifier to compare with programming profiles of suspects.

This de-anonymization involves four steps. First, in a stage called disassembly, features in the target binary file based on machine code instructions, data strings, and symbols are pulled out, allowing a rough machine code equivalent of the program to be drawn up.

Second, that listing is translated, in a process called decompilation, into a human-readable version of the C programming language from which features of the program's structure (such as loops, branches, and flow syntax) are extracted.

The third stage is reductive, cutting the number of programming features to be considered from many thousands to the 53 deeply technical ones that Caliskan's team believe are most informative of a programmer's style. "These features represent things such as arithmetic, logic, stack operations, file input and output operations, variable declarations, and initializations," she says.

The final stage is AI-based classification: using their technique on binaries stored in the public repository for the Google Code Jam algorithm writing competition, a machine learning classifier is trained to recognize first 100, and then 600 programmers' styles.

The results have been startlingly accurate. Until now, the best previous attempt to deanonymize coders from binaries, in 2011, achieved programmer identification of 51% (just one percentage point better than random chance).

The Princeton-led team's technique offered 96% programmer identification accuracy on 100 candidates, and 83% on 600 suspects. "We can now identify programmers, as long as we can obtain historical training data for a set of programmers suspected to have written the code of a binary executable," says Caliskan.

Computer scientists and cybersecurity experts are impressed at the new de-anonymization technique's seeming efficacy. "This is a surprising result," says Christian Skalka, a computer scientist specializing in innovative programming methods for cybersecurity applications at the University of Vermont. "Under realistic circumstances, the author of a program can be determined even from a compiled binary version of it.

"The key takeaway is that even after compiler transformations, a programmer's style is still tenaciously reflected in code characteristics that can be recognized using statistical machine learning techniques."

In the U.K., Jay Abbott, a Peterborough, Cambridgeshire-based consultant who works in the gaming, banking, and government cybersecurity sectors, says the technique "could be a significant game changer for the future" in helping law enforcement attribute malware and cyberattacks to alleged perpetrators.

"If it could be executed at scale, and automated," Abbott says, "this technique has some interesting potential as a solution to malware. I don't say that lightly: if you can make all code 100% attributable to an individual no one would write malicious code, as you would be caught as soon as your code was released. The reality is, though, that we are a long way away from that utopian future."

Peter Ladkin, an engineer specializing in computer networks and distributed systems at the University of Bielefeld in Germany, cautions that shiny new solutions tend to produce better countermeasures. "These researchers can attribute authorship quite well with hundreds of candidates. That's impressive and pretty sophisticated stuff," he says.

"But code-obfuscation techniques can also be improved as recognition techniques become more sophisticated, and my guess is that both will continue to advance."

Caliskan and her team are aware their technique also could infringe on the privacy of coders who need to remain anonymous for very good reasons. They suggest coders who want to remain anonymous vary their coding style, use multiple obfuscation techniques, and do not use public repositories like GitHub, for instance.

Skalka agrees. "One can no longer assume that a binary-only distribution will preserve programmer anonymity, which has ramifications for both privacy and secrecy. For example, the author of a program supporting information-sharing in an oppressive regime may wish to remain anonymous, but that author will not be protected merely by hiding the program source code, since the binary version can yield their secret," he says.

Regardless, Skalka says, "This de-anonymization work makes it clear that if the author of a program wishes to remain anonymous, they would be well-advised to not assume that standard compilation will preserve their anonymity, and instead apply aggressive obfuscation techniques."

Paul Marks is a technology journalist, writer, and editor based in London, U.K.