Malevolent Machine Learning

At the start of the decade, deep learning restored the reputation of artificial intelligence (AI) following years stuck in a technological winter. Within a few years of becoming computationally feasible, systems trained on thousands of labeled examples began to exceed the performance of humans on specific tasks. One was able to decode road signs that had been rendered almost completely unreadable by the bleaching action of the sun, for example.

It just as quickly became apparent, however, that the same systems could just as easily be misled.

In 2013, Christian Szegedy and colleagues working at Google Brain found subtle pixel-level changes, imperceptible to a human, that extended across the image would lead to a bright yellow U.S. school bus being classified by a deep neural network (DNN) as an ostrich.

Figure. High-resolution images of fake “celebrities” generated by a Generative Adversarial Network using the CelebA-HQ training dataset.

Two years later, Anh Nguyen, then a Ph.D. student at the University of Wyoming, and colleagues developed what they referre3d to as “evolved images.” Some were regular patterns with added noise; others looked like the static from an analog TV broadcast. Both were just abstract images to humans, but these evolved images would be classified by DNNs trained on conventional photographs as cheetahs, armadillos, motorcycles, and whatever else the system had been trained to recognize.

A 2017 attack centered on road-sign recognition demonstrated the potential vulnerability of self-driving cars and other robotic systems to attacks that took advantage of this unexpected property. In an experiment, a team from the University of Michigan, Ann Arbor demonstrated that brightly colored stickers attached to a stop sign could make a DNN register it as a 45mph speed-limit sign. This and similar attacks encouraged the U.S. Defense Advanced Research Projects Agency (DARPA) to launch a project at the beginning of this year to try to develop practical defenses against thesse attacks.

The key issue that has troubled deep learning researchers is why deep learning models seem to be fooled by what appears to humans like noise. Although experiments by James DiCarlo, a professor in neuroscience working at the Massachusetts Institute of Technology (MIT), and others have showed similarities between the gross structure of the visual cortexes of primates and DNNs, it has become clear the machine learning models make decisions based on information the brain either does not perceive, or simply ignores.

In work published earlier this year, the student-run Labsix group based at MIT found features recognized by DNNs can be classified into groups they call robust and non-robust.

Andrew Ilyas, Ph.D. student and Labsix member, says robust features are those that continue to deliver the correct results when the pixels they cover are changed by small amounts, as in Szegedy’s experiments. “For instance, even if you perturb a ‘floppy ear’ by a small pixel-wise perturbation, it is still indicative of the ‘dog’ class.”

Non-robust features, on the other hand, may be textures or fine details that can be disguised by lots of tiny changes to pixel intensity or color. “Imagine that there is a pattern that gives away the true class, but is very faint,” Ilyas suggests. It does not take much to hide it or change it to resemble a non-robust feature from a completely different class.

Adversarial training provides deep neural networks with a series of examples that try to force the model to ignore features that have been shown to be vulnerable.

In work similar to that of the Labsix group, Haohan Wang and colleagues at Carnegie-Mellon University found that filtering out high-frequency information from images worsened the performance of DNNs they tested. Ilyas stresses that the work his group performed demonstrated that the subtle features are useful and representative, but they are easy to subvert, underlining, he says, “a fundamental misalignment between humans and machine-learning models.”

Researchers have proposed a battery of methods to try to defend against adversarial examples. Many have focused on the tendency of DNNs to home in on the more noise-like, non-robust features. However, attacks are not limited to those features, as various attempts at countermeasures have shown. In one case, a team of researchers working at the University of Maryland used a generative adversarial network (GAN) similar to those used to synthesize convincing pictures of celebrities. This GAN rebuilt source images without the high-frequency noise associated with most adversarial examples and was, for a while, proved hard to fool. But eventually another team warped images using larger-scale changes to create adversarial examples that did beat Defense-GAN.

The most resilient approach so far is that of adversarial training. This technique provides the DNN with a series of examples during the training phase that try to force the model to ignore features that are shown to be vulnerable. It is a technique that comes with a cost: experiments have revealed that such training can just as easily hurt the performance of the DNN on normal test images; networks begin to lose their ability to generalize and classify new images correctly. They start overfitting to the training data.

“When training our model with adversarial training, we explicitly discourage it from relying on non-robust features. Thus, we are forcing it to ignore information in the input that would be useful for classification,” Ilyas notes. “One could argue, however, that the loss in accuracy is not necessarily a bad thing.”

Ilyas points out that the lower accuracy based on robust models is probably a more realistic estimate of a machine learning model’s performance if we are expecting DNNs to recognize images in the same way humans do. Ilyas says one aim of the Labsix work is to close the gap between human and machine by forcing the DNNs to home in on larger features. This will have the effect of making it easier for humans to interpret why the models make the mistakes they do.

However, with conventional DNN architectures, there is still some way to go to close the gap with humans, even if non-robust features are removed from the process. A team led by Jörn-Henrik Jacobsen, a post-doctoral researcher at the Vector Institute in Toronto, Canada, found it is possible for completely different images to lead to the same prediction. Not only that, adversarially trained DNNs that focus on robust features seem to be more susceptible to this problem.

A statistical analysis performed by Oscar Deniz, associate professor at the Universidad de Castilla-La Mancha in Spain, and his colleagues suggests a deeper issue with machine learning models as they exist today that may call for architectural enhancements. Deniz says the presence of adversarial examples is a side-effect of a long-standing trade-off between accuracy and generalization: “From my point of view, the problem is not in the data, but in the current forms of machine learning.”

A different approach to combating adversarial examples that does not rely on changes to the learned models themselves is to find ways to determine whether a machine learning model has not gone further than its training should allow. A major problem with DNNs in the way they are constructed today is that they are overly confident in the decisions they make, whether rightly or wrongly. The chief culprit is the “softmax” layer used by most DNN implementations to determine the probability of the image being in any of the categories on which it was trained.

Nicolas Papernot, a research scientist at Google Brain, explains, “The softmax layer is a great tool for training the model because it creates a nice optimization landscape, but it is not a suitable model for making predictions. A softmax layer does not allow the model to refuse to make a prediction. It is not surprising, then, that once presented with an input that it should not classify, a neural network equipped with a softmax outputs an incorrect prediction.”

Originally developed by Papernot while he was a Ph.D. student together with Patrick McDaniel, professor of information and communications science at Pennsylvania State University, the Deep k-Nearest Neighbors (DkNN) technique performs a layer-by-layer analysis of the decisions made by the machine learning model during classification to construct a “credibility score.” Adversarial examples tend to lead to results that are not consistent with a single class, but with multiple different classes. It is only toward the end of the process that the softmax layer pushes up the probability of an incorrect result to a high-enough level to push the result off-target.

“The DkNN addresses the uncertainty that stems from learning from limited data, which is inevitable,” Papernot says. The idea behind using DkNN to detect adversarial examples is to ensure the model makes a prediction only when it has enough training data to call upon to be able to generate a high-enough credibility score; otherwise, it will say it does not know, and a system relying on that DNN would either need to seek a second opinion, or to try to obtain more data.

Having developed attacks on DkNN together with his supervisor David Wagner, a professor of computer science at the University of California, Berkeley, Ph.D. student Chawin Sitawarin says an issue with the current approach is that it tends to suffer from false positives: correct classifications that have unusually low credibility scores. Sitawarin says improvements to the way the score is calculated could increase reliability, and that DkNN-like techniques represent a promising direction for detecting adversarial examples.

As work continues on multiple fronts, it seems likely that defense against these attacks will go hand-in-hand with greater understanding of how and why DNNs learn what they do.

Further Reading

Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B., and Madry, A.
Adversarial Examples Are Not Bugs, They Are Features ArXiv preprint (2019): https://arxiv.org/abs/1905.02175

Wang, H., Wu, X., Yin, P., and Xing, E.P.
High Frequency Component Helps Explain the Generalization of Convolutional Neural Networks ArXiv preprint (2019): https://arxiv.org/abs/1905.13545

Papernot, N., and McDaniel P.
Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning ArXiv preprint (2018): https://arxiv.org/abs/1803.04765

Jacobsen, J.H., Behrmannn, J., Carlini N., Tramer, F., and Papernot, N.
Exploiting Excessive Invariance caused by Norm-Bounded Adversarial Robustness ICLR 2019 Workshop on Safe ML, New Orleans, Louisiana. https://arxiv.org/abs/1903.10484

Malevolent Machine Learning

DOI

December 2019 Issue

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.

Malevolent Machine Learning

DOI

December 2019 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.