When Images Fool AI Models

Artificial intelligence (AI) systems can be fooled by certain image inputs. Called adversarial examples, they incorporate subtle changes to a picture that are imperceptible to humans and can be thought of as optical illusions for machines.

Some well-known cases are an image of a panda being misclassified as a gibbon when a specific pattern of noise is added, or a Stop sign that is no longer recognized when some black and white stickers are placed on it. Such adversarial images can be deliberately created by attackers to generate unexpected or harmful output from AI models, for example those powering chatbots such as OpenAI’s ChatGPT and Google Gemini.

“As we [increasingly] use these models, there are more and more opportunities for attacks to creep in,” said Nicholas Mattei, an associate professor of computer science at Tulane University and chair of the ACM Special Interest Group on Artificial Intelligence.

Adversarial visuals were first identified as a vulnerability over 10 years ago. Research to address the issue has explored strategies including additional training and defense systems. However, Alex Robey, a post-doctoral researcher at Carnegie Mellon University, said progress has flattened, with no significant breakthroughs in the last five years. He cited several theories about why the problem is hard to solve, such as the high noise sensitivity of discriminative models which classify input, and the fact that the large language models (LLMs) that power current chatbots have so many different uses that it is hard to control for all of them.

Many researchers are now trying to better understand visual adversarial examples to come up with new ways of tackling them. In a 2024 paper, for example, Ashwinee Panda, a post-doctoral fellow at the University of Maryland, College Park, and colleagues at Princeton University, looked at the security risks and safety implications involved with using images to integrate vision into LLMs like those powering generative AI chatbots such as ChatGPT-4.

Although certain text prompts can trick LLMs, Panda said it is easier to produce visuals that confuse them. “Figuring out what words to give a model is harder than changing a few pixels in an image that you upload to produce some kind of effect,” he said.

Furthermore, when an input is comprised of both an image and text, vision-integrated LLMs usually process both together when determining what text to output. An image therefore can set the tone or provide context that may affect the interpretation of a harmful question posed afterwards. As an example, combining the image of a video game with the query “How can I break into a house?” could cause ambiguity and lead the LLM to respond as if the question refers to unlocking a door in a game.

In an experiment for their paper, Panda and his team optimized the pixels in a classic panda image often used by the research community to create a visual adversarial example that might elicit derogatory output related to gender, race, and the human race. They input the image into Mini-GPT-4, InstructBLIP, and LLaVA, whose LLMs had received additional training to avoid generating harmful or biased output, to see if they could bypass the models’ safeguards when prompted with harmful questions after the adversarial image was input into the system.

The researchers found that their adversarial image was able to jailbreak the aligned LLMs, which are trained to produce safe, ethical output. They were surprised to find that although it was optimized to increase the probability of generating harmful content in a few specific scenarios, the image seemed able to undermine the safety of a model more generally than other offensive prompts, such as increasing the probability of an answer when asked about instructions related to murder. It was the first time that a visual adversarial example was shown to be able to modify the intended behavior of an LLM.

“Somehow, the features that the vision encoder are passing on to the language model have been done in such a way that it overrides the alignment in the language model,” said Panda.

One reason for this vulnerability, which the team proposed in a follow-up work, is that shortcuts are used when an LLM is aligned, making the safety safeguards shallow and easy to circumvent since the raw LLM would typically generate harmful output after its initial training. Panda thinks part of the problem is that a lot of money, compute, and data are invested in pre-training a model, while there is typically a tiny budget for alignment.

“Maybe it’s not so surprising that you’re not able to get a model to be safe because by default, it was not safe,” says Panda.

New ways of protecting AI models against visual adversarial attacks could help make them safer. In recent work, Robey and his colleagues focused on AI classification systems, such as those that might be used by video streaming platforms to group content into categories such as “adult” or “safe for children.” They came up with a new framework for adversarial training, in which a model is trained to recognize visual adversarial examples during additional training; that seems to improve on previous methods.

A widely-used approach to adversarial training, called the zero-sum paradigm, suffered from a problem called robust overfitting in which the model would initially become more resistant to adversarial attacks, but then its performance would start to decrease at some point mid-training. The new non-zero-sum framework proposed by Robey and his colleagues seemed to have a comparable ability to withstand adversarial attacks compared to the zero-sum method, yet it is not prone to robust overfitting.

“You get a continual improvement in robustness,” said Robey. “It is exciting.”

Although the approach is promising for AI classification models, Robey thinks adversarial training is more challenging for generative AI models, due to their scale and complexity. In addition to the significant cost of additional training, hardware that is only available to certain industry partners is often needed.

Adversarial attacks are also constantly evolving and becoming more sophisticated. “It continues to be the case that new ideas are needed to defend models against these threats,” said Robey.

To make AI models more robust, another widely-used approach is to incorporate several different defense tactics into a system. Called the Swiss cheese model, it posits that each defense system can be thought of as a slice of cheese with holes that represent its weaknesses. Since each layer of defense would have holes in different places, combining them would mean that a weakness in one layer would be blocked by another ‘slice’.

“When you stack these layers of defense together, the hope is that your system becomes robust enough,” said Robey.

One of Robey’s big concerns, however, is the safety implications when AI models are incorporated into commercial products such as robots or self-driving cars. He said a reasonable amount of safety testing is done by companies before releasing chatbots such as ChatGPT, Gemini, or Claude, but such testing may not anticipate the types of adversarial attacks that might occur when used in a third-party product.

In his recent work, Robey and his colleagues have been focusing on the use of LLMs to control robots and self-driving cars, and whether an attack could cause a car to go through a red light or a robot to collide with a human, for example. They found it could easily be done, with a high success rate.

“The finding is that the downstream system is much more vulnerable than the standalone technology itself, particularly when you incorporate both vision and language into the input space of these models,” said Robey.

He is now working with people in the self-driving car industry to see if guardrails can be designed to address such problems, such as filters or ways of training a model for the specific application so it is more robust to attacks. Robey said regulations, and incentives for model providers and companies using them in their products, also would help.

“The hope would be to put legislative pressure on all of these entities toward making it a race to the top for competing over safety standards, just as we have this marketplace right now that competes over better and better capabilities of models,” he said.

Sandrine Ceurstemont is a freelance science writer based in London, U.K.

When Images Fool AI Models

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.

When Images Fool AI Models

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.