The ability to peer into computing devices and spot malware has become nothing less than critical. Every day, in every corner of the world, cybersecurity software from an array of vendors scans systems in search of tiny pieces of code that could do damage—and, in a worst-case scenario, destroy an entire business.
Most of these programs work in a fairly predictable way. They look for code signatures—think of it as the DNA of malware—and when they find a match, they attempt to isolate or delete the malware. In most cases, the approach works, and the software keeps smartphones, personal computers, and networks reasonably secure.
What if cybercriminals could hide pernicious payloads in places where commercial cybersecurity software were unable to detect it? Unfortunately, this approach is both possible and increasingly viable. Over the last few years, researchers have found that it is possible to infect audio and video files, documents, Internet of Things devices, and even deep learning models. Just a few kilobytes of code can fly below the radar of today's malware scanners.
That is both frightening and dangerous. "These techniques represent a substantial risk. They will likely change the way we approach and manage cybersecurity," says Tao Liu, an assistant professor in the Department of Mathematics and Computer Science at Lawrence Technological University in Southfield, MI. Adds Wujie Wen, an assistant professor in the Department of Electrical and Computer Engineering at Lehigh University in Bethlehem, PA, "Hidden malware opens up all sorts of attack methods and vectors."
So far, there has been no indication that cybercriminals have begun implanting hidden malware in files, including deep learning systems used by individuals, businesses, or governments. Yet the time bomb is ticking, researchers say. A growing body of research demonstrates the approach is already viable, and there's almost no way to stop it.
It is no secret that cybersecurity has become infinitely more complex in recent years. Polymorphous malware—which essentially changes its shape and form to evade detection—has become commonplace, while more elaborate and sophisticated types of malware have surfaced. As a result, it is increasingly difficult for conventional malware scanners to detect and halt intrusions, including worms, Trojans, keystroke loggers, and ransomware.
Hidden malware complicates things further. Conventional cybersecurity software simply is not equipped to seek and destroy these payloads. Liu likens hidden malware to a small scratch on a CD or DVD; the fault-tolerant nature of the playback system can ignore the problem and continue streaming a song or movie without interruption. Likewise, a limited number of malevolent pixels in a photo or video are not visible to the naked eye when a video is playing. Our brains are unable to spot the proverbial needle in the haystack.
In other words, the attack method works because all the data contained in a file is not necessary for the file or machine learning (ML) model to execute and function correctly. Simply put, it is possible to replace some binary code with malware or a hidden message. At that point, someone can implant the malware manually or use software, including AI, to make it even more difficult to detect. "It represents a highly effective way to embed malware in legitimate files, including multimedia files, and avoid detection," says Raj Chaganti, a cybersecurity engineer at Toyota Research Institute who has studied the space.a
Not surprisingly, the problem escalates rapidly as the size of the model increases. For example, in 2021 a group of researchers from China reported 36.9 megabytes of malware can be embedded in a 178-megabyte convolutional neural-network model with only 1% accuracy loss. This is below the threshold that activates a typical antivirus engine, points out Zhi Wang, a researcher in the Institute of Information Engineering at the Chinese Academy of Sciences and a co-author of the academic paper EvilModel: Hiding Malware Inside of Neural Network Models.b
Wang and his fellow researchers were able to engineer malware-infected ZIP files that could evade 58 different antivirus engines.
Using several technical methods and applying them to several models, including AlexNet, VGG, Resnet, Inception, and Mobilenet, the researchers found ways to deconstruct malware into smaller chunks—in some cases as small as 3 bytes—and then construct multiple layers of artificial neurons. These, in turn, could be interconnected across the deep learning model so that the malicious payload would operate stealthily and have no appreciable impact on the model's performance. Ultimately, "The ratio of the embedded malware can be half of the model size with no model performance loss," Wang explains.
The technique works because there are often millions of parameters scattered across an ML model. Frameworks such as PyTorch and Tensor-Flow, for instance, rely on 4-byte-long floating-point numbers to store parameter values. The highly distributed threat simply flies below anyone's radar. In fact, Wang and his fellow researchers engineered malware-infected .zip files that could evade 58 different antivirus engines. This makes it possible to implant botnets, keystroke loggers, ransomware, and other malware—and have them communicate with a remote server.
As deep neural networks (DNNs) become the de facto technique for deploying AI, the stakes are growing. A separate group, including Liu and Wen, also studied the problem in 2020. They too found that it is possible to exploit DNNs using a type of self-contained malware called stegomalware.c They noted that several payload injection techniques are possible; the high self-repair capability and complexity of ML models makes them a target for hiding malware. In addition, the group found there were numerous ways to trigger an attack from the model through diverse physical-world events. These include techniques such as Logits Triggers, Rank Triggers, and Fine-Tune Triggers.
Regardless of the exact approach, the potential fallout is alarming—and it does not bode well for cybersecurity. "The machine learning or AI model essentially becomes a matrix that stores the malware," Liu explains. In fact, it is possible to hide large amounts of destructive data, dynamically change the structure of the embedded code, and use polymorphic techniques to make the malware virtually undetectable and untraceable using current methods. Once inside a system, the payload could spawn additional malware.
The growing complexity of AI models—particularly when combined with open-source components and libraries like GitHub—ratchets up the risks further. For example, researchers warn that supply-chain, value-chain, or third-party attacks are possible by front-loading malware into automatic updates, in much the same way the Solar Winds attack occurred. "It's possible to insert malware into the model during the development process. There's a possibility that malware could be stored in hardware, memory, or IoT components. This means that it's incredibly difficult to ensure that any CNN or DNN model is totally secure," Liu explains.
The concept is also deeply unsettling because the computing resources and tools required to infect files and ML models are increasingly cheap and available, Chaganti points out. "Today's cloud computing resources are widely available at a minimal cost-point. There are few barriers to entry for cybercriminals." In addition, the relatively immature state of Internet of Things (IoT) security, along with the introduction of distributed AI frameworks such as decentralized AI (which handles part of the data and AI processing load on numerous systems), raise the risks even further.
Identifying and stamping out malware hidden deep in files and ML models will not be easy. It will require new thinking and entirely different techniques. "Conventional approaches to cybersecurity that rely on identifying signatures and matching them with known malware won't stand up to these methods," Chaganti says. "Right now, there are a limited number of ways to address this problem."
One method that could reduce risk is a verification framework for audio and video files, Liu says. Those that generate these files—new organizations, media companies, open-source libraries and others—could certify that a file is authentic and free of malware before it is released. Such a system would work like a digital watermark. Moreover, additional software on a user's system could verify the authenticity—and safety—of the file before a PC or smartphone executes it. This approach would offer the added bonus of making it more difficult to spread false videos and deepfakes on social media and elsewhere.
This would not necessarily fix the problem of manipulated and infected ML models, however. It also could not stop malware that was inserted into software or systems prior to the software certification process. The Evil-Model researchers pointed out that another countermeasure is to destroy the embedded malware by retraining and fine-tuning a downloaded ML model. Yet, this comes with limitations too. For instance, non-technical users are not likely to have the tools to manage the cleansing process.
Still another possibility is more advanced detection and eradication through AI. Already, some cybersecurity software products can analyze power consumption and packet behavior in systems—and even packet patterns across a network—and either flag or thwart questionable activity. For example, if ransomware begins encrypting files, it's possible to detect an immediate yet slight uptick in power consumption. "If you have a good sample of what constitutes baseline and normal network behavior, you can detect minute changes caused by the malware," Chaganti explains.
Other ideas also exist. For instance, Wen says cybersecurity software might be made to incorporate an AI kill-chain that can detect when malware is being assembled, and shut the process down. To achieve this level of traceability, a deeper understanding of what constitutes "normal" behavior is necessary. This may require a combination of databases and machine learning to understand malware behavioral patterns at a deeper level. "There are certain types of behaviors that are detectable. It's a matter of identifying them," he says.
Fortunately, the problem has not yet leaped from the research lab to the wild. Experts, however, say that hidden malware in files and ML models is inevitable—and the technique will almost certainly appear within the next few years. Making matters worse, the technique makes it possible to attack and alter functionality in ML models used for various purposes and across numerous industries and situations. "Hidden malware and AI model manipulation are things that we must begin to examine and act on," Wen concludes. The risk is significant."
Wang, Z., Liu, C., and Cui, X.
EvilModel: Hiding Malware Inside of Neural Network Models, 2021 IEEE Symposium on Computers and Communications (ISCC), September 5–8, 2021. https://ieeexplore.ieee.org/abstract/document/9631425
Liu, T., Liu, Z., Liu, Q., Wen, W., Xu, W., and Li, M.
StegoNet: Turn Deep Neural Network into a Stegomalware. ACSAC '20: Annual Computer Security Applications Conference, December 2020, Pages 928–938. https://dl.acm.org/doi/10.1145/3427228.3427268
Chaganti, R., Vinayakumar, R., Alazab, M., and Pham, T.D.
Stegomalware: A Systematic Survey of Malware Hiding and Detection in Images, Machine Learning Models and Research Challenges, Cornell University, October 6, 2021. https://arxiv.org/abs/2110.02504
O'Kane, P., Sezer, S., and McLaughlin, K.
Obfuscation: The Hidden Malware, IEEE Security & Privacy, Volume: 9, Issue: 5, Sept.-Oct. 2011, Pages 41–47. https://ieeexplore.ieee.org/abstract/document/5975134
©2022 ACM 0001-0782/22/10
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from firstname.lastname@example.org or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2022 ACM, Inc.
No entries found