Error-prone large language models (LLMs) should not be allowed to make decisions based on sensor data fed to them from smartphones, medical devices, and the wider Internet of Things, experts in machine intelligence are warning. They say emerging research on a concept called “penetrative AI,” designed to allow LLMs to extend their reach from their Web-scraped learning datasets into probing, and acting upon, the data that people generate on their own devices, holds a raft of dangers.
At issue is the fact that, despite their propensity for serving up the wrong answers to users’ questions, and the outstanding lawsuits over where many of them obtained their copyrighted training data, LLMs nevertheless have managed to find a broad array of applications, spanning translation, transcription, story creation, copywriting, and the generation of program code, as examples.
For some AI researchers, these use cases are not nearly enough – and some now want to extend the range of applications that LLMs undertake well beyond these prompt-driven, text-based operations.
The notion of LLMs tapping into live data from the physical world was unveiled in late February, at what is, rather quietly, often one of the most fascinating tech events on the calendar: the ACM HotMobile 2024 conference in San Diego, CA, an event that welcomes research on “new directions or controversial approaches” to mobile-related technologies.
One research project revealed there, led by Huatao Xu of Nanyang Technological University in Singapore, working with Mani Srivastava at the University of California at Los Angeles and Mo Li at Hong Kong University, certainly fit the event’s billing for something different. Their paper’s title: “Penetrative AI: Making LLMs Comprehend the Physical World.”
Says Xu, “The primary goal of our penetrative AI initiative is to explore and extend the capabilities of LLMs, such as ChatGPT, beyond the confines of digital text-based tasks. By integrating real-world data from physical signals into an LLM, we aim to bridge the gap between digital knowledge and physical reality.”
This, he adds, will allow for the first time the deep learning models to “directly interact with, interpret, and respond to the data sensed from the real physical world around us.”
However, given their aforementioned inclination for gross errors and hallucinations (thought in part to be based on fictional works in their training datasets), can LLMs actually process incoming data, reason with it using a smart, detailed prompt and the embedded world knowledge lent it in training, and act as a deterministic analyst of incoming signals to provide a reliable, repeatable—and so as safe as possible—output?
To see if it could, the researchers undertook two experiments. First, they decided to try tracking human activity by feeding two ChatGPT variants (ChatGPT-3.5 and ChatGPT-4) data from a user’s smartphone sensors. Then, they attempted human heartbeat detection and monitoring using the same pair of LLMs.
To track user activity, the team used a Samsung Galaxy 8 Android smartphone with the aim of monitoring (a) triaxial accelerometer data to determine how the user was moving their body (running, walking, or sitting or lying down, for example); (b) tap the phone’s Wi-Fi access point (AP) identifier and signal strength data to find out at a granular level where the user is within a built environment, and (c) use Android’s GPS satellite counting capability to work out whether the user is indoors (zero satellites accessible) or, if outdoors, where.
Yet because LLMs accept text data as input, a pre-processing step had to be written into the phone’s software to convert the number-based sensor data readings into words ChatGPT could process. So data points, for instance, might convert to, say, text that reads: “step count: 10/min”, or “Total Wi-Fi APs scanned: 21.” So an LLM prompt, replete with expert knowledge and reasoning guidelines, would let it assess a user’s motion and likely surroundings.
Did it work? The answer is a qualified ‘yes’. The reseachers say they proved LLMs can be “highly effective” at such an activity monitoring task, but there were significant caveats. With ChatGPT-3.5, they found the system sometimes got confused, outputting unknown states for motion detection, while indoor and outdoor environmental classification was better with the ChatGPT-4 variant “achieving above 90% accuracy with the best prompt.”
“We’ve shown LLMs like ChatGPT can analyze data collected from your phone’s sensors—things like accelerometer, satellite, and Wi-Fi signals around you—to figure out your activity and where you are,” says Xu.
For the medical device, the team decided to try skipping the conversion of sensor data to text, to see if penetrative AI could be performed with greater simplicity. They directly injected a filtered version of the device’s electrocardiogram (ECG) signal into the LLM, to see if it could work out the subject’s pulse rate itself by counting certain waveform maxima (R-peaks), which a natural language-based prompt directed the LLM to select and count.
The results were mixed, and LLM-dependent: “While ChatGPT-3.5 often generates prolonged sequences of R-peaks, resulting in substantial errors, ChatGPT-4 conversely demonstrated enhanced stability and precision in identifying R-peaks in the majority of cases,” the researchers reported in their paper.
“It’s fascinating because it shows that these models cannot just read and understand text, but can also make sense of more complex data, like the heartbeat patterns recorded in an ECG,” says Xu.
He adds, “A future doctor may easily have dialogue with an LLM-based agent to extract important medical features from raw medical readings, rather than engaging heavy signal processing systems or machine learning models trained with data from millions of patients.”
So even though the researchers concede they did not achieve perfect accuracy in their human activity and heartbeat analyses with LLMs, they believe the “surprisingly encouraging” results they achieved “present an enticing opportunity to leverage LLMs’ world knowledge” in future ChatGPT-4 penetrative AI applications.
That was the case not only in the domestic Internet-of-Things home automation and health applications they have attempted, but also, when LLM error-avoidance mechanisms are better understood, in much more complex industrial “cyber-physical” systems, where Xu believes penetrative AI’s “capacity to learn and adapt to complex industrial environments” could “reduce downtime and improve safety, heralding a new era of manufacturing efficiency.”
However, such optimism for penetrative AI’s prospects is not shared by AI safety and ethics specialists.
“LLMs as we know them now should not be used without humans in the loop,” says Gary Marcus, a former machine learning entrepreneur and emeritus professor in cognitive neuroscience at New York University. “They continue to make things up, or ‘hallucinate’, and are known to be unreliable, particularly when confronted with unexpected circumstances. Lab data that shows LLMs working under one set of normal circumstances doesn’t guarantee that they would work in the full range of real-world circumstances.”
Kartik Talamadupula, Applied AI Officer of ACM’s Special Interest Group on AI (SIGAI), and director of AI research at Symbl.ai in Seattle, WA, which builds generative AI that attempts to understand human conversations, says of the penetrative AI work: “I think such efforts are misguided at best, and even dangerous, for two main reasons.
“The first is the Clever Hans effect: that is, only the results that make sense to the authors are presented in support of whatever conclusion they seek to support. To my knowledge, there is no systematic study or analysis of how many times the LLM generated such responses versus nonsensical responses. My guess is it would not be a very favorable ratio.
“The second reason that I do not find this study or the report very convincing is the lack of discussion about the kind of overlap between the data that the authors provide to the model —ChatGPT—and the data used to train that model.”
Talamadupula says there are “very likely various subjective biases and gaps in the kinds of ‘sensor data’ that the model has been trained on. These are only more likely to produce hallucinations as the model tries to respond to the input prompt. And in a physically deployed setting such as this, hallucinations can be even more dangerous than in mere chatbots.”
On the societal risks front, Joanna Bryson, a professor of ethics and technology at the Hertie School of Governance in Berlin, Germany, says penetrative AI’s attempts to use error-prone LLMs to interpret sensor data could lead to biases, unfairness, and unjust, discriminatory outcomes. “For someone to replace humans with a similar system, it would have to be both more resilient and at least as innovative,” Bryson says, and going on current LLM performance, that’s “very unlikely” to be the case, she adds.
“It sounds like something that would just absolutely break justice.”
In the computer security arena, however, Bruce Schneier, a cryptographer and electronic privacy engineer at the Berkman Klein Center for Internet & Society at Harvard University, and a board member at the Electronic Frontier Foundation, thinks it’s a bit too early to judge penetrative AI just yet.
“There’s always a worry when you have a new automatic thing, but don’t forget people get a lot wrong, too. Like everything else in tech, it’ll be mediocre to start, and it’ll get better over the years. And eventually it’ll be good. Would I hook it up to a police drone today? Probably not. But they’re not doing that. They’re basically testing, and it seems like a really clever, interesting idea,” Schneier says.
“In any case, you can’t just put an LLM in front of a heart monitor and give it to a patient. There’s going to be years of certification and testing, in enormous FDA approval processes, before anything like that can happen.”
Paul Marks is a technology journalist, writer, and editor based in London, U.K.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment