Multimodal AI Connects the Digital Dots

Advances in artificial intelligence (AI) have arrived at breathtaking speed over the last few years. Computer vision has come into focus, robotics has marched forward, and generative AI has moved squarely into the mainstream of society.

Yet, for all the progress, an unsettling truth stands out: most of today’s AI frameworks remain relatively disconnected. In most cases, they function as separate islands of AI automation, lacking key capabilities because they are unable to share data and operate synergistically with other AI components.

Engineers, researchers and others are taking note. They are actively exploring ways to construct advanced multimodal systems. By connecting separate AI components and data streams, it’s possible to build smarter systems that more closely align with humans.

“The goal is to tap a variety of systems and data sources to enable more advanced functionality,” says Martial Hebert, professor and dean of the School of Computer Science at Carnegie Mellon University.

Adds Yoon Kim, an assistant professor in the electrical engineering and computer science department at the Massachusetts Institute of Technology (MIT), “Humans are already multimodal. Our ability to build embodied AI models that can see, hear, understand language and handle other sensory tasks is crucial for development far more sophisticated technology.”

Getting to Smarter AI

What makes multimodal AI so alluring is the ability to tap complementary but detached data channels, combine and decipher them, and spot signals and patterns that would otherwise fly under the radar. Legacy databases, large language models (LLMs), IoT sensors, software applications, and various devices can all serve as fuel for multimodal AI.

For instance, a service robot that incorporates multimodal AI can process images, sound, touch, and other senses in a more human-like way—and respond accordingly. A medical diagnostics app can combine images, clinical text, and other data enroute to a more accurate outcome. Multimodal AI also makes it possible for people—doctors, lawyers, scientists, business analysts and others—to converse and interact with data more intuitively through an app.

Multimodal AI is advancing rapidly. Open AI’s ChatGPT-4 recently gained the ability to see, hear, and speak. The widely used generative AI system can directly upload images and reply to a user. Individuals with vision problems or other disabilities have access to an iOS and Android app called Be My Eyes to better navigate their surroundings.

Meanwhile, Microsoft is integrating its Copilot framework across the company’s broad set of tools and applications, including business intelligence and data analytics. Copilot is designed to build presentations on the fly, provide quick summaries of topics, coordinate scheduling and other administrative tasks, and use generative AI with voice or text to produce and share text, audio, images, and video across various apps.

Access to enormous volumes of data in the cloud are fueling multimodal AI, says Hoifung Poon, general manager for Microsoft Health Futures. Today, Poon says, “Large swaths of digitalized human knowledge and data can be easily collected and used to train large multimodal models (LMMs) for a wide range of applications.” The common denominator is text, “which captures the bulk of human knowledge and can potentially serve as ‘the interlingua’ of all modalities,” he says.

The deep learning architecture Transformer has pushed multimodal forward. It excels at spotting patterns and relationships mathematically across all modalities, from text and speech to images and molecules. This, coupled with the fact that it works synergistically with GPUs to scale computation, has allowed Transformer to supersede convolutional and recurrent neural networks (CNNs and RNNs). Today, large Transformer-based models can develop an understanding of content and conduct reasoning and conversations.

“In practical terms, this means that multimodal AI systems are far better equipped to juggle different forms of output, such as text, images and audio,” Kim says. He predicts multimodal systems will advance by an order of magnitude over the next few years to include the ability to answer arbitrary and somewhat abstract questions, generate complex images and presentations, and support advanced sensing and control systems for machines, such as robots.

Hard Coding Progress

Despite enormous progress in the field, engineering highly advanced multimodal systems requires further advances. For now, one obstacle can be low-quality or poorly trained datasets that deliver fuzzy, biased, sometimes wildly inaccurate results. This could produce systems that misinterpret tone or intonation, for example. In a worst-case scenario, it could lead to an incorrect medical diagnosis or an autonomous vehicle that misinterprets critical data.

Linking and unifying separate AI systems will require fundamental changes to software. “It’s important to design frameworks that allow models to interact with multiple modalities in a coherent way,” Kim says. This includes bridging models and data trained on different modalities so that they can be “combined” to become multimodal models. These models must be capable of generating snippets of software code that can be executed to affect the real world.

As a result, researchers are now exploring ways to develop sophisticated orchestration frameworks, such as Microsoft’s AutoGen, to address the challenge. For example, AutoGen is designed to manage intermodal communications and interactions—including actions that take place across virtual software agents that tie into physical components in robots, autonomous vehicles, and other machines.

Yet, even with orchestration tools in place, experts say advanced multimodal systems may require a human to manually supervise data, relabel it, and more directly oversee discreet processes. In fact, some wonder whether total multimodal AI automation is beyond the horizon—at least for the foreseeable future. “Without the right controls over multiple sources and streams of data, things can go very wrong,” Hebert warns.

Conflicting data or objectives can completely undermine multimodal AI, says Poon, who is actively researching self-verification methods for generative AI. For example, he says, “Teaching LLMs to avoid potentially harmful behaviors can result in a so-called ‘alignment tax’ that diminishes overall performance.” Likewise, combining data from different sources can lead to “batch effects” or confounders that distort findings and undermine outcomes, he adds.

Getting to a broad and highly synchronized multimodal AI framework will be difficult, though Hebert and others believe it is possible. “Data accuracy and availability isn’t a big problem within a single AI channel,” he explains. “But coordinating multiple channels and data streams—particularly when touch, speech, text, and vision must work harmoniously in real time—can be extraordinarily difficult.”

Samuel Greengard is an author and journalist based in West Linn, OR, USA.