Multimodal AI
Comment
Stakeholder Type

Multimodal AI

1.1.2

Sub-Field

Multimodal AI

AI research has traditionally focussed on solving discrete problems that involve a single type of data, such as images, text or audio. This has led to superhuman capabilities in narrow areas like object recognition, speech recognition and game-playing. But humans and animals use multiple senses to navigate the world around them, and there is growing recognition that for AI to become more flexible it will need to work with multiple data modalities at once.

Future Horizons:

×××

5-yearhorizon

The knowledge industry is entirely disrupted

Multimodal AI works with both text and image data to automate a wide range of tasks in knowledge industry jobs. “Generative” AI can now produce art, images and long-form videos indistinguishable from human-made ones, disrupting the creative industries and stoking fears about misinformation. Limited data availability stymies efforts to expand into new modalities, prompting growing focus on using AI models to create synthetic data to train on.

10-yearhorizon

AI works with more modalities

More efficient algorithms and a concerted effort to collect data expand the modalities that AI can work with. This leads to breakthroughs in precision medicine. It also helps AI systems to develop deeper knowledge of the world around them, and a grasp of physical concepts and social dynamics. This allows AI to work more seamlessly and safely alongside humans, boosting the use of the technology in less structured settings like retail, care and education. However, the expanded modalities also create the possibility for emergent characteristics such as in-context learning and episodic memory to develop, opening up a path to artificial general intelligence.

25-yearhorizon

AI understands the world through multiple data streams

What happens on this timescale will have a sensitive dependence on the outcomes of the next few years of AI development. However, we can predict that more general advanced AI systems will use multiple data streams to understand the world around them, in much the same way as humans. These are not limited to the five senses: at 25 years, specialist AI systems use a different set of modalities, depending on their task. Multimodal deep learning also accelerates scientific inquiry by allowing the simultaneous analysis of vastly different kinds of data.

Multimodal AI is already used in autonomous vehicles to fuse input from cameras, radar and LIDAR, and shows promise in healthcare, where a wide range of physiological signals need to be considered.1920 More recently, large language models repurposed to work with multiple data modalities have shown considerable promise. Photorealistic imagery and video clips can now be generated from simple text descriptions, the latest AI-powered chatbots can perform complex image analysis and robots can combine visual input, motion data and natural-language commands to carry out complex tasks.212223

The ability to draw correlations between different data sources can accelerate learning and help ground the knowledge encoded by language models in the realities of the physical world.2425 It could also transform human-computer interaction by allowing people to communicate with machines in a wide range of mediums. These approaches are data-intensive, though, and while the internet is a goldmine of text and images, finding sufficient training material for other modalities could be a challenge. Robotics is presenting a potential model for overcoming this barrier, though, with research groups pooling resources to create massive open datasets of multimodal training data.2627

Multimodal AI - Anticipation Scores

The Anticipation Potential of a research field is determined by the capacity for impactful action in the present, considering possible future transformative breakthroughs in a field over a 25-year outlook. A field with a high Anticipation Potential, therefore, combines the potential range of future transformative possibilities engendered by a research area with a wide field of opportunities for action in the present. We asked researchers in the field to anticipate:

  1. The uncertainty related to future science breakthroughs in the field
  2. The transformative effect anticipated breakthroughs may have on research and society
  3. The scope for action in the present in relation to anticipated breakthroughs.

This chart represents a summary of their responses to each of these elements, which when combined, provide the Anticipation Potential for the topic. See methodology for more information.