Multimodal AI is already used in autonomous vehicles to fuse input from cameras, radar and lidar and shows promise in healthcare, where a wide range of physiological signals need to be considered.1516 More recently, large-language models repurposed to work with multiple data modalities have shown considerable promise. Photorealistic imagery can now be generated from simple text descriptions, the latest AI-powered chatbots can perform complex image analysis, and robots can combine visual input and natural language commands to carry out complex tasks.171819
The ability to draw correlations between different data sources can accelerate learning and help ground the knowledge encoded by language models in the realities of the physical world.20 These approaches are data-intensive, though and, while the internet is a goldmine of text and images, finding sufficient training material for other modalities could be a barrier.