DVPS: Rethinking Multimodal AI through Direct Interaction with the Real World

DVPS: Rethinking Multimodal AI through Direct Interaction with the Real World

TLDR : Translated, an Italian company specializing in AI language solutions, will lead the European research project DVPS, funded with 29 million euros by Horizon Europe. The project's goal is to explore a new learning path for multimodal AI, based on direct interaction with the physical world, combining language, spatial perception, sensory signals, and vision.

Translated, a Rome-based company specializing in linguistic solutions and AI-assisted translation, will lead the European research project DVPS, set to launch on July 1st. This ambitious program, supported with 29 million euros under Horizon Europe, brings together 20 partners from 9 countries around a common vision: exploring a new learning path for multimodal AI, based on direct interaction with the physical world.

Advancing the Science and Engineering of Fundamental Multimodal Models

Its name, DVPS, stands for "Diversibus viis plurima solvo," meaning "Through different paths, I solve multiple problems," reflects this ambition. While current models rely on static data from texts, images, or videos—representations of the world—DVPS aims to take an additional step. By combining language, spatial perception, sensory signals, and vision, the project seeks to bring AI closer to a form of understanding more rooted in reality.
Marco Trombetti, co-founder and CEO of Translated, highlights:
"Large language models have marked a breakthrough, but their limitations are evident: they are based on a fixed architecture and learn only from static content created by humans in the digital world. To go further, AI must interact with the real world, in real-time. With DVPS, we give machines the ability to grow through direct experience, and to instantly share what they learn with each other."
The fundamental multimodal models (MMFM) developed within the project will introduce three methodological breakthroughs:
  • Labeling Efficiency: thanks to transfer learning and few-shot adaptation, models can be trained with minimal annotated data, thus reducing reliance on manually labeled datasets;
  • Computation Reuse: by leveraging large-scale pre-training, they will reduce the computational cost of downstream applications, paving the way for more sustainable development;
  • Engineering Efficiency: automating model design will reduce the need for highly specialized expertise for each new task or domain.

Three Initial Fields of Application: Linguistics, Cardiology, and Geo-Intelligence

One of the challenges the project aims to tackle is contextual understanding in real-time during simultaneous translation situations involving multiple speakers, in a noisy or unstructured environment.
In such configurations, humans naturally mobilize a range of non-verbal cues: gaze direction, voice spatialization, body orientation. Current systems struggle to reconstruct this context. By combining computer vision, spatial sound analysis, and gesture interpretation, the models developed by DVPS could pave the way for linguistic assistants capable of better adapting to real-world situations.
In the healthcare domain, the project aims to contribute to the early detection of cardiovascular risks through 3D heart modeling from advanced medical imaging. In the field of environmental management, its goal is to improve response to natural disasters, for example by aggregating satellite and ground data to anticipate floods.

A Project Structured Around Key Tools

The ultimate goal is to establish a solid scientific foundation for the European research community. To support this vision, DVPS will design three fundamental components:
  • AutoDVPS: an open-source toolkit for designing and expanding MMFM. It will be tested in the three initial application domains, as well as in two domains not yet defined, a strategy intended to assess the models' generalization capacity beyond their design assumptions;
  • DVPSBench: a comparative analysis suite dedicated to these models' robustness, performance, and ethical considerations;
  • DVPS-FM: a foundational model trained on a massive set of diverse modalities.
The project also plans to publish the "Principles and Practices of MMFM" manual, accompanied by a MOOC aiming to train over 1,500 learners. To stimulate innovation and synergies, 15 collaborations are planned with other European AI initiatives, along with the creation of a co-innovation lab bringing together academics and industry professionals.
A collective dynamic serving European technological sovereignty
The founding team of DVPS consists of 70 top European scientists specializing in AI from the following partners:
  • Academic Research: University of Oxford, Alan Turing Institute, École polytechnique fédérale de Lausanne, ETH Zurich, Imperial College London, Fondazione Bruno Kessler, Karlsruhe Institute of Technology, University of Barcelona, and Vlaamse Instelling voor Technologisch Onderzoek
  • Specialized Partners: University Hospital Heidelberg, Vall d'Hebron Institut de Recerca, Amsterdam University Medical Centers, Deepset, Sistema, MEEO, Lynkeus, Data Valley, and Pi School of AI
  • High-Performance Computing: Cyfronet, the Polish national high-performance computing center