ML Drift: Facilitating Local Inference

TLDR : A team of researchers from Google and Meta has developed ML Drift, a solution to efficiently run artificial intelligence directly on the device, despite the challenges related to the diversity of GPU architectures. ML Drift, thanks to innovations like tensor virtualization, significantly improves performance and offers great compatibility across mobile and desktop platforms.

Most artificial intelligence models are inferred (that is, "executed") on servers. However, developing local inference, meaning directly on the device, would accelerate the spread of artificial intelligence, notably by reducing server constraints and enhancing privacy.

However, deploying generative AI models on various types of GPUs presents notable challenges: the diversity of GPU architectures, ranging from proprietary solutions to open platforms, makes the task complicated, with each type of GPU having its own characteristics and limitations.

Facing a growing risk of material dependency, optimizing performance on heterogeneous platforms becomes imperative to ensure smooth and efficient execution of generative models.

To address these challenges, a team of researchers from Google and Meta, including Jiuqiang Tang, Raman Sarokin, and Ekaterina Ignasheva, developed ML Drift, a solution intended for inference on various platforms. Their expertise lies in optimizing GPU inference engines, allowing efficient execution of generative AI workloads. ML Drift stands out for its ability to overcome the technical obstacles associated with inter-GPU API development, thus ensuring broad compatibility across mobile and desktop platforms.

Methodological Approach and Technical Innovations

ML Drift introduces several technical innovations, including tensor virtualization and optimized memory management. Tensor virtualization allows the decoupling of logical indices from the physical indices of the GPU, offering increased flexibility in memory layout and kernel optimization. Additionally, memory management and optimization strategies reduce memory footprint and improve performance.

Results and Future Perspectives

Performance evaluations of ML Drift show significant improvements over existing open-source solutions, with substantial gains in terms of performance (supporting 10 to 100 times more parameters). These promising results pave the way for future applications and improvements, notably the integration of advanced quantization techniques and exploration of specialized instructions for ML workloads. In the future, the team plans to extend ML Drift's capabilities to newer diffusion models and transformer-based architectures while exploring effective interoperability with heterogeneous processors.

Publication reference: arXiv:2505.00232v1

Translated from ML Drift : faciliter l'inférence locale

To better understand

What is tensor virtualization and why is it important for inference on varied GPUs?

Tensor virtualization decouples logical from physical GPU indices, allowing increased memory management flexibility. This is crucial for optimizing inference performance on diverse GPUs with heterogeneous architectures, enabling better resource utilization.