ML Drift: Facilitating Local Inference

Most artificial intelligence models are inferred (that is, "executed") on servers. However, developing local inference, meaning directly on the device, would accelerate the spread of artificial intelligence, notably by reducing server constraints and enhancing privacy.

However, deploying generative AI models on various types of GPUs presents notable challenges: the diversity of GPU architectures, ranging from proprietary solutions to open platforms, makes the task complicated, with each type of GPU having its own characteristics and limitations.

Facing a growing risk of material dependency, optimizing performance on heterogeneous platforms becomes imperative to ensure smooth and efficient execution of generative models.

To address these challenges, a team of researchers from Google and Meta, including Jiuqiang Tang, Raman Sarokin, and Ekaterina Ignasheva, developed ML Drift, a solution intended for inference on various platforms. Their expertise lies in optimizing GPU inference engines, allowing efficient execution of generative AI workloads. ML Drift stands out for its ability to overcome the technical obstacles associated with inter-GPU API development, thus ensuring broad compatibility across mobile and desktop platforms.

Methodological Approach and Technical Innovations

ML Drift introduces several technical innovations, including tensor virtualization and optimized memory management. Tensor virtualization allows the decoupling of logical indices from the physical indices of the GPU, offering increased flexibility in memory layout and kernel optimization. Additionally, memory management and optimization strategies reduce memory footprint and improve performance.

Results and Future Perspectives

Performance evaluations of ML Drift show significant improvements over existing open-source solutions, with substantial gains in terms of performance (supporting 10 to 100 times more parameters). These promising results pave the way for future applications and improvements, notably the integration of advanced quantization techniques and exploration of specialized instructions for ML workloads. In the future, the team plans to extend ML Drift's capabilities to newer diffusion models and transformer-based architectures while exploring effective interoperability with heterogeneous processors.

Publication reference: arXiv:2505.00232v1

Stephane Nachez

ActuIA editorial team — news, data and analysis on artificial intelligence for decision-makers.

ML Drift: Facilitating Local Inference

Methodological Approach and Technical Innovations

Results and Future Perspectives

GPT More Confident on Difficult Tasks Where It Makes the Most Mistakes, According to a USC/Berkeley Preprint

Google Introduces MLE-STAR: A New Approach for Machine Learning Engineering

Promising Alternative to Chain-Of-Thought: Sapient Bets on a Hierarchical Architecture