On June 3rd, Hugging Face introduced SmolVLA, an open-source Vision-Language-Action robotics model. This compact model, with only 450 million parameters, can run on consumer hardware such as a MacBook or a standard GPU, while offering performance comparable to much larger models.
AI applied to robotics is experiencing a boom thanks to advances in computer vision, natural language processing, and reinforcement learning. This progress has intensified with VLA models, capable of analyzing their environment, understanding human instructions, and acting autonomously in complex environments.
However, this technical promise faces several limitations. On one hand, most existing VLA models are extremely large, often equipped with several billion parameters, which leads to prohibitive training costs and limits their adoption in real-world conditions. On the other hand, recent advances remain largely proprietary: weights are sometimes published, but training details and essential methodological components often remain out of reach.
SmolVLA positions itself as a response to these constraints : offering a lightweight, open, and reproducible alternative without compromising on performance.
Architecture and Design
SmolVLA was trained exclusively on datasets collected by the community, via the LeRobot platform hosted on Hugging Face. It is based on a modular architecture comprising two main components:
- SmolVLM-2, a lightweight and efficient model optimized for multi-image and video processing. It integrates two complementary blocks: the SigLIP visual encoder and the SmolLM2 language decoder, allowing the system to decipher the robot's visual environment and generate an understanding in natural language;
- Action Expert, a 100 million parameter transformer that predicts the actions the robot should take, based on information provided by the VLM.
Targeted design choices contribute to the model's efficiency:
- reducing the number of visual tokens speeds up inference without compromising quality;
- layer skipping allows for faster execution by bypassing certain model layers;
- interlaced attention optimizes information flow between modalities;
- asynchronous inference allows action prediction during the execution of the previous one.
These levers help improve performance while controlling computational load. By open-sourcing the model, its codebase, training datasets, and robot hardware, and providing detailed instructions to ensure complete reproducibility, Hugging Face aims to democratize access to VLA models and accelerate research on generalist robotic agents.