TLDR : A study by METR found that experienced developers were 19% slower when using AI tools, despite expecting faster results. Several factors contribute to this, including imperfect tool usage and high project quality standards.
Table of contents
What if AI has not yet delivered on its promises of productivity gains? An experimental study conducted by the independent laboratory METR (Model Evaluation and Transparency Research) in the first half of 2025 challenges these certainties. Tested in real conditions on open-source repositories they are well acquainted with, seasoned developers were on average 19% slower when using generative AI tools than when they didn't. An unexpected underperformance, all the more striking because the participants themselves mistakenly believed that AI had helped them work faster.
Traditional benchmarks, such as SWE-Bench Verified or RE-Bench, have been used to measure the rapid progression of AI models. However, these evaluations are often decontextualized: they focus on isolated, standardized tasks, evaluated automatically, often far from the practical requirements and implicit constraints of everyday software work.
To draw reliable conclusions on the real impact of generative AI on software development, METR recruited 16 developers with several tens to hundreds of hours of experience with language models and contributing on average for five years to mature open-source projects (22,000 GitHub stars, over a million lines of code).
The methodology of its randomized controlled trial (RCT) is based on a simple principle: allow or disallow the use of AI, (mainly Cursor Pro with Claude 3.5/3.7), randomly, on a set of 246 technical problems that are part of their professional daily life (bugs, refactorings, feature enhancements), then measure the resolution time.
Work times were self-reported and supplemented by screen recordings. The deliverables were qualitatively evaluated through the submitted Pull Requests.
A Discrepancy Between Perception and Reality
Despite their experience, developers anticipated a 24% acceleration thanks to AI. At the end of the experiment, this estimate slightly decreased but remained optimistic: a 20% perceived time gain.
In reality, tasks performed with the help of AI took on average 19% more time than those performed without, a gap not only contrary to the participants' expectations but also contrary to the predictions made by external experts from the fields of economics (-39%) and machine learning (-38%).
Multiple Explanations
METR identifies five main factors likely to explain this slowdown:
- Imperfect use of tools, notably overly simple prompts;
- Limited familiarity with AI interfaces like Cursor;
- High quality standards in the studied projects, sometimes incompatible with generated suggestions;
- Insufficient coverage of complex cases by the models;
- A form of cognitive distraction linked to experimenting with AI.
Other hypotheses, such as measurement errors or methodological flaws, have been ruled out by the analysis.
Far from concluding that AI harms the performance of all developers in all contexts, the study mainly highlights that productivity gains are neither immediate nor automatic: they depend on a fine adjustment between tool, task, and professional context.