The confidence displayed by GPT-4o, ChatGPT, and GPT-o3 exceeds their actual accuracy, and the gap widens precisely on difficult tasks – conversely, on easy tasks, the models underestimate themselves. This quantified hard-easy effect in a preprint under ACL review posted on arXiv on April 3, 2026 directly impacts the human oversight required by Article 14(4)(b) of the AI Act: the confidence signal produced by the model is least reliable where the supervisor would need it the most. The authors - Noam Michael, Daniel BenShushan, Jacob Bien, and Don A. Moore, USC Marshall School of Business and UC Berkeley Haas School of Business - report a preregistered protocol, hypotheses, and methodology declared before data collection, which strengthens the empirical scope of the result on the tested perimeter (GPT-4o, ChatGPT, and GPT-o3).
The confidence signal is least reliable where the supervisor would need it the most.
Hard-easy effect measured on GPT-4o, ChatGPT, and GPT-o3 via LifeEval - preregistered protocol, arXiv:2605.23909, v1 of April 3, 2026
The LifeEval Benchmark and Quantified Hard-Easy Effect
To produce this result, the authors constructed a proprietary benchmark, LifeEval, presented (loosely translated) as a test designed to assess the calibration of models across different levels of difficulty. Across the entire set, the mean achievable score (Mean Accuracy Score) is 56.80%. Four metrics are reported: Mean Score, Expected Calibration Error (ECE), Mean Confidence, and a regression coefficient linking difficulty and overconfidence. This latter coefficient carries the empirical signature of the hard-easy effect: overconfidence is strongest on difficult tests, while easy tests result in substantial underconfidence. The co-author ensuring the psychological anchoring is Don A. Moore, professor at the Haas School of Business and holder of the Lorraine Tyson Mitchell Chair in Leadership and Communication, a reference author on the subject ("The Trouble With Overconfidence", Psychological Review, 2008). A methodological caveat remains: the comparison with human bias - "like people" formula used in the abstract - relies on an analogy whose comparative methodology is not explicitly detailed at this stage. The transposition to LLMs of the human hard-easy effect remains debated: Juslin, Winman, and Olsson (Psychological Review, 2000) showed that the effect almost entirely disappears in humans once item selection artifacts are controlled, and the question of whether the observed mechanism in models is analogous or based on other causes remains open.
Scope Not to Be Extrapolated
The arXiv paper:2605.23909 is a preprint under ACL review (v1, April 3, 2026): results not yet validated by a review committee. LifeEval covers GPT-4o, ChatGPT, and GPT-o3 - conclusions do not mechanically apply to other model families. The preregistered protocol strengthens internal scope but does not broaden external coverage. However, the intense pace of successive version deployments forces a relativization of the finding.
A Convergent Set of 2026 Results
The USC/Berkeley paper does not arrive in isolation. Three other recent works document the same miscalibration, across distinct scopes. Sudipta Ghosh and Mrityunjoy Panday (Cognizant) published in February 2026 an empirical study of the "Dunning-Kruger effect" in LLMs covering 24,000 trials on four models. Kimi K2 shows an Expected Calibration Error of 0.726 for an accuracy of only 23.3%, while Claude Haiku 4.5 achieves the best measured calibration (ECE 0.122) at 75.4% accuracy. The least performing models are the most overconfident. In the medical field, npj Gut and Liver, a journal in the Nature portfolio, published on February 5, 2026 an evaluation on 48 LLMs tested on 300 gastroenterology questions: regardless of accuracy level, all models display poor estimation of their own certainty. A Johns Hopkins / MIT / Microsoft Healthcare team extends this finding in visual response to medical questions (VQA, visual question answering) (arXiv:2604.02543): models maintain high confidence even when producing hallucinations. The pattern is now documented across four independent methodological families.
Articulation with Article 14(4)(b) of the AI Act
The European timeline gives this set of results a dated operational scope. Article 14 of the AI Act, initially scheduled for application on August 2, 2026, sees its entry into force postponed to December 2, 2027, by the provisional political agreement Digital Omnibus on AI of May 7, 2026 - subject to formal adoption by co-legislators. Its paragraph (4)(b) requires that individuals responsible for the human oversight of a high-risk AI system remain aware of the tendency to rely or over-rely automatically on the system's output ("automation bias"), especially for systems used to provide information or recommendations for decisions made by individuals (loosely translated). The mechanical link with the hard-easy effect is direct: model confidence peaks precisely on cases where they are most wrong, the zone where the human supervisor has the least reliable signal to detect an error. A technical solution is documented - THERMOMETER (Shen et al., MIT/IBM, ICML 2024) proposes a post-hoc multi-task calibration, but the obligation of Article 14(4)(b) remains an organizational requirement weighing on the deployer, independent of calibration progress on the model side. For a European B2B buyer using an LLM for medical decision support, recruitment, or credit rating - uses covered by Annex III of the regulation -, the selection criterion shifts: it is no longer enough to compare displayed accuracies, the system and its interface must also allow the human supervisor to modulate the confidence produced by the model.
