In the context of our dossier Trusted Artificial Intelligence: From Critical Systems to the Common Good published in issue 3 of the magazine ActuIA currently on newsstands and available in our online shop, we spoke with Guillaume Avrin, Head of the "Artificial Intelligence Evaluation" department at the Laboratoire National de Métrologie et d'Essais (LNE). In February 2021, LNE, which has established itself for several years as a trusted third-party assessor of artificial intelligence, announced that it had obtained government funding for the creation of the world's first generic platform dedicated to the evaluation of artificial intelligence, called LEIA.
In this interview, Guillaume Avrin explains LNE's missions in the field of artificial intelligence, what artificial intelligence algorithm testing and the "black box" effect are, but also the challenges of certification.
ActuIA: Could you give us a brief presentation of the LNE?
Guillaume Avrin: The Laboratoire national de métrologie et d'essais (LNE) is a public establishment of an industrial and commercial nature (EPIC) attached to the Ministry of Industry. It is the national reference body for testing, evaluation and metrology. In support of public policies, its action aims to assess and structure the supply of new products, seeking both to protect and meet the needs of consumers and to develop and promote the competitiveness of national industry (cf. Article L823-1 of the Consumer Code). It also assists manufacturers in their efforts to innovate and become more competitive in many fields of expertise and sectors of activity. LNE carries out work on the characterisation, qualification and certification of systems and technologies to support all breakthrough innovations (artificial intelligence, nanotechnologies, additive manufacturing, radioactivity measurement, hydrogen storage, etc.) for the benefit of the scientific, regulatory and industrial community.
In the emerging field of artificial intelligence, LNE has metrological and methodological expertise in evaluation, which has no equivalent at European level. It has carried out more than 950 evaluations of AI systems since 2008, notably in language processing (translation, transcription, speaker recognition, etc.), image processing (person recognition, object recognition, etc.) and robotics (autonomous vehicles, service robots, agricultural robots, collaborative robots, intelligent medical devices, etc.). It is involved in all the major cross-cutting challenges of AI and, in parallel, ensures the implementation of a solution qualification system based on an efficient and partly internalized national network, particularly in the upstream stages of this deployment and for the development of the metrological, methodological and instrumental models to be implemented.
Are the tests of artificial intelligence algorithms similar to the tests you are used to doing in other specialties, or have they required the development of new skills and protocols? How are they different? How many artificial intelligence experts work for LNE? Do they all work at LNE or do you use independent experts?
The tests that we conduct in AI contribute to estimating fairly and in the absolute (for purposes of development, performance characterization, benchmarking, certification, etc.) the use value, performance, hazards, environmental impact, positive or negative, i.e. also its impact, even indirect, on societies and individual lifestyles: socio-economic consequences, ethical, legal, sociological questioning, etc.
This is a significantly new issue linked to the strong professional and social substitution power of AI, but which also has a metrological specificity: the ability of intelligent systems is to be measured mainly on the functional level and lies above all in their adaptability (a conventional system is on the contrary judged quantitatively on its performance in compliance with a job framework that is perfectly defined from the design stage). It is therefore not only a question of objectively quantifying functions and performance, but also of validating and characterising operating environments (perimeters of use) which are by nature variable, often highly variable, particularly in the case of so-called "open" environments. It is this variability of the situation to be treated, specific to the terrain of the human mind, which confers the quality of intelligence to the system and even measures its degree.
This extensive field of use and experimentation and the at least partially autonomous and often non-convex, non-linear, non-deterministic behaviour of AI systems require the development of sui generis protocols and measurement instruments:
- The measurement of intelligent systems is based on a so-called "soft" metrology that is more functional than quantitative, more attentive to robustness than to performance, and using composite and multidimensional metrics capable of accurately and precisely reflecting the magnitude and sensitivities of the environmental field of operation of the component under test.
- It is essentially a question of covering a field of use with a necessarily limited but minimal sampling fineness to guarantee the absence of aberrant reactions. The test scenarios to be presented to the system under evaluation are therefore potentially very numerous, some of which may lead to accident or near-accident situations and can only be generated using simulation means. As simulation implies a necessarily reductive modelling of reality, compromises must be found between the needs for completeness and realism.
- If the application concerns very rare data (for example medical data of orphan pathologies), the AI developer is generally the only one or one of the only ones in the world to have data on the subject. The trusted third-party evaluator will therefore need to be able to access corpuses of the developer that have not been used during learning, which he can "augment" (noise, disturbances, transformations, etc.) in order to limit bias, especially overlearning, for use in evaluations.
- If the application concerns common, transapplicable and/or inexpensive to produce data (recognition of everyday objects such as humans, animals, road signs, office equipment, etc.), then it is relevant to set up reference databases (standards for AI) independently of the developer.