Delivery of the largest open science multilingual language model ever

12 juillet 2022

While they regularly provide fascinating results, large artificial intelligence models are generally black boxes: we don’t know exactly how they calculate their answers and many elements are not made public. The BigScience project, involving a thousand researchers in a participatory and open science approach, is changing this with “Bloom”.

It is the largest multilingual language model trained in a completely open and transparent way. This type of artificial intelligence simultaneously learns a text generation model and a text representation model by repeatedly performing a basic task: predicting the next word in a text whose beginning is known, in the manner of “intelligent” keyboards. In addition to handling 46 languages, from English to Basque, its open science nature will help scientists from all walks of life explore how language models work to improve them. The BigScience project, initiated by the company Hugging Face, was supported by the CNRS, GENCI and the French Ministry of Higher Education and Research, which enabled Bloom to be trained on the “Jean Zay” machine, one of the most powerful supercomputers in Europe. Philippe Lavocat, President and CEO of GENCI, states:

“BigScience initiates a world first and paves the way for other scientific breakthroughs. It has benefited from the resources of the Jean Zay converged supercomputer, one of the most powerful in Europe, which will be commissioned in 2019 in the wake of the AI for Humanity plan. Today, more than 1,000 research projects are using its resources. A key factor in this success is the Jean Zay extension deployed at the beginning of the year, which is the result of joint work between the Ministry of Higher Education and Research, the CNRS through the Institute for Development and Resources in Scientific Computing (Idris), and GENCI ”

Language models are artificial intelligences whose first applications concern natural language texts: answers to questions, automatic generation of sentences, detection of “feelings”, automatic summarization and simplification, or automatic translation. Most of the existing models have been trained only with texts written in English and according to principles and methods that are difficult to reproduce in all their details. For example, it is not possible to know, when a model answers a question, if the answer is the result of a calculation or if the answer was already in its training databases.

The BigScience project was initiated in the spring of 2021 by the Franco-American artificial intelligence start-up Hugging Face, to remedy these problems by training a new model: Bloom. It learns from large corpora of texts, using a simple principle, which consists in predicting to complete sentences, word by word. Each prediction of the model is compared with the correct word, which allows to adjust the internal parameters of the model. In the case of Bloom, the learning is done by evaluating thousands of billions of words, leading to a model that contains 176 billion parameters. This learning took several months, requiring hundreds of graphics processing units (GPUs) running in parallel, the equivalent of 5 million hours of computation. Such computing power can only be obtained on supercomputers like the Jean Zay machine. Thomas Wolf, co-founder and chief scientific officer of the start-up Hugging Face states:

“The creation of the Bloom model and the success of the BigScience research collaboration show that another way of creating, studying and sharing innovations in AI is possible, bringing together industrialists, academics and associations around an international, multidisciplinary and open access project. I am delighted that Hugging Face has been able to find the necessary support in France for this unprecedented approach on a global scale.

Bloom differs from other language models in that it is trained simultaneously in 46 languages, spread over sources as varied as literature, scientific articles and sports news reports, and including many languages that are rarely taken into account, in particular some 20 African languages. The learning corpus even contains computer code! The whole is equivalent to several million books. The more diverse the approach and the sources, the more the model is able to perform different tasks. Moreover, the data was not sorted according to language because, paradoxically, Bloom learns better that way. Agglomerating content in various languages allows to learn robust and efficient models for all the languages considered, and often leads to better results than monolingual models. Another special feature: Bloom’s architecture, the list of data used and its learning log will be entirely available in open science, to facilitate research on language models. Finally, Bloom is freely distributed with a responsible license, which explicitly prohibits malicious uses of the model.

Languages used for Bloom training.
“Indic family” covers about fifteen languages from the Indian subcontinent (Hindi, Tamil, Urdu, …) and “Niger-Congo family” about twenty languages from sub-Saharan Africa (Swahili, Yoruba, Wolof, …). 10.8% of the data was computer code, with 13 different languages.
Source : Hugging Face

Antoine Petit, president and CEO of CNRS adds:

We are delighted with this original public-private partnership, which shows just how essential the complementary nature of skills and resources, such as the power of the Jean Zay supercomputer, is in meeting a challenge as important and topical as research into artificial intelligence. Behind the scientific breakthrough, we salute the involvement of Idris staff who made this training on the supercomputer possible, and we welcome the essential role played by the CNRS through the mobilization of the entire automatic language processing community.

Translated from Livraison du plus grand modèle de langue multilingue « open science » jamais entraîné