Focus on “Big Science”, the collaborative project for the development of an efficient open source language model

0
Focus on “Big Science”, the collaborative project for the development of an efficient open source language model

In order to develop an efficient open source multilingual language model in one year, several laboratories, large groups and start-ups have come together. They will use the French supercomputer Jean Zay to carry out the “Big Science”project. The main objective is to design a giant neural network capable of “speaking” eight languages including French, English and several African languages. The kick-off workshop took place at the end of April and we offer you a focus on this very interesting participative project.

A project involving a hundred institutions

The “Summer of Language Models 21” or “Big Science” is a year-long research project focused on language models used and studied in the field of natural language processing (NLP). More than 250 researchers from about 100 institutions such as CNRS, Inria, Airbus, Ubisoft, Facebook, Systran, Ubisoft, Airbus, OVH, as well as several French and foreign universities, are contributing to the project.

The project was born out of discussions initiated in early 2021 between Thomas Wolf (Hugging Face), Stéphane Requena and Pierre-François Lavallee (from GENCI and IDRIS respectively). Very quickly, several experts from the HuggingFace scientific team (including Victor Sanh and Yacine Jernite) as well as members of the French academic and industrial research community in AI and NLP joined the discussions to further develop the project.

Big Science is defined as a one-year research workshop where a set of collaborative tasks will be carried out around the creation of a large dataset from a wide variety of languages and an efficient multilingual language model.

Using the French supercomputer Jean Zay in a collaborative project

GENCI and IDRIS wanted to take part in the project by proposing the use of the Jean Zay supercomputer, located in Orsay. The two institutions have made available 5 million hours of computing time (about 208 days), which corresponds to a quarter of the machine’s capacity.

In parallel, an online workshop for the public will be held on May 21 and 22, with collaborative tasks to create, share, and evaluate a huge multilingual database in order to start designing the model. Discussions will be held to identify the challenges of large linguistic models and to better understand how they work.

If successful, this workshop may be repeated and updated as the project progresses, as it is intended to be participatory.

The functioning of the “Big Science” project

This research program will consist of:

  • A steering committee that will give scientific or general advice.
  • An organizing committee, divided into several working groups that will be in charge of determining and carrying out collaborative tasks, as well as organizing workshops and other events to create the NLP tool.

Several roles will be given in this project, three are reserved for researchers and experts, the last one involves public participation:

  • A role as a scientific advisor and functional organizer: a task that requires a light commitment, i.e. reading a newsletter every two weeks and offering comments in a working group.
  • A role as an active member of one of the project’s working groups: designing and implementing collaborative tasks, organizing live events.
  • A role as chair/co-chair of a working group: requiring a much greater commitment, coordinating efforts and organizing the decision-making process of the working group.
  • A role as a participant in a workshop or public event: participating in the completion of a collective task in a guided manner following the guidelines set by the working groups.

The solution developed in this project aims to be more accomplished and less “biased” than those developed by OpenAI and Google. OpenAI’s GPT-3 delivers 4.5 billion words per day for about 300 clients, contains 570 GB of text (745 GB for Switch-C, Google’s tool) and 175 billion parameters (10 times more for Google).

Translated from Focus sur “Big Science”, le projet collaboratif pour le développement d’un modèle de langues open source efficace