Facebook updates Dynabench, its NLP model evaluation platform, with Dynaboard

0
Facebook updates Dynabench, its NLP model evaluation platform, with Dynaboard

This week, Facebook offered a new feature called Dynaboard. This platform allows for comprehensive and standardized evaluations of natural language processing (NLP) models. This tool updates Dynabench , an AI benchmarking model already specialized in NLP models.

Dynabench

, the platform for benchmarking AI systems
Dynabench is a tool for benchmarking AI models and in particular NLP models. It creates complex test sets that it submits to different systems by taking advantage of a technology called “dynamic adversarial data collection”. With this platform, the user can measure the quality of an AI model progressively.

According to Facebook, Dynabench provides better indicators of an algorithm’s reliability than can be provided by other benchmarking techniques. Currently, the platform is available for AI experts and researchers to test their systems in the lab, for example.

In order to offer an adequate solution to this demand, Facebook has developed the Dynaboard tool, a dashboard that enriches Dynabench.

The main contribution of Dynaboard

: the Dynascore
Dynaboard introduces a brand new indicator called Dynascore: it takes into account several evaluation axes in order to propose a result that will specify the quality of the NLP model.

Here are the evaluation axes taken into account by Dynascore:

  • The accuracy of the model: Dynaboard analyzes how the model succeeds in solving the task it is asked to perform
  • The computational efficiency of the model: Dynascore takes into account the number of examples that the model can process in one second in the cloud.
  • Memory usage: the amount of memory required by a model is measured in gigabytes of total usage. It is averaged over time, over a defined period of several seconds.
  • Robustness: it takes into account typographical errors that the model may generate or paraphrases used by the model during a comparative analysis for example.
  • Fairness: a test can replace the gender in a sentence, i.e. change from feminine to masculine and vice versa, or replace a person’s name with one from another culture. The model is considered “fair” if its predictions remain stable after these changes.

All these axes can be adjusted thanks to the Dynaboard which provides a complete table of all the data collected during the evaluation of a model, as can be seen on the image below:

dynaboard test modèles NLP précision équité robustesse mémoire calcul

How was Dynaboard

tested?
Dynaboard allows to improve the evaluation conditions of an NLP model. In turn, the feature underwent experiments to see if it was really capable of performing this task. To do this, Facebook used the platform to rank known NLP models that are considered to be the most effective such as BERT, RoBERTa, ALBERT, T5 and DeBERTa. These systems are generally the top five models from another Benchmark tool called GLUE.

After calculating the Dynascore for each model using Dynaboard, the researchers noticed that the GLUE ranking was maintained with a few differences. These tests were performed with only four of the five evaluation axes used to obtain the Dynascore, but even if the fifth is added, DeBERTa remains the best ranked model.

In the future, Facebook wants to offer Dynabench to all programmers, professional or amateur, so that everyone can run their own models for evaluation. The company hopes to contribute to a general improvement of automatic language processing models.

Translated from Facebook met à jour Dynabench, sa plateforme d’évaluation de modèles NLP, avec Dynaboard