(De)Toxigen and AdaTest, the new Microsoft tools for more reliable language models

0
(De)Toxigen and AdaTest, the new Microsoft tools for more reliable language models

Large language models (LLMs), in addition to being very energy intensive, can reproduce the biases and stereotypes acquired during their training. Microsoft researchers have developed open source tools and datasets to test content moderation systems: (De)ToxiGen and AdaTest. These could lead to more reliable LLMs or models similar to OpenAI’s GPT-3, capable of parsing and generating text with human sophistication. Their work was presented at the 60th annual meeting of the Association for Computational Linguistics (ACL 2022).

Large Language Models (LLMs), while adaptable to a wide variety of applications, are risky because they were trained on a mass of text, written by humans, from the Internet. As a result, they can generate inappropriate and harmful language that reproduces the stereotypes conveyed by the authors of these texts. Content moderation tools have been designed to flag or filter such language in certain contexts, but the datasets available to train these tools often fail to capture the complexities of potentially inappropriate and toxic language, particularly hate speech.

(De)ToxiGen: Leveraging large language models to create more robust hate speech detection tools

In an effort to solve this toxicity problem, a team of researchers from Microsoft, MIT, Allen Institute for AI, Carnegie Mellon University, and the University of Washington developed ToxiGen, a dataset to train content moderation tools that can be used to flag malicious language, and published their study titled ” ToxiGen: a large-scale machine-generated dataset for contradictory and implicit hate speech detection.” on Arxiv.

Toxic language detection systems often incorrectly designate text mentioning minority groups as toxic, as these groups are often the targets of online hate. “Such over-reliance on false correlations also causes systems to struggle to detect implicitly toxic language,” according to the researchers, who to help mitigate these problems created ToxiGen, a new large-scale, machine-generated dataset of 274,000 toxic and benign statements about 13 minority groups.

ToxiGen would be one of the largest publicly available datasets on hate speech, according to Microsoft.

Ece Kamar, Partner Research Area Manager at Microsoft Research and project manager for AdaTest and (De)ToxiGen, told Techcrunch:

“We recognize that any content moderation system will have flaws, and these models need to be constantly improved. The goal of (De)ToxiGen is to enable AI system developers to more effectively find risks or problems in any existing content moderation technology. Our experiments show that the tool can be used to test many existing systems, and we look forward to learning from the community about new environments that would benefit from this tool.”

To generate the samples, the researchers populated an LLM with examples of neutral and hate speech targeting 13 minority groups, including Blacks, Muslims, Asians, “Latinos,” Native Americans, people with physical and cognitive disabilities, and LGBTQ . Statements were drawn from existing data sets but also from news articles, opinion pieces, podcast transcripts, and similar public textual sources.

The team demonstrated the limitations of AI in detecting toxicity: they fooled a number of AI-powered content moderation tools using statements from (De)ToxiGen, the content filter used by OpenAI in the open API (which provides access to GPT-3).

The team stated:

“The statement creation process for ToxiGen, called (De)ToxiGen, was designed to uncover weaknesses in certain moderation tools by guiding an LLM to create statements that could misidentify the tools. Through a study of three human-written toxicity datasets, the team found that starting with a tool and refining it with ToxiGen could “significantly” improve the tool’s performance.”

AdaTest: an adaptive testing and debugging process for NLP models inspired by the test debugging cycle in traditional software engineering

The paper ” Associating People with Large Language Models to Find and Fix Bugs in NLP Systems” was published by Scott Lundberg and Marco Tulio Ribeiro, both principal investigators. An adaptive testing and debugging process for NLP models inspired by the test-debug cycle of traditional software engineering, AdaTest promotes a partnership between the user and a large language model (LM): the LM proposes tests validated and organized by the user, who in turn provides feedback and guides the LM to better tests.

AdaTest, short for Human-AI Team Approach Adaptive Testing and Debugging, debugs a model by having it generate a large number of tests, while a human checks the model by running valid tests, selects and organizes semantically related topics. The goal is to target the model to specific areas of interest and use the tests to troubleshoot and retest the model. This last step in the debugging loop is critical because once the tests are used to repair the model, they are no longer test data but learning data.

Ece Kamar explains:

“AdaTest is a tool that leverages the existing capabilities of large language models to bring diversity to human-created seed tests. In particular, AdaTest puts people at the center to initiate and guide test case generation. We use unit tests as a language to express appropriate or desired behavior for various inputs. In this way, a person can create unit tests to express the desired behavior, using different inputs and pronouns… Since the ability of current large-scale models to add diversity to all unit tests is diverse, there may be cases where automatically generated unit tests may need to be reviewed or corrected by people. This is where we benefit from the fact that AdaTest is not an automation tool, but a tool that helps people investigate and identify problems.”

The research team conducted an experiment to see if AdaTest allowed experts, who have been trained in ML and NLP, and non-experts to better write tests and bugs in the models to be found. The results showed that experts using AdaTest found on average five times more model errors per minute, while non-experts – who had no programming training – were ten times more successful at finding errors in a given model (API perspective) for content moderation.

ToxiGen and AdaTest, their dependencies and source code, are available on GithHub.

Article Sources:
TOXIGEN: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection
Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, Ece Kamar.

ADATEST: Adaptive Testing and Debugging of NLP Models
Scott Lundberg , Marco Tulio Ribeiro, Ece Kamar.

Translated from (De)Toxigen et AdaTest, les nouveaux outils de Microsoft pour des modèles de langage plus fiables