Only about 20 percent of the biographies on the English-language Wikipedia site, one of the world’s most visited websites, are of women, according to the Wikimedia Foundation. Aspart of her doctoral project in computer science at the University of Lorraine, within the National Institute for Research in Digital Science and Technology (INRIA), Angela Fan worked with her thesis supervisor, Claire Gardent, to develop a new solution that would address this imbalance through artificial intelligence.
Gender is one of the most widespread and insidious forms of inequality. For example, Wikipedia in English contains over 1.5 million biographies of notable writers, inventors, and scholars, but less than 19% of these biographies are about women. Despite this low percentage, a quarter of the biographies submitted for deletion each month are about women. Despite their considerable impact throughout history in science, business, politics, and all other areas of our society, women are either overlooked or underrepresented.
Angela Fan, a researcher at META AI, has open-sourced an end-to-end AI model that automatically creates excellent biographical articles on prominent public figures.
The AI model that generates biographies
Angela Fan and Claire Gardent began the process of generating a biography using a research-augmented generation matrix, which relies on large-scale pre-training, and teaches the model to identify only useful information, such as birthplace or where the person went to school, as it builds the biography.
The model first retrieves relevant information from the Internet to introduce the subject. Next, the generation module creates the text, while the third step, the citation module, builds the bibliography referring to the sources that were used. The process then repeats, with each section predicting the next, covering all the elements that make up a robust Wikipedia biography, including the subject’s youth, education and career.
Information is generated section by section, using a caching mechanism similar to Transformer-XL, to refer to existing sections and achieve a higher degree of contextualization at the document level. Caching is essential because it allows the model to better trace what it has already produced.
Evaluation teams found that 68% of the text generated in biographies was not in the reference corpus and was only partially verifiable. Lack of data to train the engine or existing biographical articles on women was a major problem. On the other hand, articles about women, especially those from marginalized groups, are significantly shorter than the average article about men, less detailed and use different language. For example, they refer to a “female scientist” instead of simply saying “scientist.” This bias in the training data was internalized by the models. In addition, Wikipedia articles are written from factual sources, often from the web, and not from verified sources.
Diversifying representation on Wikipedia
According to Angela Fan, this model only partially solves a multidimensional problem and there are still other areas where new technologies should be explored.
Furthermore, some sources have a bias that needs to be taken into account. In biographies about women, there are details about their personal lives, such as being divorced, that are irrelevant and distract from the achievements that should be highlighted.
Meta points out:
“There is still work to be done for other marginalized and intersectional groups around the world and in all languages. Our assessment and dataset focus on women, which excludes many other groups, including non-binary people.”
Angela Fan concludes:
“We are driven by the desire to share this important area of research with the broader community of researchers in the field of AI generation. We hope that our techniques can be used as a starting point for people enriching Wikipedia content with their articles, and that they will improve the equity of online information available to students writing biographies, and many others.”