Lanfrica, NLP applied to African languages - Interview with Bonaventure Dossou and Chris Emezue

Breaking down the language barrier thanks to data and AI and strengthening access to information in languages with limited resources, two central issues for our societies. At Vivatech last June, Unesco organized a startup competition on AI for human rights, where one of the challenges was precisely these issues. The start-up Lanfrica, created by Bonaventure Dossou and Chris Emezue, both students in Germany and research interns at Mila, was the winner. An online platform that provides information on existing (completed and ongoing) research, results, benchmarks and projects on African languages and presents them in a user-friendly way, Lanfrica works for all African languages by creating an open source online database system that provides quick and easy access to existing research on natural language processing and machine translation, as well as to results on African languages. Lanfrica implements a participatory, community-driven approach to populate the database with existing research on African languages.

How did you meet? What is your common background before starting the Lanfrica project? We met during our undergraduate studies in Russia where we both got our B.A. in mathematics. Soon enough, we discovered a common passion for new technologies, especially artificial intelligence.

Our goal was already anchored in our minds: to actively contribute to the development of Africa. While we were still students, we started to publish articles, mainly on NLP and African languages, but also to participate in international conferences in the field of AI. The idea for the Lanfrica project initially came from Chris who, while we were in our room thinking about issues related to "AfricaNLP", NLP for African languages and dialects, came up with the idea.

What made you want to launch the Lanfrica project and what is its objective? It's quite simple: we want to unify and connect all the resources of all African languages. Why do you want to do this? Simply because when we got into the NLP research business and started our work and efforts to put Africa on the global NLP map, we discovered that the first major problem in achieving this goal is access to data and resources. It's simple: you can't do anything without it. The second major problem is the lack of documentation and existing works. This is particularly important because this is what drives scientific research forward: being able to build better theories than previous ones, to get better results based on previous experimental results. For example, in the context of the Fon-French translator engine we designed, we unfortunately did not find any previous work that consolidated our work.

Lanfrica aims to be a solution that brings together all the advances in research involving African languages to allow easy access to datasets and articles from as many, if not all, African languages as possible. We have proposed a clear and simple procedure in section 2 of our paper on Lanfrica.

How is the Lanfrica project structured? What techniques did you use to retrieve the data set needed to complete it? We started by organizing, structuring and using the large amount of information available on the internet in an automated way. In this way, we were able to obtain a base on which we could include more data. One of the techniques we have used to include more data is to use a system where users can recommend experts (researchers, scientists, etc.) who have written a publication or worked on a project involving one or more African languages. We believe that this process, which is unique to Lanfrica, will link both researchers and users in the cataloguing, and allow the platform to remain self-sustaining.

Apart from Lanfrica, we have many other projects: the Fon-French translation engine, the Fon and Igbo automatic speech recognition systems, the multilingual automatic translator for African languages, etc. These projects are of course complementary, but they are not the same. These projects are of course complementary, but have different levels of complexity.

Will other researchers be able to use these datasets in their projects? All the datasets we use in our projects are open source (unless licensing constraints prevent us from making them available to the greatest number of people). So, as far as Lanfrica is concerned, any researcher could use these datasets or any other resource related to this project. The only thing we ask, and which we believe is very important in scientific disciplines, is to cite our publications when using Lanfrica.

Do you have any other projects related to linguistics or more generally to AI? Yes, and there are many projects. At the moment, as we mentioned earlier, we are working on the Fon-French translation engine, the Fon and Igbo automatic speech recognition systems for which we recently launched the "Okwugbé" python library that will allow anyone to train their ASR system on any language they want.

We are also working on Named Entity Recognition (NER) for African languages with an incredible community of researchers and the Masakhane family, and on the Multilingual Automatic Translator for African languages, focusing on six African languages (Fon, Igbo, Yoruba, Swahaili, Xhosa and Kinyarwanda). We have many other projects in mind, but nothing concrete yet.

What are your ambitions, your short, medium and long term research objectives? Bonaventure Dossou: In the future, I want to build more tools for African languages and for Africa more generally, in as many fields as possible. I am increasingly interested in machine learning, deep learning and reinforcement learning. I wish to better understand the mathematical concepts that may exist behind learning methods, with the objective, why not, to improve them and in turn, to invent an innovative learning method.

More generally, I have the ambition to be at the top, to be a pioneer in the development of technology and AI in Africa and this goes through the pursuit of a PhD in the field. I am aware that there is still a long way to go, but I am sure that with the will, combined with the desire to learn, this path is all set.
On a more personal level, a goal that is close to my heart would be to be able to build a digital assistant for my family and my children so that they can continue to learn and interact in their mother tongue, Fon. And as for the future generations, I wish to be an unavoidable example, a solid reference that can transcend them, motivate them, inspire them to pursue their dreams.

Chris Emezue: In terms of my short-term goals, I want to improve my abilities in the field of machine learning and more generally in AI. I want to understand the mathematical concepts behind it. I believe that a good understanding of the basics of machine learning, NLP, and programming will be very useful in my future career.

As for my mid-term goals, I plan to get my PhD after my master's degree and gain "industrial" experience in terms of machine learning. I would also like to work in machine learning and its applications in health.
In the long term, I want to be an entrepreneur in the field, in addition to being a researcher in machine learning. I would like to build tools, useful products for Africa, but also for the rest of the world.

Has a recent research project, a start-up project or an AI breakthrough caught your attention? Bonaventure Dossou: For me, healthcare coupled with machine learning is simply the future. I don't know how to tell you this, but machine learning coupled with biology sounds so beautiful (laughs). Having said that, there are a lot of startups that are being created every day in the AI field. Undeniably, each one is unique and seems to be doing well, although I think a collaboration or some sort of larger coalition could help improve the industry and make them grow faster.

I am particularly interested in drug discovery and health more broadly. There are many startups and companies that specialize in this sector and are leveraging the capabilities of AI to improve methods and processes. Some examples that come to mind are Mila, Roche, Modelis or Speeqo.

Chris Emezue: In terms of research areas and projects, I'm particularly interested in work that blends artificial intelligence and healthcare. I think healthcare needs to find ways to improve the health of patients, because that's a very important issue.

At the same time, I'm very interested in research related to reinforcement learning and the causal situations it can create. I really believe that understanding these causal situations and finding ways to model them could help solve some of the problems that the current machine learning setup is facing, I'm thinking for example of generalization out of distribution in ML.

Translated from Lanfrica, le TAL appliqué aux langues africaines - Entretien avec Bonaventure Dossou et Chris Emezue