The arXiv article corpus is now available on Kaggle

0
The arXiv article corpus is now available on Kaggle

For more than 30 years, the arXiv article corpus has been providing the research community and the public with access to scientific articles in a wide variety of fields including computer science, artificial intelligence research, physics, mathematics, statistics, electrical engineering, quantitative biology and economics. From now on, the arXiv corpus will also be available on Kaggle as announced on its blog.

The large number of arXiv research papers is both beneficial and stimulating. Whether it is a graduate student looking to deepen her knowledge in her field, an established professor exploring adjacent fields, or researchers seeking a global overview, this rich body of information offers significant, yet sometimes overwhelming depth.

To help make arXiv more accessible, the organization presented an open and free pipeline on Kaggle to the machine-readable arXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full-text PDFs, and more.

“Having the entire arXiv corpus on Kaggle greatly increases the potential of arXiv articles,” said Eleonora Presani, Executive Director of arXiv, “By offering the Kaggle data set, we go beyond what humans can learn from reading all these articles and make the data and information behind arXiv available to the public in a machine-readable format”.

Kaggle is a destination for data scientists and machine learning engineers looking for interesting data sets, public notebooks, etc. Researchers can use Kaggle’s extensive data mining tools and easily share their relevant scripts and results with others.

“ArXiv is more than a repository of articles, it is a knowledge sharing platform,” said Eleonora Presani. “This requires constant innovation in the way we present and interpret the knowledge we make available. Kaggle users can help push the boundaries of this innovation and it can be a new way for our community to collaborate.

“With large datasets, it is generally expected that innovative discoveries, connections, tools or insights will be overlooked, which can lead to additional information, not only on the original subject, but in other fields of study, allowing even more discovery and innovation,” said Steinn Sigurdsson, Scientific Director of arXiv.

arXiv’s Call to action

“Our hope is to enable new use cases that can lead to the exploration of richer machine learning techniques that combine multimodal functionality with applications such as trend analysis, article recommendation engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.

An example of such a semantic search application built on a specific corpus would be Google’s COVID-19 Research Explorer, a tool that helps researchers browse the CORD-19 dataset – a repository of over 190000 scientific papers on COVID-19. Interfaces built on datasets such as this one use advanced NLU techniques to understand a user’s intent behind a query. Ultimately, this can enable more efficient searching by bringing to light data and evidence relevant to complex scientific questions. We hope that the publication of the arXiv machine-readable dataset will inspire the creation of similar NLU tools on this new corpus.

Alex Alemi, a senior researcher at Google, has also pursued exciting ML applications using arXiv. As described in the article on using arXiv as a dataset, Alex and his colleagues have sought to propel arXiv as a reference for large-scale multi-relational tasks such as graphical neural networks. I am delighted to see the research community take up the challenge of a rich and multifaceted dataset with such practicality in the real world, and the new questions it will raise,” says Alex.

Access

The dataset is now available on Kaggle and will be updated weekly.

Translated from Le corpus d’articles arXiv est désormais disponible sur Kaggle