Human languages are very complex (vocabulary, grammar, syntax…), linguists thought that no machine would be able to analyze the sounds of a language and the structures of words as linguists do. However, a team of researchers from Cornell University, MIT and McGill University has developed an AI system that can learn the rules and patterns of human languages on its own. The study “Synthesizing theories of human language with Bayesian program induction” was published in Nature Communications.
The researchers were interested in discovering AI-driven theories, they chose human language as their test field. They focused on the linguist’s construction of language-specific theories and his synthesis of abstract cross-linguistic meta-theories, while proposing links to language acquisition in children. The cognitive sciences of language have indeed established an explicit analogy between the scientist constructing grammars of particular languages and the child learning those languages.
Kevin Ellis, Assistant Professor of Computer Science at Cornell University and lead author of the paper, explains:
“One of the motivations for this work was our desire to study systems that learn patterns of data sets that are represented in a way that humans can understand. Instead of learning weights, can the model learn expressions or rules? And we wanted to see if we could build this system to learn about a whole battery of interdependent data sets, so that the system would learn a little bit about how to best model each of them.”
The choice of human language
Natural language is an ideal domain to study discovery theory for several reasons:
- Decades of work in linguistics, psycholinguistics, and other cognitive language sciences provide diverse raw material for developing and testing models of automated theoretical discovery. Corpora, datasets, and grammars from a wide variety of typologically distinct languages are available, providing a rich and varied testbed for comparative analysis of theory induction algorithms;
- On the other hand, children acquire language from modest amounts of data compared to AI. Similarly, field linguists develop grammars based on very small amounts of obtained data. These facts suggest that the child’s analogy as a linguist is productive and that the induction of theories of language is tractable from sparse data with the right inductive biases;
- Finally, theories of language representation and learning are formulated in computational terms, exposing a set of formalisms ready to be deployed by AI researchers.
These three characteristics of human language: the availability of a large number of highly diverse empirical targets, interfaces with cognitive development, and computational formalisms within linguistics, have led researchers to choose language as a target for automated theoretical induction research.
A Bayesian program learning model
Linguistics aims to understand the general representations, processes, and mechanisms by which people learn and use language, not just to catalog and describe particular languages. To capture this at the framework level of the theoretical induction problem, researchers adopted the Bayesian program learning (BPL) paradigm. They built the model using Sketch, a program synthesizer that was developed at MIT by Armando Solar-Lezama.
They focused on theories of natural language morphophonology, the area of language governing the interaction of word formation and sound structure.
The team evaluated the GLP model on 70 datasets covering the morphophonology of 58 languages. These datasets came from phonology textbooks: although linguistically diverse, they are much simpler than full language learning, with no more than 100 words and only a handful of grammatical phenomena isolated. When given words and examples of how those words change to express different grammatical functions (such as tense, case, or gender) in a language, this machine learning model proposes rules that explain why the forms of those words change…
The conclusions of the study
The model was able to propose a correct set of rules to describe these shape changes for 60% of the problems. It could be used to study linguistic hypotheses and investigate similarities in the way different languages transform words.
According to the researchers, humans deploy their theories more richly than their model. In doing so, they propose new experiments to test theoretical predictions, design new tools based on a theory’s conclusions, and distill higher-level knowledge that goes far beyond what their “Fragment-Grammar” approximation can do. However, continuing to push theoretical induction along these many dimensions remains a prime target for future research.
“Synthesizing theories of human language with Bayesian program induction”
Nature Communications https://doi.org/10.1038/s41467-022-32012-w
- Kevin Ellis, Assistant Professor of Computer Science at Cornell University;
- Adam Albright, Professor of Linguistics, MIT;
- Armando Solar-Lezama, Professor and Associate Director of the Computer Science and Artificial Intelligence Laboratory (CSAIL), MIT;
- Joshua B. Tenenbaum, Paul E. Newton Career Development Professor of Cognitive Science and Computation in the Department of Brain Science and Cognitive Function and member of CSAIL, MIT;
- Timothy J. O’Donnell, Assistant Professor in the Department of Linguistics at McGill University and Canada-CIFAR Chair in AI at Mila – Institut québécois d’intelligence artificielle.