A research team at Google Brain has revisited multilayer perceptrons (MLP) by designing MLP-Mixer. This is a no-frills model that approaches state-of-the-art performance in ImageNet classification, and could achieve performance comparable to systems like ViT (Vision Transformer), BiT (Big Transfer), HaloNet and NF-Net. In the future, it is quite possible that the simplest multilayer neural networks could be more sophisticated than the most advanced current architectures.
MLP-Mixer consists of a main classifier, several mixer layers and linear embeddings. Each layer contains one MLP managing tokens and the other managing channels, which are composed of two fully connected layers.[/caption]
A study to exploit multilayer perceptrons for image classification and computer vision
Currently, convolutional neural networks excel in image processing and computer vision because they are designed to discern spatial relationships, and pixels that are close together in an image tend to be more related than pixels that are far apart. MLPs do not have this bias, so they tend to take into account interpixel relationships that exist, but are not necessary to the image processing process. Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov and Lucas Beyer, along with Google Brain researchers, came up with the idea of modifying MLPs so that they could process and compare images through patches rather than by analyzing each pixel individually. So they designed MLP-Mixer, which allows MLPs to exploit this process. Their creation waspublished in a paper entitled "MLP-Mixer: An all-MLP Architecture for Vision". It should be noted that MLPs are the simplest "building blocks" of deep learning. This work shows that they can compete with the most powerful architectures for image classification.How was MLP-Mixer designed and how does it work?
The authors pre-trained MLP-Mixer for image classification using ImageNet-21k, which contains 21,000 classes, and refined it on ImageNet which has 1,000 classes. Given an image divided into patches, MLP-Mixer uses an initial linear layer to generate 1,024 representations of each patch. MLP-Mixer stacks the representations in a matrix, so that each row contains all the representations of a patch and each column contains a representation of each patch. MLP-Mixer consists of a series of mixing layers, each containing two MLPs, each consisting of two fully connected layers. Given a matrix, a mixer layer uses one MLP to mix the representations in the columns (which the authors call token mixing) and another to mix the representations in the rows (which the authors call channel mixing). This process renders a new matrix to be passed to the next mixer layer. A softmax layer renders a classification. [caption id="attachment_31422" align="aligncenter" width="770"]
MLP-Mixer consists of a main classifier, several mixer layers and linear embeddings. Each layer contains one MLP managing tokens and the other managing channels, which are composed of two fully connected layers.[/caption]