Meta unveils MultiRay, the platform for optimizing large-scale AI models

0
Meta unveils MultiRay, the platform for optimizing large-scale AI models

MultiRay is currently employed in over 125 use cases at Meta and supports up to 20 million requests per second, or 800 billion per day . Its first model, TextRay, supports text analytics applications to remove misleading content and improve users’ search experience on its platforms.

With this new platform, Meta’s teams can quickly improve machine learning models and replicate them for different purposes, such as detecting hate speech. This optimizes and automates these tasks more than if each team had to develop large-scale models from scratch.

In today’s AI systems that process text, images and other modalities, the best results come from a large-scale model trained on a huge amount of data, which is then specialized to perform a very specific task (e.g., identifying hate speech). The result is a high quality solution, but one that is particularly expensive and limited. The more problems that need to be solved, the more models are needed, and these models can quickly turn into money pits. This means that in practice, the latest large-scale models are rarely used in production, in favor of smaller, simpler models.

MultiRay’s universal models are trained to perform well in a variety of fields and tasks. This general-purpose model generates better performance than the task-specific models used previously. With MultiRay, Meta teams can quickly improve machine learning models and replicate them for different purposes, such as thematic tagging of publications or hate speech detection. This optimizes and automates these tasks more than if each team had to develop large end-to-end models from scratch.

TextRay, MultiRay’s very first model, has been in development since 2020. It supports text analytics applications to remove misleading content and improve the search experience for users on our platforms, among other things.

1

More modalities equals more problems

That’s a good start, but the real world is much more complex because it incorporates many modalities. A Facebook post, for example, may contain text, images and video. To fully understand a post, the system must analyze each of these elements both separately and together. This practice involves combining multiple computational models into a larger, more advanced model. The resulting increase in computing power and energy consumption slows our efforts to deploy the most advanced machine learning models in our products and services.

MultiRay’s second model, PostRay, combines text and image analysis in one model. With most posts on Facebook and Instagram containing both text and image data, PostRay reduces the need for teams to have their own understanding of text and images. Various use cases are possible at Meta, including content classification, used for Reels.

PostRay models are more difficult to train, deploy and maintain because they combine different state-of-the-art searches in different domains simultaneously. With MultiRay, you only need to do these tasks once to benefit the entire enterprise. This centralized and versatile system allows us to work directly with leading research teams and implement their work as soon as it is published.

How MultiRay works

The primary objective of MultiRay is to facilitate access to large-scale core models at Meta. It does this by centralizing execution on acceleration programs such as graphics processors and using a cache to save as much recomputation as possible. MultiRay is currently used in over 125 use cases at Meta and supports up to 20 million requests per second, or 800 billion per day.

What are integrations?

MultiRay uses large base models that return a point in a vector space with a large number of dimensions representing the input. This point, called an integration, is a version of the original input that is much more suitable for machine learning. Instead of processing the raw input (such as text or images), task-based models can reuse MultiRay’s much easier integration

2

to manipulate. The basic models deployed in MultiRay are designed to work in different types of tasks, such as similarity and classification. This universality makes our integrations relatively large (a large number of kilobytes) so that they can convey more information.

Why centralize large models?

Amortization between several teams

Large-scale models and latency constraints call for execution on accelerator programs such as graphics processors. While this type of specialized hardware is in high demand within Meta, it does not prevent the training and hosting of high-end models from consuming a lot of power. Because the same equipment and processing methods can be used repeatedly, MultiRay’s customer teams share the costs of training and hosting these models. This gives the teams access to far more and better resources than any of them could have individually. The whole is worth more than the sum of its parts.

Simplified development and operations

Meta teams are typically responsible for their own models and their management, as well as their infrastructure. The increasing scale of models increases the operational burden required for each team to train and manage them. It is also more difficult to apply sophisticated optimization techniques to models that are distributed across multiple teams. MultiRay intervenes in only a few large, centralized models, allowing a single team to manage the majority of operations and perform optimization. Client teams have smaller, specialized models that are easy to manage. This enables many teams that lack the capabilities to train, deploy and manage innovative AI systems to use this technology.

Accelerate production

With the MultiRay centralized service being used by more than 125 customers, the enhancements that are being made benefit everyone. It has become a sandbox for our machine learning and systems specialists to make key optimizations that benefit the entire ecosystem of acceleration programs and PyTorch. For example, this isthe first large-scale use case for the production deployment ofPyTorch’sBetterTransformer solution at Meta. The result? Significant savings without impacting quality.

3

Increased efficiency on acceleration programs: batch processing of cross-requests

Hardware in acceleration programs is most efficient when it processes multiple requests in parallel, i.e., in batches. Ideal batch processing optimizes service without causing delays. This method can create additional challenges for our internal users, especially when the perfect batch changes with the introduction of new hardware or models.

To make it easier for them, MultiRay’s external API is designed to receive only one request at a time. Then, MultiRay uses internal batch logic to group simultaneous user requests into a single batch. We only need to write the logic once, and then adjust it to create batches of the right size models and materials. The requesting customers are not aware of this batch processing, even when we make significant performance changes, such as increasing the batch size used for migration to next-generation GPU acceleration hardware.

Cache: Striking the right balance between computing power and storage optimization

MultiRay uses a cache to achieve maximum savings in recomputation. This multi-tiered cache allows MultiRay to reduce cost and latency, with each tier generating more results more slowly, and therefore at a lower cost. These tiers range from the fast, but limited, local per-host cache in the RAM of each MultiRay server to the slower, but more widely distributed cache in flash memory

MultiRay models are large in scale and generate proportional integrations (a large number of kilobytes) to maintain universality. For text analysis, these integrations are far more important than the inputs themselves. Even though it takes less energy to process an integration from the cache than to recalculate it, it still takes energy. Since the available cache memory is limited, the results cannot be stored in it for a long time.

MultiRay measures the demand patterns of different clients to determine the best cache settings (size, lifetime, update strategies) and reduce

4

the total cost of service. For example, we use this measured data to simulate the energy required for different cache lifetime parameters. In this way, we are able to strike the right balance between recalculating a request on the acceleration programs and fetching it from the cache. This feedback loop has allowed us to improve the efficiency of MultiRay despite constantly changing user behavior.

The Downside: The Challenges of a Centralized Service

Any centralized service that Meta uses has its challenges. Some of these, such as client management, quotas, or cost allocation, which were considered solvable for large-scale systems such as databases, had to be adapted to the AI industry. Both query size and cache hit rate affect the energy required to process queries. As a result, quotas are more complex. Furthermore, sharing the expense of higher quality but also more expensive MultiRay models only works if our models are used at scale. To do so, they need to perform extremely well in many use cases. This moving target has led us to invest heavily in updating our models (managing versions, upgrading to new versions, stopping support for obsolete versions) as well as developing innovative architectures and training workflows to speed up production and continue to provide state-of-the-art technology to MultiRay users.

Learn more about MultiRay

If you are interested in learning more about MultiRay, we invite you to explore the research conducted by Meta’s Foundational AI Research (FAIR) team that inspired it:

  • Unsupervised cross-lingual representation learning at scale from cross-lingual representations at scale,) where researchers demonstrated for the first time that cross-lingual modeling is possible without affecting the performance of each language.
  • General purpose text embeddings from pre-trained language models for scalable inference. General purpose text embeddings from pre-trained language models for scalable inference,) where the researchers studied a TALN solution in which multiple tasks are performed on the same text using

5

pre-trained models at large scale for significantly less than the cost of computation.

  • Multiscale vision transformers and Masked autoencoders as spatiotemporal learners as spatiotemporal learners) MultiRay: prior research to deduce how MultiRay could be applied to video-related tasks in the future.

Translated from Meta dévoile MultiRay, la plateforme d’optimisation des modèles d’IA à grande échelle