Listen to this story
|
Google introduced ‘ViT-22B’ by scaling vision transformers to 22 billion parameters —which is 5.5 x larger than the previous vision backbone ViT-e which had 4 billion parameters.
Google incorporated scaling methods from text models like PaLM to make the scaling possible. Owing to its modified design, efficient sharding recipe, and implementation, ViT-22B was able to be trained on Cloud TPUs with high hardware utilisation.
ViT-22B uses frozen representations or complete fine-tuning to advance the state-of-the-art on a variety of vision tasks. Additionally, the model was effectively applied in PaLM-e, which shows how the state-of-the-art for robotics tasks can be considerably advanced by a large model combining ViT-22B with a language model.
AIM Daily XO
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
Instead of the typical Transformer’s sequential execution of the attention and MLP blocks, Google implemented parallel layers method—which was applied in PaLM and saved 15% training time.
The model’s utilisation has also been upscaled by 3% by omitting biases in the QKV projections, part of the self-attention mechanism, and in the LayerNorms.
Download our Mobile App
‘Sharding’—which basically means the process distributing the model parameters in different compute devices—was needed as the model was this big. In addition, Google also shards the activations.
Communications of activations and weights between devices occur at the same time as computations in the matrix multiply unit due to an approach called ‘asynchronous parallel linear operations’. Using this asynchronous approach reduces the wait time for incoming communication, thereby enhancing device efficiency.
Google claims that ViT-22B offers increased fairness and robustness when compared to existing models. The tech giant also asserts that it demonstrates increased similarities to human visual perception regarding shape and texture bias.