Google Scales Vision Transformers to 22 Billion Parameters

Google incorporated text models like PaLM to make the scaling possible.
Listen to this story

Google introduced ‘ViT-22B’ by scaling vision transformers to 22 billion parameters —which is 5.5 x larger than the previous vision backbone ViT-e which had 4 billion parameters.

Google incorporated scaling methods from text models like PaLM to make the scaling possible. Owing to its modified design, efficient sharding recipe, and implementation, ViT-22B was able to be trained on Cloud TPUs with high hardware utilisation.

ViT-22B uses frozen representations or complete fine-tuning to advance the state-of-the-art on a variety of vision tasks. Additionally, the model was effectively applied in PaLM-e, which shows how the state-of-the-art for robotics tasks can be considerably advanced by a large model combining ViT-22B with a language model.

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Instead of the typical Transformer’s sequential execution of the attention and MLP blocks, Google implemented parallel layers method—which was applied in PaLM and saved 15% training time.

The model’s utilisation has also been upscaled by 3% by omitting biases in the QKV projections, part of the self-attention mechanism, and in the LayerNorms.


Download our Mobile App



‘Sharding’—which basically means the process distributing the model parameters in different compute devices—was needed as the model was this big. In addition, Google also shards the activations.

Communications of activations and weights between devices occur at the same time as computations in the matrix multiply unit due to an approach called ‘asynchronous parallel linear operations’. Using this asynchronous approach reduces the wait time for incoming communication, thereby enhancing device efficiency.

Google claims that ViT-22B offers increased fairness and robustness when compared to existing models. The tech giant also asserts that it demonstrates increased similarities to human visual perception regarding shape and texture bias.

Sign up for The AI Forum for India

Analytics India Magazine is excited to announce the launch of AI Forum for India – a community, created in association with NVIDIA, aimed at fostering collaboration and growth within the artificial intelligence (AI) industry in India.

Shyam Nandan Upadhyay
Shyam is a tech journalist with expertise in policy and politics, and exhibits a fervent interest in scrutinising the convergence of AI and analytics in society. In his leisure time, he indulges in anime binges and mountain hikes.

Our Upcoming Events

Regular Passes expiring on Friday
27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023

21 Jul, 2023 | New York
MachineCon USA 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR

Council Post: Retail Business through Generative AI

Today, retail technology is developing at a fast pace – whether it is business transformation or even exploring emerging tech (AR/VR and metaverse etc.) to give customers a more experiential journey. Businesses are innovating not only to remain relevant, but also, ahead. Some are really shaping the future of omni-channel retail by predicting customer expectations and market trends.

Cerebras Wants What NVIDIA Has

While OpenAI apparently utilised 10,000 NVIDIA GPUs to train ChatGPT, Cerebras claims to have trained their models to the highest accuracy for a given compute budget.