ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

Tram Ho

General introduction

Vision-and-Language pretraining (VLP) models have proven effective in improving downstream tasks that involve combining both linguistic and visual information. To be included in the VLP model, the image pixels need to be embed along with the language tokens. With the image embed step, it is no longer strange, we can use popular CNN networks

To date, most VLP research has focused on improving performance by improving the image embed step. However, this embed step has real-world limitations since the model usage is often very heavy, resulting in slow speeds when executing real-world queries. Therefore, the group of authors focused their research on designing a fast, light, and delicious visual embedding. Recent studies demonstrate that using a simple linear projection for each patch is effective for embedding image pixels before they are included in the transformer model. Although often used for text, the transformer model has recently proved effective with image data. Applying transformer to images in VLP model helps to replace traditional CNN models.

In the paper, the authors propose a Vision-and-Language Transformer (ViLT) model that handles two methods (vision and language) in a unified way. The biggest difference compared to previous VLP models is the “shallow” nature of the proposed model, where the input image will not have to go through a separate CNN network. Eliminating the visual embedding step using this deep learning model greatly reduces the size and runtime of the aggregate model (see figure below).


This is also one of the cool ideas in the article. Find out more in the sections below


Vision-and-Language model classification

The authors propose to classify vision-and-language models based on the following two information:

  • Compare 2 methods in terms of parameters and computational resources
  • Do the 2 methods interact in a deep learning network?

Based on the above 2 information, we have 4 cases as shown below. The height of the rectangle represents the relative computing resources of that module.


We have seen some examples for these 4 cases as follows:

  • With figure (a) we have model VSE++ and SCAN. These 2 models use separate embedders for images and text (it’s heavier for pictures) The model then represents the similarity between embedded features from the two models with simple scalar multiplication or shallow attention classes.
  • With figure (b) we have CLIP model using transformer embedder for 2 separate methods with similar complexity. The interaction between the pooled image vector and the text vector is still “shallow” (using dot product). However, using strong embedders for each method does not guarantee that the model will effectively learn complex vision-and-language tasks. For example, if we fine-tuning the MLP head on NLVR2 instead of CLIP’s dot product, the accuracy is only about 50.99 so the representations are not capable of learning this task. So this prompts the need to build a stricter interaction structure between methods
  • As shown in figure (c), recent VLP models use a deep transformer to interactively model image and text features. CNN is still used to extract and embed image features. This makes the model large, requiring more and computational resources.
  • ViLT is the first canonical model for figure (d). In which the embedding layer is shallow and light. Therefore, this architecture focuses most of the computation on modeling interactions between methods

The interaction between the two methods

The core of modern VLP models is the Transformer (where you go, you see Transformers). The model takes as input embeddings of images, text, models between methods and interactions between methods across classes, and then outputs a contextualized feature sequence. In this model, the author team uses a single-stream approach, ie layers will perform operations with concat embed input of images and text. This approach ensures no additional parameters like dual-stream .

Visual Embedding

Instead of using Region feature and Grid Feature, the authors introduce Patch projection for the extraction module, specifically a linear project for patch images. Patch projection is not strange if you have read the paper about ViT model. The author team uses a

32 × 32 32 times 32 patch projection requires only 2.4M parameters.

Vision-and-Language Transformer

Model overview


The architecture of ViLT is quite simple as shown in the figure. A good point is that the authors initialize the weights of the transformer interaction model from pretrained ViT instead of BERT. This initialization “powers up” the interaction layer to handle the visual feature, which makes up for the loss of a separate deep visual embedder. The authors have also experimented with initializing weights for classes from BERT pretrained and using pretrained patch projects from ViT, but this is not effective.


The model works as follows: Input text

t CHEAP OFFER × DRAW t in mathbb{R}^{L times|V|} embed to

Input image

I CHEAP OLD × H × W I in mathbb{R}^{C times H times W} cut into patches and flattened to

Text and image embeddings are added with the corresponding modal-type embedding vectors

t t y p e , v t y p e CHEAP H t^{mathrm{type}}, v^{mathrm{type}} in mathbb{R}^H , then concat to a composite string

Pre-training Objectives

Two main and common goals when training VLP models are Image text matching (ITM) and masked language modeling (MLM).

Image Text Matching

The author group randomly replaces the aligned image with another image with probability 0.5. In addition, the architecture adds an ITM layer that is responsible for projecting the pooled output feature

p p via binary class. The authors then performed a negative log-likelihood loss calculation representing the ITM loss.

Another good point is that the authors design word patch alignment (WPA) to calculate the alignment score between two subsets

z EASY : z EASY t z^D : z^D | t (textual subset) and

Masked Language Modeling

The main goal is to predict the label of the masked text tokens

t m a S k e d t_{masked} word contextualized vector

Whole Word Masking

This is a masking technique where we will mask all letters in a word. This approach shows efficiency on downstream tasks when applying BERT.

The idea of ​​this method is as follows. For example, given the word “giraffe” has been tokenized into the following 3 components [“gi”, “##raf”, “##fe”], this tokenization uses the bert-base-uncased tokenizer. If not all tokens are labeled, for example , [“gi”, “[MASK]”, “##fe”], the model will depend only on neighboring tokens [“gi”, ” ##fe”] to predict the masked word “##raf” without having to rely on information from the image.

Image Augmentation

The authors use RandAugment in the fine-tuning process. However, the following two augment methods are discarded as color inversion because the input text may have color information at some point in the image. Next, the cutout is also removed because this can lose some important objects of small size in the image.


In the article, the authors have contributed VLP ViLT architecture to effectively solve Visual – Language problems. Besides, some techniques are applied such as Augmentation, Masking to increase the performance of the model. This is a model you can try in some math problems like Image Caption or Visual Question Answering.


[1] ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

[2] Logit and probit models


Share the news now

Source : Viblo