저자 : Benjamin Graham
제목 : LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference
학회 : ICCV2021
소속 : Facebook research Team
링크 : https://arxiv.org/abs/2104.01136
인용 : 305 (2023.10.02기준)
Summarization of the LeViT.
What is the throughput?
- Throughput refers to the number of images processed per second.
Why it called LeViT?
- We adopted a pyramid structure using pooling layers, akin to the LeNet architecture, instead of the original repeated block structure. This inspired the name LeViT
Goal
- To develop a ViT based family of models with better inference speed on both highly-parallel architectures like GPU, CPU and ARM hardware in mobile devices
Objective
- To evaluate variations of convolutional and transformer hybrid while controlling for the runtime
Contribution
- A multi-stage transformer architecture using attention as a downsampling mechanism
• A computationally efficient patch descriptor that shrinks the number of features in the first layers
• A learnt, per-head translation-invariant attention bias that replaces ViT’s positional embeddding
• A redesigned Attention-MLP block that improves the network capacity for a given compute time.
Solution (Method)
- By reintroducing convolutional components, transformer components that learn convolutional-like features can be replaced.
- We utilized a pyramid structure with pooling layers, in contrast to the traditional repeated block structure.
0. Abstract
- They proposed the new attention based model, named LeViT, optimized for both accuracy and efficiency.
- They analyzed the principles from the extensive literature on convolutional neural networks and applied them to transformers.
- They introduced an 'attention bias' to incorporate positional information in ViT. This is conceptually similar to the bias in convolutional layers, which shifts the activation of the output. However, while the attention bias introduces additional learnable parameters, the bias in CNN layers is typically added as scalar values.
1. Introduction.
- The foundational transformer architecture consists of two residual blocks: the "MLP block" and the "self-attention block".
- The self-attention mechanism can be conceptualized as a bilinear function, as detailed in the provided [link]. In comparision, Convolution Neural Network(CNN) typically use a Conv1D layer, which is limited to fixed-size neighborhoods. In contrast, our self-attention mechanism is capable of global comparisons.
- ViT achieves state-of-the-art(SOTA) results on IN-1k dataset in terms of speed-accuracy tradeoff when pre-trained on extensive JFT-300M, a vast human annotated dataset. On the other hand, DeiT delivers competitive results and improved throughput solely with training on the IN-1k dataset.
- Transformer architectures are typically faster than CNN architectures. This phenomenon arises because transformers primarily rely on the attention mechanism, which is performed by matrix multiplication, most of the hardware optimzes this matrix operation. Contrarily CNN mechanism requires a lot of data access patterns, often making their operations IO-bound*
You can find further explanations in the links provided below: [ViT] [DeiT]
IO-bound : A situation where the duration of memory access and allocation surpasses that of computational operations.
- And our proposed model LeViT show the better trade-off than ViT/DeiT models in the small/medium sized architecture
2. Related work.
1. Transformer
- The integration of attention mechanisms into Vision tasks has been explored as detailed in the linked below.
- Numerous studies have attempted to incorporate attention mechanisms into Computer vision tasks. However, the quadratic computational complexity and the increased number of parameters due to the attention mechanism have often led researchers to work with smaller image sizes/patches
2. Vision Transformer (ViT)
- The ViT model was one of the first to apply a convolution-free encoder to 224x224 images, leveraging an extensive training dataset known as JFT-300M, bringing it close to its NLP counterpart.
- Its success, attributed in part to strong data augmentation, underscores the fact that transformers have less inherent structure than CNN models.
3. Other Works:
- While numerous attention based architectures have been proposed for CV tasks, few have emphasized the trade-off between accuracy and efficiency as much as the LeViT paper.
3. Motivation
- In this section 3, we'll discuss about "grafting experiments" which means that they cooperate DeiT and ResNet50 architecture.
3-1. Convolutions in the ViT architecture.
1. The Vision Transformer(ViT) divides an image into numerous patches using Conv2d layer. These patches are extract ed using 𝓀 x𝓀 conv filter and 𝓀 stride introducing non-linearity. After that, these patches are pass through a linear function in the MHSA block. This process can consider to be convolutional functions of the input. because, patch extractor simailar to convolution layer, linear layer in MHSA block. This process can be conceptualized as a convolutional function of the input. The reason being that the patch extractor operates similarly to a convolutional layer, the linear layer in the MHSA block can be seen as a convolution filter, and the self-attention layer learns inter-relationships between these patches.
2. In our third image which shown the weight of the patch extractor, typical patterns in CNN architectures. Head 3 depicted in color, represents low-frequency patterns and Head 8 shown in grayscale, captures high frequency patterns
3. Due to the convolutional architectures learning process, conv mask passed through the highly duplicated area. (Spatial smoothness) In ViT, the divided patch mechanism does not allow the overlap area. However, ViT/DeiT used strong augmentation (rotate, affine transform) our attention-based model also learned spatial smoothness.
→ Despite of the lack of "inductive bias" in attention-based model, we can train a model similar to the CNN architectures weight.
3-2. Preliminary experiment
- We can stack the transformer model above the ResNet-50 to extract the feature from the transformer architecture.
- When we inosculate the ResNet architecture to an attention-based model, inserting the conv layer in the stage of the beginning layer gets a huge benefit.
- This table shows the two-stage ResNet and 6 layers of DeiT can show the superior result.
- When we check Figure 3, we can observe an interesting phenomenon when the grafted model starts the training, they show a similar accuracy graph in the early layer. but when the model gets deeper, their convergence rate is similar to the DeiT model, which means that Our grafted model can utilize the low-level information from the earlier layers(Conv) due to their inductive bias(translation invariance).
- We might be assumed that our convolutional layers can incorporate the priors (spatial information & translation invariances) and our transformer architectures just processes the objective of solving a given task.
4. Model
- There is no CLS token in our LeViT
- As we can see the below image, our LeViT utilze the 4 layers patch embedding layer. To do so, our activation map can be reduced [3, 224, 224] to [256, 14, 14]. And each channels go through the [3, 32, 64, 128, 256] channels each layer.
- Attention bias can be formulate below operation. and it is written as below code.
5. Experiment.
- S means that in our section 4. we show the two image, one is same size output resolution. and the other one is bottle neck structure.