논문 읽기 [DirectCLR] - Understanding Dimension Collapse in Contrastive self-supervised learning

오늘은 Dimension collapse에 대해서 알아보고자 한다. 특히 Contrastive Learning에서도 일어나는 Dimension Collapse에 초점을 둘 것이다.

제목 : Understanding Dimension Collapse in Contrastive self-supervised learning

학회 : ICLR 2022

저자 : Li Jing, Pascal Vincent, Yann LeCun, Yuandong Tian

인용 : 164회 (2023.07.24 기준)

- 시작에 앞서서 (a)는 joint embedding 방식에서의 embedding space에 대해 설명을 한 것인데, 같은 이미지가 다른 augmentation이 적용이 되어, 같은 Encoder를 통과한 z와 z'가 (b), (c)와 같이 만들어지는 것에 대해 설명하고자 한다.

- (b) 논문에서도 자주 등장하는 complete collapse이다. 논문에서는 주로 constant 즉 상수로 모이게 된다는 표현을 쓴다.

- (c) 이번 논문에서 주로 다룰 주제는 (c) dimensional collapse이다. (b)는 잘 일어나지 않으며, Contrastive Learning에서는 collaspe문제가 Negative sample에 의해 일어나지 않을 것이라 생각했지만, (c)와 같은 문제가 있다는 것을 보여준다.

Abstract

- Joint Embedding approach는 같은 이미지의 다른 view 에서의 agreement를 최대화 시키는 것이 목표.

- 하지만 이런 Joint embedding approach는 collapse problem을 겪고 있다.

- Contrastive Learning은 negative sample을 사용해서 collapse problem을 해결하고자 하였지만, 실제로는 Contrastive Learning에서도 collapse Problem이 일어나며, 원인은 Contrastive Learning의 동작 방식에 있다는 것을 밝힌다.

- Non-Contrastive Learning 방식은 lower-dimensional에만 mapping되는 또 다른 문제를 가지고 있다.

- DirectCLR 이라는 새로운 Contrastive Learning 방식을 제안한다.

Introduction

- 최근 Self-supervised Learning method의 특징은 joint embedding 방식을 사용하여 지도학습과 유사한 성능 기록.

- joint embedding의 목표는 같은 이미지에 대해 Augmentation을 적용하여 invariant representation을 학습하는 것이다.

→ 하지만 모든 입력을 똑같은 vector에 넣기 때문에 collapsing problem 발생한다.

SimCLR, MoCo : Negative sample을 사용하여 해결하고자 하였다.

BYOL, Simsiam: Negative Pair없이 stop-gradient 방식을 통해 predictor가 collapse 되는 것을 방지한다.

그 외에도 clustering step, redundant한 information 최소화하는 방식이 존재하였다.

→ 위의 방식들은 데이터가 한 점으로 뭉치는 complete collapse 방식은 막을 수 있었지만, 특정 차원에서만 붕괴되는 dimensional collapse가 일어난다고 주장하였다. (embedding space가 lower-dimensional subspace에만 span된다.)

- Contrastive Learning은 negative sample을 사용해서 collapse problem을 해결하고자 하였지만, 실제로는 Contrastive Learning에서도 collapse Problem이 일어나며, 원인은 아래와 같다고 주장한다.

(1) data augmentation에 의해 생기는 분산이 원래 데이터 분포에 대한 분산보다 크기에, feature direction이 data augmentation을 따라가게 되어 weight collapse가 발생하게 된다.

→ simCLR에서는 data augmentation이 CL에서 중요하다고 주장하였고, Data augmentation은 데이터의 분산을 늘려주어 다양한 특성을 학습할 수 있게 한다. 하지만, Joint embedding method의 특성상, 중요한 정보만 학습하게 되므로, 이 과정에서 Data augmentation이 잘못 되었다면, 중요한 정보를 학습하는 것이 아닌, Augmentation된 분포를 학습하려고 노력할 것이고, weight collapse로도 이어질 것이란 내 생각...

(2) data augmentation의 공분산이 data의 분산보다 작지만, implicit regularization에 의해 여전히 weight collapse 발생.

Related work.

- self-supervised learning method → simCLR, MoCo와 같은 joint embedding approch에서 InfoNCE Loss 사용.

- Theoretical Understanding of Self-supervised Learning은 downstream task에서 CL이 왜 유용한지에 대한 논문.

Arora A theoretical analysis of contrastive unsupervised representation learning. In ICML2019

Lee Predicting what you already know helps: Provable self-supervised learning. ArXiv, 2020.

Tosh Contrastive learning, multi-view redundancy, and linear models. ArXiv, abs/2008.10150, 2021.

- Understanding self-supervised learning dynamics without contrastive pairs은 BYOL, Simsiam에서 predictor와 input correlation matrix 사이에서의 eignvector를 alignment하는 것이 complete collapse를 방지하는 데 도움이 된다고 밝힘.

- Implicit Regularization: Linear Neural Network에서, gradient descent가 인접한 matrix를 유사해지게 만드는 것이다.

Dimensional Collapse

C ∈ ℝ{dxd}, d = 128이며, C는 embedding layer의 covariance matrix이다. z는 제일 위의 그림, z' 는 z/N 이다. N은 샘플 수

z는 128차원짜리 vector가 되고, z'는 z의 평균입니다.

- 우측 그림을 보면 log(σk)에 대해 정렬하였을 때, 약 95번째 부터 그 이후는 다 0으로 떨어지는 것을 볼 수 있다. 우리는 이것을 collapsed dimensions 라고 부른다.

- Figure2는 Embedding space에 대해 SVD를 실행한 결과이다.

Dimensional Collapse Caused by Strong Augmentation (1)

1. Linear Model 편

- 1층짜리 Network를 embedding layer로 사용한다고 가정하면, z = Wx가 된다. 따라서, Loss는 infoNCE 사용. z와 z'는 처음 이미지에서 본 2개의 각 branch의 입력이다. 또한 z와 z'를 normalize 시키면 (|z - z'|**2)/2 는 zT ⋅z' 로 치환 가능.

2. Gradient flow dynamics

- 첫번째 lemma는 W'는 W의 기울기를 나타낸다. 따라서 W' = -G이고, 위에서 normalized 시킨 vector들을 치환하는 공식을 넣으면 아래와 같이 정의가 가능하다.

- 증명 또한 아래와 같이 어렵지 않다. chain rule로 풀어내면 된다.

- embedding vector의 gradient는 아래와 같다. ∂L / ∂zi, ∂L/∂z'i 를 풀어내면 아래와 같이 나타낼 수 있다.

- 위의 수식을 바탕으로 G = -WX라는 일반적인 선형 layer와 유사한 형태를 나타낸다. G는 embedding vector를 normalized 시킨 vector로 g_{zi}와 다르다.

- lemma2를 설명하기 전에 논문에서 자주 언급하는 Tian 씨의 논문을 읽고 와야겠다. .

'Deep Learning (Computer Vision) > Contrastive Learning' 카테고리의 다른 글

논문 읽기[CMC] - Contrastive Multiview Coding (0)	2023.07.17
논문 요약 [CPC] - Representation Learning with Contrastive predictive Coding (1)	2023.06.27
Contrastive Learning (3): [SimCLR] A Simple Framework for Contrastive Learning of Visual Representations (0)	2023.06.25
Contrastive Learning (2) MoCo (Momentum Constrastive for Unsupervised Visual Representation Learning) (2)	2023.04.24
논문으로 알아보는 Contrastive Learning (1) - DrLIM (Dimensionality Reduction by Learning an Invariant Mapping) (0)	2023.04.04

Deep Learning Post