[논문분석] Learning Transferable Visual Models From Natural Language Supervision (ICMR 2021)

2 분 소요

논문링크

한줄요약 ✔

Contrastive Language–Image Pre-training (CLIP)은 Image-Language를 서로 Contrastive learning하는 방법론이다.
- 하나의 이미지와 그 이미지를 설명하는 텍스트를 매칭시키는 문제를 푸는 모델.

Preliminaries 🍱

Image Captioning

Traditional Image Captioning with a focus on raw image pixels generates language descriptions for images.

Zero-Shot Prediction

Zero-Shot Prediction은 특정 downstream task의 dataset을 일절 사용하지 않고 기학습 모델로 추론하는 것.

Challenges and Main Idea💣

Problem Definition ❤️

Given a dataset \(\mathcal{D}\) with pairs of an image and a text.

Return a pre-trained model \(\mathcal{S}\).

Such that \(\mathcal{S}\) achieves the SOTA performance in various tasks.

Proposed Method 🧿

Training Dataset

인터넷에 존재하는 약 4억 개의 (image ,text) 데이터 쌍 (ImageNet은 1천 4백만장의 데이터가 존재한다).
- 가령, 상기 이미지에서 (<강아지 이미지>, “pepper the aussie pup”) .

Training

Contrastive Pre-training

Zero-shot 학습이라서 Pre-training 이후 추가적인 fine-tuning 과정이 없다.
cosine similarity 연산으로 (이미지, 텍스트) 매칭을 학습한다.

Image Encoder

이미지 벡터화.

이미지 \(X_{i,j}\)를 Image Encoder에 통과시켜 벡터값 \(X^v_{i,j}\)으로 표현한다.
- \(X_{i,j}\): \(j\)번째 Batch의 \(i\)번째 입력 이미지.
\(X^v_{i,j}\)에 가중치 행렬 \(W^O_{image}\)를 곱하여 \(X^{vw}_{i,j}\)를 얻는다.
\(X^{vw}_{i,j}\)에 L2-normalization을 적용하여 \(I_{i,j}\) 임베딩을 얻는다.

Text Encoder

텍스트 벡터화.

텍스트 \(Y_{i,j}\)를 Text Encoder에 통과시켜 벡터값 \(Y^v_{i,j}\)으로 표현한다.
- \(Y_{i,j}\): \(j\)번째 Batch의 \(i\)번째 입력 Text.
- 문장의 끝인 [EOS] 토큰을 해당 text를 표현하는 벡터로 학습시켜 \(T\) 계산.
\(Y^v_{i,j}\)에 가중치 행렬 \(W^O_{text}\)를 곱하여 \(Y^{vw}_{i,j}\)를 얻는다.
\(Y^{vw}_{i,j}\)에 L2-normalization을 적용하여 \(T_{i,j}\) 임베딩을 얻는다.

Cosine Similarity

이전 단계에서 이미 L2 정규화를 거쳤기 때문에 단순 내적으로 cosine similarity 계산 가능.
- 정규화 후 서로 다른 두 벡터의 크기는 \(1\), 방향 벡터는 그대로 보존된다.
초록색 점선 : \(ImageLoss\).
빨간색 점선 : \(TextLoss\).

Loss Function

\(\mathcal{L}=\frac{<ImageLoss>+<TextLoss>}{2}\)

가령, \(ImageLoss\)는 하기 이미지에서 \(i\)번째 batch에 대해 계산한다.

\(N\): batch size.

Create Classifier from Label Text

Downstream task의 dataset에 존재하는 label을 텍스트로 변환.
- i.e., plane, car, dog, etc.
각 label을 “a photo of a (object)” phrase에서 (object)의 파라미터 값으로 활용하여 text \(\mathcal{T}\)를 생성.
\(\mathcal{T}\)를 Text Encoder에 통과시켜서 Text Vector \(\bar{\mathcal{T}}\)도출.

Zero-shot Prediction

Downstream task의 인풋 이미지를 Image Encoder에 통과시켜서 이미지 벡터 \(I_i\) 획득.
- Image Encoder은 pre-training 단계에서 학습됨.
\(I_i\)과 이전에 구한 Text Vector \(\bar{\mathcal{T}}\) 간의 cosine similarity 계산.

Experiment 👀

실험 내용이 많아서 주요한 결과만 보고 나머지는 각설한다.

Classification

Linear Probe:
- Fine-Tuning을 하는 방법 중 하나로 Output을 출력하는 Top Layer 부분을 데이터셋별로 다르게 학습하고, 그 이전 레이어의 parameters는 고정한다.
27개중 16개의 데이터셋에서 CLIP의 Zero-Shot Prediction이 Linear Probe 보다 성능이 비교적 좋은 것을 확인.

v. Few-shot

Applicability

CLIP-ViT는 기존 ImageNet의 pre-trained 모델에 비해 다양한 task에 더 유연하고 일반화된 모델.

Open Reivew 💗

Discussion 🍟

Major Takeaways 😃

Conclusion ✨

Reference

[1] 논문 리뷰: CLIP

Twitter Facebook LinkedIn