4 분 소요

논문링크


한줄요약 ✔

  • WSSS 사용할 시 Image-level weak supervision의 한계.
    • sparse object coverage
    • inaccurate object boundary
    • co-occurring pixels from non-target objects
  • Explicit Pseudo-pixel Supervision (EPS): two weak supervision으로 pixel-level feedback을 얻는다.
    • localization map \(\rightarrow\) distingusih different objects.
      • CAM(Class Activation Map)으로 생성.
    • saliency map \(\rightarrow\) rich boundary information.
      • Saliency detection model으로 생성.

Preliminaries 🍱

  • 일반적으로 WSSS의 전체적인 파이프라인은 two stage로 구성되어 있다.
    • pseudo-mask 생성 (image classifier 이용).
    • pseudo-mask를 GT로 사용하여 각 iteration마다 GT 갱신하며 recursive하게 segmentation model 학습.

Challenges and Main Idea💣

C1) sparse object coverage

C2) inaccurate object boundary

C3) co-occurring pixels from non-target objects

Idea) Explicit Pseudo-pixel Supervision (EPS) 제안


Problem Definition ❤️

Given a weak dataset \(\mathcal{D}\).

Return a model \(\mathcal{S}\).

Such that \(\mathcal{S}\) approximates the performance of its fully-supervised model \(\mathcal{T}\).


Proposed Method 🧿

Model Architecture (EPS)

[Figure 2]

image

  • Background를 포함해서 \(C+1\)개의 class로 분류하는 classifier로 \(C+1\)개의 localization map을 생성해서 saliency map과 비교한다.
  • Foreground
    • \(C\)개의 localization map을 합쳐서 foreground map을 생성하고, 이를 saliency map과 매칭시킨다. ⇒ improving boundaries of objects.
  • Background
    • background localization map과 saliency map의 바깥부분 \((1−M_s)\) 을 매칭시킨다. ⇒ mitigate the co-occuring pixels of non-target objects.
      • 고양이와 개가 함께 나타나는 이미지를 고려해 봅시다. 여기서 고양이와 개가 대상(target)이고, 다른 객체들은 대상이 아닌(non-target) 객체입니다. 이 때, “co-occurring pixels of non-target objects”는 예를 들어 바닥, 벽, 나무, 가구 등의 픽셀들로 구성될 것입니다. 이 픽셀들은 여러 객체들과 함께 등장하며, 각 객체의 일부가 될 수 있지만, 그 자체로는 특정 객체를 나타내지는 않습니다.

Loss Function

image

\(\mathcal{L}_{total}=\mathcal{L}_{sal}+\mathcal{L}_{cls}\)

  • \(\mathcal{L}_{sal}={1 \over H \cdot W} \Vert M_s - \hat{M}_s \Vert^2\) .
    • \(M_s\): the off-the-shelf saliency detection model– PFAN [51] trained on DUTS dataset
    • \(H \cdot W\)를 나누는 이유?
      • normalization: 이미지의 크기가 클수록 loss 값도 커질 가능성이 있습니다. 이를 방지하기 위해 loss 값을 이미지의 크기로 나누어 normalize합니다.
    • marked by red box/arrorw in Figure 2.
    • the sum of pixel-wise differences between our estimated saliency map and an actual saliency map.
    • involved in updating the parameters of \(C + 1\) classes, including target objects and the background.
  • \(\mathcal{L}_{cls}={-1 \over C} \Sigma^C_{i=1} y_i log \sigma (\hat{y_i}) + (1-y_i) log (1-\sigma (\hat{y_i}))\) .
    • \(\sigma\): sigmoid function.
    • 이진 교차 엔트로피 손실(Binary Cross-Entropy Loss) 공식.
    • marked by blue box/arrorw in Figure 2.
    • only evaluates the label prediction for \(C\) classes, excluding the background class.
      • the gradient from \(\mathcal{L}_{cls}\) does not flow into the background class.

Joint Training

  • By jointly training the two objectives, we can synergize the localization map and the saliency map with complementary information
  • we observe that noisy and missing information of each other is complemented via our joint training strategy, as illustrated in Figure 3
    • Missing objects: \((c)\)에서 놓친 의자와 보트 객체를 \((d)\)에서는 잘 segment하는 모습.
    • noise 제거: \((c)\)에 존재하는 비행기의 contrail(연기) 같은 것들이 \((d)\)에서는 제거된 모습.

\(\hat{M}_s=\lambda M_{fg} + (1-\lambda)(1-M_{bg})\)

  • \(\hat{M}_s\): estimated saliency map.
  • \(M_s\): actual saliency map.
  • \(M_{fg}\): foreground saliency map.
    • \(M_{fg}=\Sigma^C_{i=1} y_i \cdot M_i \cdot \mathbb{1} [ \mathcal{O} (M_i,M_s) > \tau]\) .
      • \(\mathcal{O}(M_i,M_s)\): is the function to compute the overlapping ratio between \(M_i\) and \(M_s\).
      • \(M_i\): \(i\)-th localization map.
        • Assigned to the foreground if \(M_i\) is overlapped with the saliency map more than \(\tau%\)%, otherwise the background.
      • \(y_i \in \mathbb{R}^C\): binary image-level label \([0 or 1]\).
        • 모델 예측이 객체가 존재하는 경우에 대해서만 각 객체에 대한 saliency map을 합하여 최종 foreground saliency map을 구성한다.
  • \(M_{bg}\): background saliency map.
    • \(M_{bg}=\Sigma^C_{i=1} y_i \cdot M_i \cdot \mathbb{1} [ \mathcal{O} (M_i,M_s) <= \tau] + M_{C+1}\) .
  • \(\lambda \in [0,1]\): a hyperparameter to adjust a weighted sum of the foreground map and the inversion of the background map.

Experiment 👀

image

Setup

Datasets

  • PASCAL VOC 2012 and MS COCO 2014
  • augmented training set with 10,582 images
    • Augmentation 빼면 성능 어떤가??

Baseline

  • ResNet38 pre-trained on ImageNet

Boundary Mismatch Problem

image

Co-occurrence Problem

  • What is it?
    • Some background classes frequently appear with target objects in PASCAL VOC 2012
  • Dataset: PASCAL-CONTEXTdataset.
    • provides pixel-level annotations for a whole scene (e.g., water and railroad).
  • Evaluation:

image

  • We choose three co-occurring pairs; boat with water, train with railroad, and train with platform. We compare IoU for the target class and the confusion ratio \(m_{k,c}={FP_{k,c} \over TP_c}\) between a target class and its coincident class.
    • \(FP_{k,c}\): the number of pixels mis-classified as the target class \(c\) for the coincident class \(k\).
    • \(TP_c\): the number of true-positive pixels for the target class \(c\).
    • \(k\): the coincident class.
    • \(c\): the target class.
  • SEAM 는 특이하게 self-supervised training을 기반으로 하기 때문에 서로 다른 객체에 대해 겹치는 픽셀들에 잘못된 target object 레이블을 할당하고 이것을 정답 레이블로 학습에 활용하여 더 confusion ratio가 더 높게 측정되는 모습이다.

Map Selection Strategies

image

  • Naive strategy:
    • The foreground map is the union of all object localization maps; the background map equals the localization map of the background class.
  • Pre-defined class strategy:
    • We follow the naive strategy with the following exceptions. The localization maps of several pre-determined classes (e.g., sofa, chair, and dining table) are assigned to the background map (i.e., pre-defined class strategy)
      • 특정 객체들이 target objects가 아님을 알고 미리 배경으로 지정.
  • Our adaptive strategy:
    • 상기 EPS 내용 참조.

Our adaptive 가 가장 IoU 수치가 높고, 이는 해당 전략이 target object를 가장 잘 표현함을 의미한다.

Comparison with state-of-the-arts

Accuracy of pseudo-masks

image

image

Accuracy of segmentation maps

image

image

image

image

Effect of saliency detection models

  • Notably, our EPS using the unsupervised saliency model outperforms all existing methods using the supervised saliency model

Open Reivew 💗

NA


Discussion 🍟

NA


Major Takeaways 😃

NA


Conclusion ✨

  • We propose a novel weakly supervised segmentation framework, namely explicit pseudo-pixel supervision (EPS).

Reference

NA

댓글남기기