4 분 소요

논문링크


한줄요약 ✔

  • Weakly-supervised Learning은 주로 noisy labels가 많다.
    • 여전히 fully-supervised model 보다 성능이 안좋은 이유.
  • 기존 WS 방법들은 과장되었다 .
    • Clean validation samples:
      • Correct labels를 가진 데이터로 검증하는 것; Early stopping, meta learning 등의 이유로 사용.
      • 차라리 이것을 training dataset에 포함시켰으면 성능이 더 좋게 나오더라 .
        • 이것을 training data로 사용하여 fine-tuning한 모델이 이를 validation set으로 활용한 최신 WSL 모델보다 성능이 좋더라.. (Figure 1)
          • WSL 모델들은 단순 weak labels 보다는 성능 좋긴함..
  • 기존 모델은 단순히 linguistic correlation이 비슷한 것끼리 엮이도록 학습하였는데 이것은 편향때문에 일반화 성능 저하될 수 있다 .
    • 긍정적인 것에 부정적인 레이블을 강제하여 further tuning → 일반화 성능 향상 .
  • Contributions:
    • Validation samples를 training data로써 활용하여 fine-tuning.
      • Revisiting the true benefits of WSL.

Preliminaries 🍱

Weak Supervision

  • definition
    • proposed to ease the annotation bottleneck in training machine learning models.
    • uses weak sources to automatically annotate the data
  • drawback
    • its annotations could be noisy (some annotations are incorrect); causing poor generalization
    • solutions
      • to re-weight the impact of examples in loss computation
      • to train noise-robust models using KD
        • equipped with meta-learning in case of being fragile
      • to leverage the knowledge from pre-trained LLM
  • datasets
    • WRENCH
    • WALNUT

Realistic Evaluation

Semi-supervised Learning

  • often trains with a few hundred training examples while retaining thousands of validation samples for model selection
    • 레이블 사람이 수동으로 달아주는 것 한계
  • discard the validation set and use a fixed set of hyperparameters across datasets
    • 일일히 각 데이터셋에 최적 하이퍼파라미터 조합을 사람이 찾음;
  • prompt-based few-shot learning
    • sensitive to prompt selection and requires additional data for prompt evaluation
      • few-shot learning에 모순!
    • prompt 예시: 다음 리뷰가 긍정적인지 부정적인지 판단하십시오:”
  • recent work
    • fine-grained model selection 생략
    • Number of validation samples strictly controlled

의의

  • To our knowledge, no similar work exists exploring the aforementioned problems in the context of weak supervision.

Challenges and Main Idea💣

C1) 기존 WS 방법들은 데이터 활용이 과장되었다.

Idea) 차라리 이것을 training dataset에 포함시켰으면 성능이 더 좋게 나오더라

C2) 기존 WS 방법들은 학습 방법에 편향이 존재한다.

Idea) 긍정적인 것에 부정적인 레이블을 강제하여 further tuning → 일반화 성능 향상


Problem Definition ❤️

Given a pre-trained model on \(D_w\) ~ \(\mathcal{D}_n\).

Return a model.

Such that it generalizes well on \(D_{test}\) ~ \(\mathcal{D}_c\).


Methodology 👀

Setup

Formulation

  • \(\mathcal{X}\): feature.
  • \(\mathcal{Y}\): label space.
    • \(\hat{y}_i\): labels obtained from weak labeling sources; could be different from the GT label \(y_i\).
  • \(D={(x_i,y_i)}^N_{i=1}\).
    • \(D_c\): clean data distribution.
    • \(D_w\): weakly labeled dataset.
    • \(\mathcal{D}_n\): noisy distribution.
  • The goal of WSL algorithms is to obtain a model that generalizes well on \(D_{test} ∼ D_c\) despite being trained on \(D_w ∼ D_n\).
  • baseline: \(RoBERTa-base\).

Datasets

image

image

  • eight datasets covering different NLP tasks in English

WSL baselines

image

image

  • \(FT_W\): standard fine-tuning approach for WSL.
  • \(L2R\): meta-learning to determine the optimal weights for each noisy training sample.
  • \(MLC\): meta-learning for the meta-model to correct the noisy labels.
  • \(BOND\): noise-aware self-training framework designed for learning with weak annotations.
  • \(COSINE\): self-training with contrastive regularization to improve noise robustness further.

y축(Relative performance improvement over weak labels):

\[G_{\alpha}={(P_{\alpha}-P_{WL}) \over P_{WL}}\]
  • \(P_{\alpha}\): the performance achieved by weak labels.
  • \(P_{WL}\): a certain WSL method.

⇒ Without clean validation samples, existing WSL approaches do not work.

과연 정말 clean data가 없으면 WSL 성능이 안 좋을 수밖에 없나? .

Clean Data

image

⇒ a small amount of clean validation samples may be sufficient for current WSL methods to achieve good performance

image

⇒ the advantage of using WSL approaches vanishes when we have as few as 10 clean samples per class

Continuous Fine-tuning (CFT)

image

  • CFT
    • In the first phase, we apply WSL approaches on the weakly labeled training set, using the clean data for validation.
    • In the second phase, we take the model trained on the weakly labeled data as a starting point and continue to train it on the clean data.

⇒ the net benefit of using sophisticated WSL approaches may be significantly overestimated and impractical for real-world use cases.

  • 그냥 FT 한 것만으로도 기존 WSL 방법들 상당한 성능 향상 (even when # clean data = low)
  • L2R의 Yelp 데이터셋 결과의 경우 CFT 이후 오히려 성능이 떨어진 모습인데, 이것은 L2R가 validation loss를 사용하여 파라미터를 업데이트하기 때문에 검증 샘플의 가치가 큰 영향을 주지 않았을지도..

image

⇒ Pre-training on more data clearly helps to overcome biases from weak labels.

  • pre-training provides the model with an inductive bias to seek more general linguistic correlations instead of superficial correlations from the weak labels

image

⇒ contradictory samples play a more important role here and at least a minimum set of contradictory samples are required for CFT to be beneficial


Open Reivew 💗

NA


Discussion 🍟

NA


Major Takeaways 😃

NA


Conclusion ✨

Strength

  • If a proposed WSL method requires extra clean data, such as for validation, then the simple FTW+CFT baseline should be included in evaluation to claim the real benefits gained by applying the method.

Weakness

  • it may be possible to perform model selection by utilizing prior knowledge about the dataset
  • For low-resource languages where no PLMs are available, training may not be that effective
  • We have not extended our research to more diverse types of weak labels

Reference

NA

댓글남기기