MSPL: Multi-Step Pseudo-Labeling for
Open-Vocabulary Object Detection

1KAIST AI 2Boston University
ECCV 2026
MSPL teaser

MSPL builds an offline-to-online pipeline for open-vocabulary object detection. Offline, it turns foundational segmentation outputs into reliable open-vocabulary pseudo-labels through three-step reasoning. Online, these labels, region descriptions, and foreground/background decisions supervise detector training so that the model can detect base classes, novel classes, and objects that are otherwise unlabeled in the original training set.

Abstract

Open-vocabulary object detection (OVD) aims to recognize and localize object categories beyond the training set. Recent approaches leverage vision-language models to generate pseudo-labels using image-text alignment, allowing detectors to generalize to unseen classes without explicit supervision. However, these methods depend heavily on single-step image-text matching, neglecting the intermediate reasoning steps crucial for interpreting semantically complex visual contexts, such as crowding or occlusion. In this paper, we introduce MSPL, a framework that incorporates multi-step visual reasoning into the pseudo-labeling process for OVD. It decomposes complex scene understanding into three interpretable steps—object localization, category recognition, and background grounding—where these intermediate reasoning states serve as rich supervision sources. Extensive experiments on standard OVD evaluation protocols demonstrate that MSPL achieves state-of-the-art performance with superior pseudo-labeling efficiency, outperforming the strong baseline by 9.4 AP50 for novel classes on OV-COCO and improving box and mask APr by 3.2 and 2.2, respectively, on OV-LVIS.

Intro

Limitations of single-step pseudo-labeling

Existing pseudo-labeling methods often reduce novel-class discovery to a single image-text matching step. This can assign labels from surrounding context rather than region-specific evidence, miss objects that are absent from captions, and absorb unlabeled or occluded objects into the background. MSPL addresses these failure modes by explicitly separating localization, recognition, and background grounding instead of compressing the whole scene into one matching step.

Method

Offline MSPL pseudo-label generation

In the offline stage, MSPL first obtains object-level region proposals and modulates visual context so the target region remains clear while distracting non-target areas are attenuated. A multimodal LLM then reasons over each candidate in three steps: verifying whether an object exists, assigning an open-vocabulary category with a region-level description, and deciding whether the predicted concept is foreground or background. The retained pseudo-labels become semantic anchors for training.

Online MSPL training framework

In the online stage, MSPL uses the offline annotations only as training supervision. Pseudo-labels expand the open-vocabulary base set, region-level descriptions provide fine-grained language supervision through Region-Text Alignment, and background concepts serve as negatives for Contrastive Background Learning. This design improves feature disentanglement while avoiding inference-time dependency on pseudo-labels.

Results

MSPL achieves strong performance on standard open-vocabulary detection benchmarks. On OV-COCO, it improves novel-class detection with both RN50 and RN50×4 backbones under pseudo-annotation supervision. The highlighted rows show that MSPL reaches 43.4 APN50 with RN50 and 47.8 APN50 with RN50×4, while preserving competitive base-class performance.

Instance labels in CB only
Methods Backbone APN50 APB50
Instance labels in CB (CLIP Supervision)
ViLD-ensRN50 (24M)27.651.3
BARONRN50 (24M)34.060.4
CORARN50 (24M)35.135.4
BINDViT-B/16 (86M)36.350.2
CLIP-SelfViT-B/16 (86M)37.6-
LBPRN50 (24M)37.858.7
CCKT-DetRN50 (24M)38.035.0
CAKERN50 (24M)38.2-
OV-DQUORN50 (24M)39.2-
DeCo-DETRRN50 (24M)41.3-
BINDViT-L/16 (307M)41.554.8
CCKT-DetSwinB (88M)41.940.9
CORA+RN50×4 (87M)43.443.8
CLIP-SelfViT-L/14 (307M)44.3-
OV-DQUORN50×4 (87M)45.6-
Extra caption datasets, weak labels, or pseudo labels
Methods Supervision Backbone APN50 APB50
Extra caption datasets, Weak/Pseudo Labels in CB ∪ CN
DeticIN21K & CC3MRN50 (24M)27.842.0
OV-DETRPseudo annotationsRN50 (24M)29.452.7
CoDetCC3M & COCO CaptionRN50 (24M)30.646.4
PB-OVDCOCO CaptionRN50 (24M)30.846.4
VL-PLMPseudo annotationsRN50 (24M)34.460.2
RegionCLIPCC3MRN50 (24M)35.257.6
OC-OVDCOCO CaptionRN50 (24M)36.649.4
SAS-DetCOCO CaptionRN50 (24M)37.458.5
DITOLAION-2BViT-B/16 (86M)36.648.8
LP-OVODPseudo annotationsRN50 (24M)40.560.5
MSPL (Ours)Pseudo annotationsRN50 (24M)43.458.9
CFM-ViTLAION-2BViT-L/16 (307M)34.346.4
RegionCLIPCC3MRN50×4 (87M)39.361.6
DITODataComp-1BViT-L/16 (307M)40.254.6
CORA+COCO CaptionRN50×4 (87M)43.156.2
MSPL (Ours)Pseudo annotationsRN50×4 (87M)47.860.9

Main results on OV-COCO. Rows highlighted in yellow denote MSPL.

BibTeX

@inproceedings{choi2026mspl,
  title     = {MSPL: Multi-Step Pseudo-Labeling for Open-Vocabulary Object Detection},
  author    = {Choi, Hojun and Lim, Youngsun and Shin, Jaeyo and Shim, Hyunjung},
  booktitle = {European Conference on Computer Vision},
  year      = {2026}
}