MSPL: Multi-Step Pseudo-Labeling for Open-Vocabulary Object Detection

MSPL builds an offline-to-online pipeline for open-vocabulary object detection. Offline, it turns foundational segmentation outputs into reliable open-vocabulary pseudo-labels through three-step reasoning. Online, these labels, region descriptions, and foreground/background decisions supervise detector training so that the model can detect base classes, novel classes, and objects that are otherwise unlabeled in the original training set.

Abstract

Open-vocabulary object detection (OVD) aims to recognize and localize object categories beyond the training set. Recent approaches leverage vision-language models to generate pseudo-labels using image-text alignment, allowing detectors to generalize to unseen classes without explicit supervision. However, these methods depend heavily on single-step image-text matching, neglecting the intermediate reasoning steps crucial for interpreting semantically complex visual contexts, such as crowding or occlusion. In this paper, we introduce MSPL, a framework that incorporates multi-step visual reasoning into the pseudo-labeling process for OVD. It decomposes complex scene understanding into three interpretable steps—object localization, category recognition, and background grounding—where these intermediate reasoning states serve as rich supervision sources. Extensive experiments on standard OVD evaluation protocols demonstrate that MSPL achieves state-of-the-art performance with superior pseudo-labeling efficiency, outperforming the strong baseline by 9.4 AP₅₀ for novel classes on OV-COCO and improving box and mask AP_r by 3.2 and 2.2, respectively, on OV-LVIS.

Intro

Limitations of single-step pseudo-labeling

Existing pseudo-labeling methods often reduce novel-class discovery to a single image-text matching step. This can assign labels from surrounding context rather than region-specific evidence, miss objects that are absent from captions, and absorb unlabeled or occluded objects into the background. MSPL addresses these failure modes by explicitly separating localization, recognition, and background grounding instead of compressing the whole scene into one matching step.

Method

In the offline stage, MSPL first obtains object-level region proposals and modulates visual context so the target region remains clear while distracting non-target areas are attenuated. A multimodal LLM then reasons over each candidate in three steps: verifying whether an object exists, assigning an open-vocabulary category with a region-level description, and deciding whether the predicted concept is foreground or background. The retained pseudo-labels become semantic anchors for training.

In the online stage, MSPL uses the offline annotations only as training supervision. Pseudo-labels expand the open-vocabulary base set, region-level descriptions provide fine-grained language supervision through Region-Text Alignment, and background concepts serve as negatives for Contrastive Background Learning. This design improves feature disentanglement while avoiding inference-time dependency on pseudo-labels.

Results

MSPL achieves strong performance on standard open-vocabulary detection benchmarks. On OV-COCO, it improves novel-class detection with both RN50 and RN50×4 backbones under pseudo-annotation supervision. The highlighted rows show that MSPL reaches 43.4 AP^N₅₀ with RN50 and 47.8 AP^N₅₀ with RN50×4, while preserving competitive base-class performance.

Instance labels in C_B only

Methods	Backbone	AP^N₅₀	AP^B₅₀
Instance labels in C_B (CLIP Supervision)
ViLD-ens	RN50 (24M)	27.6	51.3
BARON	RN50 (24M)	34.0	60.4
CORA	RN50 (24M)	35.1	35.4
BIND	ViT-B/16 (86M)	36.3	50.2
CLIP-Self	ViT-B/16 (86M)	37.6	-
LBP	RN50 (24M)	37.8	58.7
CCKT-Det	RN50 (24M)	38.0	35.0
CAKE	RN50 (24M)	38.2	-
OV-DQUO	RN50 (24M)	39.2	-
DeCo-DETR	RN50 (24M)	41.3	-
BIND	ViT-L/16 (307M)	41.5	54.8
CCKT-Det	SwinB (88M)	41.9	40.9
CORA+	RN50×4 (87M)	43.4	43.8
CLIP-Self	ViT-L/14 (307M)	44.3	-
OV-DQUO	RN50×4 (87M)	45.6	-

Extra caption datasets, weak labels, or pseudo labels

Methods	Supervision	Backbone	AP^N₅₀	AP^B₅₀
Extra caption datasets, Weak/Pseudo Labels in C_B ∪ C_N
Detic	IN21K & CC3M	RN50 (24M)	27.8	42.0
OV-DETR	Pseudo annotations	RN50 (24M)	29.4	52.7
CoDet	CC3M & COCO Caption	RN50 (24M)	30.6	46.4
PB-OVD	COCO Caption	RN50 (24M)	30.8	46.4
VL-PLM	Pseudo annotations	RN50 (24M)	34.4	60.2
RegionCLIP	CC3M	RN50 (24M)	35.2	57.6
OC-OVD	COCO Caption	RN50 (24M)	36.6	49.4
SAS-Det	COCO Caption	RN50 (24M)	37.4	58.5
DITO	LAION-2B	ViT-B/16 (86M)	36.6	48.8
LP-OVOD	Pseudo annotations	RN50 (24M)	40.5	60.5
MSPL (Ours)	Pseudo annotations	RN50 (24M)	43.4	58.9
CFM-ViT	LAION-2B	ViT-L/16 (307M)	34.3	46.4
RegionCLIP	CC3M	RN50×4 (87M)	39.3	61.6
DITO	DataComp-1B	ViT-L/16 (307M)	40.2	54.6
CORA+	COCO Caption	RN50×4 (87M)	43.1	56.2
MSPL (Ours)	Pseudo annotations	RN50×4 (87M)	47.8	60.9

Main results on OV-COCO. Rows highlighted in yellow denote MSPL.

BibTeX

@misc{choi2026msplmultisteppseudolabelingopenvocabulary,
      title={MSPL: Multi-Step Pseudo-Labeling for Open-Vocabulary Object Detection}, 
      author={Hojun Choi and Youngsun Lim and Jaeyo Shin and Hyunjung Shim},
      year={2026},
      eprint={2510.14792},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.14792v4}, 
}