OVBEVSeg: Open-Vocabulary BEV Segmentation with 3D-Aware Geometric Constraints

Abstract

Bird's-eye view (BEV) perception fuses multi-camera images into a unified top-down representation for autonomous driving. Despite recent progress, state-of-the-art methods remain confined to closed-set scenarios, making them vulnerable to unpredictable real-world environments. In this work, we introduce open-vocabulary BEV segmentation (OVBS), which leverages vision-language models (VLMs) to recognize categories beyond the training set while maintaining precise BEV perception and real-time efficiency. A key challenge in OVBS lies in the 3D geometric inconsistency inherent in the ill-posed lifting of 2D VLM semantics into BEV. To address this, we propose OVBEVSeg, a geometry-aware OVBS framework that enhances efficient Gaussian splatting (GS)-based unprojection by leveraging robust 3D geometric constraints across three progressive stages: (1) 2D-to-BEV pseudo-labeling via reliable 3D projection for OV generalization; (2) joint 2D–BEV per-scene optimization with BEV structural constraints for 3D geometric consistency; and (3) 3D geometric distillation for efficiency. On the nuScenes dataset, OVBEVSeg achieves state-of-the-art performance, outperforming closed-set methods by 15.3 mIoU on unseen categories. Remarkably, despite being entirely label-free, it remains competitive with self- and semi-supervised baselines trained with up to 40% of ground-truth annotations. Furthermore, it achieves 2.5× faster inference with only 0.22× the memory consumption of projection-based methods.

Intro

Limitations of 2D-OV-then-BEV pseudo-labeling

Existing BEV perception pipelines typically assume a closed set of semantic categories, while real driving scenes contain unseen objects such as trucks, strollers, and wheelchairs. A direct open-vocabulary extension—detecting objects in 2D and then lifting them into BEV—remains brittle because sparse-view 2D-to-3D unprojection amplifies localization and semantic errors. OVBEVSeg instead reverses the direction of reasoning: reliable 3D structure is projected into 2D and BEV to create geometrically consistent open-vocabulary supervision.

Method

OVBEVSeg consists of three progressive modules. First, Pseudo-BEV Labeling (PBL) establishes 2D–BEV correspondences by projecting unsupervised 3D object boxes into image and BEV spaces, then assigns open-vocabulary labels using object-centric CLIP embeddings. Second, BEV-Aware 3D Gaussian Splatting (BAGS) optimizes 3D Gaussians with image, depth, and BEV occupancy constraints so that the recovered geometry remains consistent in both camera and top-down views. Third, BEV-Aware Gaussian Distillation (BAGD) transfers this high-fidelity geometry into a feed-forward student model for efficient online inference.

The framework also supports language-aware 3D scene understanding. By linking image masks, CLIP image embeddings, and BEV masks through the 2D–BEV correspondence set, OVBEVSeg can propagate open-vocabulary semantics into BEV space. For language-embedded BAGS, a disentanglement distillation objective preserves category-level distinctions among semantically similar vehicle subclasses, enabling consistent language grounding across image, BEV, and 3D Gaussian representations.

Results

OVBEVSeg is evaluated on the nuScenes validation set under the multi-class OVBS setting. Without using novel-class ground-truth labels, it substantially improves novel-class IoU over prior BEV segmentation models, while preserving real-time efficiency. Compared with the GaussianLSS baseline, OVBEVSeg improves the mean IoU by +6.6 points; compared with the previous SOTA model TaDe, it improves the mean IoU by +5.7 points.

Comparison of OVBS in the multi-class setting on nuScenes validation

Method	Memory (MiB) ↓	FPS ↑	Novel (IoU) ↑			Base (IoU) ↑					Mean IoU ↑
Method	Memory (MiB) ↓	FPS ↑	truck	bus	motorcycle	car	trailer	construction vehicle	pedestrian	bicycle	Mean IoU ↑
Fully Supervised^†	33.0	80.2	29.7	38.4	10.4	-	-	-	-	-	-
OVBS
VED^†	-	-	0.0	0.0	0.0	7.4	0.0	0.0	0.0	0.0	0.9
VPN^†	-	-	0.0	0.0	0.0	16.6	4.9	7.1	0.0	4.4	4.1
PON^†	38.6	43.8	0.0	0.0	0.0	24.7	16.6	12.3	8.2	9.4	8.9
DiffBEV^†	-	-	0.0	0.0	0.0	38.9	21.1	8.4	9.6	13.2	11.4
GaussianLSS^†	33.0	80.2	0.0	0.0	0.0	40.0	25.1	11.6	14.4	10.4	12.7
TaDe^†	41.5	51.4	0.0	0.0	0.0	42.8	26.3	11.4	14.0	14.2	13.6
OVBEVSeg (Ours)^†	32.4	79.6	19.0	20.3	6.6	41.8	26.9	13.1	15.0	12.0	19.3
Δ_B			(+19.0)	(+20.3)	(+6.6)	(+1.8)	(+1.8)	(+1.5)	(+0.6)	(+1.6)	(+6.6)
Δ_S			(+19.0)	(+20.3)	(+6.6)	(-1.0)	(+0.6)	(+1.7)	(+1.0)	(-2.2)	(+5.7)

† denotes visibility filtering. Δ_B and Δ_S denote relative improvements over GaussianLSS and TaDe, respectively.

Qualitative results show that OVBEVSeg detects novel-class objects that are missed by closed-set baselines and produces sharper BEV boundaries. By enforcing 3D geometric consistency, BAGS reduces distorted or scattered BEV renderings from vanilla 3DGS and provides cleaner supervision for the online model.

Language-embedded BAGS extends OVBEVSeg beyond standard BEV segmentation toward open-world 3D scene understanding. It embeds language-aware features into 3D Gaussians and preserves consistent projections across image and BEV views, supporting scalable multimodal auto-labeling in autonomous driving scenes.

BibTeX

@misc{choi2026openvocabularybevsegmentation3daware,
      title={Open-Vocabulary BEV Segmentation with 3D-Aware Geometric Constraints}, 
      author={Hojun Choi and Seulbin Hwang and Dae Jung Kim and Kisung Kim and Hyunjung Shim and Jinhan Lee},
      year={2026},
      eprint={2606.24353},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.24353}, 
}