Open-Vocabulary BEV Segmentation with
3D-Aware Geometric Constraints

1KAIST AI 2NAVER LABS
ECCV 2026
*Work done during an internship at NAVER LABS.
Co-corresponding authors.

Abstract

Bird’s-eye view (BEV) perception fuses multi-camera images into a unified top-down representation for autonomous driving. Despite recent progress, state-of-the-art methods remain confined to closed-set scenarios, making them vulnerable to unpredictable real-world environments. We introduce open-vocabulary BEV segmentation (OVBS), which leverages vision-language models to recognize categories beyond the training set while maintaining precise BEV perception and real-time efficiency. To address the 3D geometric inconsistency caused by ill-posed 2D-to-BEV lifting, we propose OVBEVSeg, a geometry-aware framework that uses 3D constraints across pseudo-BEV labeling, BEV-aware 3D Gaussian splatting, and BEV-aware Gaussian distillation. On nuScenes, OVBEVSeg achieves state-of-the-art OVBS performance, improving novel-class recognition while retaining competitive memory usage and real-time speed.

Intro

Limitations of 2D-OV-then-BEV pseudo-labeling

Existing BEV perception pipelines typically assume a closed set of semantic categories, while real driving scenes contain unseen objects such as trucks, strollers, and wheelchairs. A direct open-vocabulary extension—detecting objects in 2D and then lifting them into BEV—remains brittle because sparse-view 2D-to-3D unprojection amplifies localization and semantic errors. OVBEVSeg instead reverses the direction of reasoning: reliable 3D structure is projected into 2D and BEV to create geometrically consistent open-vocabulary supervision.

Method

OVBEVSeg framework

OVBEVSeg consists of three progressive modules. First, Pseudo-BEV Labeling (PBL) establishes 2D–BEV correspondences by projecting unsupervised 3D object boxes into image and BEV spaces, then assigns open-vocabulary labels using object-centric CLIP embeddings. Second, BEV-Aware 3D Gaussian Splatting (BAGS) optimizes 3D Gaussians with image, depth, and BEV occupancy constraints so that the recovered geometry remains consistent in both camera and top-down views. Third, BEV-Aware Gaussian Distillation (BAGD) transfers this high-fidelity geometry into a feed-forward student model for efficient online inference.

Language-aware OVBEVSeg extension

The framework also supports language-aware 3D scene understanding. By linking image masks, CLIP image embeddings, and BEV masks through the 2D–BEV correspondence set, OVBEVSeg can propagate open-vocabulary semantics into BEV space. For language-embedded BAGS, a disentanglement distillation objective preserves category-level distinctions among semantically similar vehicle subclasses, enabling consistent language grounding across image, BEV, and 3D Gaussian representations.

Results

OVBEVSeg is evaluated on the nuScenes validation set under the multi-class OVBS setting. Without using novel-class ground-truth labels, it substantially improves novel-class IoU over prior BEV segmentation models, while preserving real-time efficiency. Compared with the GaussianLSS baseline, OVBEVSeg improves the mean IoU by +6.6 points; compared with the previous SOTA model TaDe, it improves the mean IoU by +5.7 points.

Comparison of OVBS in the multi-class setting on nuScenes validation
Method Memory
(MiB) ↓
FPS ↑ Novel (IoU) ↑ Base (IoU) ↑ Mean
IoU ↑
truck bus motorcycle car trailer construction
vehicle
pedestrian bicycle
Fully Supervised 33.0 80.2 29.7 38.4 10.4 - - - - - -
OVBS
VED - - 0.0 0.0 0.0 7.4 0.0 0.0 0.0 0.0 0.9
VPN - - 0.0 0.0 0.0 16.6 4.9 7.1 0.0 4.4 4.1
PON 38.6 43.8 0.0 0.0 0.0 24.7 16.6 12.3 8.2 9.4 8.9
DiffBEV - - 0.0 0.0 0.0 38.9 21.1 8.4 9.6 13.2 11.4
GaussianLSS 33.0 80.2 0.0 0.0 0.0 40.0 25.1 11.6 14.4 10.4 12.7
TaDe 41.5 51.4 0.0 0.0 0.0 42.8 26.3 11.4 14.0 14.2 13.6
OVBEVSeg (Ours) 32.4 79.6 19.0 20.3 6.6 41.8 26.9 13.1 15.0 12.0 19.3
ΔB (+19.0) (+20.3) (+6.6) (+1.8) (+1.8) (+1.5) (+0.6) (+1.6) (+6.6)
ΔS (+19.0) (+20.3) (+6.6) (-1.0) (+0.6) (+1.7) (+1.0) (-2.2) (+5.7)

† denotes visibility filtering. ΔB and ΔS denote relative improvements over GaussianLSS and TaDe, respectively.

OVBEVSeg qualitative results

Qualitative results show that OVBEVSeg detects novel-class objects that are missed by closed-set baselines and produces sharper BEV boundaries. By enforcing 3D geometric consistency, BAGS reduces distorted or scattered BEV renderings from vanilla 3DGS and provides cleaner supervision for the online model.

Language-embedded BAGS results

Language-embedded BAGS extends OVBEVSeg beyond standard BEV segmentation toward open-world 3D scene understanding. It embeds language-aware features into 3D Gaussians and preserves consistent projections across image and BEV views, supporting scalable multimodal auto-labeling in autonomous driving scenes.

BibTeX

@inproceedings{choi2026ovbevseg,
  title     = {Open-Vocabulary BEV Segmentation with 3D-Aware Geometric Constraints},
  author    = {Choi, Hojun and Hwang, Seulbin and Kim, Dae Jung and Kim, Kisung and Shim, Hyunjung and Lee, Jinhan},
  booktitle = {European Conference on Computer Vision},
  year      = {2026}
}