Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding

Ye Mao · Weixun Luo · Ranran Huang · Junpeng Jing · Krystian Mikolajczyk

Imperial College London

A short video walkthrough of UniScene3D qualitative results and scene understanding behavior.

The scenes and text queries are drawn from wild, out-of-domain data, underscoring UniScene3D's strong generalization capability.

UniScene3D teaser figure

Figure 1. UniScene3D learns transferable 3D scene representations from multi-view colored pointmaps for viewpoint grounding, scene retrieval, scene type classification, and 3D visual question answering.

Abstract

Pretraining 3D encoders by aligning with Contrastive Language-Image Pre-training (CLIP) has emerged as a promising direction to learn generalizable representations for 3D scene understanding. We propose UniScene3D, a transformer-based encoder that learns unified scene representations from multi-view colored pointmaps, jointly modeling image appearance and geometry. For robust colored pointmap representation learning, we introduce cross-view geometric alignment and grounded view alignment to enforce cross-view geometry and semantic consistency. Extensive low-shot and task-specific fine-tuning evaluations on viewpoint grounding, scene retrieval, scene type classification, and 3D VQA demonstrate strong performance, supporting UniScene3D as a unified representation for 3D scene understanding.

Key Takeaways

Method

UniScene3D takes multi-view image-pointmap pairs, fuses appearance and geometry at the input stage, and learns a unified representation with view-level and scene-level alignment objectives.

Overview of UniScene3D pretraining

Figure 2. The model combines early RGB-geometry fusion with cross-view geometric alignment, grounded view alignment, view-level alignment, and scene-level alignment.

Qualitative Examples

These examples show how UniScene3D uses appearance and geometry together to retrieve views that match language queries.

Visualization 1
Visualization 2

Use the arrow buttons to browse the qualitative visualization pages.

Evaluation

UniScene3D is evaluated on viewpoint grounding, scene retrieval, scene type classification, and 3D VQA.

Viewpoint Grounding

Method Input ScanRefer Nr3D Sr3D
R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
DFNI20.959.277.617.351.871.812.643.565.6
SigLIP2I22.261.179.518.854.374.013.044.467.2
FG-CLIPI20.258.577.617.352.672.813.444.566.6
Uni3D-gPC4.219.035.53.617.233.43.416.532.2
POMA-3DPM16.450.671.213.644.565.910.237.459.2
UniScene3DCPM38.672.585.725.264.080.923.662.880.4

Scene Retrieval

Method Input ScanRefer Nr3D Sr3D
n=5 R@1 n=5 R@5 n=10 R@1 n=10 R@5 n=5 R@1 n=5 R@5 n=10 R@1 n=10 R@5 n=5 R@1 n=5 R@5 n=10 R@1 n=10 R@5
DFNI4.613.54.613.33.912.74.413.00.63.10.63.2
SigLIP2I1.55.31.86.60.73.01.03.80.10.30.10.4
FG-CLIPI9.226.316.340.06.220.812.432.80.62.70.73.8
Uni3D-gPC0.31.40.31.30.51.40.41.50.31.60.21.4
3D-VisTAPC0.62.40.72.20.10.90.21.30.21.50.21.7
SceneVersePC0.52.10.62.70.51.30.51.40.41.80.11.9
POMA-3DPM13.834.920.447.211.932.719.346.81.66.92.49.5
UniScene3DCPM22.448.633.462.919.746.330.761.73.011.84.617.5

Scene Type Classification

Method Input Accuracy
0-shot 1-shot 5-shot 10-shot
DFNI41.640.560.177.5
SigLIP2I36.036.657.877.9
FG-CLIPI49.540.062.077.5
Uni3D-gPC8.621.437.145.4
3D-VisTAPC1.928.949.455.0
SceneVersePC0.824.051.164.2
POMA-3DPM63.932.264.974.1
UniScene3DCPM70.743.272.483.7

3D VQA

Method Input ScanQA SQA3D Hypo3D
EM@1 EM@10 EM@1 EM@10 EM@1 EM@10
FG-CLIPI20.949.949.589.731.182.1
3D-VisTAPC22.452.148.585.631.081.2
SceneVersePC22.751.549.985.031.680.3
POMA-3DPM22.352.351.191.233.484.8
UniScene3DCPM23.253.552.592.335.285.9

Main Findings

UniScene3D transfers well across zero-shot, few-shot, and task-specific fine-tuning settings.

Colored pointmaps outperform image-only, pointmap-only, and RGB-D alternatives.

The proposed alignment objectives consistently improve representation quality.

Performance remains strong across different numbers of input views.

Scaling to more pretraining scenes steadily improves downstream results.

UniScene3D generalizes better than prior methods on out-of-domain scenes.

Conclusion

UniScene3D is a 3D representation learning method that jointly models geometry and appearance from multi-view colored pointmaps. By leveraging rich cross-view context, it introduces cross-view geometric alignment and grounded view alignment objectives for pretraining.

Across multiple 3D scene understanding benchmarks, UniScene3D achieves state-of-the-art performance while remaining robust to varying numbers of input views. Looking forward, a natural next step is to scale UniScene3D to larger model sizes and explore its use as a general vision backbone for 3D foundation models.

Citation

@inproceedings{mao2026uniscene3d,
  title     = {Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding},
  author    = {Mao, Ye and Luo, Weixun and Huang, Ranran and Jing, Junpeng and Mikolajczyk, Krystian},
  booktitle = {arXiv},
  year      = {2026}
}