UniScene3D

A short video walkthrough of UniScene3D qualitative results and scene understanding behavior.

The scenes and text queries are drawn from wild, out-of-domain data, underscoring UniScene3D's strong generalization capability.

Figure 1. UniScene3D learns transferable 3D scene representations from multi-view colored pointmaps for viewpoint grounding, scene retrieval, scene type classification, and 3D visual question answering.

Abstract

Pretraining 3D encoders by aligning with Contrastive Language-Image Pre-training (CLIP) has emerged as a promising direction to learn generalizable representations for 3D scene understanding. We propose UniScene3D, a transformer-based encoder that learns unified scene representations from multi-view colored pointmaps, jointly modeling image appearance and geometry. For robust colored pointmap representation learning, we introduce cross-view geometric alignment and grounded view alignment to enforce cross-view geometry and semantic consistency. Extensive low-shot and task-specific fine-tuning evaluations on viewpoint grounding, scene retrieval, scene type classification, and 3D VQA demonstrate strong performance, supporting UniScene3D as a unified representation for 3D scene understanding.

Key Takeaways

Core Question: Unlike 2D vision, 3D scene understanding still lacks a generalizable encoder like CLIP, largely due to the scarcity of large-scale 3D pretraining data. This raises the question: can a 2D vision encoder be extended into a general 3D scene encoder without extensive 3D pretraining?
Preliminary Finding: Pointmaps encode world-frame geometry like point clouds while preserving an image-like format compatible with 2D vision models. Our initial study shows that pretrained 2D vision weights are also beneficial for learning pointmap features.
Model Contribution: UniScene3D extends pretrained CLIP models to learn unified 3D scene representations from pixel-aligned, multi-view colored pointmaps by jointly encoding geometry and appearance.
Key Training Idea: We introduce cross-view geometric alignment and grounded view alignment to enforce geometric and semantic consistency across viewpoints.
Result: The learned representations effectively combine complementary information from images and pointmaps, generalize across diverse scenes, and transfer well to a broad range of downstream 3D tasks.

Method

UniScene3D takes multi-view image-pointmap pairs, fuses appearance and geometry at the input stage, and learns a unified representation with view-level and scene-level alignment objectives.

Figure 2. The model combines early RGB-geometry fusion with cross-view geometric alignment, grounded view alignment, view-level alignment, and scene-level alignment.

Qualitative Examples

These examples show how UniScene3D uses appearance and geometry together to retrieve views that match language queries.

Use the arrow buttons to browse the qualitative visualization pages.

Evaluation

UniScene3D is evaluated on viewpoint grounding, scene retrieval, scene type classification, and 3D VQA.

Viewpoint Grounding

Method	Input	ScanRefer			Nr3D			Sr3D
Method	Input	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
DFN	I	20.9	59.2	77.6	17.3	51.8	71.8	12.6	43.5	65.6
SigLIP2	I	22.2	61.1	79.5	18.8	54.3	74.0	13.0	44.4	67.2
FG-CLIP	I	20.2	58.5	77.6	17.3	52.6	72.8	13.4	44.5	66.6
Uni3D-g	PC	4.2	19.0	35.5	3.6	17.2	33.4	3.4	16.5	32.2
POMA-3D	PM	16.4	50.6	71.2	13.6	44.5	65.9	10.2	37.4	59.2
UniScene3D	CPM	38.6	72.5	85.7	25.2	64.0	80.9	23.6	62.8	80.4

Scene Retrieval

Method	Input	ScanRefer				Nr3D				Sr3D
Method	Input	n=5 R@1	n=5 R@5	n=10 R@1	n=10 R@5	n=5 R@1	n=5 R@5	n=10 R@1	n=10 R@5	n=5 R@1	n=5 R@5	n=10 R@1	n=10 R@5
DFN	I	4.6	13.5	4.6	13.3	3.9	12.7	4.4	13.0	0.6	3.1	0.6	3.2
SigLIP2	I	1.5	5.3	1.8	6.6	0.7	3.0	1.0	3.8	0.1	0.3	0.1	0.4
FG-CLIP	I	9.2	26.3	16.3	40.0	6.2	20.8	12.4	32.8	0.6	2.7	0.7	3.8
Uni3D-g	PC	0.3	1.4	0.3	1.3	0.5	1.4	0.4	1.5	0.3	1.6	0.2	1.4
3D-VisTA	PC	0.6	2.4	0.7	2.2	0.1	0.9	0.2	1.3	0.2	1.5	0.2	1.7
SceneVerse	PC	0.5	2.1	0.6	2.7	0.5	1.3	0.5	1.4	0.4	1.8	0.1	1.9
POMA-3D	PM	13.8	34.9	20.4	47.2	11.9	32.7	19.3	46.8	1.6	6.9	2.4	9.5
UniScene3D	CPM	22.4	48.6	33.4	62.9	19.7	46.3	30.7	61.7	3.0	11.8	4.6	17.5

Scene Type Classification

Method	Input	Accuracy
Method	Input	0-shot	1-shot	5-shot	10-shot
DFN	I	41.6	40.5	60.1	77.5
SigLIP2	I	36.0	36.6	57.8	77.9
FG-CLIP	I	49.5	40.0	62.0	77.5
Uni3D-g	PC	8.6	21.4	37.1	45.4
3D-VisTA	PC	1.9	28.9	49.4	55.0
SceneVerse	PC	0.8	24.0	51.1	64.2
POMA-3D	PM	63.9	32.2	64.9	74.1
UniScene3D	CPM	70.7	43.2	72.4	83.7

3D VQA

Method	Input	ScanQA		SQA3D		Hypo3D
Method	Input	EM@1	EM@10	EM@1	EM@10	EM@1	EM@10
FG-CLIP	I	20.9	49.9	49.5	89.7	31.1	82.1
3D-VisTA	PC	22.4	52.1	48.5	85.6	31.0	81.2
SceneVerse	PC	22.7	51.5	49.9	85.0	31.6	80.3
POMA-3D	PM	22.3	52.3	51.1	91.2	33.4	84.8
UniScene3D	CPM	23.2	53.5	52.5	92.3	35.2	85.9

Main Findings

UniScene3D transfers well across zero-shot, few-shot, and task-specific fine-tuning settings.

Colored pointmaps outperform image-only, pointmap-only, and RGB-D alternatives.

The proposed alignment objectives consistently improve representation quality.

Performance remains strong across different numbers of input views.

Scaling to more pretraining scenes steadily improves downstream results.

UniScene3D generalizes better than prior methods on out-of-domain scenes.

Conclusion

UniScene3D is a 3D representation learning method that jointly models geometry and appearance from multi-view colored pointmaps. By leveraging rich cross-view context, it introduces cross-view geometric alignment and grounded view alignment objectives for pretraining.

Across multiple 3D scene understanding benchmarks, UniScene3D achieves state-of-the-art performance while remaining robust to varying numbers of input views. Looking forward, a natural next step is to scale UniScene3D to larger model sizes and explore its use as a general vision backbone for 3D foundation models.

Citation

@inproceedings{mao2026uniscene3d,
  title     = {Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding},
  author    = {Mao, Ye and Luo, Weixun and Huang, Ranran and Jing, Junpeng and Mikolajczyk, Krystian},
  booktitle = {arXiv},
  year      = {2026}
}