Recent advancements in Vision and Language Models (VLMs) have benefited open-world 3D representation, facilitating 3D zero-shot capability in unseen categories. Existing open-world methods pre-train an extra 3D encoder to align features from 3D data (e.g., depth maps or point clouds) with CAD-rendered images and corresponding texts.
However, these CAD-rendered images have limited color and texture variations, leading to overfitting in alignment. Furthermore, the volume discrepancy between pre-training datasets of the 3D encoder and VLM leads to sub-optimal 2D knowledge transfer for 3D learning. To address these two challenges, we propose OpenDlign, a novel framework for learning open-world 3D representations, that leverages depth-aligned images generated from point cloud-projected depth maps. Unlike CAD-rendered images, our generated images provide rich, realistic color and texture diversity while preserving geometric and semantic consistency with the depth maps. Additionally, OpenDlign optimizes depth map projection and integrates depth-specific text prompts, improving 2D VLM knowledge adaptation for 3D learning via data-efficient fine-tuning.
Experimental results show OpenDlign's superiority in 3D zero-shot and few-shot tasks, exceeding benchmarks by 9.7% on ModelNet40 and 17.6% on OmniObject3D for zero-shot classification. Notably, aligning with generated depth-aligned images boosts the performance of other existing methods, highlighting its potential to advance 3D open-world learning. Our code can be found in the supplemental material.
@article{ye2024opendlign,
author = {Ye Mao, Junpeng Jing, Krystian Mikolajczyk},
title = {OpenDlign: Enhancing Open-World 3D Learning with Depth-Aligned Images},
journal = {Arxiv},
year = {2024},
}