OpenDlign: Enhancing Open-World 3D Learning with Depth-Aligned Images

Imperial College London
Teaser

Comparing conventional 3D open-world learning frameworks with OpenDlign: It achieves multimodal alignment via depth-aligned images, offering consistent geometric and semantic details as well as richer color and textural variations than rendered images used in prior frameworks. OpenDlign streamlines 3D representation by directly fine-tuning the CLIP image encoder, avoiding the additional encoder pre-training needed by previous open-world methods. Notably, both rendered and depth-aligned images are only available in learning alignment.

Abstract

Recent advancements in Vision and Language Models (VLMs) have benefited open-world 3D representation, facilitating 3D zero-shot capability in unseen categories. Existing open-world methods pre-train an extra 3D encoder to align features from 3D data (e.g., depth maps or point clouds) with CAD-rendered images and corresponding texts.

However, these CAD-rendered images have limited color and texture variations, leading to overfitting in alignment. Furthermore, the volume discrepancy between pre-training datasets of the 3D encoder and VLM leads to sub-optimal 2D knowledge transfer for 3D learning. To address these two challenges, we propose OpenDlign, a novel framework for learning open-world 3D representations, that leverages depth-aligned images generated from point cloud-projected depth maps. Unlike CAD-rendered images, our generated images provide rich, realistic color and texture diversity while preserving geometric and semantic consistency with the depth maps. Additionally, OpenDlign optimizes depth map projection and integrates depth-specific text prompts, improving 2D VLM knowledge adaptation for 3D learning via data-efficient fine-tuning.

Experimental results show OpenDlign's superiority in 3D zero-shot and few-shot tasks, exceeding benchmarks by 9.7% on ModelNet40 and 17.6% on OmniObject3D for zero-shot classification. Notably, aligning with generated depth-aligned images boosts the performance of other existing methods, highlighting its potential to advance 3D open-world learning. Our code can be found in the supplemental material.

Overall Framework

Chair Animation

Overview of OpenDlign. In (a), OpenDlign converts point clouds into multi-view depth maps using a contour-aware projection, which then helps generate depth-aligned RGB images with diverse textures, geometrically and semantically aligned with the maps. A transformer block, residually connected to the CLIP image encoder, is fine-tuned to align depth maps with depth-aligned images for robust 3D representation. For zero-shot classification (b), OpenDlign aggregates multi-view logits from both pre-trained and fine-tuned encoders for label prediction and for few-shot classification (c), it employs a logistic regressor trained on multi-view features from the encoders.

Zero-Shot 3D Classifictaion

Teaser

Few-Shot 3D Classifictaion

Teaser

Zero-Shot 3D Object Detection

Teaser

Cross-Modal Retrieval

Teaser

3D shape retrieval results. (a) Two most similar 3D shapes for each query image. (b) Most similar 3D shapes for each query text. (c) Two most similar 3D shapes for combined image and text queries.

BibTeX

@article{ye2024opendlign,
  author    = {Ye Mao, Junpeng Jing, Krystian Mikolajczyk},
  title     = {OpenDlign: Enhancing Open-World 3D Learning with Depth-Aligned Images},
  journal   = {Arxiv},
  year      = {2024},
}