concept-fusion.github.io - ConceptFusion: Open-set Multimodal 3D Mapping

Description: ConceptFusion: Open-set Multimodal 3D Mapping

clip (625) slam (187) multimodal (76) 3d mapping (53) foundation models (6) conceptfusion (3) open-set (3)

Example domain paragraphs

Building 3D maps of the environment is central to robot navigation, planning, and interaction with objects in a scene. Most existing approaches that integrate semantic concepts with 3D maps largely remain confined to the closed-set setting: they can only reason about a finite set of concepts, pre-defined at training time. Further, these maps can only be queried using class labels, or in recent work, using text prompts.

We address both these issues with ConceptFusion, a scene representation that is: (i) fundamentally open-set, enabling reasoning beyond a closed set of concepts (ii) inherently multi-modal, enabling a diverse range of possible queries to the 3D map, from language, to images, to audio, to 3D geometry, all working in concert. ConceptFusion leverages the open-set capabilities of today’s foundation models pre-trained on internet-scale data to reason about concepts across modalities such as natural language, imag

ConceptFusion constructs pixel-aligned features from off-the-shelf foundation models that can only produce a global (image-level) embedding vector. This is achieved by: processing input images to generate generic (class-agnostic) object masks and extracting a local features for each, computing a global feature for the input image as a whole, and fusing the region-specific features with global features using our proposed zero-shot pixel alignment technique.

Links to concept-fusion.github.io (12)