SparseLGS: Sparse View Language
Embedded Gaussian Splatting

Jun Hu1,2     Zhang Chen2     Li Zhong2     Yi Xu2     Juyong Zhang1
1University of Science and Technology of China        2OPPO US Research Center       

Abstract

Recently, several studies have combined Gaussian Splatting to obtain scene representations with language embeddings for open-vocabulary 3D scene understanding. While these methods perform well, they essentially require very dense multi-view inputs, limiting their applicability in real-world scenarios. In this work, we propose SparseLGS to address the challenge of 3D scene understanding with pose-free and sparse view input images. Our method leverages a learning-based dense stereo model to handle pose-free and sparse inputs, and a three-step region matching approach to address the multi-view semantic inconsistency problem, which is especially important for sparse inputs. Different from directly learning high-dimensional CLIP features, we extract low-dimensional information and build bijections to avoid excessive learning and storage costs. We introduce a reconstruction loss during semantic training to improve Gaussian positions and shapes. To the best of our knowledge, we are the first to address the 3D semantic field problem with sparse pose-free inputs. Experimental results show that SparseLGS achieves comparable quality when reconstructing semantic fields with fewer inputs (3-4 views) compared to previous SOTA methods with dense input. Besides, when using the same sparse input, SparseLGS leads significantly in quality and heavily improves the computation speed (5 × speedup).

Pipeline

Our approach SparseLGS is capable of generating high-quality language fields from pose-free sparse view inputs in just a few minutes. We first leverage SAM and CLIP to obtain object-wise semantic maps, then use a learning-based stereo model to derive camera poses and point clouds from sparse inputs. To address semantic inconsistencies across views, we employ a three-step multi-view semantic alignment strategy. To better integrate semantics with Gaussian Splatting, we establish a bijection between the original CLIP features and their dimensionality-reduced counterparts. During training, we incorporate RGB supervision to enhance the 3D consistency of our learned language field.

Comparison

We compare our method with two SOTA 3D language field reconstruction methods.

BibTeX

If you find SparseLGS useful for your work please cite:

@article{hu2024sparselgs,
  title={SparseLGS: Sparse View Language Embedded Gaussian Splatting},
  author={Jun Hu and Zhang Chen and Zhong Li and Yi Xu and Juyong Zhang},
  journal={arXiv preprint arXiv:2412.02245},
  year={2024}
}