MonoDiffSplat: Room-Scale Reconstruction from a Single Image

Abstract

Reconstructing high-fidelity, room-scale 3D scenes from a single RGB image remains a persistent challenge. Recent methods like G4Splat have made significant strides in sparse-view reconstruction by combining 2D Gaussian Splatting with generative priors, specifically utilizing Stable Virtual Camera for single-view inputs, and enforcing planar constraints to regularize geometry. However, G4Splat still faces limitations in handling heavily occluded regions, where geometric priors fail to capture complex object interactions, in single-view scenarios.

We present an enhanced G4Splat pipeline that addresses these limitations through two key innovations. First, we replace the standard MASt3R-SfM initialization with Depth Anything V3 (DA3), lifting the single view into a denser, metric-accurate point cloud that provides a stronger anchor for the Gaussian optimization, bypassing the MAtCha-based chart alignment strategy which often fails in non-overlapping regions.. Second, we introduce a parallax-inducing camera trajectory strategy for the generative inpainting stage. Unlike standard rotational paths or the plane-centric view selection used in the baseline, our "Wiggle & Dolly" trajectories force the video diffusion model to hallucinate valid depth cues and resolve disocclusions by actively moving into the scene. This approach provides a robust geometric baseline, significantly improving single-view reconstruction quality in the complex, unseen regions that prior methods struggle to resolve.

Pipeline Highlights: Metric Depth Initialization, Parallax-Inducing Generative Inpainting, and Plane-Based Geometric Regularization.

Method Overview

Our method tackles the sparse view problem in three stages:

1. Dense Metric Initialization: We utilize Depth Anything V3 to lift 2D input images into a dense 3D point cloud.
2. Geometric Regularization (Adapted from G4Splat): We adopt G4Splat's geometry-guided framework, utilizing RANSAC plane fitting and SAM-based normal estimation to constrain Gaussian optimization in planar regions.
3. Generative Inpainting with Parallax Trajectories (See3D): Unlike G4Splat, which defaults to Stable Virtual Camera for single-view inputs, we employ the See3D video diffusion prior by introducing a parallax-inducing trajectory. This strategy forces the model to hallucinate structural depth cues and resolve disocclusions (e.g., behind tables), effectively filling the "black void" artifacts that standard plane-aware view selection fails to address.

Geometric Consistency

Surface normals and depth maps are estimated to regularize the Gaussian splat orientations and positions during optimization. This prevents floating artifacts and encourages accurate geometry from the input view(s).