MonoDiffSplat

3D Scene Reconstruction using 2D Gaussian Splatting and Video Diffusion Inpainting

Room-Scale 3D Reconstruction via Video Diffusion Priors and Parallax-Inducing SfM

Seyed M. Hossein Hosseini
York University
💻 Code (Coming Soon)

Abstract

Reconstructing high-fidelity, room-scale 3D scenes from a single RGB image remains a persistent challenge. Recent methods like G4Splat have made significant strides in sparse-view reconstruction by combining 2D Gaussian Splatting with generative priors, specifically utilizing Stable Virtual Camera for single-view inputs, and enforcing planar constraints to regularize geometry. However, G4Splat still faces limitations in handling heavily occluded regions, where geometric priors fail to capture complex object interactions, in single-view scenarios.

We present an enhanced G4Splat pipeline that addresses these limitations through two key innovations. First, we replace the standard MASt3R-SfM initialization with Depth Anything V3 (DA3), lifting the single view into a denser, metric-accurate point cloud that provides a stronger anchor for the Gaussian optimization, bypassing the MAtCha-based chart alignment strategy which often fails in non-overlapping regions.. Second, we introduce a parallax-inducing camera trajectory strategy for the generative inpainting stage. Unlike standard rotational paths or the plane-centric view selection used in the baseline, our "Wiggle & Dolly" trajectories force the video diffusion model to hallucinate valid depth cues and resolve disocclusions by actively moving into the scene. This approach provides a robust geometric baseline, significantly improving single-view reconstruction quality in the complex, unseen regions that prior methods struggle to resolve.

Pipeline Highlights: Metric Depth Initialization, Parallax-Inducing Generative Inpainting, and Plane-Based Geometric Regularization.

Rendered frame from single view reconstruction

Single input image

Generated scene

Generated scene

Rendered frame from single view reconstruction

Single input image

Generated scene

Generated scene

Rendered frame from single view reconstruction

Single input image

Generated scene

Generated scene

Figure 1: Result on Single Image 3D Reconstruction using our proposed pipeline.


Method Overview

Our method tackles the sparse view problem in three stages:

  • 1. Dense Metric Initialization: We utilize Depth Anything V3 to lift 2D input images into a dense 3D point cloud.
  • 2. Geometric Regularization (Adapted from G4Splat): We adopt G4Splat's geometry-guided framework, utilizing RANSAC plane fitting and SAM-based normal estimation to constrain Gaussian optimization in planar regions.
  • 3. Generative Inpainting with Parallax Trajectories (See3D): Unlike G4Splat, which defaults to Stable Virtual Camera for single-view inputs, we employ the See3D video diffusion prior by introducing a parallax-inducing trajectory. This strategy forces the model to hallucinate structural depth cues and resolve disocclusions (e.g., behind tables), effectively filling the "black void" artifacts that standard plane-aware view selection fails to address.

Geometric Consistency

Surface normals and depth maps are estimated to regularize the Gaussian splat orientations and positions during optimization. This prevents floating artifacts and encourages accurate geometry from the input view(s).

RGB image
RGB Image
Normal map
Depth Map

Scene 1: Outdoor scene with geometric detail

RGB image
RGB Image
Normal map
Depth Map

Scene 2: Room reconstruction


Acknowledgments

This work builds upon G4Splat and Depth Anything V3. We thank the authors for making their code and models publicly available.