MonoDiffSplat

Iterative Sparse-View 3D Reconstruction via Plane-Guided Depth Refinement and Video Diffusion

Sam (Seyed M.) Hosseini
York University
Repository

Abstract

Reconstructing usable 3D geometry from very sparse inputs remains difficult, especially when large parts of a scene are unobserved. Recent pipelines can combine monocular depth, plane priors, Gaussian rendering, and generative completion, but the interaction between those components becomes fragile in single-view and low-view regimes.

MonoDiffSplat is a repository pipeline built on top of G4Splat, Depth Anything 3, See3D, and 2D Gaussian Splatting. In the current implementation, the system bootstraps depth and plane structure from the input views, trains an initial 2D Gaussian model, then runs up to three iterative rounds of novel-view rendering, See3D inpainting, base-cloud-anchored depth refinement, point-cloud quality control, and resumed Gaussian training.

Within that G4Splat-based framework, MonoDiffSplat extends the implementation with Depth Anything 3 initialization for sparse and single-view settings, stage-to-stage chaining of plane models and base clouds, structured coverage-driven view selection, and resume-stage geometry injection with additional pruning during Gaussian retraining. This page focuses on the repository pipeline and qualitative outputs.

Rendered frame from single view reconstruction

Single input image

Generated scene

Generated scene

Rendered frame from single view reconstruction

Single input image

Generated scene

Generated scene

Rendered frame from single view reconstruction

Single input image

Generated scene

Generated scene

Figure 1: Qualitative sparse-view reconstruction examples.


Method Overview

MonoDiffSplat keeps the core G4Splat ingredients and reorganizes them into three phases with explicit stage chaining and iterative refinement:

  • Bootstrap. Depth Anything 3 produces an initial monocular depth estimate with global alignment. Geometry-branch normals and SAM masks drive plane extraction, an initial plane-aware refinement is run, and a first 2D Gaussian model is trained.
  • Iterative See3D refinement (1–3 rounds, default 2). Each round renders structured novel views, runs See3D inpainting with feathered visible-region merge, anchors the new depths against the previous round's unified point cloud while carrying forward fitted plane models, exports an aligned point cloud, and resumes Gaussian training with delta geometry injection and floater cleanup.
  • Finalization. Optional adaptive-TSDF tetra mesh extraction and held-out view evaluation.
MonoDiffSplat pipeline diagram

Figure 2: Pipeline summary. Bootstrap builds the first depth-refined point cloud and Gaussian model from sparse RGB inputs. The iterative phase then repeats one to three times: render novel views, inpaint with See3D, refine depth against the previous round's geometry, export QC point clouds, resume Gaussian training, and re-render.


Geometric Consistency

Depth and normal estimates are used as geometric supervision throughout refinement and training. In the current implementation, this is meant to reduce floaters, keep planar regions better aligned, and make sparse-view reconstructions more stable when only a few input images are available.

RGB image
RGB Image
Normal map
Depth Map

Scene 1: Outdoor scene with geometric detail

RGB image
RGB Image
Normal map
Depth Map

Scene 2: Room reconstruction


Extensions Over G4Splat

  • Base-cloud-anchored depth refinement. See3D depths are aligned to the previous round's QC point cloud via per-segment scale fitting, with hard acceptance gates and an adaptive view-rescue floor. Inherited plane models are matched first, with boundary-anchored linear mono fits as a weak-plane fallback.
  • Geometry-branch normals. Plane extraction on inpainted views uses normals computed from raw mono depth rather than from GS-warp-aligned depth, avoiding projective distortion on regions with sparse Gaussian coverage.
  • Structured view selection. Novel views are drawn from a fixed trajectory mix (parallax, dolly, orbit, top-down, plane-guided), filtered by visibility and coverage, and feather-merged with input-view backward warps to reduce full-frame diffusion drift.
  • QC-gated geometry injection and floater cleanup. Unified point cloud export applies multi-view consistency, plane-extension snapping, and an adaptive inlier gate before injection. Resumed Gaussian training adds depth-prune grace periods, post-densify shard cleanup, and an opacity-gated anisotropy pass.

Acknowledgments

This repository is built on G4Splat and also depends on Depth Anything V3, See3D, and the broader 2D Gaussian Splatting toolchain. Credit for the base G4Splat formulation remains with the original authors; MonoDiffSplat primarily documents and implements the iterative sparse-view extensions described in this repository.

We thank the upstream authors for releasing code, models, and documentation that made this implementation possible.