Reconstruct, Inpaint, Finetune: Dynamic Novel-view Synthesis from Monocular Videos

TL;DR: CogNVS is a video diffusion model for dynamic novel-view synthesis trained in a self-supervised manner using only 2D videos! We reformulate novel-view synthesis as a structured inpainting task: (1) we reconstruct input views with off-the-shelf SLAM systems, (2) create self-supervised training pairs for pretraining an inpainting model, and (3) test-time finetune to the input at inference.

In-the-wild Synthetic Gallery

(a) Input video

(b) Novel-view renders

(c) Novel-view by CogNVS

Abstract

We explore novel-view synthesis for dynamic scenes from monocular videos. Prior approaches rely on costly test-time optimization of 4D representations or do not preserve scene geometry when trained in a feed-forward manner. Our approach is based on three key insights: (1) covisible pixels (that are visible in both the input and target views) can be rendered by first reconstructing the dynamic 3D scene and rendering the reconstruction from the novel-views and (2) hidden pixels in novel views can be "inpainted" with feed-forward 2D video diffusion models. Notably, our video inpainting diffusion model (CogNVS) can be self-supervised from 2D videos, allowing us to train it on a large corpus of in-the-wild videos. This in turn allows for (3) CogNVS to be applied zero-shot to novel test videos via test-time finetuning. We empirically verify that CogNVS outperforms almost all prior art for novel-view synthesis of dynamic scenes from monocular videos.

Comparison on Kubric-4D

We evaluate CogNVS on the synthetic Kubric-4D dataset. Under extreme novel-view conditions, our method preserves sharp object boundaries and maintains 3D-consistent geometry of dynamic scenes, while successfully revealing occluded objects and background regions.

Input view	GT point cloud	GCD	TrajCrafter	CogNVS	GT novel view

Comparison on ParallelDomain-4D

On the synthetic ParallelDomain-4D dataset featuring autonomous driving scenarios, CogNVS effectively hallucinates plausible road layouts and vehicle motions in novel views.

Input view	GT point cloud	GCD	TrajCrafter	CogNVS	GT novel view

Comparison on DyCheck

We also benchmark CogNVS on the real-world DyCheck dataset. Here, CogNVS¹ leverages renders from MegaSAM, and CogNVS² from Mosca. Despite starting from noisy and incomplete point cloud renders (e.g., from MegaSAM), our approach still generates photo-realistic and 3D-consistent novel views.

Input view	MegaSAM	Shape‑of‑Motion	Mosca	CAT4D	TrajCrafter	CogNVS¹	CogNVS²

How does it work?

During training (left), given a 2D source video (in blue) of a dynamic scene, we first reconstruct the scene using off-the-shelf monocular reconstruction algorithms like MegaSAM to obtain the 3D scene geometry, \( \mathcal{G}_{\rm src} \), and camera odometry, \( \mathbf{c}_{\rm src} \). We then sample a set of arbitrary camera trajectories \( \{\mathbf{c}_1, \cdots, \mathbf{c}_N\} \) to simulate plausible occluded geometries, \( \{\mathcal{G}^{\rm cov}_{{\rm src},1}, \cdots, \mathcal{G}^{\rm cov}_{{\rm src},N}\} \), which when rendered from original camera trajectory \( \mathbf{c}_{\rm src} \) produces a mask of source pixels that are co-visible in the sampled trajectory (in orange). The source video and its masked variant produce a self-supervised training pair for learning CogNVS, our video inpainting diffusion model (visualized in the next figure). At inference (right), we finetune CogNVS on the given input sequence by similarly constructing self-supervised training pairs. The final novel-view is then generated using the finetuned CogNVS in a feed-forward manner.

BibTeX

@article{chen2025cognvs,
  title={Reconstruct, Inpaint, Finetune: Dynamic Novel-view Synthesis from Monocular Videos},
  author={Kaihua Chen and Tarasha Khurana and Deva Ramanan},
  year={2025},
  archivePrefix={arXiv},
  eprint={2507.12646},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2507.12646}
}

Reconstruct, Inpaint, Finetune:Dynamic Novel-view Synthesis from Monocular Videos

In-the-wild Real-world Gallery