Reconstruct, Inpaint, Finetune:
Dynamic Novel-view Synthesis from Monocular Videos

Carnegie Mellon University

TL;DR: CogNVS is a video diffusion model for dynamic novel-view synthesis trained in a self-supervised manner using only 2D videos! We reformulate novel-view synthesis as a structured inpainting task: (1) we reconstruct input views with off-the-shelf SLAM systems, (2) create self-supervised training pairs for pretraining an inpainting model, and (3) test-time finetune to the input at inference.

Given an in-the-wild monocular video capturing a dynamic scene, we first reconstruct the scene, render it from the target novel-view and inpaint any unobserved regions. Because CogNVS can be pre-trained via self-supervision, it can also be test-time-finetuned on a given target video, enabling it to zero-shot generalize to novel domains. Here, we illustrate CogNVS’s “reconstruct, inpaint, finetune” pipeline on a sample video.

In-the-wild Real-world Gallery

(a) Input video
(b) Novel-view renders
(c) Novel-view by CogNVS

In-the-wild Synthetic Gallery

(a) Input video
(b) Novel-view renders
(c) Novel-view by CogNVS

Abstract

We explore novel-view synthesis for dynamic scenes from monocular videos. Prior approaches rely on costly test-time optimization of 4D representations or do not preserve scene geometry when trained in a feed-forward manner. Our approach is based on three key insights: (1) covisible pixels (that are visible in both the input and target views) can be rendered by first reconstructing the dynamic 3D scene and rendering the reconstruction from the novel-views and (2) hidden pixels in novel views can be "inpainted" with feed-forward 2D video diffusion models. Notably, our video inpainting diffusion model (CogNVS) can be self-supervised from 2D videos, allowing us to train it on a large corpus of in-the-wild videos. This in turn allows for (3) CogNVS to be applied zero-shot to novel test videos via test-time finetuning. We empirically verify that CogNVS outperforms almost all prior art for novel-view synthesis of dynamic scenes from monocular videos.

Comparison on Kubric-4D

We evaluate CogNVS on the synthetic Kubric-4D dataset. Under extreme novel-view conditions, our method preserves sharp object boundaries and maintains 3D-consistent geometry of dynamic scenes, while successfully revealing occluded objects and background regions.

Input view GT point cloud GCD TrajCrafter CogNVS GT novel view

kubric2910
Spin
kubric
Teddy
Teddy

Comparison on ParallelDomain-4D

On the synthetic ParallelDomain-4D dataset featuring autonomous driving scenarios, CogNVS effectively hallucinates plausible road layouts and vehicle motions in novel views.

Input view GT point cloud GCD TrajCrafter CogNVS GT novel view

pardom48
pardom167
pardom173
pardom181
pardom294

Comparison on DyCheck

We also benchmark CogNVS on the real-world DyCheck dataset. Here, CogNVS1 leverages renders from MegaSAM, and CogNVS2 from Mosca. Despite starting from noisy and incomplete point cloud renders (e.g., from MegaSAM), our approach still generates photo-realistic and 3D-consistent novel views.

Input view MegaSAM Shape‑of‑Motion Mosca CAT4D TrajCrafter CogNVS1 CogNVS2

Apple
Block
Paper Windmill
Spin
Teddy

How does it work?

During training (left), given a 2D source video (in blue) of a dynamic scene, we first reconstruct the scene using off-the-shelf monocular reconstruction algorithms like MegaSAM to obtain the 3D scene geometry, \( \mathcal{G}_{\rm src} \), and camera odometry, \( \mathbf{c}_{\rm src} \). We then sample a set of arbitrary camera trajectories \( \{\mathbf{c}_1, \cdots, \mathbf{c}_N\} \) to simulate plausible occluded geometries, \( \{\mathcal{G}^{\rm cov}_{{\rm src},1}, \cdots, \mathcal{G}^{\rm cov}_{{\rm src},N}\} \), which when rendered from original camera trajectory \( \mathbf{c}_{\rm src} \) produces a mask of source pixels that are co-visible in the sampled trajectory (in orange). The source video and its masked variant produce a self-supervised training pair for learning CogNVS, our video inpainting diffusion model (visualized in the next figure). At inference (right), we finetune CogNVS on the given input sequence by similarly constructing self-supervised training pairs. The final novel-view is then generated using the finetuned CogNVS in a feed-forward manner.

Method Figure

BibTeX

@article{chen2025cognvs,
  title={Reconstruct, Inpaint, Finetune: Dynamic Novel-view Synthesis from Monocular Videos},
  author={Kaihua Chen and Tarasha Khurana and Deva Ramanan},
  year={2025},
  archivePrefix={arXiv},
  eprint={2507.12646},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2507.12646}
}