VipDiff: Towards Coherent and Diverse Video Inpainting via Training-free Denoising Diffusion Models

ChaohaoXie, Kai Han, Kenneth K.Y. Wong,

The University of Hong Kong

VipDiff provides a training-free solution for taming image-level diffusion models to generate temporal-coherent video inpainting results.

Abstract

We propose a training-free framework, named VipDiff, for conditioning diffusion model on the reverse diffusion process to produce temporal-coherent inpainting results without requiring any training data or fine-tuning the pre-trained diffusion models.

Our VipDiff takes optical flow as guidance to extract valid pixels from reference frames to serve as constraints in optimizing the randomly sampled Gaussian noise, and uses the generated results for further pixel propagation and conditional generation.

VipDiff also allows for generating diverse video inpainting results over different sampled noise.

Comparisons with SOTA methods

Our VipDiff generates spatially and temporally coherent video inpainting results, which surpasses existing state-of-the-art methods on missing areas with large masks.

Method

Overall framework of our VipDiff. Given a target frame, we first adopt a flow completion model to predict the optical flows. Then the flows are utilized for pixel propagation to extract temporal prior from reference frames to get partially inpainted image. Next, the partially inpainted image will act as constrains for optimizing the random sampled Gaussian nosie z which is feed for the reverse denoising U-Net. Through backpropagation, we optimize the noise z at each time step and finally find an optimal z* for filling the target frame. All the model parameters are frozen during the noise optimization process.

Generalization ability

Our VipDiff can allow different pre-trained image-level diffusion models under different sample strategy to generate diverse results.

BibTeX

@inproceedings{xie2025vipdiff,
  author    = {Xie, Chaohao and Han, Kai and Wong, Kwan-Yee Kenneth},
  title     = {VipDiff: Towards Coherent and Diverse Video Inpainting via Training-free Denoising Diffusion Models},
  booktitle   = {IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
  year      = {2025},
}