We propose a training-free framework, named VipDiff, for conditioning diffusion model on the reverse diffusion process to produce temporal-coherent inpainting results without requiring any training data or fine-tuning the pre-trained diffusion models.
Our VipDiff takes optical flow as guidance to extract valid pixels from reference frames to serve as constraints in optimizing the randomly sampled Gaussian noise, and uses the generated results for further pixel propagation and conditional generation.
VipDiff also allows for generating diverse video inpainting results over different sampled noise.
Our VipDiff generates spatially and temporally coherent video inpainting results, which surpasses existing state-of-the-art methods on missing areas with large masks.
Our VipDiff can allow different pre-trained image-level diffusion models under different sample strategy to generate diverse results.
@inproceedings{xie2025vipdiff,
author = {Xie, Chaohao and Han, Kai and Wong, Kwan-Yee Kenneth},
title = {VipDiff: Towards Coherent and Diverse Video Inpainting via Training-free Denoising Diffusion Models},
booktitle = {IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
year = {2025},
}