Diffusion Models for Video Generation
Highlights
- ▸Video diffusion faces two structural walls image generation does not: temporal consistency across frames and scarce high-quality text-video pairs — the data scarcity limits scaling more than model capacity
- ▸The v-prediction parameterization (Salimans & Ho 2022) predicts velocity in angular coordinates rather than noise, directly fixing color-shift artifacts that plague naive temporal extensions of image diffusion
- ▸VDM reconstruction guidance conditions new-frame generation on reconstructed previous frames with a tunable weighting factor — larger weights improve quality, and this mechanism defines the core temporal conditioning architecture
Original excerpt
Diffusion models have demonstrated strong results on image synthesis in past years. Now the research community has started working on a harder task—using it for video generation. The task itself is a superset of the image case, since an image is a video of 1 frame, and it is much more challenging because:
1. It has extra requirements on temporal consistency across frames in time, which naturally demands more world knowledge to be encoded into the model. 2. In comparison to text or images, it is more difficult to collect large amounts of high-quality, high-dimensional video data, let along text-video pairs.
🥑 Required Pre-read: Please make sure you have read the previous blog on “What are…
10 more articles in this vault.
Import the full Lilian Weng vault to Burn 451 and build your own knowledge base.
Content attributed to the original author (Paul Graham). Burn 451 curates publicly available writing as a reading index. For removal requests, contact @hawking520.