Error Analyses of Auto-Regressive Video Diffusion Models: A Unified Framework

Abstract

A variety of Auto-Regressive Video Diffusion Models (ARVDM) have achieved remarkable successes in generating realistic long-form videos. However, theoretical analyses of these models remain scant. In this work, we develop theoretical underpinnings for these models and use our insights to improve the performance of existing models. We first develop Meta-ARVDM, a unified framework of ARVDMs that subsumes most existing methods. Using Meta-ARVDM, we analyze the KL-divergence between the videos generated by Meta-ARVDM and the true videos. Our analysis uncovers two important phenomena inherent to ARVDM--error accumulation and memory bottleneck. By deriving an information-theoretic impossibility result, we show that the memory bottleneck phenomenon cannot be avoided. To mitigate the memory bottleneck, we design various network structures to explicitly use more past frames. We also achieve a significantly improved trade-off between the mitigation of the memory bottleneck and the inference efficiency by compressing the frames. Experimental results on DMLab and Minecraft validate the efficacy of our methods. Our experiments also demonstrate a Pareto-frontier between the error accumulation and memory bottleneck across different methods.

Main Messages

Most Auto-Regressive Video Diffusion Models (ARVDMs) face memory bottleneck and error accumulation. The memory bottleneck at each AR step is defined by the conditional mutual information between past states and outputs, given the input. Error accumulation is the cumulative sum of errors from all preceding AR steps.
The memory bottleneck is inevitable but can be monotonically reduced by incorporating more historical information into the denoising process.
The memory bottleneck can be mitigated by letting the denoising network take historical frames as inputs in a proper way.
The memory bottleneck and error accumulation are correlated. Less memory bottleneck error implies more error accumulation.

Error Accumulation

Memory Bottleneck

Memory Retrieval

The left frame displays the ground truth, while the right frame with a red border the model prediction based on memory.

Success Cases 😊. The model successfully retrieves the memory and reconstructs the scene.

Failed Cases 😢. The model either fails to retrieve the exact color or fails to reconstruct the terrain.

Methodology Overview

A Unified Framework of ARVDMs

We build a unified framework Meta-ARVDM, which includes most ARVDMs, to study the errors shared by them.

We identify conditions: Monotonicity, Circularity, and $0-T$ Boundary.

Theoretical Analysis

The error of long videos $Y_{1:K\Delta}^{0}$ generated by Meta-ARVDM can be decomposed as follows. \[ \begin{align*} \mathrm{KL}(X_{1:K\Delta}^{0}\,\|\,Y_{1:K\Delta}^{0})\lesssim \mathrm{NIE} + \mathrm{SEE} + \mathrm{DE} + \sum_{k=1}^K \mathrm{NIE}_{\mathrm{A}\mathrm{R}} + \mathrm{SEE}_{\mathrm{A}\mathrm{R}} + \mathrm{DE}_{\mathrm{A}\mathrm{R}} + \mathrm{MB}_k . \end{align*} \] It contains noise initialization errors ($\mathrm{NIE}, \mathrm{NIE}_{\mathrm{A}\mathrm{R}}$), score estimation erros ($\mathrm{SEE},\mathrm{SEE}_{\mathrm{A}\mathrm{R}}$), discretization error ($\mathrm{DE},\mathrm{DE}_{\mathrm{A}\mathrm{R}}$), and memory bottleneck ($\mathrm{MB}_k$). Here the memory bottleneck is the conditional mutual information between the past and output of an AR step conditioned on the input, i.e., \[ \begin{align*} \mathrm{MB}_k:= I\big( \texttt{Output}_k ; \texttt{Past}_k \big|\texttt{Input}_k\big). \end{align*} \] The error of short video clip $ Y_{K\Delta+1:(K+1)\Delta}^{0}$ is decomposed as follows. \[ \begin{align*} \mathrm{KL}\big( X_{K\Delta+1:(K+1)\Delta}^{0}\big\| Y_{K\Delta+1:(K+1)\Delta}^{0} \big)\lesssim \mathrm{I}\mathrm{E} + K\big[ \mathrm{NIE}_{\mathrm{AR}} + \mathrm{SEE}_{\mathrm{AR}} + \mathrm{DE}_{\mathrm{AR}} \big] \end{align*} \] The multiplicative factor $K$ reflects the error accumulation in the video clips generated in $K$-th AR step.

Mitigation of Memory Bottleneck

For general random variables $X,Y,Z$ and any deterministic function $g$, we have (adding historical data helps!) \[ \begin{align*} I(X;Z|Y,g(X))\leq I(X;Z|Y). \end{align*} \] We augment the denoising networks as

BibTeX

@misc{wang2025erroranalysesautoregressivevideo,
      title={Error Analyses of Auto-Regressive Video Diffusion Models: A Unified Framework}, 
      author={Jing Wang and Fengzhuo Zhang and Xiaoli Li and Vincent Y. F. Tan and Tianyu Pang and Chao Du and Aixin Sun and Zhuoran Yang},
      year={2025},
      eprint={2503.10704},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.10704}, 
}