A variety of Auto-Regressive Video Diffusion Models (ARVDM) have achieved remarkable successes in generating realistic long-form videos. However, theoretical analyses of these models remain scant.
In this work, we develop theoretical underpinnings for these models and use our insights to improve the performance of existing models. We first develop Meta-ARVDM
, a unified framework of
ARVDMs that subsumes most existing methods. Using Meta-ARVDM
, we analyze the KL-divergence between the videos generated by Meta-ARVDM
and the true videos. Our analysis uncovers
two important phenomena inherent to ARVDM--error accumulation and memory bottleneck. By deriving an information-theoretic impossibility result, we show that the memory bottleneck
phenomenon cannot be avoided. To mitigate the memory bottleneck, we design various network structures to explicitly use more past frames. We also achieve a significantly improved trade-off
between the mitigation of the memory bottleneck and the inference efficiency by compressing the frames. Experimental results on DMLab and Minecraft validate the efficacy of our methods.
Our experiments also demonstrate a Pareto-frontier between the error accumulation and memory bottleneck across different methods.
We build a unified framework Meta-ARVDM
, which includes most ARVDMs, to study the errors shared by them.
We identify conditions: Monotonicity, Circularity, and $0-T$ Boundary.
The error of long videos $Y_{1:K\Delta}^{0}$ generated by Meta-ARVDM
can be decomposed as follows.
\[
\begin{align*}
\mathrm{KL}(X_{1:K\Delta}^{0}\,\|\,Y_{1:K\Delta}^{0})\lesssim \mathrm{NIE} + \mathrm{SEE} + \mathrm{DE} + \sum_{k=1}^K \mathrm{NIE}_{\mathrm{A}\mathrm{R}} + \mathrm{SEE}_{\mathrm{A}\mathrm{R}} + \mathrm{DE}_{\mathrm{A}\mathrm{R}} + \mathrm{MB}_k .
\end{align*}
\]
It contains noise initialization errors ($\mathrm{NIE}, \mathrm{NIE}_{\mathrm{A}\mathrm{R}}$), score estimation erros ($\mathrm{SEE},\mathrm{SEE}_{\mathrm{A}\mathrm{R}}$),
discretization error ($\mathrm{DE},\mathrm{DE}_{\mathrm{A}\mathrm{R}}$), and memory bottleneck ($\mathrm{MB}_k$). Here the memory bottleneck is the conditional mutual information between the past and output of an AR step conditioned on the input, i.e.,
\[
\begin{align*}
\mathrm{MB}_k:= I\big( \texttt{Output}_k ; \texttt{Past}_k \big|\texttt{Input}_k\big).
\end{align*}
\]
The error of short video clip $ Y_{K\Delta+1:(K+1)\Delta}^{0}$ is decomposed as follows.
\[
\begin{align*}
\mathrm{KL}\big( X_{K\Delta+1:(K+1)\Delta}^{0}\big\| Y_{K\Delta+1:(K+1)\Delta}^{0} \big)\lesssim \mathrm{I}\mathrm{E} + K\big[ \mathrm{NIE}_{\mathrm{AR}} + \mathrm{SEE}_{\mathrm{AR}} + \mathrm{DE}_{\mathrm{AR}} \big]
\end{align*}
\]
The multiplicative factor $K$ reflects the error accumulation in the video clips generated in $K$-th AR step.
For general random variables $X,Y,Z$ and any deterministic function $g$, we have (adding historical data helps!) \[ \begin{align*} I(X;Z|Y,g(X))\leq I(X;Z|Y). \end{align*} \] We augment the denoising networks as
@misc{wang2025erroranalysesautoregressivevideo,
title={Error Analyses of Auto-Regressive Video Diffusion Models: A Unified Framework},
author={Jing Wang and Fengzhuo Zhang and Xiaoli Li and Vincent Y. F. Tan and Tianyu Pang and Chao Du and Aixin Sun and Zhuoran Yang},
year={2025},
eprint={2503.10704},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.10704},
}