Improved Techniques for Maximum Likelihood Estimation for Diffusion ODEs
Abstract
Diffusion models have exhibited excellent performance in various domains. The probability flow ordinary differential equation (ODE) of diffusion models (i.e., diffusion ODEs) is a particular case of continuous normalizing flows (CNFs), which enables deterministic inference and exact likelihood evaluation. However, the likelihood estimation results by diffusion ODEs are still far from those of the state-of-the-art likelihood-based generative models. In this work, we propose several improved techniques for maximum likelihood estimation for diffusion ODEs, including both training and evaluation perspectives. For training, we propose velocity parameterization and explore variance reduction techniques for faster convergence. We also derive an error-bounded high-order flow matching objective for finetuning, which improves the ODE likelihood and smooths its trajectory. For evaluation, we propose a novel training-free truncated-normal dequantization to fill the training-evaluation gap commonly existing in diffusion ODEs. Building upon these techniques, we achieve state-of-the-art likelihood estimation results on image datasets (2.56 on CIFAR-10, 3.43/3.69 on ImageNet-32) without variational dequantization or data augmentation, and 2.42 on CIFAR-10 with data augmentation. Code is available at https://github.com/thu-ml/i-DODE.
1 Introduction
Likelihood is an important metric to evaluate density estimation models, and accurate likelihood estimation is the key for many applications such as data compression (Ho et al., 2021; Helminger et al., 2020; Kingma et al., 2021; Yang & Mandt, 2022), anomaly detection (Chen et al., 2018c; Dias et al., 2020) and out-of-distribution detection (Serrà et al., 2020; Xiao et al., 2020). Many deep generative models can compute tractable likelihood, including autoregressive models (Oord et al., 2016; Salimans et al., 2017; Chen et al., 2018b), variational auto-encoders (VAE) (Kingma & Welling, 2014; Vahdat & Kautz, 2020), normalizing flows (Dinh et al., 2017; Kingma & Dhariwal, 2018; Ho et al., 2019) and diffusion models (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020; Song et al., 2021c, a; Karras et al., 2022). Among these models, recent work named variational diffusion models (VDM) (Kingma et al., 2021) achieves state-of-the-art likelihood estimation performance on standard image density estimation benchmarks, which is a variant of diffusion models.
There are two types of diffusion models, one is based on the reverse stochastic differential equation (SDE) (Song et al., 2021c), named as diffusion SDE; the other is based on the probability flow ordinary differential equation (ODE) (Song et al., 2021c), named as diffusion ODE. These two types of diffusion models define and evaluate the likelihood in different manners: diffusion SDE can be understood as an infinitely-deep VAE (Huang et al., 2021) and can only compute a variational lower bound of the likelihood (Song et al., 2021c; Kingma et al., 2021); while diffusion ODE is a variant of continuous normalizing flows (Chen et al., 2018a) and can compute the exact likelihood by ODE solvers. Thus, it is natural to hypothesize that the likelihood performance of diffusion ODEs may be better than that of diffusion SDEs. However, all existing methods for training diffusion ODEs (Song et al., 2021b; Lu et al., 2022a; Lipman et al., 2022; Albergo & Vanden-Eijnden, 2022; Liu et al., 2022b) cannot even achieve a comparable likelihood performance with VDM, which belongs to diffusion SDEs. It still remains largely open whether diffusion ODEs are also great likelihood estimators.
Real-world data is usually discrete, and evaluating the likelihood of discrete data by diffusion ODEs needs to first perform a dequantization process (Dinh et al., 2017; Salimans et al., 2017) to make sure the input data of diffusion ODEs is continuous. In this work, we observe that previous likelihood evaluation of diffusion ODEs has flaws in the dequantization process: the uniform dequantization (Song et al., 2021b) causes a large training-evaluation gap, and the variational dequantization (Ho et al., 2019; Song et al., 2021b) requires additional training overhead and is hard to train to the optimal.
In this work, we propose several improved techniques, including both the evaluation perspective and training perspective, to allow the likelihood estimation by diffusion ODEs to outperform the existing state-of-the-art likelihood estimators. In the aspect of evaluation, we propose a training-free dequantization method dedicated to diffusion models by a carefully designed truncated-normal distribution, which can fit diffusion ODEs well and improve the likelihood evaluation by a large margin compared to uniform dequantization. We also introduce an importance-weighted likelihood estimator to get a tighter bound. In the aspect of training, we split our training into pretraining and finetuning phases. For pretraining, we propose a new model parameterization method including velocity parameterization, which is an extended version of flow matching (Lipman et al., 2022) with practical modifications, and log-signal-to-noise-ratio timed parameterization. Besides, we find a simple yet efficient importance sampling strategy for variance reduction. Together, our pretraining has a faster convergence speed compared to previous work. For finetuning, we propose an error-bounded high-order flow matching objective, which not only improves the ODE likelihood but also results in smoother trajectories. Together, we name our framework Improved Diffusion ODE (i-DODE).
We conduct ablation studies to demonstrate the effectiveness of separate parts. Our experimental results empirically achieve the state-of-the-art likelihood on image datasets (2.56 on CIFAR-10, 3.43/3.69 on ImageNet-32), surpassing the previous best ODEs of 2.90 and 3.48/3.82, with the superiority that we use no data augmentation and throw away the need for training variational dequantization models.
2 Diffusion Models
2.1 Diffusion ODEs and Maximum Likelihood Training
Suppose we have a -dimensional data distribution . Diffusion models (Ho et al., 2020; Song et al., 2021c) gradually diffuse the data by a forward stochastic differential equation (SDE) starting from :
(1) |
where are manually designed noise schedules and is a standard Wiener process. The forward process is accompanied with a series of marginal distributions , so that with some constant . Since this is a simple linear SDE, the transition kernel is an analytical Gaussian (Song et al., 2021c): , where the coefficients satisfy , (Kingma et al., 2021). Under some regularity conditions (Anderson, 1982), the forward process has an equivalent probability flow ODE (Song et al., 2021c):
(2) |
which produces the same marginal distribution at each time as that in Eqn. (1). The only unknown term is the score function of . By parameterizing a score network to predict the time-dependent , we can replace the true score function, resulting in the diffusion ODE (Song et al., 2021c):
(3) |
with the associated marginal distributions . Diffusion ODEs are special cases of continuous normalizing flows (CNFs) (Chen et al., 2018a), thus can perform exact inference of the latents and exact likelihood evaluation.
Though traditional maximum likelihood training methods for CNFs (Grathwohl et al., 2019) are feasible for diffusion ODEs, the training costs of these methods are quite expensive and hard to scale up because of the requirement of solving ODEs at each iteration. Instead, a more practical way is to match the generative probability flow with by a simulation-free approach. Specifically, Lu et al. (2022a) proves that can be formulated by , where
(4) |
However, computing requires solving another ODE and is also expensive (Lu et al., 2022a). To minimize in a simulation-free manner, Lu et al. (2022a) also proposes a combination of weighted first-order and high-order score matching objectives. Particularly, the first-order score matching objective is
(5) |
where , and .
2.2 Log-SNR Timed Diffusion Models
Diffusion models have manually designed noise schedule , which has high freedom and affects the performance. Even for restricted design space such as Variance Preserving (VP) (Song et al., 2021c), which constrains the noise schedule by , we could still have various choices about how fast changes w.r.t time . To decouple the specific schedule form, variational diffusion models (VDM) (Kingma et al., 2021) use a negative log-signal-to-noise-ratio (log-SNR) for the time variable and can greatly simplify both noise schedules and training objectives. Specifically, denote , the change-of-variable relation from to is
(6) |
and replace the time subscript with , we get the simplified score matching objective with likelihood weighting:
(7) |
This result is in accordance with the continuous diffusion loss in Kingma et al. (2021).
2.3 Dequantization for Density Estimation
Many real-world datasets usually contain discrete data, such as images or texts. In such cases, learning a continuous density model to these discrete data points will cause degenerate results (Uria et al., 2013) and cannot provide meaningful density estimations. A common solution is dequantization (Dinh et al., 2017; Salimans et al., 2017; Ho et al., 2019). Specifically, suppose is 8-bit discrete data scaled to . Dequantization methods assume that we have trained a continuous model distribution for , and define the discrete model distribution by
To train by maximum likelihood estimation, variational dequantization (Ho et al., 2019) introduces a dequantization distribution and jointly train and by a variational lower bound:
A simple way for is uniform dequantization, where we set .
3 Diffusion ODEs with Truncated-Normal Dequantization
In this section, we discuss the challenges of training diffusion ODEs with dequantization and propose a training-free dequantization method for diffusion ODEs.
3.1 Challenges for Diffusion ODEs with Dequantization
We first discuss the challenges for diffusion ODEs with dequantization in this section.
Truncation introduces an additional gap.
Theoretically, we want to train diffusion ODEs by minimizing and use for the continuous model distribution. However, as , we have . Due to this, it is shown in previous work (Song et al., 2021c; Kim et al., 2022) that there are numerical issues near for both training and sampling, so we cannot directly compute the model distribution at time . In practice, a common solution is to choose a small starting time for improving numerical stability. The training objective then becomes minimizing , which is equivalent to
(8) |
and is directly used to evaluate the data likelihood. However, as , such a method will introduce an additional gap due to the mismatch between training () and testing (), which may degrade the likelihood evaluation performance.
Uniform dequantization causes a train-test mismatch.
After choosing , the continuous model distribution is defined by . Let be a dequantization distribution with support over . The variational lower bound for the discrete model density is:
One widely used choice for is uniform distribution (uniform dequantization). However, this leads to a training-evaluation gap: for training, we fit to the distribution , which is a Gaussian distribution near each discrete data point because for ; while for evaluation, we test on uniform dequantized data . Such a gap will also degrade the likelihood evaluation performance and is not well-studied.
3.2 Training-Free Dequantization by Truncated Normal
In this section, we show that there exists a training-free dequantization distribution that fits diffusion ODEs well.
As discussed in Sec. 3.1, the gap between training and testing of diffusion ODEs is due to the difference between the training input (where ) and the testing input . To fill such a gap, we can choose a dequantization distribution which satisfies
(9) |
For small enough , we have , then Eqn. (9) becomes . We also need to ensure the support of is , i.e. the random variable is approximately within . To this end, we choose the variational dequantization distribution by a truncated normal distribution as follows:
(10) |
where is a truncated-normal distribution with mean , covariance , and bounds in each dimension. Moreover, such truncated-normal dequantization provides a guideline for choosing the start time : To avoid large deviation from the truncation by , we need to ensure that in most cases. We leverage the 3- principle for standard normal distribution and let to satisfy . As , the critical start time satisfies that the negative log-SNR . Surprisingly, such choice of is exactly the same as the in Kingma et al. (2021) which instead is obtained by training. Such dequantization distribution can ensure the conditions in Eqn. (9) and we validate in Sec. 6 that such dequantization can provide a tighter variational bound yet with no additional training costs. We summarize the likelihood evaluation by such dequantization distribution in the following theorem.
Theorem 3.1 (Variational Bound under Truncated-Normal Dequantization).
Suppose we use the truncated-normal dequantization in Eqn. (10), then the discrete model distribution has the following variational bound:
where
Besides, we also have the following importance-weighted likelihood estimator by using i.i.d. samples by using Jensen’s inequality as in Burda et al. (2015). As increases, the estimator gives a tighter bound, which enables more precise likelihood estimation.
Corollary 3.2 (Importance Weighted Variational Bound under Truncated-Normal Dequantization).
Suppose we use the truncated-normal dequantization in Eqn. (10), then the discrete model distribution has the following importance-weighted variational bound:
where
Remark 3.3.
Another way to bridge the discrete-continuous gap is variational perspective. We can view the process from discrete to continuous as a variational autoencoder, where the prior is modeled by diffusion ODE. The dequantization and variational perspectives of diffusion ODEs have a close relationship both theoretically and empirically, and we detailedly discuss them in Appendix A.
4 Practical Techniques for Improving the Likelihood of Diffusion ODEs
In this section, we propose some practical techniques for improving the likelihood of diffusion ODEs, including parameterization, a high-order training objective, and variance reduction by importance sampling. For simplicity, we denote for any scalar function .
4.1 Velocity Parameterization
While the score matching objective only depends on the noise schedule, the training process is affected by many aspects such as network parameterization (Song et al., 2021c; Karras et al., 2022). For example, the noise predictor is widely used to replace the score predictor , since the noise has unit variance and is easier to fit, while is pathological and explosive near (Song et al., 2021c).
In this work, we consider another network parameterization which is to directly predict the drift of the diffusion ODE. The parameterized model is defined by
(11) |
By rewriting the (first-order) score matching objective in Eqn. (5), is equivalent to:
(12) |
where is the velocity to predict. Given unlimited model capacity, the optimal is
(13) |
which is the drift of probability flow ODE in Eqn. (2).
We give an intuitive explanation for in Appendix D that the prediction target is the tangent (velocity) of the diffusion path, and we name as velocity parameterization. Besides, we show it empirically alleviates the imbalance problem in noise prediction.
In addition, we prove the equivalence between different predictors and different matching objectives for general noise schedules in Appendix B. We also show in Appendix E that the flow matching method Lipman et al. (2022); Albergo & Vanden-Eijnden (2022); Liu et al. (2022b) and related techniques for improving the sample quality of diffusion models in Karras et al. (2022); Salimans & Ho (2022); Ho et al. (2022) can all be reformulated in velocity parameterization. To be consistent, we still call as flow matching. It’s an extended version of Lipman et al. (2022) with likelihood weighting and several practical modifications as detailed in Section 4.3.
4.2 Error-bounded Second-Order Flow Matching
According to Chen et al. (2018a), the ODE likelihood of Eqn. (11) can be evaluated by solving the following differential equation from to :
(14) |
As in Eqn. (12) can only restrict the distance between and , but not the divergence and . The precision and smoothness of the trace affects the likelihood performance and the number of function evaluations for sampling. For simulation-free training of , we propose an error-bounded trace of second-order flow matching, where the second-order error is bounded by the proposed objective and first-order error.
Theorem 4.1.
(Error-Bounded Trace of Second-Order Flow Matching) Suppose we have a first-order velocity estimator , we can learn a second-order trace velocity model which minimizes
by optimizing
(15) |
where
Moreover, denote the first-order flow matching error as , then , the estimation error for can be bounded by:
4.3 Timing by Log-SNR and Normalizing Velocity
In practice, we make two modifications to improve the performance. First, we use negative log-SNR to time the diffusion process. Still, we parameterize to predict the drift of the timed diffusion ODE i.e. , so the corresponding predictor . Second, the velocity of the diffusion path may have different scales at different , so we propose to predict the normalized velocity , with the parameterized network , which is equal to . The objective in Eqn. (12) reduces to
And the corresponding second-order objective:
(16) |
where is the stop-gradient version of , since we only use the parameterized first-order velocity predictor as an estimator. Our final formulation of parameterized diffusion ODE is
(17) |
4.4 Variance Reduction with Importance Sampling
The flow matching is conducted for all in through an integral. In practice, the evaluation of the integral is time-consuming, and Monte-Carlo methods are used to unbiasedly estimate the objective by uniformly sampling . In this case, the variance of the Monte-Carlo estimator affects the optimization process. Thus, a continuous importance distribution can be proposed for variance reduction. Denote , then
(18) |
We propose to use two types of importance sampling (IS), and empirically compare them for faster convergence.
Designed IS
Intuitively, we can choose . This way, the coefficients of is a time-invariant constant, and the velocity matching error is not amplified or shrank at any . This is similar to the IS in Song et al. (2021b), where the weighting before the noise matching error is cancelled, and it corresponds to uniform under our parameterization.
For noise schedules used in this paper, we can obtain closed-form sampling procedures using inverse transform sampling, see Appendix C.
Learned IS
The variance of the Monte-Carlo estimator depends on the learned network . To minimize the variance, we can parameterize the IS with another network and treat the variance as an objective. Actually, learning is equivalent to learning a monotone mapping , which is inverse cumulative distribution function of . We can uniformly sample , and regard the IS as change-of-variable from to .
(19) |
Suppose we parameterize with . Denote , which is a Monte-Carlo estimator of . Since its variance and is invariant to , we can minimize for variance reduction.
While this approach seeks the optimal IS, it causes extra overhead by introducing an IS network, requiring complex gradient operation or additional training steps. Thus, we only use it as a reference to test the optimality of our designed IS. We simplify the variance reduction in Kingma et al. (2021), and propose an adaptive IS algorithm, which is detailed in Appendix H. Empirically, we show that designed IS is a more preferred approach since it is training-free and achieves a similar convergence speed to learned IS.
5 Related Work
Diffusion models, also known as score-based generative models (SGMs), have achieved state-of-the-art sample quality and likelihood (Dhariwal & Nichol, 2021; Karras et al., 2022; Kingma et al., 2021) among deep generative models, yielding extensive downstream applications such as speech and singing synthesis (Chen et al., 2021; Liu et al., 2022a), conditional image generation (Ramesh et al., 2022; Rombach et al., 2022), guided image editing (Meng et al., 2022; Nichol et al., 2022), unpaired image-to-image translation (Zhao et al., 2022) and inverse problem solving (Chung et al., 2022; Kawar et al., 2022).
Diffusion ODEs are special formulations of neural ODEs and can be viewed as continuous normalizing flows (Chen et al., 2018a). Training of diffusion ODEs can be categorized into simulation-based and simulation-free methods. The former utilizes the exact likelihood evaluation formula of ODE (Chen et al., 2018a), which leads to a maximum likelihood training procedure (Grathwohl et al., 2019). However, it involves expensive ODE simulations for forward and backward propagation and may result in unnecessary complex dynamics (Finlay et al., 2020) since it only cares about the model distribution at . The latter trains neural ODEs by matching their trajectories to a predefined path, such as the diffusion process. This approach is proposed in Song et al. (2021c), and extended in Lu et al. (2022a); Lipman et al. (2022); Albergo & Vanden-Eijnden (2022); Liu et al. (2022b). We propose velocity parameterization which is an extension of Lipman et al. (2022) with practical modifications and claim that the paths used in Lipman et al. (2022); Albergo & Vanden-Eijnden (2022); Liu et al. (2022b) are special cases of noise schedule. Aiming at maximum likelihood training, we also get inspiration from Lu et al. (2022a). We additionally apply likelihood weighting and propose to finetune the model with high-order flow matching.
Variance reduction techniques are commonly used for training diffusion models. Nichol & Dhariwal (2021) proposes an importance sampling (IS) for discrete-time diffusion models by maintaining the historical losses at each time step and building the proposal distribution based on them. Song et al. (2021b) designs an IS to cancel out the weighting before the noise matching loss. Kingma et al. (2021) proposes a variance reduction method that is equivalent to learning a parameterized IS. We simply their procedure and propose an adaptive IS scheme for ablation. By empirically comparing different IS methods, we find a designed and analytical IS distribution that achieves a good performance-efficiency trade-off.
6 Experiments
In this section, we present our training procedure and experiment settings, and our ablation studies to demonstrate how our techniques improve the likelihood of diffusion ODEs.
We implement our methods based on the open-source codebase of Kingma et al. (2021) implemented with JAX Bradbury et al. (2018), and use similar network and hyperparameter settings. We first train the model by optimizing our first-order flow matching objective for enough iterations, so that the first-order velocity prediction has little error. Then, we finetune the pretrained first-order model using a mixture of first-order and second-order flow matching objectives . The finetune process converges in much fewer iterations than pretraining. Finally, we evaluate the likelihood on the test set using the variational bound under our proposed truncated-normal dequantization. The detailed training configurations are provided in Appendix I.
Our training and evaluation procedure is feasible for any noise schedule . We choose two special noise schedules:
Variance Preserving (VP)
. This schedule is widely used in diffusion models, which yields a process with a fixed variance of one when the initial distribution has a unit variance.
Straight Path (SP)
. This schedule is used in Lipman et al. (2022); Albergo & Vanden-Eijnden (2022); Liu et al. (2022b), where they call it OT path and claim it leads to better dynamics since the pairwise diffusion paths are straight lines. We simply regard it as a special kind of noise schedule.
Under these two schedules, are uniquely determined by , and we do not have any extra hyperparameters. They also have corresponding objectives and designed IS, which can be expressed in closed form (see Appendix C for details). We train our i-DODE on CIFAR-10 (Krizhevsky et al., 2009) and ImageNet-32111There are two different versions of ImageNet32 and ImageNet64 datasets. For fair comparisons, we use both versions of ImageNet32, one is downloaded from https://image-net.org/data/downsample/Imagenet32_train.zip, following Lipman et al. (2022), and the other is downloaded from http://image-net.org/small/train_32x32.tar (old version, no longer available), following Song et al. (2021b) and Kingma et al. (2021). The former dataset applies anti-aliasing and is easier for maximum likelihood training. (Deng et al., 2009), which are two popular benchmarks for generative modeling and density estimation.
6.1 Likelihood and Samples
Model | CIFAR-10 | ImageNet-32 | ||||
NLL | FID | NFE | NLL | FID | NFE | |
VDM (Kingma et al., 2021) | 2.65 | 7.60 | 1000 | 3.72 | ||
VDM (with data augmentation) (Kingma et al., 2021) | 2.49 | |||||
(Previous ODE) | ||||||
FFJORD (Grathwohl et al., 2019) | 3.40 | |||||
ScoreSDE (Song et al., 2021c) | 2.99 | 2.92 | ||||
ScoreFlow (Song et al., 2021b) | 2.90 | 5.40 | 3.82 | 10.18 | ||
Soft Truncation (Kim et al., 2022) | 3.01 | 3.96 | 3.90 | 8.42 | ||
Flow Matching (Lipman et al., 2022) | 2.99 | 6.35 | 142 | 3.53 | 5.31 | 122 |
Stochastic Interp.(Albergo & Vanden-Eijnden, 2022) | 2.99 | 10.27 | 3.48 | 8.49 | ||
i-DODE (SP) (ours) | 2.56 | 11.20 | 162 | 3.44/3.69 | 10.31 | 138 |
i-DODE (VP) (ours) | 2.57 | 10.74 | 126 | 3.43/3.70 | 9.09 | 152 |
i-DODE (VP, with data augmentation) (ours) | 2.42 | 3.76 | 215 |
Table 1 shows our experiment results on CIFAR-10 and ImageNet-32 datasets. Our models are pretrained with velocity parameterization, designed IS, and finetuned with second-order flow matching. We report the likelihood values using our truncated-normal dequantization with the importance-weighted estimator under . To compute the FID values, we apply an adaptive-step ODE solver to draw samples from the diffusion ODEs. We also report the NFE during the sampling process, which reflects the smoothness of the dynamics.
Combining our training techniques and dequantization, we exceed the likelihood of previous ODEs, especially by a large margin on CIFAR-10. In Figure 1, we compare our pretraining phase to VDM (Kingma et al., 2021), which indicates that our techniques achieve 2x3x times of previous convergence speed. We further strengthen the likelihood results by employing data augmentation techniques and a larger network, following VDM. We observe that augmented training data may cause fluctuations in the training and testing losses. When we select the models that achieve the best testing performance, we obtain an SDE likelihood of 2.46 at around only 2M iterations, compared to 2.49 of VDM at 10M iterations.
We do not observe the superiority of SP to VP such as lower FID and NFE as in Lipman et al. (2022). We suspect it may result from maximum likelihood training, which puts more emphasis on the high log-SNR region. More theoretical comparisons with Lipman et al. (2022) are given in Appendix E.2.
Randomly generated samples from our models are provided in Appendix J. Since we use network architecture and techniques targeted at the likelihood, our FID is worse than the state-of-the-art, which can be improved by designing time weighting to emphasize the training at small log-SNR levels (Kingma et al., 2021) or using high-quality sampling algorithms such as PC sampler (Song et al., 2021c). Besides, data augmentation and a larger network notably improve the FID to 3.76 on CIFAR-10, while achieving the state-of-the-art likelihood.
6.2 Ablations
Due to the expensive time cost of pretraining, we only conduct ablation studies on CIFAR-10 under the VP schedule. First, we test our techniques for pretraining when training from scratch. We plot the training curves with noise predictor (Kingma et al., 2021) and velocity predictor, then further implement our IS strategies (Figure 2). We find that velocity parameterization and IS both accelerate the training process, while designed IS performs slightly worse than adaptive IS. Considering the extra time cost for learning the IS network, we conclude that designed IS is a better choice for large-scale pretraining. Then we visualize different IS by plotting the mapping from uniform to importance sampled , as well as the variance at different noise levels on the pretrained model (Figure 3). We show that the IS reduces the variance by sampling more in high log-SNR regions.
Model | NLL (U) | NLL (TN) | FID | NFE |
---|---|---|---|---|
VDM (Kingma et al., 2021) | 2.78 | 2.64 | 8.65 | 213 |
Pretrain (ours) | 2.75 | 2.61 | 10.66 | 248 |
+ Finetune (ours) | 2.74 | 2.60 | 10.74 | 126 |
Next, we test our pretraining, finetuning and evaluation on the converged model (Table 2). As stated before, our pretraining has faster loss descent and converges to a higher likelihood than VDM. Based on it, our finetuning slightly improves the ODE likelihood and smooths the flow, leading to much less NFE when sampling. Our truncated-normal dequantization is also a key factor for precise likelihood computing, which surpasses previous uniform dequantization by a large margin.
7 Conclusion
We propose improved techniques for simulation-free maximum likelihood training and likelihood evaluation of diffusion ODEs. Our training stage involves improved pretraining and additional finetuning, which results in fast convergence, high likelihood and smooth trajectory. We improve the likelihood evaluation with novel truncated-normal dequantization, which is training-free and tailored for diffusion ODEs. Empirically, we achieve state-of-the-art likelihood on image datasets without variational dequantization or data augmentation and make a breakthrough on CIFAR-10 compared to previous ODEs. Due to resource limitations, we didn’t explore tuning of hyperparameters and network architectures, which are left for future work.
Acknowledgements
This work was supported by the National Key Research and Development Program of China (2020AAA0106302); NSF of China Projects (Nos. 62061136001, 61620106010, 62076145, U19B2034, U1811461, U19A2081, 6197222, 62106120, 62076145); a grant from Tsinghua Institute for Guo Qiang; the High Performance Computing Center, Tsinghua University. J.Z was also supported by the New Cornerstone Science Foundation through the XPLORER PRIZE. The large-scale training was supported by Shengshu Technology.
References
- Albergo & Vanden-Eijnden (2022) Albergo, M. S. and Vanden-Eijnden, E. Building normalizing flows with stochastic interpolants. In The Eleventh International Conference on Learning Representations, 2022.
- Anderson (1982) Anderson, B. D. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313–326, 1982.
- Bradbury et al. (2018) Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., VanderPlas, J., Wanderman-Milne, S., et al. Jax: composable transformations of python+ numpy programs. Version 0.2, 5:14–24, 2018.
- Burda et al. (2015) Burda, Y., Grosse, R., and Salakhutdinov, R. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.
- Chen et al. (2021) Chen, N., Zhang, Y., Zen, H., Weiss, R. J., Norouzi, M., and Chan, W. Wavegrad: Estimating gradients for waveform generation. In International Conference on Learning Representations, 2021.
- Chen et al. (2018a) Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud, D. Neural ordinary differential equations. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 6572–6583, 2018a.
- Chen et al. (2018b) Chen, X., Mishra, N., Rohaninejad, M., and Abbeel, P. Pixelsnail: An improved autoregressive generative model. In International Conference on Machine Learning, pp. 864–872. PMLR, 2018b.
- Chen et al. (2018c) Chen, Z., Yeo, C. K., Lee, B. S., and Lau, C. T. Autoencoder-based network anomaly detection. In 2018 Wireless telecommunications symposium (WTS), pp. 1–5. IEEE, 2018c.
- Choi et al. (2022) Choi, K., Meng, C., Song, Y., and Ermon, S. Density ratio estimation via infinitesimal classification. In International Conference on Artificial Intelligence and Statistics, pp. 2552–2573. PMLR, 2022.
- Chung et al. (2022) Chung, H., Kim, J., Mccann, M. T., Klasky, M. L., and Ye, J. C. Diffusion posterior sampling for general noisy inverse problems. In The Eleventh International Conference on Learning Representations, 2022.
- Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L., Li, K., and Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE, 2009.
- Dhariwal & Nichol (2021) Dhariwal, P. and Nichol, A. Q. Diffusion models beat GANs on image synthesis. In Advances in Neural Information Processing Systems, volume 34, pp. 8780–8794, 2021.
- Dias et al. (2020) Dias, M. L., Mattos, C. L. C., da Silva, T. L., de Macedo, J. A. F., and Silva, W. C. Anomaly detection in trajectory data with normalizing flows. In 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE, 2020.
- Dinh et al. (2017) Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real nvp. In International Conference on Learning Representations, 2017.
- Dormand & Prince (1980) Dormand, J. R. and Prince, P. J. A family of embedded Runge-Kutta formulae. Journal of computational and applied mathematics, 6(1):19–26, 1980.
- Finlay et al. (2020) Finlay, C., Jacobsen, J.-H., Nurbekyan, L., and Oberman, A. How to train your neural ode: the world of jacobian and kinetic regularization. In International conference on machine learning, pp. 3154–3164. PMLR, 2020.
- Grathwohl et al. (2019) Grathwohl, W., Chen, R. T., Bettencourt, J., Sutskever, I., and Duvenaud, D. Ffjord: Free-form continuous dynamics for scalable reversible generative models. In International Conference on Learning Representations, 2019.
- Helminger et al. (2020) Helminger, L., Djelouah, A., Gross, M., and Schroers, C. Lossy image compression with normalizing flows. arXiv preprint arXiv:2008.10486, 2020.
- Ho et al. (2019) Ho, J., Chen, X., Srinivas, A., Duan, Y., and Abbeel, P. Flow++: Improving flow-based generative models with variational dequantization and architecture design. In International Conference on Machine Learning, pp. 2722–2730. PMLR, 2019.
- Ho et al. (2020) Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 33, pp. 6840–6851, 2020.
- Ho et al. (2022) Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D. P., Poole, B., Norouzi, M., Fleet, D. J., et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
- Ho et al. (2021) Ho, Y.-H., Chan, C.-C., Peng, W.-H., Hang, H.-M., and Domański, M. Anfic: Image compression using augmented normalizing flows. IEEE Open Journal of Circuits and Systems, 2:613–626, 2021.
- Huang et al. (2021) Huang, C.-W., Lim, J. H., and Courville, A. A variational perspective on diffusion-based generative models and score matching. In Advances in Neural Information Processing Systems, 2021.
- Hutchinson (1990) Hutchinson, M. F. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics-Simulation and Computation, 19(2):433–450, 1990.
- Karras et al. (2022) Karras, T., Aittala, M., Aila, T., and Laine, S. Elucidating the design space of diffusion-based generative models. In Advances in Neural Information Processing Systems, 2022.
- Kawar et al. (2022) Kawar, B., Elad, M., Ermon, S., and Song, J. Denoising diffusion restoration models. In Advances in Neural Information Processing Systems, 2022.
- Kim et al. (2022) Kim, D., Shin, S., Song, K., Kang, W., and Moon, I.-C. Soft truncation: A universal training technique of score-based diffusion model for high precision score estimation. In International Conference on Machine Learning, pp. 11201–11228. PMLR, 2022.
- Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Kingma & Dhariwal (2018) Kingma, D. P. and Dhariwal, P. Glow: generative flow with invertible 1 1 convolutions. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 10236–10245, 2018.
- Kingma & Welling (2014) Kingma, D. P. and Welling, M. Auto-encoding variational bayes. In International Conference on Learning Representations, 2014.
- Kingma et al. (2021) Kingma, D. P., Salimans, T., Poole, B., and Ho, J. Variational diffusion models. In Advances in Neural Information Processing Systems, 2021.
- Krizhevsky et al. (2009) Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009.
- Lipman et al. (2022) Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2022.
- Liu et al. (2022a) Liu, J., Li, C., Ren, Y., Chen, F., and Zhao, Z. Diffsinger: Singing voice synthesis via shallow diffusion mechanism. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 11020–11028, 2022a.
- Liu et al. (2022b) Liu, X., Gong, C., et al. Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Representations, 2022b.
- Loshchilov & Hutter (2019) Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
- Lu et al. (2022a) Lu, C., Zheng, K., Bao, F., Chen, J., Li, C., and Zhu, J. Maximum likelihood training for score-based diffusion odes by high order denoising score matching. In International Conference on Machine Learning, pp. 14429–14460. PMLR, 2022a.
- Lu et al. (2022b) Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., and Zhu, J. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. In Advances in Neural Information Processing Systems, 2022b.
- Meng et al. (2022) Meng, C., Song, Y., Song, J., Wu, J., Zhu, J.-Y., and Ermon, S. SDEdit: Image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022.
- Nichol & Dhariwal (2021) Nichol, A. Q. and Dhariwal, P. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp. 8162–8171. PMLR, 2021.
- Nichol et al. (2022) Nichol, A. Q., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., Mcgrew, B., Sutskever, I., and Chen, M. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning, pp. 16784–16804. PMLR, 2022.
- Oord et al. (2016) Oord, A. v. d., Kalchbrenner, N., Vinyals, O., Espeholt, L., Graves, A., and Kavukcuoglu, K. Conditional image generation with pixelcnn decoders. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 4797–4805, 2016.
- Ramesh et al. (2022) Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125, 2022.
- Rombach et al. (2022) Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695, 2022.
- Salimans & Ho (2022) Salimans, T. and Ho, J. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2022.
- Salimans et al. (2017) Salimans, T., Karpathy, A., Chen, X., and Kingma, D. P. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. In International Conference on Learning Representations, 2017.
- Serrà et al. (2020) Serrà, J., Álvarez, D., Gómez, V., Slizovskaia, O., Núñez, J. F., and Luque, J. Input complexity and out-of-distribution detection with likelihood-based generative models. In International Conference on Learning Representations, 2020.
- Sohl-Dickstein et al. (2015) Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256–2265. PMLR, 2015.
- Song et al. (2021a) Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021a.
- Song & Ermon (2019) Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, volume 32, pp. 11895–11907, 2019.
- Song et al. (2020) Song, Y., Garg, S., Shi, J., and Ermon, S. Sliced score matching: A scalable approach to density and score estimation. In Uncertainty in Artificial Intelligence, pp. 574–584. PMLR, 2020.
- Song et al. (2021b) Song, Y., Durkan, C., Murray, I., and Ermon, S. Maximum likelihood training of score-based diffusion models. In Advances in Neural Information Processing Systems, volume 34, pp. 1415–1428, 2021b.
- Song et al. (2021c) Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021c.
- Uria et al. (2013) Uria, B., Murray, I., and Larochelle, H. RNADE: The real-valued neural autoregressive density-estimator. Advances in Neural Information Processing Systems, 26, 2013.
- Vahdat & Kautz (2020) Vahdat, A. and Kautz, J. Nvae: a deep hierarchical variational autoencoder. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pp. 19667–19679, 2020.
- Vincent (2011) Vincent, P. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011.
- Xiao et al. (2020) Xiao, Z., Yan, Q., and Amit, Y. Likelihood regret: an out-of-distribution detection score for variational auto-encoder. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pp. 20685–20696, 2020.
- Xu et al. (2022) Xu, Y., Liu, Z., Tegmark, M., and Jaakkola, T. S. Poisson flow generative models. In Advances in Neural Information Processing Systems, 2022.
- Xu et al. (2023) Xu, Y., Liu, Z., Tian, Y., Tong, S., Tegmark, M., and Jaakkola, T. Pfgm++: Unlocking the potential of physics-inspired generative models. arXiv preprint arXiv:2302.04265, 2023.
- Yang & Mandt (2022) Yang, R. and Mandt, S. Lossy image compression with conditional diffusion models. arXiv preprint arXiv:2209.06950, 2022.
- Zhao et al. (2022) Zhao, M., Bao, F., Li, C., and Zhu, J. Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. In Advances in Neural Information Processing Systems, 2022.
Appendix A Different perspectives of diffusion ODEs for bridging the gap between discrete and continuous data
Suppose the discrete data to be modelled are 8-bit integers . Following the common transform in diffusion models, we first normalize it to range [-1,1] by the mapping . In the following discussions, we consider the model distribution on transformed discrete data , which is equal to since the scaling does not alter the discrete probability.
A.1 Dequantization perspective
The discrete data has a uniform gap between two consecutive values on each dimension. We can define the discrete model distribution as
(20) |
where is the diffusion ODE defined at time . Then, we can introduce a dequantization distribution with support over . Treating as an approximate posterior, we obtain the following variational bound (Ho et al., 2019):
(21) |
The ODE term can be evaluated exactly by solving another ODE called “Instantaneous Change of Variables” (Chen et al., 2018a). As for the posterior , we can derive closed-form solutions for predefined posterior formulation. We provide the details for uniform dequantization and our proposed truncated-normal dequantization.
Uniform dequantization
We simply use uniform posterior . In this case, is a constant, and the bound becomes
(22) |
Similar to Burda et al. (2015), we can also sample multiple to derive a tighter bound, which is called importance weighted estimator:
(23) |
However, this dequantization will cause a training-evaluation gap. For training, we fit to the distribution of . For evaluation, we test on uniform dequantized . This gap will degenerate the likelihood performance, as we will show later.
Truncated-normal dequantization
To bridge the training-evaluation gap, we test on , where obeys a truncated-normal distribution to make sure the range of on each dimension does not exceed . Specifically, denote , we define the truncated-normal distribution as
(24) |
Let
(25) |
By the change of variables for probability density, we have
(26) |
(27) |
where is the probability distribution function of truncated-normal distributions
(28) |
Here is the cumulative distribution function of standard normal distribution, and is the error function. Combining the equations above, the bound is reduced to
(29) |
Further, we can derive closed-form solutions for the entropy term of truncated-normal distribution:
(30) |
and we finally obtain the exact form of the bound:
(31) |
where the ODE log-likelihood can also be evaluated exactly. Similarly, we have the corresponding importance-weighted estimator by modifying Eqn. (29):
(32) |
where , and is expressed in Eqn. (28).
In our experiments, we choose the start time . Under this setting, we have , and the truncated-normal distribution is almost the same as the standard normal distribution due to the 3- principle. Thus, used in training and used in testing are virtually identically distributed, resulting in a negligible training-evaluation gap.
A.2 Variational perspective
From the variational perspective, we can view the transition from discrete to continuous as a variational autoencoder, where the prior is modeled by diffusion ODE, and the approximate posterior is the analytical Gaussian transition kernel in the forward diffusion process at the start. We have the variational bound:
(33) |
where . We want to use the reconstruction term to approximate . Note that
(34) |
for small enough , we have , so , where represents the -th dimension. Thus, we also choose as a factorized distribution, following Kingma et al. (2021):
(35) |
where each
(36) |
As is a discrete variable, the probability can be computed by softmax, so we have
(37) |
Besides, the Gaussian entropy term can be computed exactly
(38) |
and the bound is reduced to
(39) |
where , is given in Eqn. (37) and is the exact ODE likelihood. We also have the importance weighted estimator by modifying Eqn. (33):
(40) |
A.3 Practical connections and results
Let us consider the bound without importance weighted estimator. By observing the bound in Eqn. (31) for truncated-normal dequantization and the bound in Eqn. (39) for variational perspective, we can find that they have similar formulations. Suppose we use , we have , and the bound in Eqn. (31) is approximately
(41) |
Next, consider the variational perspective. Though the reconstruction term in Eqn. (39) depends on the data distribution, empirically it is nearly a constant . So we have the approximate bound
(42) |
We note the only difference is that our proposed truncated-normal dequantization uses rather than for ODE likelihood evaluation, and there is a small constant difference in the bound.
NLL | Uniform Dequantization | Variational | Truncated-Normal Dequantization | ||||||
---|---|---|---|---|---|---|---|---|---|
CIFAR-10 (VP) | 2.74 | 2.72 | 2.71 | 2.60 | 2.59 | 2.58 | 2.60 | 2.58 | 2.57 |
CIFAR-10 (SP) | 2.81 | 2.79 | 2.78 | 2.61 | 2.59 | 2.58 | 2.60 | 2.57 | 2.56 |
ImageNet-32 (VP) | 3.52 | 3.51 | 3.50 | 3.46 | 3.44 | 3.44 | 3.45 | 3.44 | 3.43 |
ImageNet-32 (SP) | 3.57 | 3.56 | 3.55 | 3.48 | 3.47 | 3.46 | 3.47 | 3.45 | 3.44 |
Remark A.1.
For high-dimensional data such as images, directly comparing log-likelihood may suffer from scaling issues by the dimension. In practice, we usually compare the BPD (bits/dim) by
(43) |
where is the data distribution. Since BPD averages the log-likelihood on each dimension, scaling dimensionality has no effect on the final result.
We test the two types of dequantization and the variational perspective on our final models, using different numbers of importance samples . The results are listed in Table 3. Empirically, truncated-normal dequantization performs slightly better than variational, while uniform dequantization gives a bad likelihood due to the large training-evaluation gap. We also observe that increasing further improves the results by giving a tighter bound.
Remark A.2.
Since uniform dequantized data has a larger noise level than truncated-normal dequantized data, we find evaluating at start time leads to bad likelihood. Thus, we tune for uniform dequantization (Figure 4), and eventually choose for CIFAR-10 (VP), CIFAR-10 (SP), ImageNet-32 (VP), ImageNet-32 (SP) respectively.
Appendix B Equivalence of different predictors and matching objectives
We have the following theorem which demonstrates that different predictors are mutually transformable by a time-dependent skip connection, and they can be trained in a simulation-free approach by equivalent matching objectives.
Theorem B.1.
Let be the sample from data distribution, and be the sample from . Denote . Suppose we have four kinds of predictors parameterized by and corresponding matching objectives with positive time weighting function :
-
•
score predictor and score matching loss
-
•
noise predictor and noise matching loss
-
•
data predictor and data matching loss
-
•
velocity predictor and flow matching loss
For any , if we denote the optimal (ground-truth) predictors that minimize the corresponding matching losses as respectively, then they are equivalent by the following relations:
(44) | ||||
where is the ground-truth score.
Proof.
For any positive weighting , the overall optimum of the matching loss is achieved when the optimum of the inner expectation is achieved for any . For fixed , by denoising score matching (Vincent, 2011), we know minimizing is equivalent to minimizing , where . The inner expectation is a minimum mean square error problem, so the optimal score predictor satisfies
(45) |
Similarly, for , the optimal noise predictor satisfies
(46) |
For , the optimal data predictor satisfies
(47) | ||||
For , the optimal velocity predictor satisfies
(48) | ||||
∎
The equivalence of optimal predictors also implies the equivalence of parameterized predictors. From the above theorem, we know and are related by . In practice, we use timing. From the relationship , we obtain the noise predictor expressed by
(49) |
Further, we can replace with the normalized velocity predictor .
Moreover, we can derive the equivalent training objectives under different parameterizations by employing the relations discussed above freely. For example, when we replace the normalized velocity predictor with the score predictor in the second-order objective Eqn. (4.3), we can obtain the second-order denoising score matching similar to Lu et al. (2022a). However, though theoretically equivalent, the actual performance of these objectives highly depends on the specific model architecture, hyperparameters and parameterization, and the authors of Lu et al. (2022a) find that their high-order denoising score matching objectives only work for VE schedule, but degenerate the performance of pretrained models with VP schedule.
Appendix C Specifications under VP and SP schedule
As stated in Section 4.3, using timing and normalized velocity predictor , the likelihood weighted first-order and second-order flow matching objectives are reformulated as:
(50) |
(51) |
where . For VP and SP schedule, since , using their schedule properties, are deterministic functions of without any hyperparameters. Thus, we can derive their specific objectives and equivalent predictors using the formula for general noise schedules. We summarize them in Table 4, where denotes the stop-gradient version of .
Formula | VP | SP |
---|---|---|
Next, we derive the designed IS procedure. We want to choose a proposal distribution , which is proportional for VP and SP. Since we have explicit expressions for the density, we utilize inverse transform sampling to design a sampling procedure. Concretely, we take uniform samples of a number , and solve the following equation about :
(52) |
Here we assume maximum time , and is a normalizing constant.
VP
We have (omit the constant of the indefinite integral)
(53) |
Then the equation for inverse transform sampling is
(54) |
The solution has a closed-form expression, which gives the inverse transformation from to
(55) |
SP
We have (omit the constant of the indefinite integral)
(56) |
Denote , then the equation for inverse transform sampling is
(57) |
The solution has no closed-form expressions. Similar to the implementation in Song et al. (2021b), we use the bisection method to find the root.
Appendix D Illustration of velocity prediction and imbalance problem
First, we give an intuitive illustration of our velocity parameterization and corresponding flow matching objective in Section 4.1. As shown in Figure 5(a), for each pair where and , let . As increases, moves from to gradually, forming a diffusion path in the sample space, and is the velocity across the path. Thus, minimizing is to predict the expected velocity for all possible pairs.
Next, we interpret the superiority of velocity prediction from the perspective of balanced prediction difficulty. Intuitively, the noise prediction model suffers from an imbalance problem: at small , is similar to data, and extracting the insignificant noise component is hard; at large , is similar to noise, so the noise prediction is easy and has a small error. Velocity prediction, on the other hand, has a property that the prediction target is less relevant to input . In Fig. 5(b) we empirically confirm it on our pretrained model. We plot the mean square prediction error (MSE) w.r.t. time , which shows that velocity prediction alleviates the imbalance problem by enlarging the training at large . Since the overall error is a weighted combination of the MSE at different and is invariant to the parameterization, we can conclude that under noise prediction, the MSE is lower near , but is imposed a larger weight, so it has a larger gradient variance.
Appendix E Relationship between velocity parameterization and other works
In this section, we demonstrate how the techniques in related works (Karras et al., 2022; Lipman et al., 2022; Salimans & Ho, 2022; Ho et al., 2022) can be reformulated as velocity parameterization.
E.1 Interpretation by preconditioning
Works that aim at improving the sample quality of diffusion models also consider the network parameterizations that adaptively mix signal and noise. Karras et al. (2022) proposes to precondition the neural network with a time-dependent skip connection that allows it to estimate either data or noise , or something in between. Similarly, we write the noise predictor in the following formulation:
(58) |
where is the pure network, . The flow matching loss can be rewritten as
(59) | ||||
where
(60) |
Following first principles in EDM, We derive formulas for to ensure:
-
1.
The training inputs of have unit variance.
-
2.
The effective training target has unit variance.
-
3.
We select to minimize , so that the errors of are amplified as little as possible.
From principle 1, we have
(61) | ||||
From principle 2, we have
(62) | ||||
From principle 3, we have
(63) | ||||
If we assume and consider VP schedule, we have , and the coefficients are reduced to
(65) |
In this case, the preconditioning is in agreement with our velocity parameterization by . In practice, we find setting as in Karras et al. (2022) leads to faster descent of the loss at the start, but slower convergence as the training proceeds.
E.2 Connection to flow matching in Lipman et al. (2022)
Lipman et al. (2022) defines a conditional probability path that gradually moves the data to a target distribution . Note that they use to represent data distribution and to represent target distribution. To be consistent, we reverse their time representation. They obtain the marginal probability path by marginalizing over :
(66) |
They want to learn a vector field , which defines a flow by
(67) |
so that the marginal can be generated by the push-forward . In practice, they consider the Gaussian conditional probability paths
(68) |
and propose a conditional flow matching (CFM) objective for simulation-free training of
(69) |
where
(70) |
Suppose the mean is linear to , and the standard deviation is invariant to , as the two experimented cases in flow matching. By setting , we have
(71) |
where we use since . Then we can observe that they are corresponding to our notations: the conditional probability path corresponds to the Gaussian transition kernel of the forward diffusion process; the marginal probability path corresponds to the ground-truth marginals associated with the forward diffusion process; the matching target in CFM corresponds to the velocity of the diffusion path in our formulation.
Therefore, the CFM objective in Lipman et al. (2022) is actually velocity parameterization when specific to Gaussian diffusion processes, which is similar to our first-order objective of the pretraining phase. We can express CFM in a simpler form, which is easier to analyze and generalize to any noise schedule. Then by the equivalence of different predictors (Theorem B.1) and the relationship between and , we have
(72) | ||||
which demonstrates that the CFM objective not only changes the parameterization but also imposes a different time weighting on the original denoising score matching objective. When the training aims to improve the sample quality (e.g., FID), the optimal choice for is still an open problem.
Comparing the CFM objective to our first-order objective Eqn. (50), the practical differences are that we use normalized predictor , timing, and apply likelihood weighting. The likelihood weighting refers to time weighting in Eqn. (5) and in Eqn. (12), which is consistent under different parameterizations and is the theoretically optimal choice for maximum likelihood training (Song et al., 2021c). Also, changing the time domain from to will not alter the value of the objective, but will affect the variance of Monte-Carlo estimation and the convergence speed, as we have discussed. For example, the OT path in Lipman et al. (2022) is , and the relation between and is . Under timing, we can decouple the choice of noise schedules to the greatest extent, and regard the change of variable from to as a tunable importance sampling procedure.
Besides, normalizing the field is necessary for stable training of the velocity predictor and is the key to unifying v prediction and preconditioning. Such strategies have also been adopted in more general physics-inspired generative models. For example, Xu et al. (2022, 2023) propose to normalize the Poission field when training Poisson flow generative models.
E.3 Connection to v prediction
In Salimans & Ho (2022); Ho et al. (2022), a technique called “v prediction” is used, which parameterizes a network to predict . Assuming a VP schedule following their choice, we have , so by taking the derivative w.r.t. we have , then
(73) |
so
(74) | ||||
and the velocity is
(75) | ||||
Besides, we can compute the normalizing factor as
(76) |
so we have the normalized velocity
(77) |
Therefore, , which means that v prediction is a special case of velocity parameterization when the noise schedule is VP.
Appendix F Error-bounded trace of second-order flow matching
Here we provide the proofs for the error-bounded trace of second-order flow matching. First, we provide a lemma that gives the Jacobian of the ground-truth velocity predictor .
Lemma F.1.
Suppose , denote , we have
(78) |
and
(79) |
Proof.
Then we prove Theorem 4.1 as follows.
Proof.
The optimization in Eqn. (15) can be rewritten as
(84) |
For fixed and , minimizing the inner expectation is a minimum mean square error problem for , so the optimal satisfies
(85) |
Using Lemma 79 and , we have
(86) | ||||
Therefore, we can obtain the error bound by
(87) | ||||
where is the first-order estimation error. ∎
Appendix G Difference between our second-order flow matching and the previous time score matching in Choi et al. (2022)
We propose the error-bounded second-order flow matching objective to regularize , which is equal to by the “Instantaneous Change of Variables” formula of CNFs (Chen et al., 2018a). Choi et al. (2022) proposes a joint score matching method to estimate the data score as well as the time score , which seems related. However, they are essentially different.
Firstly, the change-of-variable for CNFs describes the total derivative of w.r.t. which evolves by the ODE flow trajectory, not each fixed data point . However, in Choi et al. (2022) describes the partial derivative of for , i.e., any fixed data point in the whole space. Specifically, according to the Fokker-Planck equation, we have
(88) |
It follows that
(89) |
Therefore, the total derivative we care about is different from the partial derivative in Choi et al. (2022), and their training objectives are also different (with different optimal solutions).
Moreover, there is another difference: Choi et al. (2022) trains another model to estimate the partial derivative , which is independent of the ODE velocity (in the form of the parameterized score function ). However, our method restricts the parameterized velocity itself, and does not employ another model.
Finally, the techniques used in Choi et al. (2022) and our work are also different. Choi et al. (2022) estimates the score matching loss for the partial derivative by the well-known integral-by-parts, which is used to derive the famous sliced score matching (Song et al., 2020), to avoid the computation of the score function ; However, our method leverages the property of mean square error (that its minimum is conditional mean), which is used to derive the famous denoising score matching (Vincent, 2011), to estimate the divergence of . In the score matching literature, sliced score matching and denoising score matching are two rather different techniques. As first-order denoising score matching is widely used in training diffusion models (such as Song et al. (2021c)), our proposed second-order flow matching is also suitable for training diffusion ODEs.
Appendix H Details of our adaptive IS
In this section, we give details of our adaptive IS stated in Section 4.4. First, we parameterize similar to Kingma et al. (2021):
(90) |
where is a dense monotone increasing network. Concretely, we use a two-layer fully-connected network where is the sigmoid activation function, are linear layers with positive weight and output units of 1024 and 1.
Then we present our adaptive IS procedure in Algorithm 1. Kingma et al. (2021) proposes to reuse the gradient to optimize and avoid a second backpropagation by decomposing the gradient using chain-rule. We simply their learning of by removing the complex gradient operation in one iteration and propose to alternatively optimize and . It may take extra overhead, but also seeks the optimal IS and is enough for ablation.
Appendix I Experiment details
In this section, we provide details of our experiment settings. Our network, hyperparameters and training are the same for different noise schedules on the same dataset.
Model architectures
Our diffusion ODEs are parameterized in terms of the -timed normalized velocity predictor , based on the U-Net structure of Kingma et al. (2021). This architecture is tailored for maximum likelihood training, employing special designs such as removing the internal downsampling/upsampling and adding Fourier features for fine-scale prediction. Our configuration for each dataset also follows Kingma et al. (2021): For CIFAR-10, we use U-Net of depth 32 with 128 channels; for ImageNet-32, we still use U-Net of depth 32, but double the number of channels to 256. All our models use a dropout rate of 0.1 in the intermediate layers. For CIFAR-10 (with data augmentation), we use U-Net of depth 32 with 256 channels and decrease the dropout rate to 0.05.
Hyperparameters and training
We follow the same default training settings as Kingma et al. (2021). For all our experiments, we use the Adam (Kingma & Ba, 2014) optimizer with learning rate , exponential decay rates of and decoupled weight decay (Loshchilov & Hutter, 2019) coefficient of 0.01. We also maintain an exponential moving average (EMA) of model parameters with an EMA rate of 0.9999 for evaluation.
For other hyperparameters, we use fixed start and end times which satisfy , which is the default setting in Kingma et al. (2021). In the finetuning stage, we simply set the coefficient in the mixed loss as 0.1 with no further tuning, so that the magnitude of the second-order loss is negligible w.r.t the first-order loss. Since the first-order matching accuracy is critical to the second-order matching, a large will make the training unstable or even degenerate the likelihood performance.
All our training processes are conducted on 8 GPU cards of NVIDIA A40 except for ImageNet-32 (old version) and CIFAR-10 (with data augmentation). For CIFAR-10, we pretrain the model for 6 million iterations, which takes around 3 weeks. Then we finetune the model for 200k iterations, which takes around 1 day. For ImageNet-32 (new version), we pretrain the model for 2 million iterations, which takes around 2 weeks. Then we finetune the model for 250k iterations, which takes around 3 days. We use a batch size of 128 for both training stages and both datasets.
Note that in related works (Lipman et al., 2022; Albergo & Vanden-Eijnden, 2022), experiments on ImageNet-32 (new version) are conducted at a larger batch size (512 or 1024), which may improve the results. We did not use a larger batch size or train longer due to resource limitations.
For ImageNet-32 (old version), the training processes are conducted on 8 GPU cards of NVIDIA A100 (40GB). We pretrain the model for 2 million iterations using a batch size of 512, which takes around 2 weeks. Then we finetune the model for 500k iterations using a batch size of 128 and accumulate the gradient for every 4 batches, which takes around 2.5 days.
For CIFAR-10 (with data augmentation), the training processes are conducted on a cluster of 64 GPU cards of NVIDIA A800 (80GB). We pretrain the model for 2 million iterations using a batch size of 1024, which takes around 2 weeks. Due to the large training resource requirements and the regularization effect by data augmentation, we do not further finetune the model by the second-order flow matching loss.
Likelihood and sample quality
For likelihood, we use our truncated-normal dequantization. When the number of importance samples , we report the BPD on the test dataset with 5 times repeating to reduce the variance of the trace estimator. When or , we do not repeat the dataset since the log-likelihood of a data sample is already evaluated multiple times. For sampling, since we are concentrated on ODE, we simply use an adaptive-step ODE solver with RK45 method (Dormand & Prince, 1980) (relative tolerance and absolute tolerance ). We generate 50k samples and report the FIDs on them. Utilizing high-quality sampling procedures such as PC sampler (Song et al., 2021c) or fast sampling algorithms such as DPM-Solver (Lu et al., 2022b) may improve the results, which are left for future work.