Denoising Markov Models Generalize Diffusion Through Reverse-Time Generators

Yinuo RenMicrosoft ResearchTuesday, May 26, 202615 min read

Stanford Ph.D. candidate Yinuo Ren argues that diffusion, discrete diffusion, and broader jump-based generative models can be treated as instances of the same problem: choose a forward Markov process that carries data toward a simple reference law, then learn its reverse-time generator. His framework gives conditions under which that reverse generator is explicit up to unknown densities and turns the resulting approximation problem into a path-space KL objective via Doob’s h-transform. The payoff, Ren says, is a principled way to design denoising models beyond Gaussian diffusion, including discrete and Lévy-type dynamics.

The common structure is a Markov path run backward

Yinuo Ren frames diffusion-based generative modeling as a problem of designing a Markovian probability path from data to an easy reference law, then learning how to run that path backward. In the familiar continuous case, the forward process moves data toward a simple reference distribution such as a Gaussian; the reverse-time process moves samples from the reference distribution back toward the data distribution. The learned object is usually the score, the gradient of the log intermediate density.

Ren’s broader motivation also includes stochastic interpolants and flow matching, which he presents as evidence that generative models can learn along prescribed probability paths. His proposed framework is narrower and more specific: choose a forward Markov process that sends the data law toward a simple reference law, ensure the process can be simulated explicitly, and train the reverse process with a tractable loss. In continuous diffusion, that forward process is an SDE on Euclidean space. In discrete diffusion, it is more naturally a continuous-time Markov chain with jumps on a finite state space. For applications involving multimodal data, scientific data, geometric constraints, or hard constraints, Ren argues that the natural object to study is not “diffusion” narrowly understood, but a general Markov process.

The main aim here is to give a very clear recipe for the design of denoising Markov models with almost any Markov processes.

Yinuo Ren · Source

That shift matters because the learned quantity changes with the process. In continuous diffusion, the reverse generator depends on $\nabla lo g p_{t}$ , the score. In discrete diffusion, the reverse jump rates depend on ratios such as $p_{t} (y) / p_{t} (x)$ . The loss also changes: continuous diffusion gives the familiar squared-error score-matching loss, while discrete diffusion gives a score-entropy-like objective. Ren’s framework treats both as coordinate choices inside a single generator-level construction.

The conceptual picture is simple, even if the mathematical machinery is not. One starts with samples $x_{0} \sim p_{0}$ from the data distribution. A forward noising process with generator $L_{t}$ pushes those samples toward an easy-to-sample reference distribution at time $T$ . That same forward process induces a true reverse-time process, but this reverse process depends on unknown intermediate densities. The practical model therefore introduces an approximate backward generator $K_{t}$ , learned from data and used to run an approximate reverse process from an initial law $q_{0}$ that is intended to be close to $p_{T}$ .

When Carles Domingo-Enrich asks how to interpret the measure $Q$ in the diagram, Ren clarifies that $P$ and $Q$ are path measures, not initial distributions. The distinction is central: the framework compares entire stochastic paths, not merely endpoints.

Question in the diagram	Object it relates	Role in the framework
Time reversal	Forward generator and true backward generator	Identifies the exact reverse process induced by the chosen forward Markov process
Parameterization	Approximate backward generator	Restricts the learned reverse process to a structural family tied to the forward generator
Gap	Approximate and true backward generators	Quantifies mismatch through a ratio between the learned positive function and the true density

The three gaps in Ren’s conceptual diagram organize the framework.

The reverse generator is explicit once the density is known

The generator of a Markov process is the infinitesimal expected change of a test function. For a Markov process $(x_{t})_{t \in [0, T]}$ on a state space $E$ , Ren defines the time-dependent generator as

L_{t} f (x) := h \to 0^{+} lim \frac{E [ f ( x _{t + h} ) ∣ x _{t} = x ] - f ( x )}{h} .

This generator compactly describes the evolution of the marginal densities through the Kolmogorov forward equation $\partial_{t} p_{t} = L_{t}^{*} p_{t}$ . Dynkin’s formula plays the corresponding integral role, linking expected function changes along the process to integrals of the generator. Ren also introduces the carré du champ operator,

Γ_{t} (u, v) = L_{t} (uv) - u L_{t} v - v L_{t} u,

which measures the extent to which the stochastic generator fails to behave like an ordinary derivative. If the operator were merely a derivative, this bilinear term would vanish. In stochastic settings it is the object that captures perturbations appearing in time reversal, functional inequalities, nonequilibrium dynamics, and Doob-type transforms.

The first of the three gaps is time reversal: what is the relationship between the forward generator $L_{t}$ and the true backward generator $\tilde{L}_{t}$ ? Under suitable regularity assumptions on the densities $p_{t}$ , the time-reversal theorem states that

\tilde{L}_{T - t} f = L_{t}^{*} f + p_{t}^{- 1} Γ_{t} (p_{t}, f) .

The reverse generator is therefore the adjoint forward generator plus a density-dependent perturbation. This is the abstract version of what happens in continuous and discrete diffusion. In continuous diffusion, the unknown density enters through a score. In discrete diffusion, it enters through score ratios. At the generator level, both are forms of the same perturbation term.

The second gap is parameterization. Since the true density $p_{t}$ is unknown, Ren restricts the approximate backward generator to the same structural family but replaces $p_{t}$ with a strictly positive function $φ_{t} : E \to R_{+}$ :

K_{T - t} f = L_{t}^{*} f + φ_{t}^{- 1} Γ_{t} (φ_{t}, f) .

In practice, $φ_{t}$ , or an equivalent coordinate representation, is parameterized by a neural network. This is not an arbitrary search over all possible backward Markov processes. It is a search over density-dependent perturbations of the adjoint generator, deliberately mirroring the exact time-reversal formula. If $φ_{t} = p_{t}$ , then $K_{t}$ equals the true reverse generator. Ren calls this the “holy grail” of the parameterization.

The third gap is the mismatch between the approximate backward generator and the true backward generator. Algebraically, this gap can be written in terms of the mismatch ratio

η_{t} := \frac{φ _{t}}{p _{t}} .

With this definition, the approximate generator is itself a perturbation of the true backward generator:

K_{T - t} f = \tilde{L}_{T - t} f + η_{t}^{- 1} Γ_{t}^{*} (η_{t}, f) .

Thus $K_{t}$ being a $φ_{t}$ -perturbation of the adjoint forward generator is equivalent to being an $η_{t}$ -perturbation of the true reverse generator. The ideal case is $η_{t} \equiv 1$ , meaning perfect recovery of the true backward process.

Doob’s transform turns mismatch into a path-space KL objective

The framework becomes trainable once the generator mismatch is connected to a quantitative path-space discrepancy. Ren does this through a generalized Doob’s $h$ -transform.

Given a Markov generator $L_{t}$ , a positive function $h_{t}$ , and a function $λ_{t}$ , the generalized transform defines

L_{t}^{λ, h} f = h_{t}^{- 1} L_{t} (h_{t} f) - λ_{t} f .

Under the theorem Ren cites, the transformed process is absolutely continuous with respect to the original path measure, and the Radon–Nikodym derivative has an explicit pathwise form. Ren identifies the estimated backward generator $K_{T - t}$ as the generalized Doob transform of the true backward generator $\tilde{L}_{T - t}$ , with $h_{t}$ equal to the mismatch ratio $η_{t}$ . Choosing

λ_{t} = η_{t}^{- 1} \tilde{L}_{T - t} η_{t}

makes the transformed generator conservative, so the approximate backward process is a valid Markov process rather than an operator that loses or creates total probability mass.

The training implication comes from the resulting change-of-measure theorem. By plugging the time-reversal formula and Dynkin’s formula for $lo g η_{t}$ into the generalized transform, Ren obtains a path-space KL expression. Under suitable assumptions, there exists a path measure $Q$ such that the time-reversal process under $Q$ is governed by $K_{t}$ , and

D_{KL} (P ∥ Q) = E_{P} [\int_{0}^{T} (η_{t} L_{t}^{*} η_{t}^{- 1} + L_{t}^{*} lo g η_{t}) (x_{t}) d t] =: L [η_{t}] .

Here $P$ is the path law of the forward process and its true reverse; $Q$ is the reweighted path law inducing the approximate backward generator. The functional $L [η_{t}]$ is the exact path-space discrepancy between the true and approximate backward processes.

When Carles Domingo-Enrich asks whether, in the purely continuous case, this recovers the Girsanov change of measure, Ren answers yes. If one plugs in the generator corresponding to a diffusion process, the expression reduces to the familiar squared-error loss. In Ren’s phrasing, it “magically” reduces to the $L^{2}$ loss.

The algorithmic consequence is the error bound. By the tower property of KL divergence, the terminal law $q_{T}$ generated by the approximate backward process satisfies

D_{KL} (p_{0} ∥ q_{T}) \leq D_{KL} (p_{T} ∥ q_{0}) + L [η_{t}] .

Ren interprets the first term as terminal mismatch: how well the chosen forward process reaches the easy reference distribution. The second is estimation error: how well the learned backward process matches the true backward process.

Minimizing the training loss is not merely fitting a heuristic score.

Yinuo Ren

The point is that the loss controls an upper bound on the generative KL error, together with the terminal mismatch created by the forward design. That is the connection between the abstract path-space construction and the endpoint distribution produced by the model.

There remains a practical problem: $η_{t}$ contains the unknown density $p_{t}$ . Ren’s generalized score-matching result replaces the marginal ratio by a conditional ratio,

η_{t ∣0} (x_{t} ∣ x_{0}) := \frac{φ _{t} ( x _{t} )}{p _{t ∣0} ( x _{t} ∣ x _{0} )} .

Up to constants independent of model parameters, minimizing the path-space KL is equivalent to minimizing a generalized score-matching loss involving the data distribution $p_{0}$ , the forward transition kernel $p_{t ∣0}$ , and the forward generator $L_{t}$ . This is the point where the abstract generator construction becomes an algorithm: sample data, sample an intermediate time, simulate the forward process to get $x_{t}$ , evaluate the generalized score-matching integrand, and update the parameters of $φ_{t}^{θ}$ .

At inference time, the procedure reverses in the qualified sense shown on Ren’s meta-algorithm slide: sample $y_{0}$ from $q_{0} \approx p_{T}$ , simulate the learned backward generator $K_{t}^{θ}$ from time 0 to $T$ , and output $y_{T}$ with the goal that $q_{T} \approx p_{0}$ .

Ren also adds numerical simulation error to the bound. If the numerical simulation of the approximate backward process has one-step error $O (κ^{r + 1})$ , the generated law $\overset{q}{^}_{T}$ satisfies a bound of the form

D_{KL} (p_{0} ∥ \overset{q}{^}_{T}) ≲ D_{KL} (p_{T} ∥ q_{0}) + L [η_{t}^{θ}] + T κ^{r} .

This separates three errors: truncation error from imperfect convergence of the forward process, estimation error from learning the wrong reverse dynamics, and numerical error from discretizing the backward process.

Term	Meaning in Ren’s bound
Terminal mismatch	How well the forward process reaches the easy reference law
Estimation error	How well the approximate backward process matches the true backward process
Numerical error	Error introduced by discretizing the backward simulation

Ren’s meta-error bound separates forward-process design, learning, and simulation.

Continuous and discrete diffusion are recovered as coordinate choices

The framework recovers standard continuous diffusion when the state space is $R^{d}$ and the forward generator has diffusion form,

L_{t} f = b_{t} \cdot \nabla f + \frac{1}{2} D_{t} : \nabla^{2} f .

Plugging this generator into the time-reversal and parameterization formulas gives a backward generator of the same diffusion type, with score coordinates

\overset{s}{^}_{t} = \nabla lo g φ_{t} .

The resulting generalized score-matching loss involves the diffusion matrix $D_{t}$ :

E [\int_{0}^{T} \frac{1}{2} D_{t} (x_{t}) : (\overset{s}{^}_{t} (x_{t}) - \nabla lo g p_{t ∣0} (x_{t} ∣ x_{0}))^{\otimes 2} d t] .

Ren notes that when $D_{t}$ is proportional to the identity, this becomes the standard diffusion score-matching objective, up to time reweighting.

Discrete diffusion appears when the state space is finite and the forward generator is a jump process,

L_{t} f (x) = y \in X \sum (f (y) - f (x)) λ_{t} (y, x) .

In that case, the backward rates are reweighted by the ratio of the positive parameterization function at two states:

\hat{λ}_{T - t} (x, y) = \frac{φ _{t} ( y )}{φ _{t} ( x )} λ_{t} (x, y) .

The natural score coordinate is therefore the ratio

\overset{s}{^}_{t} (x, y) = \frac{φ _{t} ( y )}{φ _{t} ( x )} .

This recovers the discrete diffusion objective Ren associates with score-entropy losses. The significance is not that continuous and discrete objectives look identical; they do not. The significance is that both arise from the same generator-level parameterization and the same path-space KL principle.

Heavy-tailed data expose a limitation of Gaussian forward processes

Lévy-type models show why expanding beyond Gaussian diffusion is not just a formal generalization. The general KL guarantee contains the terminal mismatch term $D_{KL} (p_{T} ∥ q_{0})$ . If the forward process and reference law are poorly matched to the data, this term can break the guarantee regardless of how well the reverse process is learned.

The obstruction Ren highlights is heavy tails. For affine Gaussian forward diffusions, if the data law $p_{0}$ is heavy-tailed with infinite second moment and the reference $q_{0}$ is Gaussian, then for every positive $T$ ,

D_{KL} (p_{T} ∥ q_{0}) = \infty.

Conceptually, affine Gaussian forward dynamics do not repair a tail mismatch against a Gaussian terminal reference. Even if the learning objective were driven to zero, the terminal mismatch would remain uncontrolled. Ren takes this as motivation for forward processes with matching tails and jump behavior, such as $α$ -stable processes.

The general form he uses for Lévy-type models is the Courrège form. Under suitable regularity assumptions, the forward generator can be written as a sum of deterministic transport, local Gaussian diffusion, and nonlocal jumps:

L_{t} f (x) = b_{t} (x) \cdot \nabla f (x) + \frac{1}{2} D_{t} (x) : \nabla^{2} f (x) + \int_{R^{d} ∖ {x}} (f (y) - f (x) - (y - x) \cdot \nabla f (x) 1_{B (x, 1)} (y)) λ_{t} (d y, x) .

The transport term moves mass deterministically. The diffusion term accounts for local Gaussian noise. The integral term describes jumps from $x$ to $y$ , with a small-jump compensation term because infinitely many small jumps can be indistinguishable from local transport.

Plugging this generator into Ren’s framework yields a backward Lévy triplet. The reverse jumps are reweighted by $φ_{t} (y) / φ_{t} (x)$ , the diffusion coefficient remains $D_{t}$ , and the backward drift includes the usual diffusion-score term plus an additional small-jump compensation term. The path-space KL decomposes into a diffusion term and a jump term:

L [η_{t}] = E [\int_{0}^{T} \frac{1}{2} D_{t} : (\nabla lo g η_{t})^{\otimes 2} d t] + E [\int_{0}^{T} \int λ_{t} (d y, x_{t}) ℓ (\frac{η _{t} ( y )}{η _{t} ( x _{t} )}) d t],

with

ℓ (r) = r - 1 - lo g r .

Ren’s conclusion from this decomposition is that one can design generative models from broader Lévy processes, provided the corresponding loss can be learned and the backward dynamics can be simulated.

Process choice can enforce geometry and change the probability path

Ren’s examples illustrate what changes when the forward process is chosen for the geometry of the problem rather than treated as a default implementation detail.

The first uses geometric Brownian motion as the forward process. The setup preserves the positive orthant $R_{++}^{d}$ by construction. The forward process has multiplicative noise,

d x_{t} = x_{t} ⊙ Σ d w_{t},

with diffusion matrix

D (x_{t}) = diag (x_{t}) Σ Σ^{⊤} diag (x_{t}) .

The target data are mixtures of Gaussians restricted to the positive orthant, and the model is trained with the generalized diffusion score-matching loss. Ren reports that training remains stable even though the score can be singular near the boundary, and generated samples match the target without boundary violations. The example supports his broader point: choosing dynamics consistent with the state space can enforce constraints structurally rather than by post hoc correction.

The second experiment is a pure jump model in continuous space. Here the drift and diffusion coefficient are both zero, and the domain is the torus $T^{2} = R^{2} / Z^{2}$ . The forward Lévy kernel is an isotropic Gaussian jump kernel, and the initial backward law is approximately uniform on the torus. Samples start from the uniform law and progressively jump into structured target distributions such as chessboard, Swiss roll, and moons patterns.

Ren emphasizes that this is not simply discrete diffusion translated to another notation. Discrete diffusion uses jump processes on a finite state space. The pure jump experiment uses jumps for generative modeling in continuous space. It produces probability paths that are neither ordinary deterministic transport nor diffusion. He contrasts this denoising setup with generator matching: here one defines a forward process and searches for its corresponding backward process, rather than prescribing a probability path first and learning the generator that realizes it.

In the Q&A, Carles Domingo-Enrich asks whether the Lévy-type section is motivated mainly by heavy-tailed distributions or by the possibility of modeling systems with both continuous and discrete aspects. Ren answers that he does have multimodal distributions in mind. Although the diffusion and jump terms appear decoupled in the KL loss, the pure jump example suggests to him that jump terms can also be imposed on continuous components in generative models. His safer design implication is that jumps may help model multimodal structure and create probability paths not available to pure diffusion.

Domingo-Enrich also asks whether typical modeled distributions, such as images, audio, video, or scientific data, are heavy-tailed in practice. Ren points to prior work using $α$ -stable processes for generative modeling and notes that these models require care. The backward Lévy triplet can be complicated, so practical backward simulation often involves approximations. For Ren, an important open direction is to understand and improve those approximations through further investigation and better numerical schemes.

The remaining questions are about choosing and simulating the right process

The recipe Ren leaves is four-part: choose a forward Markov process with generator $L_{t}$ that sends the data distribution toward an easy reference; use the time-reversal theorem to identify the exact backward generator, which depends on the intermediate density $p_{t}$ ; parameterize the approximate backward generator by a positive function $φ_{t}$ in the same structural form as the true reverse generator; and learn $φ_{t}$ through the change-of-measure theorem connecting Doob’s $h$ -transform and the path-space KL objective.

The framework leaves several design questions open. Ren singles out the need to understand how forward dynamics affect trainability, efficiency, and robustness; to develop scalable inference and numerical methods for general Markov and Lévy-type processes; and to select forward processes in a principled way for constraints, topology, and distributional characteristics.

He also notes that the framework itself has assumptions. It assumes the process is Markovian, though generative modeling may not require Markovianity. It also forces the forward and backward processes to live in the same space, which may not be necessary either. For Ren, those assumptions mark the boundary of the current framework, not the boundary of the design space.

AI Research Methods