Fixed-Point Bridge Matching Makes Diffusion Sampling Scalable Without Target Data

Lorenz RichterMicrosoft ResearchTuesday, May 26, 202618 min read

Lorenz Richter’s seminar argues for a non-Markovian route to diffusion-based sampling when the target distribution is known only through an unnormalized density rather than data. He presents existing Markovian path-space samplers as theoretically flexible but increasingly constrained by trajectory simulation and storage costs, then proposes building reciprocal bridge measures from endpoint couplings and learning their Markovian projection by fixed-point regression. The resulting Bridge Matching Sampler, Richter says, uses a single learned control, accommodates flexible priors and reference processes, and shows improved stability and mode preservation in high-dimensional synthetic and molecular benchmarks, especially with damping.

The hard case is sampling without target data

Lorenz Richter frames the problem as generative modeling in its less forgiving form: sampling from a complex, high-dimensional, multimodal distribution when the target is not given by examples, but only by an unnormalized density. In the data-based setting, samples already indicate where the modes are. In the data-free setting, the sampler has to discover them.

The classical notation is simple. Given an unnormalized density $ρ_{target}$ , the goal is to generate samples from

p_{target} = \frac{ρ _{target}}{Z}, Z = \int_{R^{d}} ρ_{target} (x) d x .

Richter gives two motivating domains. In computational physics and chemistry, Boltzmann densities take the form $ρ_{target} (x) \propto exp (- V (x) / (k_{B} T))$ , where the potential $V$ , Boltzmann constant $k_{B}$ , and temperature $T$ define distributions relevant to phase transitions and material properties. In diffusion-model fine-tuning, a pretrained model $p_{pre}$ may be tilted by a reward $r$ , giving $p_{target} (x) \propto p_{pre} (x) exp (r (x))$ . In both cases, there may be no empirical target dataset to train against.

The traditional stochastic-process approach is to transport samples from a tractable prior through a stochastic differential equation. Langevin dynamics is the canonical example: its drift uses the gradient of the log unnormalized target density, and with infinite runtime it converges to the target. The problem is not that this is conceptually wrong. The problem is that its convergence is asymptotic. In Richter’s double-well illustrations, the process can work on an easy target if run long enough, but when the prior is shifted away from the modes, trajectories struggle to cross energy barriers. He notes that the mean first exit time scales exponentially in the energy barrier of the corresponding potential.

That motivates adding a time-dependent control to the dynamics:

d X_{s}^{u} = (\nabla lo g ρ_{target} (X_{s}^{u}) + u (X_{s}^{u}, s)) d s + 2 d W_{s}, X_{0}^{u} \sim p_{prior} .

The control is meant to push the system out of equilibrium and steer it to the target at a finite terminal time. Richter’s target is to identify a control $u^{*}$ such that $X_{T}^{u^{*}} \sim p_{target}$ . In statistical-physics language, this is a non-equilibrium forcing. In sampling language, it is a way to replace asymptotic convergence with finite-time transport that can cover isolated modes.

Path-space methods make Markovian diffusion samplers programmable, but still trajectory-bound

The first family of methods Richter discusses is Markovian. The setup uses a controlled forward process from the prior and a backward helper process that starts at the target. The controls, usually denoted $u$ and $v$ , are to be learned so that the forward process is the time reversal of the backward process. If this succeeds, the terminal law of the forward process is the target.

The backward process is only a conceptual helper in the data-free setting. It cannot be simulated directly, because that would require target samples. Richter’s point is that one can still reason about its path-space measure. Let $P_{X}^{u}$ and $P_{Y}^{v}$ denote measures on continuous trajectories generated by the forward and time-reversed backward processes. A natural formal objective is to minimize a divergence between them. If that divergence can be driven to zero, the path measures agree and the forward process samples from the target at terminal time.

The path-space formulation becomes useful because the log-likelihood ratio between the path measures can be written explicitly. Richter shows a Radon–Nikodym derivative involving deterministic and stochastic integrals, the controls $u$ and $v$ , and a third control $w$ specifying the process along which the likelihood ratio is evaluated. He does not ask the audience to parse every term; the key claim is that the expression is explicit enough to implement.

One complication is non-uniqueness. There are infinitely many possible bridges from prior to target. Richter lists several ways to impose uniqueness: fix the backward control, which corresponds to score-based generative modeling; prescribe an annealing path from prior to target, as in CMCD; or add a control cost, yielding a Schrödinger bridge problem.

The divergence choice changes optimization behavior. The common option is KL divergence. Richter emphasizes another choice his group has explored: a log-variance divergence, which takes the variance of the log path-likelihood ratio rather than its expectation. The additional control $w$ matters because it specifies the trajectory distribution on which the divergence is evaluated. Since $w$ can be chosen off-policy, this formulation can balance exploration and exploitation, connect to replay-buffer ideas from reinforcement learning, and avoid differentiating through the SDE solver.

Richter states two reasons the log-variance objective is more than a cosmetic variant. First, when the functional derivative of the log-variance divergence is evaluated at $w = u$ , it matches the derivative of the KL divergence up to scaling. Second, at the Monte Carlo estimator level, the log-variance derivative is a control-variate version of the KL derivative. He reports seeing this variance reduction in numerical simulations: a KL-based loss fluctuates heavily, while the log-variance loss decreases more smoothly in the shown comparison.

There is also a robustness property at the solution. At $u = u^{*}$ or $v = v^{*}$ , the variance of the corresponding derivative of the log-variance divergence is zero. Richter connects this to what machine learning sometimes calls “sticking the landing”: once the optimizer reaches the optimum, Monte Carlo noise does not kick it away. He notes that this property is not true for the KL divergence unless additional tricks are used.

Path-space likelihood ratios also make it natural to add sequential Monte Carlo. The basic identity is importance sampling in path space: an expectation under the target can be written as an expectation over forward trajectories corrected by a likelihood-ratio weight. The full likelihood ratio can be factorized over subtrajectories, allowing resampling at intermediate times. In Richter’s description, one starts from the prior, runs a trained or partially trained controlled diffusion, resamples according to subtrajectory weights, optionally uses MCMC, then continues.

The resulting method is Sequential Controlled Langevin Diffusions, or SCLD. A slide cites Chen et al., “Sequential controlled Langevin diffusions,” ICLR 2025. To make the subtrajectory likelihood ratios tractable, SCLD uses an ansatz based on Nelson’s identity, setting the backward control in terms of the forward control and a prescribed density path from prior to target. This removes the need to separately learn the backward control and gives explicit formulas for log weights over subintervals because the densities at the initial and terminal times of each subtrajectory are prescribed.

On a multimodal Robot4 task in $d = 10$ , Richter says SCLD visually matches the ground truth more closely than CMCD-KL, CMCD-LV, and CRAFT in the point-cloud comparison he shows. He also shows ELBO curves on Credit, Brownian, Seeds, and Sonar tasks, arguing that the combination with SMC speeds training. His interpretation is that before learning strong controls, the method is already doing a reasonable naive SMC procedure with unbiased estimators; learning $u$ improves the SMC sampler.

The path-space approach also generalizes to underdamped diffusion bridges. Instead of a state variable alone, the process uses position and velocity. Noise and control enter only the velocity component, which is important for applying Girsanov’s theorem and gives smoother trajectories in the position variable. Richter says that without the learned control, there is rigorous analysis showing improved convergence in underdamped Langevin settings. With the control present, the theory is less clear because $u$ depends on both velocity and position, but numerically the underdamped structure helps.

The crucial advantage is not merely being underdamped. Richter emphasizes that the Hamiltonian structure allows splitting methods such as OBAB and BAOAB rather than relying on Euler–Maruyama. In experiments on Funnel with $d = 10$ , Cancer with $d = 31$ , and ManyWell with $d = 50$ , underdamped variants generally have higher effective sample size than overdamped ones in the shown results. Across multiple experiments, Richter’s interpretation is that the main gain comes from combining underdamped dynamics with structure-exploiting splitting integrators.

Still, these are trajectory-based Markovian methods. They require simulating trajectories, often with many discretization steps, and storing them in memory. In high dimensions, Richter says, this is “killing us.” Even if off-policy log-variance training avoids differentiating through the SDE solver, the trajectories must be stored; if gradients through trajectories are needed, the burden is worse.

The non-Markovian turn is to build bridges, then Markovianize them

The second half of Richter’s argument moves away from Markovian trajectories as the primary training object. The proposed non-Markovian framework still starts from path measures, but the target path measure is written as a coupling of endpoints plus a bridge process between them:

Π^{*} (X) = p_{prior} (X_{0}) p_{target} (X_{T}) P_{0, T} (X ∣ X_{0}, X_{T}) .

A Brownian bridge is the basic example. It starts at one endpoint and is conditioned to arrive at another. Richter uses the bar notation for the bridge process to signal that it is non-Markovian: it depends on the future endpoint. If one had target samples, as in the data-based setting, Brownian bridges between prior and target data could be sampled directly. More efficiently, the Brownian bridge’s conditional distribution is Gaussian, so one need not simulate the SDE trajectory step by step.

This bridge construction is related to the reciprocal class. A path measure is in the reciprocal class of a reference process if it can be written as an endpoint coupling times the reference bridge. The reciprocal projection of a path measure replaces its conditional path law with the reference bridge while retaining its endpoint coupling. Richter describes this, informally, as building bridges between endpoints.

A more general non-Markovian process has a path-dependent drift:

d \overset{ˉ}{X}_{t} = σ (t) ξ (\overset{ˉ}{X}, t) d t + σ (t) d W_{t}, \overset{ˉ}{X}_{0} \sim p_{prior} .

Here $ξ$ may depend on the entire trajectory. To sample in practice, Richter needs a Markov process. The Markovian projection of $\overset{ˉ}{X}$ is a Markov process $X$ with matching marginals at every time. The drift of this Markovian process is the conditional expectation of the non-Markovian drift given the current state:

u^{*} (x, t) = E_{Π^{*}} [ξ (\overset{ˉ}{X}, t) ∣ \overset{ˉ}{X}_{t} = x] .

Equivalently, Richter states, this $u^{*}$ minimizes a KL projection onto Markovian path measures. An audience member describes this as a kind of moment matching; Richter agrees and connects it to data-based methods such as stochastic interpolants and flow matching.

In stochastic interpolants, one constructs a non-Markovian interpolation between prior and target samples. Richter gives, as a concrete example, a linear interpolation plus Gaussian noise:

\overset{ˉ}{X}_{t} = (1 - t) X_{0} + t X_{T} + 2 t (1 - t) Z, Z \sim N (0, Id) .

The Markovian drift is given, modulo details, by a conditional expectation of $\partial_{t} \overset{ˉ}{X}_{t}$ given $\overset{ˉ}{X}_{t} = x$ . Richter’s view is that the conditional-expectation losses used in stochastic interpolants can be seen as Markovian projections under different terminology.

The difficulty is that, in his setting, there is no target dataset. The conditional expectation cannot be computed directly for two reasons: the target path measure $Π^{*}$ cannot be sampled, and the path-dependent drift $ξ$ is not known.

Fixed-point iterations replace unavailable target bridges with learned reciprocal projections

Richter’s proposed answer is a fixed-point iteration. Instead of taking the conditional expectation under $Π^{*}$ , iteration $i$ uses a measure $Π^{i}$ built as the reciprocal projection of the current controlled process $P^{u_{i}}$ . In abstract form,

u_{i + 1} = Φ (u_{i}) := E_{Π^{i}} [ξ (\tilde{X}, t) ∣ \tilde{X}_{t} = x] .

The same update can be written as an $L^{2}$ regression problem: choose the Markov control that best matches the non-Markovian drift target under the current bridge measure.

The algorithmic picture is important. At iteration $i$ , simulate the Markov process with control $u_{i}$ from the prior to terminal time. Keep the endpoint samples. Then forget the simulated trajectories. Construct bridges between selected initial and terminal samples. These bridge paths are non-Markovian but can be sampled efficiently. Finally, fit a new Markov control $u_{i + 1}$ by regression against the non-Markovian drift target, then simulate again.

This is where the memory profile changes. During the optimization of the regression objective, the method does not need to store full SDE trajectories from the controlled process. It uses efficiently sampled bridges. Simulation is still needed between fixed-point iterations, but not in the same dense trajectory-storage way as the Markovian path-space methods.

A central technical ingredient is a formula for $ξ$ . For a scaled Brownian reference process, Richter states that the path-dependent drift inducing a target measure $Π^{*}$ can be expressed using endpoint derivatives: gradients of the endpoint coupling with respect to the initial and terminal points, minus a known transition-density term from the reference process. He also gives a more general version involving the accumulated variance $κ (t) = \int_{0}^{t} σ^{2} (s) d s$ , the normalized time-change $γ (t) = κ (t) / κ (T)$ , and an arbitrary function $c : [0, T] \to [0, 1]$ . He describes $c$ as a design degree of freedom, potentially usable as a control variate for variance reduction, though he says that direction remains work in progress.

The tractability of the formula depends on how the endpoint coupling $Π_{0, T}^{i}$ is chosen.

Known adjoint samplers appear as coupling choices; BMS uses independent endpoints

Richter uses the endpoint-coupling choice to show that the framework recovers known methods before introducing the new sampler.

The first choice is a Schrödinger half-bridge, fixing the prior and using the current terminal marginal. In the corresponding target-measure expression, only the terminal marginal is prescribed. The path-dependent drift reduces to a gradient of the log ratio between the target terminal density and the known terminal density of the reference process. This is tractable because the target normalization constant drops out under the gradient, and the reference terminal distribution is known. Richter says this is exactly the expression used in the Adjoint Sampler, though derived there through optimal-control ideas.

The drawback is that the reference must satisfy an independence condition $P_{0 ∣ T} = P_{0}$ , often requiring a Dirac prior and potentially large $σ$ for exploration, which can cause instability.

The second choice is the classical Schrödinger bridge coupling. Richter recaps the Schrödinger bridge as the problem of finding a kinetically optimal drift subject to fixed initial and terminal marginals. It can be characterized through Schrödinger potentials $φ$ and $\overset{φ}{^}$ . If those are solved, the optimal controls are $u^{*} = σ \nabla lo g φ$ and $v^{*} = σ \nabla lo g \overset{φ}{^}$ .

Plugging that coupling into the non-Markovian drift formula gives an expression involving the target terminal density divided by the terminal Schrödinger potential $\overset{φ}{^}_{T}$ . Richter says this is the expression used in the Adjoint Schrödinger Bridge Sampler, or ASBS. Its drawback is that $\overset{φ}{^}_{T}$ must also be learned, for example through IPF-like iterations.

During a discussion, an audience member notes that in these formulas the non-Markovian drift essentially depends on the final point, so in the regression objective it becomes the target being matched. Richter agrees and says that, in the adjoint-sampling language, it corresponds to the terminal cost target.

The new proposal is to choose independent endpoint couplings from the current process:

Π_{0, T}^{i} = P_{0}^{u_{i}} \otimes P_{T}^{u_{i}} .

Under this choice, the drift formula becomes tractable while retaining both prior and target terms. In the form Richter shows, it combines a prior-score term at $X_{0}$ , a target-score term at $X_{T}$ , and a known reference transition-density correction:

σ (t)^{- 1} ξ (X, t) = \frac{1 - c ( t )}{1 - γ ( t )} \nabla_{x_{0}} lo g Π_{0}^{*} (X_{0}) + \frac{c ( t )}{γ ( t )} \nabla_{x_{T}} lo g Π_{T}^{*} (X_{T}) - \nabla_{x_{T}} lo g P_{T ∣ 0} (X_{T} ∣ X_{0}) .

Richter calls the resulting method the Bridge Matching Sampler, or BMS. Its practical attraction is that the prior is flexible, the reference process can be chosen flexibly, including $σ$ , and only one learned object is needed: the Markov control $u$ . Unlike ASBS, there is no additional Schrödinger potential to learn through alternating iterations.

Coupling choice	Method recovered or proposed	Main issue Richter highlights
Fixed prior with current terminal marginal	Schrödinger half-bridge / Adjoint Sampler	Reference must satisfy an independence condition; this can force restrictive priors or large noise.
Classical Schrödinger bridge coupling	Adjoint Schrödinger Bridge Sampler	The terminal Schrödinger potential must also be learned.
Independent current endpoints	Bridge Matching Sampler	Single learned control, with flexible priors and reference processes.

Richter’s coupling choices determine whether the framework recovers existing adjoint samplers or yields BMS.

An audience question clarifies that imposing independence is an extra coupling choice inside the general recipe. Richter confirms: simulate, keep terminal samples, take additional independent prior samples, construct bridges, then apply the same Markovianization step. He says the other coupling choices correspond respectively to Schrödinger half-bridges, Schrödinger bridges and ASBS, and independent coupling for BMS. When asked whether these choices lead to algorithms with guarantees, Richter answers “exactly,” but he does not spell out the guarantees in the seminar.

The BMS paper itself was described on the final reference slide as “in preparation,” under the title “Bridge Matching Sampler: Scalable sampling via generalized fixed-point diffusion matching.”

Damping is a stabilizer, not yet an adaptive trust region

The fixed-point iteration can be damped:

u_{i + 1} = α Φ (u_{i}) + (1 - α) u_{i}, α \in (0, 1] .

With $α = 1$ , this is the original update. With smaller $α$ , the new control keeps some memory of the previous iterate. Richter says this turns out to be crucial in numerical experiments.

The damped update also has a variational characterization. It minimizes the Markovian projection regression loss plus a penalty that keeps $u$ close to $u_{i}$ , with damping parameter $η = (1 - α) / α$ . Richter’s interpretation is straightforward: the algorithm is told not to do the whole projection all at once. The same framework allows damping to be applied to AS and ASBS, not only BMS.

In the Q&A, a participant asks whether damping is equivalent to trust regions. Richter says no: the experiments use a fixed damping parameter, whereas a trust-region method would choose the damping adaptively. He says an adaptive choice would be desirable and could be a follow-up direction. The group realized quickly that damping was a “game changer,” but they did not yet optimize it adaptively.

The numerical evidence is about scaling, damping, and mode preservation

Richter presents the experiments as evidence that the non-Markovian approach can be used in settings where, in his account, the earlier trajectory-based methods run into memory and simulation-cost limits.

The first test is a high-dimensional Gaussian mixture model. In dimension $d = 100$ with 100 modes, BMS is compared to ASBS over different numbers of modes using sliced total variation distance and Wasserstein-2 distance. Richter says BMS improves “a little bit” in the shown plots and appears, in his interpretation, more stable as the number of modes grows. His intuition is that avoiding the back-and-forth between $u$ and $\overset{φ}{^}$ helps prevent rapid mode collapse.

A more extreme Gaussian-mixture comparison uses $d = 2500$ with 20 modes. Richter says the trajectory-based methods from the first part of the talk cannot get close to this setting in practice because, in his account, the required trajectories cannot fit into memory. The plots compare ASBS and BMS under different damping values. In the shown comparisons, damping improves stability, especially for BMS. There is a trade-off: stronger damping converges more slowly but can reach a better final value; moving too fast can lose stability even in BMS.

2,500

dimension in Richter’s high-dimensional Gaussian-mixture comparison

The GMM slides themselves do some of the evidentiary work. One slide shows a $d = 100$ , 100-mode mixture and two curves comparing ASBS and BMS on sliced TVD and Wasserstein-2 as the number of modes changes. A later slide shows ASBS and BMS curves for $d = 2500$ , 20 modes, plotted against target evaluations under different damping parameters. The visual point is not a single reported number; it is the pattern Richter emphasizes verbally: damping trades speed for stability, and BMS is shown as less brittle in that comparison.

An ablation over dimensions and damping values reinforces the point in Richter’s presentation. ASBS performs well with damping, but without damping it often fails in the shown study. BMS also benefits from damping and appears somewhat more stable in that particular example.

The next set of experiments uses n-body particle systems: a double-well system in $d = 8$ , Lennard-Jones with $d = 39$ , and Lennard-Jones with $d = 165$ . The methods compared include PIS, DDS, IDEM, AS, ASBS, and BMS. Metrics include Wasserstein-2 and Wasserstein-2 of energy marginals. Richter says BMS improves somewhat over previous results in the reported table.

Experiment family	Dimensions shown	What Richter emphasizes
Gaussian mixture models	100; 2,500	BMS appears more stable than ASBS in the shown comparisons, especially with damping.
N-body particle systems	8; 39; 165	BMS improves somewhat over the compared diffusion-based samplers on the reported metrics.
Alanine systems	66; 126	The methods are trained in Cartesian space without pretraining; damping is important for mode discovery.

The numerical examples are framed around stability, damping, and feasibility at higher dimension.

The molecular-dynamics examples use alanine dipeptide and alanine tetrapeptide. Richter stresses that these are learned in Cartesian space without pretraining, not in reaction coordinates. For alanine dipeptide, heatmaps over $ϕ, ψ$ coordinates compare molecular-dynamics reference data with ASBS and BMS at damping values $η = 0$ and $η = 10$ . The methods are not perfect, but with damping they discover the main modes in the shown heatmaps. A line plot of Ramachandran $ϕ, ψ$ divergence over target evaluations suggests damping helps and BMS is more stable in that comparison.

The alanine heatmap slide is especially important because it makes Richter’s mode-discovery claim concrete. The reference molecular-dynamics panel is shown beside ASBS and BMS panels with and without damping. Richter does not present the damped results as solved molecular sampling; he says they are “already learning something quite reasonable” and can discover the main modes.

For alanine tetrapeptide, Richter describes the numbers as fresh and still being improved. The table compares ASBS and BMS on ALA2 with $d = 66$ and ALA4 with $d = 126$ using metrics including $(Φ, Ψ)$ -Jensen-Shannon divergence, free-energy RMSE, and TICA-Wasserstein-2. He reports that BMS is a little better than ASBS in the shown statistics. The main observation, he says, is feasibility: the trajectory-based methods from the first part would not be feasible in these settings.

The remaining work is exploration, variance reduction, and better damping

Asked about future directions, Richter gives a broad answer rather than a single prescription. One direction is variance reduction. The arbitrary function $c$ in the generalized non-Markovian drift formula is one possible handle, and he says the group is thinking about control-variate ideas. He finds variance reduction more straightforward in the Markovian setting; in the non-Markovian setting it is less obvious.

Another direction is mode discovery. The data-free problem remains hard because the sampler must explore without examples of the target modes. Richter mentions off-policy ideas, replay buffers, and possibly combining diffusion samplers with more classical exploration methods. He cites, as a possibility, work combining ASBS with metadynamics, suggesting similar hybrid approaches may be useful.

He connects this to the damping problem. Optimizing too quickly can collapse to modes too early; damping imposes a kind of caution by preventing each fixed-point update from moving all at once. But the damping used in the experiments is fixed, and Richter says an adaptive mechanism closer to a trust-region method would be better if it can be found.

Richter’s central move is to separate bridge construction from Markovian sampling. Markovian path-space methods provide flexible divergences, SMC integration, and underdamped variants, but their trajectory simulation and storage costs become prohibitive in high dimensions. The non-Markovian framework constructs reciprocal bridge measures from endpoint couplings, then learns a Markovian projection through a regression objective. Known adjoint samplers appear as special coupling choices; BMS comes from an independent-coupling choice that keeps the objective single and scalable. The empirical claim is cautious: not that the sampling problem is solved, but that this formulation improves feasibility and, in the examples Richter shows, can improve stability and mode coverage when combined with damping.

Data and Training Evals and Benchmarks AI Research Methods