Hamiltonian Flow Maps Learn Larger Molecular Dynamics Steps Without Trajectories

Winfried Ripken Stanislav NikolovMicrosoft ResearchTuesday, May 26, 202618 min read

Michael Plainer, Winfried Ripken and Gregor Lied argue that generative models can attack molecular dynamics’ central bottleneck: the gap between femtosecond integration steps and biological processes that unfold many orders of magnitude later. In the Microsoft Research seminar, they separate the problem by timescale, using diffusion models to sample equilibrium Boltzmann states and extract force information, while proposing Hamiltonian flow maps for the intermediate regime where simulations need large, stable steps without training on expensive future-state trajectories.

The bottleneck is the gap between stable timesteps and useful timescales

Michael Plainer framed molecular dynamics as a problem of incompatible clocks. The microscopic integration step needed to keep simulations stable is on the order of femtoseconds, while the molecular changes of interest in biomolecular simulations may unfold over microseconds, milliseconds, or longer. For alanine dipeptide, he used the familiar two-angle representation of the molecule’s conformational state to show why this matters: local atomic motion is small and nearly continuous, while many useful questions concern transitions among metastable states and the long-run equilibrium distribution.

The numerical mismatch is severe. A femtosecond-scale step size, roughly $1 0^{- 15}$ seconds, must be repeated about $1 0^{12}$ times to reach a millisecond, $1 0^{- 3}$ seconds. Plainer described that as “three times the number of stars in our galaxy,” not as a rhetorical flourish but as a way to make the computational scale legible. Even if each individual force evaluation is fast, the sequential nature of integration becomes the limiting factor.

~10¹²

femtosecond steps needed to simulate one millisecond

The bottleneck has several layers. Molecular simulations often need high chemical or quantum-chemical accuracy; otherwise, Plainer said, “why even bother” doing the simulation. Ab initio quantum chemistry can provide that accuracy but is extremely slow. Machine-learned force fields can reduce the cost of force evaluation, but they do not remove the integrator’s stability limits. Standard numerical integrators such as velocity Verlet still require tiny steps, often around a few femtoseconds, to conserve energy and avoid blow-up. Increasing the time step naively can cause the simulation to diverge within a few steps.

The source’s timescale slide organized the modeling choices on an axis from deterministic dynamics at $t = 0$ to probabilistic sampling at $t = \infty$ . At one end is deterministic molecular dynamics: starting from a state, compute forces and advance by small steps. At the other end is independent equilibrium sampling: after an infinitely long interval, the next state is no longer meaningfully correlated with the starting state and should be regarded as a sample from the Boltzmann distribution. Between these extremes lies the practical regime the speakers focused on for Hamiltonian flow maps: larger-than-classical integration steps that retain useful kinetic information.

That distinction matters because equilibrium sampling and dynamical simulation answer different questions. Sequential molecular dynamics is expensive, but it preserves kinetic properties and allows one to estimate transition rates and pathways. Direct sampling from the Boltzmann distribution can converge quickly to equilibrium statistics, but it does not provide a physical trajectory or kinetics. Plainer’s point was not that one regime dominates the other, but that molecular modeling needs both: fast access to equilibrium structure and credible dynamics over time.

Diffusion models can sample equilibrium states, and their scores can be tied to forces

The equilibrium distribution Plainer focused on is the Boltzmann distribution. In the canonical setting, molecular systems are often simulated at fixed temperature rather than fixed total energy. Langevin dynamics introduces stochasticity, allowing the system to exchange energy with its environment. The force is obtained from the potential energy $U (x)$ through $- \nabla_{x} U (x)$ , and long simulations converge to the Boltzmann distribution:

p (x) \propto exp (- \frac{U ( x )}{k _{B} T})

Because of the exponential, a small change in energy can imply a large change in probability. High-energy states are unlikely; low-energy states dominate.

Generative models enter first as Boltzmann generators or emulators. The condition is important: if one already has Boltzmann-distributed training data, a generative model can be trained to map a simple prior, such as a Gaussian, into molecular configurations distributed like the equilibrium ensemble. Plainer mentioned normalizing flows, flow matching, and diffusion models as examples. For models that provide a sample probability, one can reweight samples by the ratio between the true Boltzmann probability and the model probability. This can correct weights when the model covers the relevant states, but it cannot recover a mode the model never generates. If a third mode is missing from the model’s samples, reweighting has nothing to reweight. Reweighting can also be slow, especially when estimating likelihoods for flow-matching-style models.

Diffusion offers a different practical path: learn to generate samples through an iterative denoising process. In the usual score-based formulation, a forward stochastic differential equation corrupts data into noise, and a reverse process uses a learned score, $\nabla_{x} lo g p_{t} (x)$ , to denoise samples back to molecular configurations. In the molecular setting Plainer described, the data distribution at zero noise is assumed to be Boltzmann distributed. The force-extraction claim depends on that assumption; it is not a claim that arbitrary molecular samples automatically yield physical forces.

The simple identity connecting a diffusion model’s score to physical forces follows from the Boltzmann form. If

p (x) = \frac{exp ( - U ( x ) / k _{B} T )}{Z}

then

\nabla_{x} lo g p (x) = - \frac{\nabla _{x} U ( x )}{k _{B} T}

The normalizing constant $Z$ vanishes under the gradient. Since the force is proportional to $- \nabla_{x} U (x)$ , a diffusion model trained on Boltzmann-distributed samples should, at diffusion time zero, have a score proportional to the physical force.

The closer we get to the data, the more physical the score will be.

Michael Plainer · Source

This relation motivated Plainer’s “consistent sampling and simulation” view. In the prior work he described, with “a few tricks” needed to make it work, the same trained energy-based diffusion model can be used in two ways: as a sampler, by denoising noise into independent equilibrium samples, and as a way to extract forces for sequential simulation. The second use is possible without force labels, dynamics priors, or trajectory data, provided the model has learned from independent 3D molecular positions drawn from the equilibrium distribution and the physics relation between energy and probability is used correctly.

The limitation remains the one Plainer had already attached to Boltzmann generators: the training data must represent the equilibrium distribution, and the model must cover the relevant modes. Diffusion-based force extraction does not remove those requirements. Plainer placed it at the extremes of the timescale picture: direct sampling at the probabilistic end, small-step deterministic simulation at the other. The next problem was the intermediate regime, where simulations need steps larger than classical stability limits but cannot discard kinetics.

Hamiltonian flow maps are meant to learn the large step, not replay tiny ones

Winfried Ripken introduced Hamiltonian flow maps as an attempt to occupy the middle ground between force integration and equilibrium sampling. Classical velocity Verlet alternates updates of positions and momenta using instantaneous quantities such as forces. It can simulate short timescales accurately, but when the step size grows too large it hits a discretization limit: artifacts appear, then the simulation becomes unstable.

The proposed replacement is not a better force field in the usual sense. It is a learned map in phase space. Given positions and momenta at time $t$ , plus a requested time interval $Δ t$ , the model predicts a future state further ahead than velocity Verlet can safely reach in one step. Ripken called this a Hamiltonian flow map because it is intended to follow Hamiltonian dynamics while mapping directly between phase-space states.

Earlier machine-learning approaches to this kind of large-step prediction, as Ripken summarized them, commonly rely on regression targets: train on pairs of present and future states, or on multiple future states for different time horizons. That creates the first major problem. To obtain the training labels, one has to generate the very trajectories the method is supposed to avoid, using many small time steps.

The problem becomes sharper for ab initio molecular dynamics. Accurate quantum-mechanical force calculations can take minutes or hours for complex geometries. Standard machine-learned force-field datasets therefore try to extract maximum value from each expensive calculation: generate a broad cloud of geometries using a cheaper method, choose representative geometries, and compute accurate forces only for those relatively decorrelated examples. Ripken contrasted this with trajectory data, where adjacent points are highly correlated. If expensive quantum calculations are spent marching along a trajectory, each additional force evaluation may add little new information.

A proxy route exists: train a machine-learned force field on decorrelated ab initio data, use that proxy to generate long trajectories by small-step simulation, then train the large-step model on synthetic present-future pairs. Ripken identified two drawbacks. It still requires expensive small-step rollouts, even if the proxy is cheaper than ab initio quantum chemistry. It also inherits the proxy’s bias: the large-step model cannot exceed the quality of the synthetic labels used to train it.

The project’s alternative was to train the large-step model directly from the more widely available decorrelated positions and forces, without future-state labels. The desired model is continuous in time, so that at inference it can make arbitrary-sized jumps, rather than being tied to one fixed $Δ t$ .

The derivation starts from Hamiltonian dynamics in integral form. Moving from a state at $t$ to a state at $t^{*}$ requires integrating velocities and forces along the path:

(x_{t^{*}}, p_{t^{*}}) = (x_{t}, p_{t}) + \int_{t}^{t^{*}} (v_{τ}, f_{τ}) d τ

Directly using this as a training objective would require solving the integral during training. Instead, the model learns the mean displacement over the interval:

\overset{u}{ˉ} (x_{t}, p_{t}, t^{*} - t) = \frac{1}{t ^{*} - t} \int_{t}^{t^{*}} (v_{τ}, f_{τ}) d τ

At inference, the future state is obtained by adding $Δ t \cdot \overset{u}{ˉ}$ to the current state. Ripken used a one-dimensional harmonic oscillator as the intuition: a classical integrator takes many tiny steps around phase space, while the learned map should jump between two states on the same Hamiltonian flow in one step.

The training objective borrows from recent flow-map and mean-flow work in generative modeling. Those methods try to avoid repeatedly integrating probability-flow ODEs when moving from noise to data; instead, they learn maps that can perform the transformation in one or a few steps. Ripken adapted the same idea to Hamiltonian dynamics. By multiplying the mean-displacement equation by the interval length, taking the total derivative with respect to time, applying the product rule and the fundamental theorem of calculus, the integral disappears. Expanding the remaining total derivative by the chain rule yields an integral-free, simulation-free objective.

The resulting loss decomposes into two parts. One is a force-matching term, analogous to ordinary machine-learned force-field training. As $Δ t \to 0$ , the loss recovers force matching. The other is a self-consistency term, which enforces agreement across time horizons. Ripken described the training as self-bootstrapping: the model learns larger predictions by being anchored in small-time physics and made consistent across intervals.

Implementation requires sampling a time interval, computing a Jacobian-vector product for the network with respect to positions, momenta, and time, and using a stop-gradient formulation so that training does not need to backpropagate through the Jacobian-vector product. The resulting overhead, Ripken said, is about 30% compared with classical force matching.

~30%

training overhead over classical force matching for the proposed HFM objective

The crucial claim is not merely that a neural network can predict a future state. It is that a continuous large-step Hamiltonian flow map can be trained from instantaneous labels—positions, momenta, and forces—rather than from expensive trajectories or future-state rollouts.

Large steps need conservation machinery, because the learned map is not automatically physical

Gregor Lied turned from training to simulation. A trained Hamiltonian flow map can be applied to molecular dynamics, but naive use creates a problem in the microcanonical NVE ensemble. NVE assumes constant particle number, volume, and total energy. When the large-step model is run directly, the total energy can drift over time. The example shown for ethanol had total energy increasing over the course of the simulation.

In ordinary machine-learned force fields, one common way to avoid this drift is to predict potential energy and obtain conservative forces by differentiating that energy with respect to positions. Lied said that is not feasible in the same way for Hamiltonian flow maps because the predicted quantities are step-size dependent. The model is not simply a scalar potential whose gradient gives forces.

The workaround is to apply post-processing filters after each molecular-dynamics step. The basic filter rescales momenta so that the total energy remains constant:

E_{total} = \frac{p ^{2}}{2 m} + E_{pot} (x)

Ripken elaborated that the total energy can be computed by combining analytic kinetic energy with potential energy from an existing force field, a machine-learned force field, or an added prediction head on the Hamiltonian flow map. The work also proposes a coupled conservation filter that keeps both total energy and total angular momentum constant by solving a quadratic optimization problem during inference.

The filters are not decorative. With the coupled conservation filter, the shown simulations kept total energy constant over time and total angular momentum constant as well. Plainer later added that without filters, the model runs into the expected failure modes; they are “really necessary” for stability unless an aggressive thermostat or similar mechanism masks the drift.

Other optional filters serve different invariance or stability roles. A rotation filter can be used when the underlying neural network is not strictly equivariant: randomly rotate the system before the integration step and rotate it back afterward, averaging out artifacts from orientation dependence. A drift filter can impose conservation of total linear momentum and uniform motion of the center of mass, matching properties that velocity Verlet preserves. For NVT simulations, a thermostat such as a Langevin thermostat can be applied as well.

The inference pipeline therefore combines learned prediction and physical correction. The network receives positions, momenta, the requested time step, and atom information. It predicts mean velocities and mean forces, which update the state. Filters then enforce selected conservation or ensemble constraints.

This division of labor is important. The Hamiltonian flow map is trained to learn large-time evolution from instantaneous supervision, but it does not by construction guarantee symplecticity, exact energy conservation, or all physical invariances. The filters are the mechanism used in this work to make long simulations stable enough to evaluate.

The reported speedups come from stable larger steps, not faster per-step evaluation alone

The reported gains came from taking larger stable steps than a conventional machine-learned force field with velocity Verlet, using the same training data. For paracetamol, Ripken compared a machine-learned force field plus velocity Verlet against a Hamiltonian flow map. Both simulations were visualized in a two-dihedral state space. Starting at a 0.5 femtosecond step size, velocity Verlet remained stable only up to about 1.5 femtoseconds, while the Hamiltonian flow map was run stably up to 17 femtoseconds. The larger stable step translated into faster exploration of state space.

The source visual made the comparison explicit: “MLFF + Velocity Verlet” on the left, “Hamiltonian Flow Map (Ours)” on the right, the same training data, and a label of “> 10x faster.” The point was not that each neural-network call is necessarily cheaper. At the same step size, Ripken said the Hamiltonian flow map had roughly the same nanoseconds-per-day rate as an equivariant NequIP-style force field in the examples shown. The gain appeared when the HFM step size increased. At larger stable steps, he reported up to 380 nanoseconds per day, surpassing a lightweight SchNet model in the comparison.

Result	System or setting	Claim in the seminar
Velocity Verlet stability	Paracetamol with MLFF + velocity Verlet	Stable up to about 1.5 fs
Hamiltonian flow map stability	Paracetamol with HFM	Stable up to 17 fs
Throughput	HFM at larger stable step sizes	Up to 380 ns/day
Long NVT simulation	Alanine dipeptide	At least 1 µs
Problematic step sizes	Alanine dipeptide	Failures around 10–11 fs, coinciding with hydrogen-X vibrational periods

Selected empirical claims from the seminar.

380 ns/day

reported simulation throughput for the Hamiltonian flow map at larger stable step sizes

Ripken emphasized that the proposed loss is architecture-agnostic. The reported results used a particular implementation, but the method is not tied to that neural architecture. If a more efficient or more expressive architecture is used, he argued, the throughput and quality could improve.

Alanine dipeptide served as a more dynamics-focused test. In an NVT simulation, the model was run for at least one microsecond and recovered the reference free energy surface, including the relevant metastable states, at a 12 femtosecond step size. Ripken described that 12 femtosecond setting as roughly one order of magnitude larger than how one would usually run the simulation with machine-learned force fields.

The results were not monotonic in step size. In the detailed alanine dipeptide plots, the model recovered relevant free-energy modes up to 9 femtoseconds, showed failure modes at 10 and 11 femtoseconds, and then recovered accurate behavior again at 12 femtoseconds. The speakers treated this as a real instability to explain, not something to smooth over.

The current scope remains narrow. The systems are still small, and the immediate aim is accelerating ab initio simulations. Open questions include whether the approach scales to larger systems such as proteins, whether the models can become transferable across chemical space, how to apply enhanced sampling methods such as metadynamics or umbrella sampling, whether a more stable loss formulation would help, and whether symplectic architectures could improve long-time energy behavior.

The failures at 10 and 11 femtoseconds exposed a specific oscillatory difficulty

A question from the discussion asked why alanine dipeptide failed around 10 and 11 femtoseconds: model bias, architecture limits, training data sparsity, or something else. The speakers’ current hypothesis came from ground-truth force-field simulations. The troublesome time steps coincided with vibrational peaks of hydrogen-X bonds. The slide separated all atoms, hydrogen, carbon, nitrogen, and oxygen, and highlighted that fast hydrogen-X vibrations have periods of 10 and 11 femtoseconds.

At exactly those periods, the model must effectively predict a full vibration cycle: a hydrogen atom should return to the correct relative distance from the bonded atom after the time-step prediction. A tiny error in the predicted period becomes a large relative error when the desired result is exact cancellation over the cycle. Because the model is continuous in $Δ t$ and trained across many time horizons simultaneously, it has to learn substantial motion at nearby intervals while producing cancellation at the vibrational period itself. Ripken said that was their best current guess for why the simulations become hard to stabilize and errors accumulate.

Plainer saw symplectic architectures as a promising possible fix. His reasoning was that the loss extrapolates from smaller time steps: if a smaller-step prediction is wrong, the larger one will inherit the problem. But in the observed case, direct force predictions at the problematic step sizes looked accurate, and even larger step sizes such as 12 and 13 femtoseconds were stable again. That suggested to him that the model had learned useful local predictions, while long simulations remained unstable because of direct prediction, filtering, and lack of stronger geometric structure. A symplectic architecture, he speculated, could improve stability, especially for chaotic systems. Ripken agreed.

The method’s limits are not only about average regression error. A large-step molecular model can fail at specific physical timescales because the dynamics contain fast oscillatory modes that impose cancellation constraints. Stability depends on how the learned map composes over many steps, not just whether it predicts instantaneous quantities accurately.

The open problems are coordinate choice, scale, starting structures, and stochasticity

The discussion narrowed what Hamiltonian flow maps can be asked to do in their current form. Sasank Edara asked whether the model outputs are Cartesian coordinates or internal coordinates such as bond angles and dihedrals. Ripken said the work predicts in Cartesian space: velocities and forces are Cartesian. Reparameterizations are possible in principle, but not used in this work.

That choice leaves the model exposed to invalid geometries when it is pushed too far. Edara asked whether Cartesian predictions sometimes generate a bond length that is badly wrong and causes an energy blow-up. Ripken said this was not observed at smaller timescales, but once the Hamiltonian flow map is pushed beyond roughly 15 femtoseconds, depending on the system, such failures can occur. The architecture does not contain a hard force-field prior that prevents invalid geometries; if the state goes out of distribution, failure is possible. Plainer added that filters are central to avoiding these modes.

Scale is still an open boundary. Stanislav Nikolov asked how large a protein the current method can target. Ripken answered that the current preprint’s largest reported system was alanine dipeptide with 22 atoms, though more recent coarse-grained protein dynamics had shown promising preliminary results that may be added to a future update. Plainer’s answer was more tentative. He said the largest systems they had looked at were protein-sized, mentioning BBA with 28 amino acids, and that they had also applied the approach to some larger molecules. He was uncertain on the exact atom count of the highest-dimensional molecular case, recalling an Ala3 system from MD22 at around 42 or 46 atoms. For Hamiltonian flow maps, he said they had not noticed much difference merely from system size, aside from the dependence on architecture: whether the model can represent and scale to the higher-dimensional system.

Starting from sequence alone is a different problem from accelerating dynamics from a known state. Nikolov asked whether one could generate a reliable trajectory for an intrinsically disordered protein from only the amino-acid sequence. Plainer answered cautiously and called the question speculative. He compared Hamiltonian flow maps to ordinary machine-learned force fields: success depends on whether the model has learned the relevant physics and priors. For IDPs, he expected many priors would be needed, because training data cannot cover everything. Repulsive forces or other architectural safeguards may be necessary to prevent rapid out-of-distribution behavior. Even then, large time steps might miss modes or struggle with chaotic dynamics. When Nikolov pressed that without a starting position the trajectory may not be valuable, Plainer suggested one might start from a structure generated by AlphaFold or BioEMO and relax it, while acknowledging that such tools may not work well for IDPs.

Stochastic dynamics raises a separate modeling question. Carles Domingo-Enrich asked how the approach would change for systems with explicit stochasticity, such as adding Brownian motion to the velocity update. Ripken said related generative-modeling work, including approaches he referred to as Diamond maps or meta flow maps, attempts to extend flow-map ideas to stochastic settings by conditioning on a noise sample so that the dynamics become deterministic conditional on that noise. For thermostatted molecular dynamics, where the source of stochasticity is known, he thought a derived formulation might be manageable.

He drew a sharper boundary around another kind of uncertainty: when the time step is extended so far that trajectories become chaotic, entering the system’s Lyapunov time. Multiple trajectories from the same starting conditions can diverge because of numerical errors. That stochasticity is harder to formalize and therefore harder to distill into the same framework. Ripken said he was not optimistic, at least currently, about solving that version.

The role of thermostats and noise is therefore practical as well as theoretical. In a deterministic NVE setting, the speakers rely heavily on filters because the learned map has no theoretical guarantee of energy conservation or symplecticity. With a thermostat and noise, some deterministic errors can be buried or corrected by the thermostat pulling the system back toward the desired temperature. But Ripken cautioned that systematic model errors can still shift the effective temperature: if the model consistently overshoots, the thermostat may struggle and the simulation may run too hot on average. Filters can still help by reducing the systematic offset before the thermostat acts.

Plainer added a practical observation: in many biomolecular simulations, especially proteins, practitioners often use NVT rather than purely deterministic NVE, so stochastic thermostat terms are common. That does not eliminate the need for physical structure in the learned model, but it changes which errors are most damaging.

Data and Training AI Research Methods AI in Healthcare and Life Sciences