Flow Policies Need New Q-Learning Methods for Online Robot Adaptation
UC Berkeley PhD student Qiyang “Colin” Li argues that the flow-matching and diffusion policies now effective for robotic manipulation expose a weakness in standard Q-learning: they model complex, multimodal action chunks well, but are hard to optimize with the reparameterized actor gradients used in efficient continuous-control RL. He presents two approaches, Flow Q-learning and Q-learning with Adjoint Matching, as ways to make off-policy RL work with these policies while reusing prior robot data. The trade-off, in Li’s account, is between the stability gained by distilling flows into one-step actors and the expressivity preserved by keeping multistep flow policies.
Microsoft Research·May 26, 2026·19 min read