Energy-Based Fine-Tuning Improves Accuracy Without RLVR’s Validation-Loss Penalty
Mujin Kwun and Carles Domingo-Enrich present energy-based fine-tuning as a post-training method that replaces next-token imitation or task-specific rewards with sequence-level feature matching. Their argument is that supervised fine-tuning remains efficient but is trained under teacher forcing, while RL with verifiable rewards can improve accuracy without preserving the target completion distribution. EBFT instead samples model rollouts, compares their frozen-model feature embeddings with reference completions, and uses that signal for policy-gradient updates; in the reported coding and translation experiments, it matched or exceeded RLVR accuracy while producing lower validation cross-entropy than both RLVR and SFT.
Microsoft Research·May 26, 2026·18 min read