Orply.

Robots Need Game-Theoretic Planning to Navigate Human Interaction

Negar MehrStanford OnlineWednesday, May 20, 202619 min read

UC Berkeley roboticist Negar Mehr uses a Stanford robotics seminar on interactive autonomy to argue that robots cannot handle shared spaces by treating people and other robots as moving obstacles. She frames interaction as a coupled decision problem: agents must predict how others will respond to their own actions, coordinate across multiple possible equilibria, and learn from demonstrations of interaction rather than isolated behavior. Her broader case is that game-theoretic structure, multi-agent learning, and training-time foundation-model coaching can make that coupling tractable without replacing deployed control policies.

Interactive autonomy fails when robots treat other agents as scenery

Negar Mehr frames interactive autonomy around a practical requirement: robots must be able to act safely and intelligently around other agents, including humans and other robots. The difficulty is not limited to dramatic human-robot incidents. Mehr uses restaurant-service failures, autonomous-vehicle confusion in San Francisco, and warehouse robots stuck in standoffs to illustrate the same underlying problem: when multiple agents share space, independent planning is often not enough.

The hallway example is the simplest version of the problem. Two people walking toward each other can usually negotiate who yields where, sometimes with a brief miscoordination. Robots can get stuck in the equivalent situation because the task is not merely obstacle avoidance. Each agent’s action changes the likely future actions of the others. A robot navigating a hallway, changing lanes, carrying an object, or moving through a warehouse needs to reason about how others will react to its own decision.

Mehr calls this requirement “joint prediction and planning.” In human terms, it resembles theory of mind: when a driver nudges into a lane and infers from another driver’s slowing that the lane change is possible, the driver is not only predicting the other car’s trajectory; the driver is predicting the other driver’s response to the driver’s own action.

For robotics, Mehr formalizes that kind of interdependence as a dynamic game. Each agent has a state, a control input, dynamics, and a cost function capturing what it wants to do. In a navigation problem, one agent may want to minimize distance to its goal while also avoiding collision costs with the others. The central complication is that no agent can optimize its own objective in isolation, because each objective depends on what other agents do.

The game-theoretic solution concept she emphasizes is the Nash equilibrium. In this setting, an equilibrium is not a control-theoretic rest point where a derivative is zero. It is a set of actions in which every agent is doing the best it can given its prediction of the others. No agent has an incentive to unilaterally change its action.

That captures the desired joint reasoning, but it creates an algorithmic problem. Computing such equilibria for robots typically means solving coupled nonlinear optimal-control problems, often in real time and in a receding-horizon fashion. Mehr’s starting tension is therefore straightforward: Nash equilibria express the right interactive structure, but direct computation can be too hard for practical robotic systems.

Potential games turn coupled interaction planning into one optimization problem

The main structural simplification in Negar Mehr’s planning work is that many practical multi-agent interactions can be treated as dynamic potential games. Potential games are an established class in game theory, not a new invention by her lab. The relevant property is that equilibria can be found by minimizing a single potential function over the joint state and actions of all agents, rather than solving a collection of coupled optimal-control problems.

Mehr presents a simple three-agent example. Each agent’s cost contains its own tracking cost plus pairwise collision-avoidance terms. If agent one penalizes collision with agent two in the same way agent two penalizes collision with agent one, and the same symmetry holds for all agent pairs, then the game has a potential function. That potential is the sum of all agents’ tracking costs plus each pairwise collision-avoidance cost counted once. Minimizing that potential subject to the agents’ dynamics gives an equilibrium of the original game.

The reduction matters because it converts multi-agent planning into a form that can be attacked with ordinary trajectory optimization. Mehr’s lab used this structure in potential iLQR and related work, including quadcopter demonstrations. She reports that, compared with existing equilibrium solvers, their method was about 20 times faster on a two-agent and four-agent setup, with the savings expected to grow as the number of agents increases.

20×
reported average speedup over state-of-the-art equilibrium solvers in the planning benchmark

The same line of work was extended from soft collision penalties to explicit constraints. Mehr describes a collaborative-transport task in which two robots must carry a rigid object while interacting with humans. The robots are subject to equality constraints that must hold throughout the motion, while still reasoning about how nearby humans respond. In the demonstration she describes, two small robots connected by a rod navigate around two people. As the agents approach, the robots adjust the rod’s orientation and let the human motion resolve around them. Mehr attributes the behavior to the game-theoretic reasoning running fast enough in a receding-horizon loop.

The planning result is not simply that robots can avoid collisions. It is that, when dynamics and costs are known and the interaction has the right structure, a robot can plan interactively without treating every other agent as an unpredictable disturbance. The game structure supplies the computational shortcut.

Equilibria are multiple, and coordination depends on choosing the same one

Finding one equilibrium is not enough. Negar Mehr emphasizes that interaction equilibria are generally not unique. Two agents approaching each other can avoid collision if both yield in one coordinated convention, or if both yield in another. Either equilibrium may be good. Failure occurs when the agents pick incompatible modes.

Mehr’s example is mundane but useful: during a conference trip to Singapore, she kept bumping into people before realizing that the local walking convention was different from the one she expected in the United States. Yielding left and yielding right can both work; they fail when people assume different conventions.

For robots, that means coordination requires more than computing an efficient trajectory. A robot must recognize the interaction mode selected by others and adapt to it. In the potential-game framework, Mehr’s lab explored this by seeking multiple local minima of the potential function. Each local minimum can correspond to a different local interaction equilibrium. The robot can then infer in real time which equilibrium a human appears to be selecting and adjust accordingly.

In a human-robot passing experiment, the person sometimes yields right and sometimes yields left. The robot does not know the person’s intended convention at the start. Once the person begins to deviate, the robot identifies the corresponding mode and selects the compatible equilibrium. Mehr says this allows the robot to do what she failed to do immediately in Singapore: detect the convention from the other agent’s behavior and adapt.

The multiplicity can be unintuitive. Mehr describes a two-agent swap problem with a circular obstacle between the agents and limited visibility until the agents approach each other. She initially expected four modes. After running the solver, the lab found six: two modes where both agents go below the obstacle with one yielding to the other, two analogous modes above the obstacle, and two modes where one goes above while the other goes below.

The broader point is that multimodality is inherent in interaction. It is not a nuisance added by perception noise or imperfect controllers. Even idealized interactions can have several valid coordination modes. A robot that commits to one mode without reading the other agent’s behavior can be locally optimal and socially wrong.

During questions, Mehr adds that perfectly symmetric cases are the tricky ones. In many real settings, some asymmetry breaks the tie and makes one yielding pattern more efficient. When symmetry persists, adding noise can break it. But she cautions that deadlocks remain a real concern. The advantage of the game-theoretic planner, in her view, is that both agents’ decisions are modeled jointly rather than optimized independently.

Imitating a human in isolation misses the interaction

The planning story assumes that the robot knows each agent’s dynamics and cost function. Negar Mehr turns next to the obvious objection: how would the robot know what objective a human is optimizing?

The standard answer in reinforcement-learning literature is inverse reinforcement learning. Under the usual model, a human is treated as a noisy optimizer of an unknown cost or reward function. Demonstrations are collected, and the learner infers the underlying objective. The action model is maximum-entropy in spirit: actions with higher reward, or lower expected cost, are more likely, but human behavior is stochastic rather than deterministic.

Mehr’s objection is not to noisy optimality itself. It is to learning from a human in isolation when the target behavior is interactive. If agents’ decisions are interdependent during planning, she argues, then that interdependence must also appear during cost learning. To learn how someone interacts, it is more informative to observe them interacting with others than to observe them acting alone.

Her analogy is advising style. A prospective student would not learn much about Mehr’s advising by watching her work alone in her office. The revealing data would be her interactions with students. The same principle applies to pedestrian motion, collaborative manipulation, and human-robot coordination: preferences are often expressed most clearly under coupling.

To model noisy multi-agent behavior, Mehr’s group drew from quantal response equilibria in cognitive science and game theory. Quantal response equilibrium can be thought of as a noisy version of Nash equilibrium: each agent maintains a probability distribution over actions, and those distributions form a fixed point. Mehr notes the line that if Nash had been a statistician, he might have discovered this notion of equilibrium instead.

The static quantal-response idea then has to be extended to dynamic games. Mehr’s group does this through a multi-agent Q-function. Each agent’s expected cost-to-go is averaged over the policies of the other agents, producing a marginalized quantity. At equilibrium, an agent’s action probability is proportional to the exponential of the negative marginalized expected cost. Mehr calls the resulting concept an entropic cost equilibrium.

The name is deliberate. She describes entropic cost equilibria as the multi-agent extension of the maximum-entropy principle widely used in reinforcement learning. Adding an entropy-weight parameter allows the model to represent different levels of rationality. As agents become more “irrational” under this model, trajectories become noisier, and new interaction modes can appear that would not be explained under perfect rationality.

That framework then supports inverse learning. Costs are modeled as weighted combinations of features. The algorithm seeks weights such that the expected features under the learned model match the expected features in demonstrations. Mehr’s stated intuition is that the learned policies should be close to what appears in expert interaction data.

In a recent pedestrian study with collaborators at UT Austin, cameras collected trajectories of people walking on campus. The goal was to predict how humans move around each other in realistic pedestrian crowds. Mehr reports that their multi-agent inverse-reinforcement-learning approach produced more accurate motion prediction than state-of-the-art imitation-learning baselines, including single-agent IRL, behavioral cloning, and PECNet, on an RMSE-position plot where lower is better. In the question period, she says the improvement was roughly 30% in prediction accuracy.

≈30%
reported improvement in prediction accuracy in the pedestrian-interaction study, according to Mehr in Q&A

The result is presented as evidence for a specific claim: when data are limited and interaction is the object of study, learning the cost or reward structure from multi-agent demonstrations can be more effective than learning from isolated behavior.

Policy imitation has a multi-agent failure mode: exploitability

Mehr also considers the simpler alternative: skip reward learning and imitate policies directly. If large datasets of interactions are available, behavioral cloning seems attractive. The question her group studied is whether multi-agent behavioral cloning differs from the single-agent case in a fundamental way.

In single-agent imitation learning, one standard concern is value gap: how much value is lost by playing an approximate learned policy rather than the expert policy. In a multi-agent domain, Negar Mehr argues, this is not sufficient. A second concern appears: exploitability.

Exploitability captures how much regret agents can experience because, in a multi-agent setting, other agents are not fixed parts of the environment. If one agent plays a suboptimal learned policy, another agent can react to that suboptimality. Small deviations from the expert policy can propagate through strategic response and push the system away from the training distribution.

The theoretical result Mehr presents is stringent. Low exploitability can be achieved only with full state support and exact matching of state-action occupancy measures. In finite state and action spaces, that means the data and learned policy must cover all relevant states and state-action pairs with exact occupancy matching. Mehr calls this a “very, very, very strong assumption.”

The implication is that imitation accuracy alone is not the right target for reliable multi-agent policies. To learn policies that are not easily exploited, one must add structural assumptions about the strategic interaction. Mehr mentions dominant strategies as one such strong assumption: roughly, that agents will not deviate too much in response to others.

The issue becomes more severe when interactions are multimodal. Even if a policy imitates two observed modes well, it may not generalize to four unobserved modes. Reward learning may recover additional modes because they arise from the underlying objective structure. Behavioral cloning, by contrast, depends on demonstration coverage: if a behavior is not in the data, the policy has little reason to learn it.

Mehr’s lab also explored the case where the demonstration data do contain multiple modes. Standard learning algorithms trained on multimodal data can collapse through mode averaging. To address this, the group used diffusion policies, motivated by their ability to represent multimodal trajectory distributions. The robotics-specific constraint is decentralization: each agent must execute its own policy without a central controller deciding the joint action at runtime.

In the collaborative-transport example, two robot arms lift an object and can coordinate by moving to the left or right of an obstacle. The policies are trained jointly on demonstrations but executed independently. Mehr reports that decentralized diffusion policies allowed the robots to implicitly coordinate. Across rollouts, the robots chose the left mode about half the time and the right mode about half the time. The important point is not the exact split but the preservation of multiple coordination modes under decentralized execution.

The lesson she draws is narrower than “use diffusion.” To imitate coordination, a robot must learn from demonstrations of multi-agent interactions. That can be done through reward learning, behavioral cloning, or diffusion-style policy learning, but the data must contain interaction, not merely individual behavior.

Reinforcement learning from scratch struggles without a coach

After imitation, Mehr asks whether multi-agent reinforcement learning can learn interaction without demonstrations. Her example is the same two-arm collaborative lifting task. The group tried to train the behavior from scratch using RL, with extensive reward shaping. Mehr says the best effort still failed to produce reliable coordination, despite a student with years of experience in robotics reward shaping and RL working on the problem.

Her diagnosis is blunt: multi-agent RL in the real world remains far more limited than single-agent RL. The question then becomes how humans learn difficult interactions. Mehr points to coaching: sports teams acquire complex coordination not only through reward signals, but through curricula, feedback, and credit assignment.

That motivates her lab’s use of foundation models as coaches. The models are not used as low-level action generators during deployment. They are used during training to help structure the learning problem.

Mehr decomposes coaching into four roles: generating a curriculum, defining subtasks or rewards, providing feedback on policy performance, and assigning credit among agents. The first version she describes is single-agent humanoid locomotion. An off-the-shelf language model is prompted with the environment, robot, and target task. It breaks “humanoid running” into a sequence: basic stability, walking, increasing speed, and the target running task. The model then helps generate reward code for each subtask, and policies are trained sequentially.

Mehr stresses that this was not a heavily engineered model-training project. There was no model fine-tuning. The language model supplied task decomposition and reward guesses. She says the group was able to train a robot to run backward within six months, contrasting that with hand-tuned locomotion reward functions containing many parameters accumulated through years of trial and error in another lab.

The crucial observation was that one-shot reward generation for the final task did not work, even with many iterations. Breaking the task into simpler subtasks made the reward-design problem tractable enough for the model to help.

For multi-robot interaction, the coaching system became more complex. Mehr says their coaching system itself became multi-agent: one module generated curriculum, while other modules evaluated policies and refined advice. The underlying training algorithm remained multi-agent PPO. The difference was the coaching wrapper around it.

In the previously failing two-arm pot or basket task, the coached system learned a sequence: align with the handle, grasp it, then lift and balance together, all in decentralized execution. In a seesaw task, one quadruped learns to stabilize the seesaw so another can climb it. Mehr contrasts these outcomes with MAPPO baselines that fail to do anything useful in the shown examples.

The coaching approach is also applied to human-robot collaborative transport with quadrupeds. Mehr says she recently tried the demo herself and found it surprisingly easy, describing closing her eyes while the robot helped move things around. She treats this as evidence that coaching may help with tasks that are difficult to formalize directly.

During questions, she clarifies how language or vision models are connected to numerical training. In the bipedal setting, they initially gave numerical trajectories directly to the language model, and it “somehow” worked despite her expectation that it would not. For coordination, they found sequences of images—initial, intermediate, and final frames—more informative for a vision-language model. Later, an even more effective input was the training curve itself, especially when decomposed into separate reward-feature plots rather than a single total reward. Mehr says she does not know why plots worked best, but in their experience the representation of rollout information given to the model mattered more than the advice mechanism itself.

Credit assignment is where language models act as critics

The hardest coaching role Mehr isolates is credit assignment. In a team task, the group can fail even when one agent did the right thing. If Alice goes to the wrong place and Bob goes to the correct location for a two-agent apple-picking task, the team still receives no reward. Without a mechanism to distinguish Bob’s useful contribution from Alice’s failure, learning can stall.

Negar Mehr draws a human analogy from group projects: instructors routinely hear that one teammate did the work while another did not contribute. Humans are relatively good at judging contribution from context. Standard multi-agent RL training loops are not.

Her lab therefore uses an LLM as a critic inside the training loop. A trajectory is rolled out, then the language model is asked which agents should receive credit and which should not. The model is not retrained; it is prompted to act as a credit assigner. That individualized feedback is then integrated into standard multi-agent RL.

In a robotic warehouse example with multiple robots moving packages, Mehr reports that an LLM-critic method outperformed MAPPO, QMIX, and LICA by orders of magnitude in reward across small and medium scenes with different agent counts. The plotted green curve for the lab’s method is visibly far above the baselines, which she says are so low as to be hard to see. She also mentions preliminary work using a VLM critic to teach two humanoids to work with each other.

The claim is carefully bounded. Foundation models are not replacing control policies at inference time. They are used during training to provide high-level structure, feedback, or credit. In Q&A, when asked about latency and deployment, Mehr emphasizes that the language model is not queried during execution. Once the policy is trained, it is deployed as a policy. The cost is in training, where queries to the model can make training slow and must be used strategically.

She also says the models perform best when asked for high-level decisions, not low-level actions. In ongoing humanoid work, low-level model-based controllers have been effective when combined with high-level coaching. Expecting an RL policy alone to discover accurate behavior from scratch did not seem likely in their setup, although she notes others have pursued related approaches.

Asked what would matter for scaling this kind of coaching, Mehr identifies three ingredients from her experience: task breakdown, the form of evaluation given to the model, and the abstraction of rollout information. With the right coaching, she says, the choice of RL algorithm mattered less than the decomposition of the task and the information supplied for evaluation. Training curves, in particular, were a compact representation the model could use without being overwhelmed.

Multi-agent perception becomes a lens on continual learning

Mehr briefly turns from planning and learning to perception. Her lab has worked on collaborative scene mapping, where multiple robots combine traditional distributed-optimization tools with neural representations such as NeRFs and Gaussian splats. She traces part of this line to auditing a class on distributed optimization and consensus ADMM during her postdoc, a topic she says unexpectedly became central to many projects in her lab.

The multi-agent perception work asks how several agents can collaboratively map an environment. The newer twist is to apply a multi-agent view to a nominally single-agent problem: continual learning. A robot observing a changing environment must learn new information without forgetting old information. Mehr describes the lab’s realization this way: continual learning can be viewed as reaching consensus with past versions of oneself.

Instead of retaining all past data, the learner can maintain a form of consensus over model parameters across time. In a mapping example, a yellow chair is moved from one location to another. Replay-based methods that keep past data produce two chairs: the old and the new. Regularization-based methods resist forgetting but fail to adapt to the scene change. Mehr says a simple ADMM-style consensus in neural-network weight space can identify which parameters need to change to reflect the new scene and which should remain stable.

She connects this back to the seminar’s broader theme: multi-agent structure can be useful even when the “agents” are temporal versions of the same model. The technical motif is the same as in the planning work: find structure in the interaction, then exploit that structure computationally.

Safety has to be enforced inside learned policies, not added afterward

The safety portion is brief but direct. Mehr says her lab has been thinking about how to enforce safety constraints within the training pipeline of learned policies, including both behavioral cloning and reinforcement learning. The aim is provable constraint enforcement, not merely empirical avoidance.

The example shown is dynamically admissible trajectory generation using diffusion policies. In the open-loop diffusion setting, the dynamics of the robot are enforced directly in the diffusion generation pipeline, producing dynamically feasible trajectories even for underactuated systems. She also references work on closed-loop robot control policies with provable satisfaction of hard constraints.

Mehr does not present safety as solved. Her point is that as robots move closer to real-world integration, safety constraints become more important, and they must be considered in the policy-generation and training process itself.

The useful data are the tightly coupled moments

In the Q&A, Mehr is asked what human-interaction data would be most useful and which real-world human-robot problems need these strategies. She says she does not have a complete answer, partly because it is not fully clear which differences make these problems hard. But she identifies one kind of data as especially valuable: tightly coupled interactions.

For navigation, the most informative data are not long stretches where people are far apart and barely influence each other. They are the moments when humans get close enough that their decisions strongly affect one another. For physical collaboration, the coupling is even richer. Carrying a table together requires implicit coordination, force understanding, and all the challenges of single-agent manipulation in a multi-agent setting.

Mehr also emphasizes heterogeneity. She does not believe there is a single nominal human model. People differ, cultural norms differ, and social conventions matter, as in her Singapore walking example. Human feedback can also be biased: when a robot-human team succeeds, a human may attribute success to themselves; when the team fails because of a human mistake, the human may blame the robot.

That is why sample efficiency matters. Humans are not simulators from which unlimited data can be collected. If robots are to learn from people, especially in preference-sensitive interaction tasks, they need to learn quickly from informative interaction data.

The frontier, in your inbox tomorrow at 08:00.

Sign up free. Pick the industry Briefs you want. Tomorrow morning, they land. No credit card.

Sign up free