Neuro-Symbolic Planning Makes Robot Learning More Data-Efficient

Jiayuan MaoStanford OnlineWednesday, May 20, 202617 min read

Jiayuan Mao, a Member of Technical Staff at Amazon Frontier AI & Robotics and incoming University of Pennsylvania assistant professor, argues in a Stanford Robotics Seminar that robot learning should be built around planning over compositional world models rather than direct policy fitting alone. His case is that neuro-symbolic systems — neural models embedded in symbolic constraint graphs for objects, relations, actions and effects — can learn from few demonstrations, compose skills at inference time and generalize to new objects, states and goals more reliably than end-to-end policies.

Mao contrasts direct policy fitting with planning over learned world models

Jiayuan Mao frames the problem as general-purpose physical intelligence: systems that perceive, understand language, and act in the physical world. One common route treats that goal as function fitting. A policy maps historical observations to the next action, and the system is trained from teleoperation or other collected data.

Mao’s concern is not that this approach has failed. He points to visible progress: robot arms folding cardboard boxes, policies controlling excavators, and demonstrations from companies such as Physical Intelligence and Actor Labs. But the examples also expose the cost of the paradigm. In one highlighted excavator demo, the on-screen post says the team had collected “a massive corpus of real-world data with natural language labels” and that its first successful task demo used “just 200 trajectories.” For a manipulation task Mao characterizes as simple, he treats that as evidence of low data efficiency: tasks that appear easy to humans still require many demonstrations for policy-learning systems.

The contrast with people is the organizing claim. A person can watch an excavator operator roll and stand up a heavy stone cylinder using the bucket, infer something about the trick, and, with practice, adapt it. A person can also face a contrived manipulation problem — holding a pen in one hand and a cap in the other, then using those two tools to lift a glass without touching it — and make an internal plan after seeing only the situation or a single demonstration. The task may be novel and useless, but the planning structure is not opaque to human cognition.

Mao argues robotics should aim for that kind of few-shot generalization: learning new capabilities from one to ten demonstrations and applying them reliably to new states, new objects, and new goals. His proposed route is to combine machine learning with planning. Instead of only using a policy that maps observation history directly to action, the robot should use a world model: what objects exist, what properties they have, what actions are possible, and what those actions are likely to do before execution.

That is where Mao places “neuro-symbolic concepts.” The symbolic side is the compositional abstraction over states and actions: objects, relations, constraints, task skeletons, contact sequences, and effects. The neural side supplies learned models that estimate correspondences, generate trajectories, predict future states, or score spatial relations. Planning then composes those pieces under constraints.

The key idea is to plan with such kind of a compositional abstraction of states and actions, to solve the problem and to achieve the generality.

Jiayuan Mao · Source

The repeated technical pattern is consistent across Mao’s examples: formulate a robotics problem as constraint satisfaction or constrained optimization; learn only the parts that are hard to specify by hand; use classical geometry, physics, motion planning, or search where they are reliable; and compose the learned and engineered pieces at inference time.

Actions become composable when they are constraints, not isolated policies

Jiayuan Mao’s first technical move is to redefine what an action model should contain. A simple answer is that an action is a policy: given the current state, output controls. But this breaks down when actions must be stitched together.

His example is the two-tool glass-lifting task. The robot might decompose the task into picking up the pen, picking up the cap, and pivoting or lifting the glass. But the way it grasps the pen and cap cannot be chosen independently of the later pivot. A grasp that succeeds for merely holding the pen may fail for using the pen as a functional tool against the glass. If each action is learned as a separate policy and executed one after another, the sequence may not compose.

The alternative is to generate actions through constrained optimization. For a plate-picking task, Mao describes path constraints such as staying within joint limits and avoiding collisions, plus subgoal constraints such as eventually holding the target. For a pick-and-place task, constraints accumulate over time: first the robot must hold the plate, then continue holding it while moving, then place it on the rack. Mathematically, the robot searches for trajectories and intermediate states that minimize a cost such as total trajectory length, subject to dynamics, collision avoidance, grasping, holding, and final placement constraints.

This formulation matters because it separates learning into two levels. The first question is: what constraints define successful execution of a task? The second is: what continuous values — trajectories, contact points, grasp poses, object poses — satisfy those constraints? Not all constraints must be learned. Rigid-body dynamics can be handled by simulators. Geometric constraints can be handled by motion planners. Task-relevant constraints, such as where to contact an object to hang it, can be learned from data.

Mao’s one-shot hanging example makes the division concrete. The demonstration shows a robot picking up a yellow hanger and placing it on a rod. The target generalization asks the robot to hang a mug on a mug tree. This is not simply “pick up mug, move to tree.” The grasp and final hanging pose constrain each other. If the mug is grasped by the handle, there may be no way to place the handle onto the mug tree. The robot must jointly choose contact points for grasping and hanging.

The proposed system learns the task as a contact analogy. In the demonstration slide, Mao highlights the relevant contact point. Given a novel object, the system asks: what point on this new object has the same functional role as the demonstrated point? Mao says the implementation uses pretrained visual features, specifically DINOv2 in the shown case, to compute correspondences. But those correspondences are not trusted as final answers. The visual heat map is noisy and produces wrong proposals. It is used as guidance for a model-based planner, which verifies candidates through motion planning, grasping analysis, and stability analysis.

That hybrid is the point. A blind planner that randomly samples possible contact points would search an enormous space. A pure correspondence policy would make many mistakes. Visual features narrow the search; physical analysis verifies the result.

The shown results include a real robot hanging a yellow mug on a wooden mug tree after learning from the hanger demonstration, then generalizing to other mugs and kitchenware. Mao also highlights a harder stress test: 3D-printed alphabetic shapes, which look unlike mugs or hangers and vary substantially in geometry, are hung on the mug tree. The visible benchmark chart compares one-shot policy learning at 0%, policy learning with correspondences at 24%, and Mao’s approach at 93%.

Method	Success rate shown
Policy learning	0%
Policy learning with correspondences	24%
Mao’s method	93%

The seminar’s benchmark chart for one-shot hanging on 3D-printed alphabetic shapes.

Mao uses the result to draw a narrower lesson than “neural methods are bad” or “symbolic methods are enough.” The successful system relies on neural representations to propose functional correspondences and on physical models to check stability and feasibility inside a constrained optimization framework. In his phrase, the system gets “the best of both worlds”: efficiency and generalization.

He then extends the same idea to a slightly longer-horizon manipulation setting. A two-armed robot holds two sticks and learns three skills — rotate, push, and lift — with one demonstration per skill. It then generalizes those individual skills to unseen target shapes and composes them to lift letters and other simple rigid objects from a table. Mao is careful about the limitation: the objects are simple and rigid, and the setting is constrained. But he presents it as an early sign that learned, generalizable actions can be composed rather than merely replayed.

Spatial reasoning becomes a graph first and exact poses second

Jiayuan Mao’s second extension moves from contact-rich short-horizon manipulation to object arrangement, specifically setting up tables. Here the robot must transform a language instruction into spatial relationships and then into exact object poses.

A representative instruction is: “Let’s set up a breakfast table! No stacks today.” Mao formulates the arrangement as a spatial constraint satisfaction problem. Given object shapes — an apple, plate, spoon, mug, and Cheez-It box in the slide — find object poses such that the apple is left of the plate, the spoon is right of the plate, the plate is near the front edge, and relevant objects are horizontally or vertically aligned.

The symbolic representation is an abstract spatial relationship graph. Objects are nodes; relations such as left-of, right-of, near-front-edge, horizontally aligned, and vertically aligned are constraints. Once the problem is written this way, learning again happens at two levels: learning the constraint set and finding the values that satisfy it.

Mao argues that many constraint-set decisions are commonsense knowledge rather than robot-control knowledge. Which object should go on the left of which? What does “breakfast table” imply? What does “no stacks” forbid? These are the kinds of symbolic-level judgments he says can be delegated to large language models or vision-language models rather than learned entirely from robot demonstrations.

The system starts with a language goal and an image containing object shapes. A vision-language model generates the abstract spatial relationship graph. Additional examples can be provided to express preferences. Mao gives the example of showing how his family usually sets a dining table, so the generated arrangement reflects that style. The next step is continuous: a compositional diffusion model generates exact object poses, and the robot then moves objects into those poses.

The diffusion component is not presented as a monolithic scene generator. It is a library of relation-specific models. For each relationship type, such as left-of, there is a dedicated diffusion model. Mao asks the audience to think of each such model as an energy function over object shapes and poses: how well do the inputs satisfy “A is left of B”? The global arrangement problem becomes minimizing the sum of energies for all constraints in the graph.

In practice, he says, the diffusion models predict gradients over inputs rather than scalar energy values. If moving an object left would better satisfy the left-of relation, the model predicts a gradient in that direction. To compose constraints, the system adds the predicted gradients from the relevant relation models. The composed field then points toward placements that jointly satisfy the constraints.

Mao shows individual fields for constraints such as near-front-edge of the plate, central-column of the plate, left-of apple relative to plate, and right-of spoon relative to plate. Adding them forms a composed field from which the system samples a plate pose. He mentions using the Unadjusted Langevin Algorithm to sample and says the paper proves the method gives an unbiased sample of the composed distribution, while skipping the mathematical details in the talk.

The operational advantage is inference-time composition. Once relation-specific models are trained, new sets of relationships can be assembled without retraining. The same learned building blocks support study desks, coffee tables, and dining tables.

The real-robot version adds motion constraints. In the shown setup, two robot arms sit on opposite sides of a table, and each has its own workspace. The placement plan must satisfy the spatial relations predicted from language and the motion constraints imposed by the arms’ reachability. The robot synthesizes a feasible solution and executes the arrangement. Mao’s summary is that few-shot object arrangement comes from composing neural diffusion models inside a neuro-symbolic constraint optimization framework, while using large language models for commonsense symbolic knowledge.

Long-horizon manipulation needs action models that predict their effects

Spatial arrangement still does not solve temporal planning. Long-horizon manipulation requires the robot to decide which actions to take, in what order, and whether intermediate states will make later actions possible. Jiayuan Mao’s third system learns planning-compatible action models from a small number of demonstrations.

The setup uses around 10 teleoperated trajectories for a task family. In the dishwashing example, each demonstration shows washing a single plate. The target is a longer-horizon task such as washing and placing two plates on a dish rack. This is not merely repetition. Mao notes that the order matters: the movement of a farther plate may be blocked by the initial position of the first plate, so the robot must reason about physical feasibility before committing to an order.

The system uses language to impose structure on demonstrations. A video-language model, identified as Gemini “back in the days,” segments a long demonstration trajectory into meaningful temporal chunks and assigns action names such as move, grasp, adjust, or place. Vision models segment the raw observations into object point clouds. For each step, the system identifies relevant objects and tracks the trajectories of object point clouds and the robot hand.

This produces a few examples — on the order of one to ten — for each individual skill. But planning needs more than skill execution. It needs a model of what will happen if the skill is applied. Mao therefore represents each action with two parts: a body and an effect. The body is a trajectory constraint, used to generate feasible trajectories for the action. The effect is a state predictor, used to estimate the future state after executing a sampled trajectory.

For a “place book on shelf” action, the trajectory model takes inputs such as the initial gripper pose and object point clouds and generates possible trajectories using a diffusion model. The model may generate placements to the left or right depending on the demonstrations and current scene. The state predictor takes the initial state and the inferred trajectory and predicts where relevant object points will be afterward. Mao notes that other state changes describable in language could be included, but the presented focus is geometric change such as object pose.

Planning then uses a high-level task skeleton, samples trajectories or target poses for the actions, simulates their effects, and rejects physically infeasible plans. In the bookshelf example, the language goal is to place a green book on a shelf. A possible skeleton is: pick up the green book, adjust the pink book left, place the green book. The system can internally simulate moving the pink book left and then placing the green book. In one sampled trajectory, the placement collides with a clock, so the plan fails. Another sampled trajectory places the book on the left and succeeds, after which the robot executes the plan: pick up the book, push the pink book left, and insert the green book.

The claimed generalization depends on the structure of the training and testing split. Mao says training demonstrations have objects axis-aligned, with at most two books stacked. The system is tested on new heights, new positions, new orientations, and novel obstacles. He says the learned planning-compatible model finds plans across these settings, while an end-to-end policy trained to solve the whole task directly cannot do so; he refers to additional comparisons in the paper.

The mug-tree task is revisited as a long-horizon planning problem. Training data consists of 10 demonstrations per mug, and each demonstration hangs only one mug through a two-step sequence: grasp mug, hang mug. The target task is to hang multiple mugs on the mug tree — three mugs in a six-step task or four mugs in an eight-step task. Now the robot must decide which mug goes on which branch and in what order, because future hanging actions may collide with mugs already placed. Mao describes the model as making deliberate plans and executing them physically.

Task family	Training data described	Longer-horizon test
Dishwashing	10 teleoperated trajectories; each demo washes one plate	Wash and put two plates on the dish rack
Book placement	Demonstrations with axis-aligned objects and at most two books stacked	New heights, positions, orientations, and obstacles
Mug-tree hanging	10 demonstrations per mug; each demo hangs one mug	Hang three mugs in six steps or four mugs in eight steps

Mao’s long-horizon examples keep demonstrations short but test planning over longer sequences.

The summary is consistent with the earlier systems: learned trajectory-generation models produce possible actions; learned transition models predict future states; planning composes them and selects a sequence. Large vision and language models do object segmentation, relevant-object identification, and commonsense symbolic work, but the physical feasibility of actions is handled through learned low-level models and planning.

Foundation models make composition more important, not less

Jiayuan Mao anticipates the objection that his examples emphasize low-data regimes, while robotics may eventually benefit from large-scale foundation models. His answer is not that scale is irrelevant. It is that neuro-symbolic structure remains useful scientifically, architecturally, and as a way to compose foundation models.

First, he argues that neuro-symbolic systems provide scientific insight into tasks and learning. A purely black-box framing gives little traction on questions such as how hard a task is, what makes it identifiable, how much data is required, or whether a finite dataset can train a model capable of the task. Mao points to related work on the expressiveness of graph neural networks and transformers, parameterized circuit complexity for transformer policies, and identifiability and sample complexity of transformers. The unifying point is that symbolic structure can make task and model complexity analyzable rather than merely empirical.

Second, he says practical robot systems are inherently compositional because of data availability and latency. A deployed robot is not just a single learned model; it must combine perception, memory tracking, high-level planning, and low-level control, often at different frequencies. Mao presents Retriever, a programming framework for closed-loop robot agents, as an example of this systems view. Its principles include asynchronous robot action and time-explicit typing: modules explicitly annotate their control frequency and how they compose with other modules.

The Retriever demo shows two robot arms working over a table with a drawer set and a plate of food under the instruction “Season the steak with chili flakes.” The overlay describes the system as “asynchronous robot agent programming with time-explicit typing,” running autonomously at 2x playback. The system searches drawers, updates memory when the chili flakes are not found in one drawer, revises the plan, and controls robot actions. Mao emphasizes that the video is close to true speed and says asynchronous execution and explicit treatment of time make the system efficient and smooth.

In response to a question about whether planning is open-loop and whether closed-loop execution is possible, Mao says the Retriever example is already closed-loop. A high-level planner may operate around 0.5 Hz, memory tracking around 1 Hz, and modules continue adjusting from new observations. The book-sorting work did not use closed-loop execution, but he says it is possible; the difficulty is engineering asynchronous modules that run at different speeds and synchronize state.

Third, Mao argues that as foundation models improve, robotics will need principled composition more, not less. Better vision models, language models, and action models still leave the problem of how to combine them under uncertainty and constraints. He contrasts this with “just like a simple chain of thoughts,” suggesting that probabilistic reasoning and planning algorithms are needed to compose model outputs into reliable action.

His imagined scaled-up system includes a model orchestrator. Given a context such as washing a plate, it would determine which constraints, features, and action models matter: has-soap, dirtiness, plate, faucet-open, detector models, physical dynamics models, relation models, and so on. Individual models could output booleans, poses, trajectories, forces, or other typed values from sensory input and a concept name. The orchestration layer would decide how to assemble them into a task-specific reasoning and planning problem.

This also changes the data story. Mao notes that robot data is hard to collect, and current systems still depend on humans providing demonstrations. His longer-term goal is continual learning for planning: start with basic compositional foundation models, use reasoning, planning, and exploration to gain new experience, turn that experience into data, train stronger foundation models, and iterate. He compares this to patterns already visible in agent research and vision-language model research, where models with different capabilities are composed and their knowledge distilled into more capable systems. His example is combining a vision-language model that can label objects from conversation context with a segmentation model, then distilling toward promptable segmentation.

The hard boundaries are semantics, preferences, and verification

The questions after the talk test where the framework ends and where unresolved design problems begin.

Asked whether today’s large language models have enough commonsense knowledge at the symbolic level, Jiayuan Mao answers that it depends on what counts as symbolic. Fine-grained geometry can be described in very long language, but robotics is still far from exploiting the full repertoire of semantic knowledge already present in large models. His point is that the field has not exhausted what those models may already contain for robotics.

Asked how future states in the bookshelf example are evaluated as successes or failures, Mao distinguishes task success from physical feasibility. In that system, a vision-language model helps generate a task skeleton using its semantic knowledge for task-goal understanding. The learned future-state prediction handles physical feasibility: no collision, stability, and related low-level constraints. The semantic model and physical transition model do different work.

A related question asks where the diffusion policy fits. Mao says the planner needs not just one trajectory but a probabilistic model of possible trajectories or target poses. For arranging books on a shelf, there may be multiple viable placements; the diffusion model represents that distribution.

Another question presses on secondary effects and preferences. A robot could remove a clock from the shelf and place a book there, or choose a path that gets the book in but scratches the shelf. How should the system verify that a behavior is preferred, not merely feasible? Mao says the specific work did not explore this much, but he suggests treating human preferences as additional utility functions factorized into the planning framework. A user preference such as placing books together rather than spreading them across the whole shelf becomes another constraint or cost added at inference time. That, for Mao, is another reason to favor compositional systems: learned pose and trajectory generators can be combined with new physical constraints or user preferences when planning.

Finally, asked what makes the work “neuro” rather than just symbolic planning, Mao gives the cleanest definition of the framework. The symbolic part is the constraint graph and reasoning structure: find values that satisfy constraints about geometry, task state, contact, stability, placement, or object relations. The neural part is attached to the graph’s pieces. Individual edges or constraints may use neural networks to generate object poses, grasp poses, contact points, or trajectories. The neural networks do not operate independently and then vote. They are integrated by an inference algorithm that searches for values satisfying all constraints simultaneously.

The symbolic part is all about that graphical structure of that constraint graph. And then each individual, let's say edge of that graph is associated actually with a neural network.