Geometric Priors Can Make Robot Learning Far More Data Efficient
In a Stanford Robotics Seminar talk, Northeastern computer science professor Robert Platt argues that robot learning should move between brittle hand-coded models and data-hungry generalist policies by building geometry into learned systems. His case is that representations such as equivariant point-cloud policies, spherical image embeddings, ray-based attention and image-plane control can make robots generalize over pose without having to learn that structure from scratch. Platt presents the payoff as data efficiency: geometric bias does not replace scaling, but can shift the curve so scarce robot demonstrations count for more.

The useful middle is not a return to hand-coded robotics
Robert Platt frames the recent history of robotics as a swing between two extremes. On one side are highly structured, hand-coded geometric and physics-based systems: detect objects, estimate state, plan with a hard-coded model, act. On the other side are today’s generalist vision-language-action models: take images and instructions, run them through learned encoders and attention, infer actions directly.
The older structured systems were powerful precisely because they made strong assumptions. Platt’s example was YODO, “You Only Demonstrate Once,” the RSS 2022 best paper by Wen, Lian, Bekris, and Schaal. It estimated object locations using previously known CAD models and then used a geometric plan. The benefit was visible in the title: with a strong enough geometric prior, one demonstration could be enough. The failure mode was just as visible. If the model misestimated the location of something in the world, the plan failed because the assumed world and the actual world diverged.
The newer generalist models avoid that particular brittleness by learning directly from data in the environment. But Platt’s concern is that they often throw away exactly the structure robots need. He used recent models such as Toyota Research Institute’s LBM, ManiFlow, DiT-Block Transformer, and X-VLA to motivate a simplified architecture: vision and language encoders, often from a pretrained VLM, followed by self-attention and then an action head such as a diffusion transformer.
In that simplified architecture, Platt argued, “the last thing that these models know about the geometric structure of the environment is positional encodings for the image patches that came out of the vision encoder.” After self-attention, he said, the reasoning becomes “completely disembodied.” If the model has discarded geometry, it must relearn geometry from data.
His proposal is not to go back to fully hand-coded robot models. It is to take a generalist model and structure it better: embed observations in a 3D world, retain geometric information in the representation, and use model architectures that reason in terms of that structure.
The four methods he presented all follow that recipe, but choose different representations:
| Method | Observation representation | Geometric mechanism |
|---|---|---|
| Equivariant Diffusion Policy | Point cloud | Finite subgroups of SO(2), SO(3), or SE(3) |
| Image 2 Sphere Policy | Image embedded on the 2-sphere | Spherical harmonics and SO(3) Fourier-space reasoning |
| RAVEN | Image patches expressed as 3D rays | Geometric Transform Attention with coordinate frames |
| Pix2Act | Raw stereo images | Image-plane trajectory prediction followed by triangulation |
The question Platt explicitly set aside was whether big data alone might solve the problem. He acknowledged that possibility but did not argue it. His narrower claim was conditional: if robotics needs something between brittle hand-coded models and data-hungry generalist models, the candidate is a learned model whose architecture is biased toward the geometry of the physical world.
Equivariance is the way Platt tries to put physical structure back into learning
Robert Platt connects the desire for geometric structure to symmetry. If the goal is to encode something like physical conservation laws into a learned robot policy, symmetry is the natural place to start. He introduced Noether’s theorem as the conceptual anchor: continuous symmetries of physical systems with conservative forces correspond to conservation laws. Time symmetry corresponds to conservation of energy; spatial translation symmetry to conservation of momentum; spatial rotation symmetry to conservation of angular momentum.
The mapping is conceptual, not literal, for the policies under discussion. Vision-language-action models are not second-order physical simulators in which rotation or translation symmetry straightforwardly installs conservation of momentum. Many robot policies are closer to zero-order or first-order systems. Still, translation and rotation symmetry are useful inductive biases. They encode a basic expectation: if a scene is translated or rotated, the appropriate action or action distribution should translate or rotate with it.
Platt illustrated this with a planar pushing example. A block pushed from one orientation and the same block-scene-action configuration rotated by 90 degrees should have analogous transition dynamics. In Markov decision process notation, the slide gave the invariant transition relation:
For worlds where transition dynamics are group rotation invariant, Platt said the optimal policy should be group rotation equivariant. In the slide’s notation:
In plain terms, if the input scene is rotated, the policy’s output action should rotate accordingly. Platt used image segmentation as the simpler analogy: a good segmentation function should produce the same segmentation whether one segments first and rotates later, or rotates the image first and segments later.
The key architectural question is how to constrain a neural network so it can only represent functions with that property. Platt gave a deliberately tiny convolution example. A standard convolutional kernel mapping a 3-by-3 image to a 2D vector would have 18 free weights, ignoring biases. But if the function must be equivariant to 90-degree rotations, many of those weights are tied or negated. In the example, the equivariant kernel had only five free parameters. Rotate the input image, apply the same constrained kernel, and the output vector rotates with it.
This small example showed the larger idea: equivariant layers reduce the function class to functions that respect a chosen symmetry. In Equivariant Diffusion Policy, that idea is applied throughout the model. A point cloud is encoded with an equivariant point cloud transformer, equivariant with respect to translation and a finite subgroup of SO(3). The denoising policy is also equivariant. In finite-group implementations, Platt said, the tensor gains an additional dimension corresponding to group elements.
The resulting policy is still a diffusion policy: it learns a flow field over action space, or action chunks. But the flow field is constrained so that rotations or translations of the input automatically generate corresponding rotations or translations of the output flow field.
We’re talking about models that automatically generalize in ways that other models would need to learn.
That automatic generalization is the central practical claim. Equivariance does not remove the need to learn the task. It removes some of the burden of learning that the same task can appear at different poses.
The strongest result is a point-cloud policy that trades structure for data efficiency
Robert Platt described Equivariant Diffusion Policy as the strongest empirical result among the methods he presented. The benchmark was MimicGen, a set of 12 manipulation tasks. In the comparison he emphasized, models were trained without pretraining on 100, 200, or 1,000 demonstrations.
The headline result was that the point-cloud version of Equivariant Diffusion Policy trained on 100 demonstrations outperformed an out-of-the-box diffusion policy trained on 1,000 demonstrations.
The table shown in the seminar gave average success rates over 12 environments. EquiDiff using point clouds reached 76.5 with 100 demonstrations, 81.6 with 200, and 82.3 with 1,000. The DiffPo-C Abs baseline reached 42.0, 57.8, and 71.4 at the same demonstration counts. Other baselines shown included DiffPo-T, DP3, and ACT.
| Method | 100 demos | 200 demos | 1,000 demos |
|---|---|---|---|
| EquiDiff (PC) | 76.5 | 81.6 | 82.3 |
| EquiDiff (No) | 63.0 | 72.6 | 77.9 |
| EquiDiff (Im) | 53.7 | 68.5 | 79.2 |
| DiffPo-C Abs | 42.0 | 57.8 | 71.4 |
| DiffPo-T | 29.0 | 43.0 | 64.9 |
| DP3 | 23.9 | 35.1 | 56.8 |
| ACT | 21.3 | 38.2 | 63.3 |
The gains were not uniform across tasks. Platt said the method helped most where there was the greatest need to generalize over pose. A bar chart grouped tasks into high-, intermediate-, and low-equivariance categories. The high-equivariance tasks showed the largest improvements. The less pose variation mattered, the less important equivariance became.
The same method was also tested on real robot tasks that Platt characterized as “non-ridiculous.” The slide listed eight tasks, all with fewer than 100 demonstrations except coffee making, which used 159. Because the slide’s photo labels and table labels are shown separately, the safest reading is to preserve the table order as displayed rather than infer task-photo mappings. In that displayed order, the tasks and results were:
| Task as listed on slide | Demonstrations | EquiDiff (PC) success |
|---|---|---|
| Twist pipe | 20 | 90% (18/20) |
| Seat wiping | 20 | 95% (19/20) |
| Screwdriver to drawer | 40 | 90% (18/20) |
| Trash sweeping | 50 | 100% (20/20) |
| Toast making | 50 | 70% (14/20) |
| Bagel baking | 60 | 85% (17/20) |
| Tool box | 99 | 70% (14/20) |
| Coffee making | 159 | 85% (17/20) |
Platt’s interpretation was not that point-cloud equivariance solves robotics. He listed concrete tradeoffs. The approach generalizes “almost perfectly” only over the finite subgroup it is built for. If the group is C4, with rotations every 90 degrees, generalization to 45 degrees still requires augmentation or other support. Larger finite groups are more computationally expensive. Point cloud encoders are expensive at training and inference time. Point clouds are also typically lower resolution than RGB images, which matters for precision tasks, and they are not a natural fit for eye-in-hand observations.
The result therefore establishes a point, not an endpoint: strong geometric structure can buy large data-efficiency gains when the representation and the task align.
Putting an image on a sphere keeps RGB while recovering some SO(3) structure
Robert Platt introduced Image 2 Sphere Policy, or ISP, as an attempt to preserve one of the advantages of image-based policies: RGB input. RGB is important for tasks requiring precision. His example was placing a coffee pod into a coffee machine slot. A point cloud policy could miss that kind of detail because point clouds are sparse relative to images.
The challenge is that a single RGB image is not obviously embedded in SE(3). A standard learned policy can feed it through a ResNet or similar encoder, but then the image becomes just patches and features. ISP instead projects image features onto the 2-sphere.
The pipeline starts with an image encoded by a ResNet, either pretrained or equivariant. The resulting feature map is projected onto a sphere, producing a sampled function over the sphere. The sphere is then rotated into the camera’s coordinate frame. This matters because the whole point of the spherical representation is that SO(3) rotations can be applied to it.
Once the feature function is on the sphere, ISP applies a Fourier transform using spherical harmonics, which Platt described as a Fourier basis for functions on the 2-sphere. The model then performs convolution in Fourier space: one convolution on the 2-sphere and one in SO(3). The SO(3) convolution uses Wigner-D matrices as basis functions, though Platt did not go into the details. After an inverse Fourier transform, the representation is brought into a discrete subgroup of SO(3), such as an icosahedral group, and then uses the same equivariant diffusion-policy machinery discussed earlier.
The results were weaker than Equivariant Diffusion Policy but still meaningful. On MimicGen, Platt said ISP outperformed the baselines by about a factor of two in data efficiency: with 100 demonstrations, it performed a little better than baselines with 200 demonstrations.
The slide reported mean scores of 65.2 for ISP-SO(3) with 100 demonstrations and 75.0 with 200. DiffPo scored 53.6 and 64.1. EquiDiff in that comparison scored 53.0 and 64.5. ACT scored 23.0 and 40.9.
| Method | 100 demos mean | 200 demos mean |
|---|---|---|
| ISP-SO(3) | 65.2 | 75.0 |
| ISP-SO(2) | 65.0 | 75.1 |
| DiffPo | 53.6 | 64.1 |
| EquiDiff | 53.0 | 64.5 |
| ACT | 23.0 | 40.9 |
Platt also said pretraining in the image encoder improves the method, raising performance to 72% in one comparison he mentioned.
The most concrete demonstration was a bean-pouring task trained with 50 demonstrations. A baseline model failed to observe when all the beans had spilled out of the scoop. The policy needed to look at the scoop and continue turning until it was empty. ISP learned that behavior. Platt’s explanation was not that equivariance taught the model beans. It was that the baseline had to learn generalization over position, the policy, and task variation all at once. ISP did not have to spend as much data learning pose generalization, leaving more capacity and data for the task-specific behavior.
For real-world tasks, the ISP slide listed box-pipe disassembly, U-pipe disassembly, 3D-pipe disassembly, and grocery bag retrieval, with 60 or 65 demonstrations. ISP-SO(3) achieved 80%, 85%, 75%, and 95% success respectively. DiffPo achieved 10%, 65%, 15%, and 75%.
The advantages and limits were clear. ISP uses RGB directly and does not require camera calibration. It is therefore better suited than point clouds for some precise eye-in-hand tasks. But it is not as sample efficient as the point-cloud Equivariant Diffusion Policy, and Platt said there is no clear method for incorporating multiple cameras or modalities.
RAVEN treats pixels as rays so different sensors can share a frame
Robert Platt presented RAVEN as a third representation choice: embed image patches as 3D rays. Each patch is not treated merely as a pixel feature. It is associated with a ray from the camera origin through the center of the patch. Platt described the ray as having orientation with respect to the image patch. The intuitive claim is simple: if an image is produced by a camera in the world, its patches can be treated as rays in the world.
RAVEN then uses Geometric Transform Attention, or GTA. The visible slides attribute this material inconsistently across names and years, so the safe technical point is the mechanism rather than a bibliographic claim. GTA modifies standard transformer attention by transforming query, key, and value features into a common reference frame before attention is computed, then placing them back into their respective frames. Platt said the rows in the displayed equations can be thought of, for example, as rotation matrices; the slide described them as representations of group elements such as rotation matrices or translations.
This does not require a wholly different learning process. It changes the attention calculation so that reference frames are incorporated into computation. In RAVEN’s architecture, a ResNet produces image patches; those patches are converted into rays; each ray gets a coordinate frame; then the model uses GTA self-attention and a GTA version of a diffusion transformer. Platt also noted that the decoder has details related to placing tokens in the coordinate frame in which the action chunk should be expressed.
The main virtue Platt assigned to RAVEN was not peak benchmark performance. He said the method performed a little worse than the previous approaches, while still outperforming baselines. The source slides showed benchmark and real-world result tables, but the visible text did not expose enough numeric detail to support a more granular comparison. The qualitative claim Platt emphasized was flexibility: RAVEN offers a consistent way to combine multiple views and potentially multiple modalities. Pixels, points, and even force information can be used if each is attached to a coordinate frame.
That makes RAVEN a bridge from single-view image methods to multi-sensor policies. The price is that the rays must be correct. RAVEN requires camera calibration. Its effectiveness also depends on design decisions about what coordinate frames to attach to pixels, points, or other pieces of data. Platt also noted that there are questions about how well this approach incorporates equivariance into the model compared with explicitly equivariant architectures.
The RAVEN real-world slide showed tasks including banana picking, beans scooping, coffee cleanup, and box cleanup. Platt did not dwell on the numbers, instead emphasizing the representation’s ability to put heterogeneous observations into a shared geometric computation.
Pix2Act removes explicit equivariance but forces viewpoint robustness through image-plane control
Robert Platt described Pix2Act as work the group was doing at the time of the seminar. The slide identified it as “Huang, et al., CoRL 2024 (submission),” and Platt said it was “in submission or will be in submission to CoRL.” Unlike the earlier methods, Pix2Act has no equivariant layers. It is still part of the same broader argument because it changes the representation so the model does not need to infer 3D action directly from a disembodied latent state.
Pix2Act uses raw stereo images. In the setup shown, two cameras are attached to the robot end effector and look at opposite sides of the gripper. The gripper is controlled through several colored control points. The model predicts trajectories for those keypoints in the image planes of the in-hand cameras. Those 2D trajectories are then triangulated back into 3D trajectories.
The model uses ResNets to encode an agent view and the in-hand views. A multi-view transformer performs self-attention within each image and cross-attention between images. Additional proprioceptive or task-conditioning information can be included. The transformer outputs tokens that feed separate diffusion heads, one for each in-hand view. Each head diffuses a trajectory in its own image plane. Triangulation combines the image-plane predictions into physical 3D motion.
The unusual part is the data augmentation. Platt compared it to domain randomization: if there are variables the model should ignore, randomize them so the model learns they contain garbage information. Here, the variables to ignore are global image structures that might not generalize across views. Pix2Act independently rotates the two in-hand cameras around their visual axes during augmentation. Platt described it as imagining each gripper-mounted camera can spin between images.
The effect is to force the model to infer each image-plane trajectory from local visual features. If the robot needs to move toward a drawer, the predicted keypoint trajectories should move toward the drawer in the image regardless of the drawer’s global location or how it corresponds to other structures in the model’s learned world representation. Global context can still provide high-level task information, but the image-plane movement must be grounded locally.
The visible Pix2Act result slides showed comparisons on single-task MimicGen settings, grouped distribution and precision settings, and a multi-task MimicGen setting, but the numeric values were not legible in the source record. Platt’s spoken claims were qualitative: the model “performs very well” and was “outperforming a whole bunch of stuff.” The specific comparison he emphasized was the multi-task result against TRI’s LBM model. In that setting, all methods were trained or fine-tuned on 600 demonstrations from six MimicGen tasks, 100 demonstrations per task. The slide stated that Pix2Act used no pretraining, while LBM used a pretrained CLIP vision encoder. Platt said Pix2Act was still edging out LBM.
That result matters within the seminar’s larger argument because Pix2Act is not an explicitly equivariant model. It suggests Platt’s broader claim is not “equivariance everywhere.” It is that robot learning improves when the representation and architecture make the physical problem easier. In Pix2Act, the model reasons in image planes where local visual motion is directly meaningful, and only projects into 3D at the end.
Scaling laws make data-only robotics look like a costly victory
Robert Platt closed by connecting geometric structure to scaling laws. In language modeling, loss improves smoothly with model size, dataset size, and compute, often following power-law relationships. Similar plots have appeared in robotics, including a Generalist AI plot of next-action prediction error against pretraining dataset size for a clothes-handling task.
Platt made a small scaling plot of his own from MimicGen: average success rate for a baseline diffusion policy over 12 tasks at 100, 200, and 1,000 data points, fitting a power law through only those three points. He said it was “probably not too far from the truth,” but did not present it as definitive evidence. The point was qualitative: relying purely on more data can put the field in a tough spot, because large increases in data may produce only small gains.
His analogy was Pyrrhus of Epirus, who defeated the Romans but lost so many people across long supply lines that continued victory became ruinous. Platt’s robotics version is that a model may keep improving with more data, yet the cost of each additional improvement may be strategically unattractive.
The hoped-for role of geometry is not to abolish scaling laws. Platt explicitly said the scaling curve will always be there and that more data is always better. The hope is to shift the curve left. A smarter model should make each data point count for more by automatically generalizing in ways a less structured model would need to learn.
He cited a Brehmer et al. 2024 workshop paper as partial support. In that work, equivariant models were used for rigid-body forward modeling involving between three and ten polyhedral objects. The model was E(3)-equivariant. The plot compared a non-equivariant transformer, a non-equivariant transformer trained with data augmentation, and an equivariant transformer as data increased. Platt said it looked like the equivariant model shifted the scaling line. He also said he performed a similar experiment on MimicGen with ISP and shifted the power law, while presenting this as future-work territory rather than settled evidence.
This is where he returned to the two-extremes framing. Robotics is not, in his view, moving all the way back to completely hand-coded one-shot models. It is moving left from generalist models toward structured models. The object is not to discard learning, but to bias learned models toward physical structure.
In machine-learning terms, Platt called this a bias-variance tradeoff. The model is biased to better fit the physical world, which “basically amounts to incorporating knowledge from physics.” He also bounded the claim: the seminar only dealt with translation, rotation, and coordinate invariance. That is not all of physics. But in the work he presented, it was “surprisingly effective.”
The structure is a bias, not a fully modeled world
Robert Platt faced the central objection in the question period: structured models can lose generality. If the model assumes a limited set of groups, such as rotations or SE(3), what happens when the task does not cleanly match those symmetries? A table and another object, for example, do not necessarily have a clear SE(3)-equivalent relation between them.
Platt agreed that the risk is real. Moving too far left on his structured-generalist spectrum eventually reaches a fully modeled world, and then the robot “is going to get in deep trouble” when the model is wrong. But he argued that the methods he presented are nowhere near that extreme. They add a bias toward invariance, not a complete world model.
If exact equivariance is a problem, methods from approximate equivariance could be used. In practice, however, Platt said his group has not found that necessary. Their environments are not perfect matches to the model structure, and because of that, the model can adapt around equivariant constraints. His condition was that the architecture must not preclude solutions. If it does not, then learning can still find them, with the geometric structure acting as a useful data-efficiency bias.
The same principle shaped his answers about tactile data and IMUs. Tactile data, he said, applies “without modification” if it can be mapped into the physical world. Force data can be treated as vectors localized in space. GelSight-style tactile pads can be treated as images with coordinate frames in the world. The question is not whether the modality is vision; it is whether the data can be associated with a coordinate frame or geometric meaning.
He acknowledged that tactile applications require effort and that benchmarks have not driven much of the work. He then recalled that his group did have a tactile paper using GelSight on a two-finger setup, where the model generalized over the orientation of a grasped part during insertion.
On IMU data, he gave the same answer: any source of data that can be associated with a coordinate frame or geometric meaning is a natural fit. The key is placing data into a common reference frame to the extent possible.
Asked how these ideas relate to larger data-hungry models, Platt’s answer was that the structured policy should not be seen as an alternative to big models. “We create big models with a structure like this,” he said. The structure should not preclude using information in the data. The question is what neural network structure best fits the data being modeled. His claim, still not demonstrated at large scale, is that these geometric structures are a more effective way to follow the data in robotics.


