Autonomous Medical Robots Need Physics Models, Not Just Foundation Models

Michael YipStanford OnlineTuesday, May 12, 202617 min read

UC San Diego professor Michael Yip argues in a Stanford Robotics Seminar that medical robotics must move beyond teleoperation if it is to address healthcare labor shortages. Current surgical robots can improve precision but still depend on a surgeon’s skill, while surgery’s scarce data, deformable tissue, safety constraints, and need for millimeter accuracy make end-to-end learning an inadequate answer on its own. Yip makes the case for a hybrid path: modern perception where it works, explicit physics and control where contact demands it, and humanoid platforms where broader hospital tasks require more general embodiment.

Healthcare needs robots with skill, not just robots with surgeons behind them

Michael Yip frames medical robotics as a response to a labor problem that healthcare already cannot meet. In the United States, he said, the projected shortage is on the order of tens of thousands of surgeons and hundreds of thousands of nurses. The slide he showed put the surgeon shortfall at roughly 30,000 and the nurse supply-demand gap at roughly 450,000, citing the Association of American Medical Colleges and the National Center for Health Workforce Analysis.

Workforce category	Shortage or gap shown
Surgeons	~30,000 projected shortfall
Nurses	~450,000 supply-demand gap

The workforce-shortage figures Yip used to motivate autonomous medical robotics

Robots are attractive here not because they are futuristic but because their properties map directly onto labor scarcity: they do not fatigue in the human sense, they can be available continuously, they can operate with repeatable precision, and their capabilities can in principle be updated fleet-wide. Yip compared this to autonomous-vehicle software updates: a robot skill, once developed and deployed, need not be taught one clinician at a time through apprenticeship.

But the surgical robots deployed today do not yet solve that scarcity problem. The da Vinci Surgical System, Yip’s central example of the current paradigm, is teleoperated. Instruments enter the patient’s body, and the surgeon manipulates them from a console. The robot can improve precision and reduce fatigue, but the skill is still the surgeon’s. Current surgical robots, he said, have “zero inherent skill.”

That distinction matters because adding a teleoperated robot does not necessarily let a hospital perform more surgery. Robotic procedures can require more personnel than non-robotic ones. In manufacturing, by contrast, autonomous robots are already deployed, but they are autonomous in a narrow and “unskilled” sense: rigid, pre-programmed, dependent on structured environments, and requiring setup and oversight. They are fast, precise, and repeatable, but the operating world is constrained around them.

The research problem is the attempt to combine these two categories: the dexterous, clinically useful embodiment of surgical robots with the capacity for autonomy. Surgical autonomy is not a new ambition. Yip pointed to ROBODOC in orthopedic surgery in 1992, AESOP as a voice-controlled laparoscopic camera holder in 1994, and later university work on imitation learning and autonomous suturing. What has changed is not the aspiration but the available techniques, especially in perception, simulation, learning, and control.

Recent robotics has been pushed by foundation models, vision-language-action models, large datasets, and world models. Yip does not treat those as irrelevant to surgery. His objection is that the assumptions under which they have recently succeeded are almost the opposite of the assumptions in surgery. Many current robot-learning demonstrations happen in clean, controlled settings with low stakes, easy resets, and plentiful demonstrators. A person can sit at a console and demonstrate folding laundry many times. Surgery offers scarce data, non-resettable environments, critical safety constraints, few expert demonstrators, low incentive for those experts to generate robot-training data, and privacy constraints around patient data. Yip said world models have “no concrete path to maturity” in this domain under those constraints.

That is why his lab’s approach returns to four older pillars of autonomous robotics: perception, modeling and simulation, planning, and control. The purpose is not nostalgia for classical robotics. It is to retain explainability and safety while building systems that can perceive surgical context, estimate physical state, plan interventions, and execute with millimeter-level precision.

Robot autonomy, in my opinion, can only be as good as its perception and its ability to recognize and understand the world.

Michael Yip

Surgical perception needs millimeter accuracy in one of vision AI’s worst environments

Michael Yip said the first step toward surgical autonomy is contextual awareness, which for him begins with perception. Modern computer vision has only recently become capable enough to segment, localize, and track objects at the level robotics needs. His slides cited Meta AI’s Segment Anything and Meta AI/Oxford’s CoTracker as examples of recent perceptual capabilities that can identify, localize, and track targets.

Surgery, however, is a hostile environment for perception AI. The constraints include poor field of view, bad or nonexistent depth measurement, specular reflection from fluid, blood, smoke, constant occlusion by instruments, deformable tissue, insufficient data, and a need for one- to two-millimeter accuracy for control. Even stereo cameras do not solve the problem cleanly, he said, because their baseline can be so tight that depth perception remains poor. A single camera requires inference from shadows, anatomy, and prior knowledge.

Perception component	What the robot must know
Precision robot proprioception	Where its instruments are in 3D space
Precision object localization	Where objects are relative to the instruments
3D scene reconstruction	What the deformable surgical environment looks like
Scene fusion	How narrow camera views combine into a broader spatial model over time

Yip’s four-part breakdown of the surgical perception problem

Yip said his UC San Diego lab has spent roughly 10 to 11 years working through these problems. Examples shown included surgical tool tracking, needle tracking, suture-thread tracking, 3D scene reconstruction, and fusion for deformable environments. These are not separate academic exercises in his account; they are prerequisites for getting instruments to grasp objects accurately and for letting a robot understand what tissue is doing as it is manipulated.

The point of reconstructing deformable scenes is not only to create better visuals. In surgery, reconstruction becomes a safety input. If a surgeon or robot is excising a tumor, for instance, the system needs to track anatomy as it moves and deforms so that it can preserve enough margin and avoid missing a lesion. But the larger claim is that vision alone is not enough. A robot that sees tissue move still needs a model of why it moves that way. To cut, dissect, retract, peel, or tear tissue safely, the robot must estimate mechanics.

That is where the work shifts from perception to digital twinning: building a model that can match observed tissue geometry and predict how it will respond to contact.

Digital twins are useful only if they can stay synchronized with deforming tissue

Michael Yip described digital twinning as a way to connect surgical vision to surgical action. The robot observes tissue, reconstructs a surface, initializes a simulation, and then uses that simulation to plan and control interactions. For his lab, a key early technique was position-based dynamics, which simulates deformable scenes using particles and constraints.

Two properties made position-based dynamics attractive. First, it can run faster than real time. That matters because a controller can simulate possible futures, evaluate interactions with tissue, and choose an action without waiting for the real world to unfold. Second, it can satisfy position constraints exactly. A surface reconstructed from camera observations can be imposed on the simulator quickly, without many iterative steps that would destroy the speed advantage.

The first problem the lab encountered was partial observability. A surgical camera sees a surface. It does not directly reveal what is happening beneath that surface. If a simulator is initialized from surface geometry alone, it must guess the subsurface structure and mechanics. When the simulated tissue moves, it can diverge from real tissue.

To correct this, Yip described the use of differentiable rendering. The system compares camera observations with simulated renderings, computes a loss between them, and backpropagates that loss to update the simulator. The parameters that change can include tissue mechanics and camera pose. In one result shown from work by X Liang and colleagues, online adaptation of heterogeneous tissue mechanics reduced prediction error from several millimeters, around five millimeters, to sub-two-millimeter accuracy.

sub-2 mm

prediction-error accuracy after online heterogeneous stiffness adaptation in the example Yip showed

The same real-to-sim idea was shown beyond mesh-like tissue surfaces. For rope-like objects, the system could observe a rope, initialize a simulation, pull on it, and learn properties such as elasticity and viscoelasticity. Yip connected this to surgical vessels, where stiffness matters for safe manipulation. For fluids, the lab used video of chocolate milk being poured to estimate viscosity. Yip called fluids “the extreme end of a soft body.”

The surgical application of the fluid work was hemorrhage control. When blood pools in a surgical field, the robot’s goal is to suction it as quickly as possible. The lab formulated the task as an optimal control problem, using model predictive control with a physics simulation of the fluid as the model. The robot’s motion plan could anticipate blood flow rather than merely react to pixels.

A second tissue problem was finding where tissue is connected before cutting. Yip described an active-sensing approach: when tissue is pulled, a sharp discontinuity in texture translation can reveal a connection. The lab wrapped this in a Bayesian inference framework so the robot could estimate not only likely attachment locations but also uncertainty. Once uncertainty is explicit, it can be incorporated into safety-aware control.

The JIGGLE framework shown in the seminar asked whether there is an optimal exploration policy for identifying deformables connected to the environment. In simulation, a robot could tug on tissue to reveal hidden connected regions while maximizing information gain and staying below an energy threshold associated with tearing. This energy metric, Yip later clarified in Q&A, came from a local position-based-dynamics simulator, not from a foundation model. Vision provided the association between observed movement and viscoelastic or boundary-energy estimates.

These models then fed into cutting and dissection. In the MEDIC work Yip showed, the robot learned to peel back butcher-shop meat to expose the cutting region for a second arm. He emphasized that the videos were bench experiments on butcher-shop meats, not clinical deployment. The autonomy was built around model predictive control methods.

One advantage of explicit modeling, in Yip’s account, is recovery. Foundation-model robotics demos sometimes show policies that fail and then try again. He called that promising but uncontrolled: there is no guarantee the system will return and correct the right failure. A real-to-sim model can instead query whether a cut actually disconnected the tissue. If residual attachments remain hidden until tissue is elongated, the robot can reveal, observe, adapt, and correct. In the examples shown, a failed dissection could be followed by targeted replanning and cutting of the remaining connected regions.

The broader claim is that model-based autonomy gives the supervisor and the system a way to ask what the robot is doing and why. Yip described his lab’s work as a “sharable, autonomous surgery toolkit” of vision models, adaptive physics models, and control and planning algorithms. Assembling those tools in different configurations can broaden what is automated while retaining explainability.

Lifelong learning is meant to sequence skills, not replace the surgical toolbox

Michael Yip acknowledged that much of the autonomy he showed had been engineered task by task. Each procedure required the lab to assemble perception, modeling, planning, and control components in a specific way. That does not scale cleanly. The next question is how a robot should sequence and combine a set of existing behaviors, and whether new skills can be added over time.

His lab’s approach is called knowledge-grounded reinforcement learning. The idea is to treat learned or engineered behaviors as knowledge modules and use a sparse neural-network architecture to learn how to combine them. A module might move a camera to scan tissue, guide a tool to cut, pick up an object, transfer an object between hands, or perform another bounded behavior. Reinforcement learning then learns how those modules should be weighted and sequenced to complete longer tasks.

Yip’s example knowledge set came from tasks associated with the Fundamentals of Laparoscopic Surgery, an exam used to demonstrate laparoscopic competence. The tasks included grasping and regrasping suture needles, moving blocks on pegboards, transferring objects between hands, endoscopic camera control, gauze retrieval, and related bimanual training behaviors. The lab learned these policies in simulation and then transferred them to physical robotic systems.

He showed multi-step autonomy for suturing, including sequencing steps for multi-throw needle suturing. The significance is not that reinforcement learning alone performs surgery end to end. It is that a learned sequencing layer may help scale a library of bounded behaviors without rebuilding the whole controller for each task.

That distinction also shaped Yip’s answer to a later question about foundation models. His lab already uses foundation models for vision, segmentation, and tracking when they are good enough, either as main methods within one pillar or as supplementary methods. He sees value in mixing approaches for robustness rather than relying entirely on black-box architectures.

But he was skeptical that robotics will consolidate around a single universal robot model soon, especially for surgery. Robotics problems are diverse, and surgery has specific needs, data limitations, and safety constraints. It would be hard, he said, to imagine in the near term that the same model that cleans a kitchen could also perform surgery.

Humanoids enter the argument as a scaling platform, not as a replacement for surgical robots

Michael Yip then turned to a different scaling problem: embodiment. His lab’s work had largely used the da Vinci platform, but the surgical-robot landscape is diverse. If autonomy has to be rebuilt for each specialized robot, scaling becomes harder.

Humanoid robots became interesting to Yip not because of backflips or kung fu demonstrations, which he said did not obviously advance society, but because hospitals contain many tasks performed by people using human-oriented spaces and tools. A sufficiently capable humanoid could in principle serve as a remote surgeon avatar, an autonomous surgical assistant, or an autonomous nurse or technician.

The lab deployed a humanoid robot named Surgie through mostly teleoperation in “as many hospital settings as we could,” as Yip put it, and the examples shown also included task-trainer scenarios, mannequins, simulated setups, and bench-style medical tasks. The videos did not present a claim of clinical deployment on patients. Surgie pressed a stethoscope to a person’s chest and back for auscultation of heart and lungs; used an ultrasound wand in an exam scenario; performed Leopold’s maneuvers on a pregnant mannequin; carried out laryngoscopy and endotracheal intubation on a training dummy; used ultrasound-guided needle placement on a blue training block for vascular-access or biopsy-style work; performed bag-mask ventilation on a training dummy; and sutured on a silicone pad for laceration-repair-style practice.

For intubation, Yip noted, a patient may have only minutes before suffocation and death. In a setting without a physician present, the option to use a robot autonomously or teleoperate it could be easier to justify than having no intervention available. The point was a value proposition for emergency-style and access-constrained scenarios, not a claim that the demonstrated system is clinically ready.

The cost and deployment constraints were also explicit. A da Vinci system costs more than $3 million, Yip said, and even many U.S. hospitals can afford only one or two. Remote communities may have much less capital and access. Meanwhile, da Vinci is specialized for laparoscopic surgery. Hospitals also need assistants to hold cameras, retract tissue, hand off instruments, operate ultrasound probes, and perform nursing or technician tasks. Buying and approving separate robots for each role is difficult to justify. A humanoid with enough skill could, in principle, cover multiple roles.

Potential hospital role	Examples Yip associated with it
Remote surgeon avatar	Teleoperated humanoid use, including laparoscopic-tool experiments
Autonomous surgical assistant	Camera holding, tissue retraction, assistive operating-room tasks
Autonomous nurse or technician	Instrument handoff, ultrasound support, common bedside or procedural tasks

The humanoid value proposition Yip laid out for hospitals

The lab tested humanoid laparoscopic teleoperation in a project called Lap-Surgie, comparing manual tool control, Surgie, and da Vinci. Manual laparoscopy is difficult because motion through a fulcrum inverts axes and changes scale. With robotics, scale and axis inversion can be corrected in software. For the training task Yip showed, users had an easier time doing the task through the humanoid than with manual tools. The da Vinci remained the easiest system to use. Yip treated that result as expected: da Vinci has been engineered for 25 to 30 years specifically for laparoscopic surgery. The humanoid result was not parity, but enough to justify continued investigation.

In Q&A, he added that the humanoid’s physical limits affected performance. The study used the same user interface and camera, but the G1 humanoid did not have a large wingspan, and users frequently hit joint limits while working around the laparoscopic fulcrum. The da Vinci has been designed to minimize those limitations. When asked whether the G1 was still doing one-to-one control around the remote center of motion, Yip said that had been programmed in.

Hands, haptics, and tactile data are a hard bottleneck

Michael Yip said the hospital-task videos exposed a major limitation: diverse instrument manipulation is challenging. A humanoid may resemble a person, but a robot hand is not a human hand. Robot hands have fewer degrees of freedom, size mismatches, and limited sensing. Some are much larger than human hands — “basketball player size hands,” as Yip put it. Tactile sensors may exist only on selected fingertip and phalange regions, leaving large areas effectively numb.

Teleoperating such hands proved nearly impossible in the lab’s initial attempts. The team tried using cameras to observe human hands and make the robot hand match them. Objects fell out, or the robot squeezed too hard. Yip’s conclusion was that the robot should often learn how to use instruments in its own morphology rather than imitate human grasping exactly.

The lab used reinforcement learning to train robot hands, in massively parallel simulation, to hold and manipulate articulated tools. The examples included tweezers, tongs, forceps, pliers, and a laparoscopic tool with a trigger. These are hard because one hand must maintain a stable grasp while also opening, closing, or actuating the tool.

Yip still wants demonstrations where possible, because reinforcement learning is computationally heavy. But good demonstrations require touch. It is not enough for the robot to sense contact; the human demonstrator also needs force feedback. In hospital settings, clinicians manipulate buttons, knobs, plugs, wires, probes, syringes, and instruments. Commercially available haptic devices, he said, did not provide the kind of feedback needed for even a simple straight-on button press. Gloves that resist finger closure do not reproduce the feeling of pushing a fingertip forward into a button.

That led the lab to build the N2D haptic glove, which Yip described as the first haptic glove they know of that can deliver directional haptics to every finger. Linkage systems and actuators push and pull the fingers in multiple directions, providing force feedback for actions such as pushing or sliding across a surface. The system is bulky, especially around the hand, but Yip said that was partly a design choice: they avoided mechanical tethers to the wrist to reduce ergonomic strain.

The glove was shown in dexterous hand teleoperation, virtual-reality simulation, and humanoid teleoperation. Attaching it to the humanoid robot allowed users to push buttons and collect demonstrations more effectively.

In Q&A, Yip returned to tactile data as one of the central training-data bottlenecks. Raw and labeled data both matter, but for robot interaction he highlighted force and touch. Much of human interaction with the world depends on feeling it. Yet tactile sensing faces a material-science tradeoff: sensitive sensors may break easily, while robust sensors tend to be sparse or less sensitive. Yip said that will be “quite a long road,” but argued researchers should still incorporate whatever force and tactile data are available now.

The exchange also clarified what was and was not being delegated to learned models. Asked whether the tearing-energy concept came from foundation models or their physics, Yip said the closed-loop tissue-tearing work used a local position-based-dynamics simulator, not a foundation model. The system could obtain an energy metric from that simulator, while vision associated observed tissue movement with viscoelasticity and boundary-energy estimates. The answer matters because it separates two layers that can be conflated: learned perception may help identify and track the scene, but the safety-relevant estimate in that example came from an explicit local physics model.

Wound care shows why medical autonomy is not limited to operating rooms

Michael Yip closed the technical arc by moving beyond hospitals and operating rooms to chronic wound care. A chronic wound, as his slide defined it, is a wound that fails to heal within an expected timeframe. Such wounds recur, heal slowly, and affect about 2% of the population in North America, according to the slide.

share of the North American population with chronic wounds, as shown on Yip’s slide

For patients with mobility impairments, wound care can require frequent, often daily, dressing changes. Yip described this as a large burden on patients, caregivers, and families, especially given nursing shortages. The burden is not merely logistical; he said patient dignity is affected when family members must provide repeated intimate care. That creates a case for autonomous robots that help people live more independently.

The lab applied its deformable modeling and optimization-based control tools to dressing removal. The robot learned how to peel bandages in directions intended to minimize skin stretching and reduce pain. Later work addressed tape manipulation: removing dressings, preparing tape, and placing new tape. Yip characterized this as early work and pointed out visible shakiness in the robot hand, while also saying he was hopeful about the social impact if the capability matures.

A small hardware detail captured the nature of the problem: robot hands did not have fingernails, so the lab had to 3D print fingernails to help peel tape. The example reinforces Yip’s broader argument that medical autonomy cannot be reduced to a software policy. The physical interface between robot and world — fingertip, sensor, tool, adhesive, tissue — often determines whether the task is possible.

The research path is hybrid: foundation models where useful, physics where necessary

Michael Yip did not reject foundation models. He said they are already useful in vision, segmentation, and tracking, and that his lab will integrate external models when they work. His argument is about where autonomy in surgical and medical robotics needs additional structure.

For high-stakes, contact-rich, deformable environments, Yip believes mathematical models and physics understanding remain central. They make systems more generalizable and adaptable, and they help preserve explainability. The supervisor must be able to ask what the robot is doing, whether it is safe, and how control remains with the human rather than the machine.

In response to a question about other imaging modalities, he agreed that surgery should use whatever modality is available, including ultrasound. His lab has worked with ultrasound before, and he described one deployment mode: an ultrasound probe can be inserted into the body, picked up by a surgical tool, and scanned internally. A robot may be able to combine ultrasound, spatial awareness, and other information in ways humans cannot easily juggle.

On hardware versus software, Yip admitted his mechanical-engineering background biases him, but his answer was direct: software is critical, yet no amount of software can make a poor robot achieve what it needs to achieve. Surgical robots are carefully designed for inherent safety, backdrivability, low backlash, smoothness, and mechanical reliability. Hardware design remains important.

Yip’s position is neither classical robotics alone nor end-to-end learning alone. It is a layered approach: use modern perception where it helps, build physical models where contact and deformation demand it, learn policies and skill sequencing where they can scale behavior, and choose embodiments that let useful autonomy reach more clinical tasks.

Agents and Autonomy Multimodal AI AI in Healthcare and Life Sciences AI in Robotics and Physical Systems AI Research Methods