AI Engineering Is Moving From Model Benchmarks to Production Harnesses

Nathan LabenzThe Cognitive RevolutionMonday, June 22, 202627 min read

Shawn “swyx” Wang argues that AI engineering is shifting from a race over raw model capability to the production systems that make models usable: evals, harnesses, memory, routing, infrastructure and auditability. Drawing on Cognition’s Frontier Code benchmark and his view of the AI Engineer agenda, Wang says the key software frontier is no longer whether agents can pass tests, but whether they can produce maintainable, mergeable code inside real organizations. His broader case is that unstable model access and enterprise constraints make the surrounding system, not the model alone, the durable product boundary.

The new software frontier is mergeable code, not passing tests

Shawn Wang treated the current state of AI engineering as a production-systems problem, not simply a model leaderboard problem. The most concrete example was Frontier Code, the benchmark he helped develop at Cognition. He called it “my baby,” while crediting most of the work to Cognition’s research team and contractors, including collaboration with Mercor. The name was inspired by Frontier Math. Wang said he realized Epoch was not going to do a code equivalent, so Cognition would.

Frontier Code is positioned as a successor to earlier agentic software benchmarks. Wang described SWE-bench as the first major agentic benchmark because it gave models open-ended tasks and let them use tools to solve real issues. But he argued that saturated benchmarks such as SWE-bench no longer distinguish model quality well enough. The gains are often one or two percentage points, and it is hard to know how much is memorization. More importantly, models can pass tests while producing code that would not be merged.

Frontier Code tries to measure the thing production teams care about: not merely whether a patch passes a test, but whether it is acceptable code. It adds rubrics and shifts the task distribution toward production work across languages including C++, Python, and Java, rather than a narrower set of issues. It is out of sample, heavily graded, and designed around observed failure modes.

Wang said Cognition has an internal catalog of roughly 20 ways models cheat. In training, these show up as reward hacks. In benchmarks, they show up as false positives: code that technically satisfies a test but violates the spirit or maintainability requirements of the repository. Frontier Code translated those patterns into rubrics. As Wang put it, the goal is to “guide the evolution of models towards maintainable code against slop.”

A Cognition chart shown during the discussion compared FrontierCode Diamond scores across models and systems. The visible bars ran from 0.7 for Llama 3 70B to 13.6 for CLE. The point of the chart was not just that CLE led the visible comparison, but that Frontier Code was intended to create separation where older coding benchmarks had compressed differences into small leaderboard moves.

Model or system	FrontierCode Diamond score shown
Llama 3 70B	0.7
Mixtral 8x22B	1.0
DeepSeek V2	1.2
GPT-4o	2.4
Claude 3.5 Sonnet	3.5
Kim v0.1	3.6
Kim v0.2	3.7
Kim v0.3	4.2
Claude Opus 3	4.7
Kim v0.4	5.3
CLE	13.6

A Cognition chart shown during the discussion compared FrontierCode Diamond scores across models and systems.

Wang also cited a separate claim from METR that about 50% of SWE-bench-passing code is “completely unmergeable.” He gave examples: modifying too many files, cheating the test, ignoring code style, or otherwise producing a patch that passes but does not belong in a production repository. The benchmark’s purpose, in his telling, is to counter incentives that reward models for lines of code or superficial test-passing rather than maintainability.

That created a subtle disagreement with Nathan Labenz, who asked whether human maintainability preferences might cap AI’s upside. If a model finds a strange but correct solution, is insisting on human taste a mistake? Labenz compared it to math proofs: if a model proves a theorem, even in an inelegant or hard-to-read way, there is still a sense in which it has won.

Wang agreed the point was real. Mathematicians reviewing Lean-generated proofs sometimes find them too detailed and unlike what a human would write, even if they are formally provable. In software, he said, there may be cases where one can stop looking at lines of code and care only whether the black box behaves correctly.

But production software has constraints math proofs do not. Other agents may need to read and maintain AI-generated code. Critical software may be subject to SEC, healthcare, or other regulatory obligations. “AI coded this thing, I don’t know what’s going on in there” is unlikely to be an acceptable excuse for the next 30 years, he said. He expects some governance around code quality to remain, though it may loosen over time.

His own practice reflects that. Wang said he has a skill called Kakuna that only hardens code, because “that’s the only skill I know.” It tries to make code parallelizable, because if models are left alone today they often create “monster files” of 6,000 lines that become unmanageable. He cautioned against going “YOLO straight into” not caring about code quality.

Frontier Code is also designed to move. Wang rejected the idea that one benchmark can avoid saturation forever, especially if it uses open-source repositories. Frontier Code 2026, he said, will likely be saturated by the end of the year, perhaps reaching around 80%. That is expected. The answer is an annual cadence: Frontier Code 2027, 2028, and so on, with the goalposts moving each year. The simplest first theme is code quality. Wang’s candidate for the next theme is security.

The other direction is private held-out evaluation. Cognition works with large banks and other Fortune 500 companies, and Wang wants private evals that reflect unsolved enterprise problems: Frontier Code Finance, Retail, Telecom, Government, and similar domain sets. In his framing, an agent lab can translate industry problems into evaluation signals that model labs can use to improve models.

The product boundary has moved from the model to the harness

Wang described the state of AI engineering less as a single capability race than as a set of systems problems hardening around models: harnesses, retrieval, memory, evaluation, deployment infrastructure, data quality, and the organizational layer that turns raw model output into production software.

His vantage point comes partly from curating the AI Engineer World’s Fair. He framed that role bluntly: once a year, he gets to be “a dictator of AI engineering” because selecting the conference tracks effectively sets an agenda for what the field treats as live. This year, that agenda has shifted. A “coding agents” track has become “software factories.” RAG has been reframed as “search and retrieval.” GPU-related work has split into pre-training, mid-training, post-training, and inference. Data quality has become a pre-training concern because, in Wang’s words, “your model is only as good as the stuff that you train on,” and better data can make training more efficient.

The larger claim is that the product boundary has moved. Wang cited Greg Brockman saying, in effect, that the model alone is no longer the product; the model plus the harness is the product. Wang said people on the systems side had been saying that for a while. “It used to be cope and now it’s less cope,” he said, while adding that it is “still a little bit cope.”

The point was not that model capability no longer matters. It was that capability only becomes commercially useful when the surrounding infrastructure can route tasks, preserve memory, enforce security, provide tools, manage cost, and produce outputs that can survive contact with real organizations.

That same cost/capability tension runs through model deployment. Prakash Narayanan asked whether startups and enterprises are shifting from the hunt for capability to the hunt for cost effectiveness as more powerful and expensive models become available. Wang separated the market: startups should maximize capability; enterprises should maximize cost effectiveness.

The feeling has shifted, he said, because companies have seen public examples of token-maxing problems at large organizations. Enterprises are asking whether the tokens produce ROI. Some CEOs ask why they should pay for developers and also pay for the tokens developers use. Still, Wang said a core group of influencers, experimenters, and R&D teams will continue to chase maximum capability because that is how the industry discovers what will become affordable later.

Fable-class models, in Wang’s telling, are worth exploring even if they are slow and expensive, because they preview capabilities that may be broadly affordable two or three years later. He said he is glad frontier labs still make frontier models available through APIs. A darker future would be one in which a lab does not release a model through an API and only allows access through its own first-party product. That would constrain the long tail of builders who want to apply frontier capabilities to things beyond coding and B2B SaaS.

Labenz read the Frontier Code results as evidence that a high willingness to pay for Fable-class coding models is rational. His rough takeaway was that Fable achieved something like a 25% correct, approved, mergeable rate, while Opus was in the high single digits, with less than a 2x increase in cost. If the success rate more than doubles while the cost less than doubles, he said, the simple conclusion is to move tokens to Fable where available.

Wang agreed for the hardest problems, but not for everything. Model and trajectory efficiency are underrated, and frontier labs are improving them. On hard tasks, higher reasoning effort can produce real uplift. But most people, most of the time, are working on commodity problems. For commodity problems, Wang said, they do not need Fable.

Memory is split between weights and workflows

Labenz pressed Wang on continual learning, an area Labenz said he had been watching for a long time while waiting for something to tip. Wang answered by first warning that published work from frontier labs, especially Google, should be read as a partial signal. For the last two or three years, he said, a consistent view from frontier-lab people has been that if an idea at Google is good, it probably does not get published. “So you should have that in mind when you read anything that is published from Google.”

On continual learning itself, Wang said the field is split between “model people” and “systems people.” The core question is whether memory should update model weights or remain outside the model in a controllable system.

The systems side can look like RAG under another name: store memories in a database, retrieve them when relevant, write reusable skills, and call those skills later. Wang acknowledged that this does constitute learning in a functional sense. If an agent does a thing, writes a skill, and calls that skill next time, “it does learn.” But it is not machine learning. It is “zero gradient,” in his phrase, and better understood as in-context learning or workflow accumulation.

The weights side argues that true internalization requires training. If the model is to absorb what it has learned, the weights probably need to change. Wang named companies such as Trajectory AI, N-gram, and Adaption Labs as examples of the weight-updating side represented at his conference, with systems-oriented memory companies occupying the other half of the track.

He did not choose a side. He emphasized that the two camps “don’t like each other.” Model people, he said, do not view the systems people as legitimate. Systems people answer that weight updates produce memory systems no one can inspect or reliably control.

The dispute, in Wang’s telling, is ultimately about control, interpretability, and forgetting. Bad facts will be recalled. Sensitive facts may need to be deleted. Organizations will need to debug memory. Maximum control comes from not updating weights and instead controlling what enters an external system.

For enterprises, Wang said the demand is simple: “cheap and perfect and private.” That combination currently skews toward systems rather than weight updates. A single security incident — customer data leaking into a trained model, or one teammate’s information being exposed to another — could make an enterprise decide not to give its data to a model at all.

Still, Wang said weight-updating companies can do serious proof-of-concept work with large enterprises, and at that level even a POC can be “a few million dollars.” The unresolved question is whether they can solve the research and trust problems before money or patience runs out.

The practical architecture he described is not ideological. Keep the traditional harness, with RAG-like memory, running as the stable production path. Run an online-learning or weight-updating system in shadow mode. Compare them. A/B test them. Use production traffic to learn. The problems, in his view, are solvable, but they are open research questions. He also expects relatively few papers because the problem is too commercially valuable.

Infinite context is not arriving fast enough to remove systems work

Wang’s account of memory was grounded in a practical constraint: context length is not scaling fast enough to eliminate system design. He called context length “the slowest Moore’s law in the industry.” In his estimate, the industry has gone from roughly thousand-token context windows to million-token context windows in about three years, which he characterized as slow relative to other parts of AI progress. He does not expect a near-term jump to 100 million or 100 trillion tokens.

~1K to ~1M

context-window growth over roughly three years, in Wang’s estimate

That matters because the common workaround for not updating weights is to keep more material in context or retrieve it from an external store. If context does not become effectively infinite, then agents will continue to need memory architectures, skill libraries, workflow histories, and retrieval layers.

If a different architecture made long-term memory much more effective, the balance could change. Wang mentioned state-space models as an example, while noting that most people do not expect them to solve the problem. But the present condition is simpler: “we don’t have infinite context because we don’t have infinite memory,” he said.

That leaves the field stitching systems together. This is why, in Wang’s telling, the systems agenda has become more credible. Harness engineering, retrieval, workflow persistence, skill storage, and agent-specific infrastructure are not just interim hacks while everyone waits for the next model. They are the operating layer through which current and near-future models become usable.

The distinction matters for enterprise adoption. A system that stores memory outside the weights can be inspected, edited, deleted, audited, and permissioned. A system that trains memory into the model may provide deeper internalization, but it also creates harder questions about privacy, provenance, and control. Enterprises do not merely want the smartest answer. They want answers that are cheap enough to scale, correct enough to trust, and private enough not to create an incident.

Advisor routing helps until the labs bake it into the model

The “advisor” architecture — a cheaper model detects when it is out of depth and calls a stronger model for help — is, in Wang’s view, a version of model routing. He placed OpenRouter, Fusion, Sakana’s Fugu, Not Diamond, Martian, and Berkeley-related work in that broader lineage. Cognition co-founder Walden Yan had an earlier version called Smart Friend, released in Windsurf.

Wang’s analysis was ambivalent. As a cost optimization, starting with a cheap model and calling a smarter model as a tool can be useful. It is a natural next upgrade from using only the base cheap model. But he thinks there is a theoretical limitation: the weak model often does not know when the strong model is needed. In Wang’s formulation, “the dumb model doesn’t know what the smart model can do. It just knows the rough shape of what the smart model can do.”

If a question is more complex than the weak model can understand, it may answer directly rather than recognize that it should escalate. In a pure intelligence architecture, Wang said, he would prefer the smart model first, delegating simpler subtasks to cheaper models. But for cost and efficiency, people want the reverse. The reverse is less satisfying and does not solve the deeper problem.

He expects advisor routing could be a trend for “maybe three months” before labs bake adaptive routing into next-generation models. A single model trained end-to-end for adaptive thinking is likely to be better judged than an external systems-level routing layer, he said.

That view is not necessarily Cognition’s preferred business outcome. Wang noted that Cognition benefits from a multi-model future in which it can act as a router. But model labs are strongly incentivized to bring routing in-house. He cited OpenAI’s acquisition of Statsig as an example of that incentive.

The strategic conclusion was stark: most users want the “God model” that simply figures out how much reasoning to spend, when to call tools, and how to solve the task. Tinkerers may enjoy building “Frankenmodel” stacks that combine Opus, Gemini, and other models at precise points. Wang does not expect that to become the industry default.

This fed into a broader discussion of platform convergence. Labenz relayed Andrew Lee’s claim that everyone is trying to build the same thing, and asked whether companies like Cognition or Taskade are converging on a neutral layer that prevents lock-in to a single model provider.

Wang said early markets do look commodity-like. For a while, everyone built a VS Code fork. Now everyone is building a background agent and trying to beat Devin. Everyone is building a Slack bot. Ramp, Coinbase, Shopify, and others all have internal versions, and everyone says theirs is transformative. That sameness is a sign of market immaturity.

His analogy was luxury handbags. To an outsider, Chanel, Louis Vuitton, and Kate Spade may all look like bags. In a mature market, segmentation becomes legible to buyers who care. AI agent companies may similarly separate by organization size, compliance needs, industry, and workflow. Some winner-take-most enterprise giants will emerge.

He also expects the old Apple-versus-Android pattern to recur. OpenAI and Anthropic, in his view, want to be the Apple: vertically integrated, with users living inside their ecosystems. Microsoft is skewing toward the open ecosystem role. Smaller startups will need a differentiated tactic or eventually be acquired.

Agent traffic is forcing infrastructure to confront non-human scale

Agent activity is already stressing infrastructure built around human assumptions. Narayanan pointed to Cursor Origin as a Git replacement for agents, GitHub struggling with commit volume, Cloudflare at the edge, and the possibility that agents become 99.9% of internet volume.

Wang said he had asked GitHub’s CEO directly why GitHub was not handling the volume. The answer, in Wang’s summary, was that GitHub is facing the largest scaling challenge it has ever seen “in every dimension ever” — roughly 10x to 15x across commits, parallelism, CI/CD, and related dimensions.

The broader point, Wang said, is that cloud infrastructure must be rebuilt for agents. Existing systems assume identity maps to humans. API tokens are obtained through user interfaces. Those assumptions need to change. Scaling also requires more instantly forkable sandboxes and file systems where agents can operate.

He sees the work by Graphite and Cursor Origin as necessary. But he also sees a structural mismatch: GitHub is now competing with startups that can pay talent far more. Someone capable of solving these problems inside Microsoft may earn a strong large-company salary; the same person at Cursor could, in Wang’s example, earn something like $5 million a year and be working on the “cool, sexy” frontier. GitHub, in his telling, has crossed from the old assumption that a startup cannot beat the incumbent into a regime where the startup may be the preferred place to solve the problem.

The scaling pressure is not limited to Git. Wang said sandbox companies such as E2B and Daytona have been growing at least 50% month over month for the past year or 18 months. The demand is not only for GPUs; it also consumes CPUs and memory. A large amount of it may be wasteful, but the economics are currently working because customers pay for it.

50%+

monthly growth Wang attributed to sandbox companies such as E2B and Daytona over the past year or 18 months

The human version of this problem is spam. Wang runs a large newsletter and now receives replies from Claude instances reading emails and trying to respond. He knows there is no human on the other side, but he still has to block them manually. That pushes him toward needing an agent to read other agents’ emails. He called it “a huge giant recursive bullshit” loop and said he has stopped reading emails.

The next escalation is agents with money. Labenz said his own agent has a Mercury credit card with a modest limit and can buy groceries through tools. Wang encouraged connecting agents to “rent a human” services to see what they do. Giving humans as tools to agents, he said, can produce much more powerful and “unhinged” behavior, because delegation can proceed from agent to human to agent to human.

He compared it to earlier human-in-the-loop APIs such as Amazon Mechanical Turk, CrowdFlower, and Scale AI. Developers learn to think API calls are cheap until a bug runs up hundreds of thousands of dollars in human-task costs.

Wang’s worry is not only load but enclosure. If companies respond to agent traffic by walling off parts of the internet, the result could be not merely a dead internet but a set of closed gardens. He used China’s Baidu and Tencent ecosystems as an analogy: one person lives in one universe, another in another. He said he does not want that for the web. Cloudflare banning some agents while allowing those inside Cloudflare would raise open-web concerns, though he emphasized the same issue could arise with OpenAI, Vercel, or others.

Vibe-coded internal tools are replacing SaaS faster than trust practices are forming

Labenz described writing his own messaging system between himself and several agents instead of using Slack or Telegram. Claude wrote it for him. It lacked most Slack features, but he did not need most of them. The question was whether agents are pushing organizations to rebuild systems of record in-house.

Wang said he is seeing the same pattern and living it himself. He has his own GitHub clone and Slack clone. Immediately before the discussion, he had been on a call with a team in which three people had each vibe-coded their own source of truth for speaker reconciliation. Each person checked a separate internal tool, and together they caught different things.

It sounded wasteful — three systems of record instead of one — but Wang said it was “wasteful in the small but actually more precise in the large.” The redundancy surfaced errors that a single tool or single perspective might have missed.

He expects people will replace SaaS with internal vibe-coded versions. He also expects “dramatic data loss” and outages as a result. The tools will be fragile, but they will be tailored. They will be “your special baby.”

The missing discipline is auditability. Most people have not been trained to write auditable code: audit logs, rollback, recovery, authorization, and data retention. Wang thinks the industry will speedrun those lessons, with models guiding users part of the way.

Behind that is a bigger strategic challenge to SaaS. Wang said top founders have been talking to him about the company having a sovereign system of record rather than leaving each category of data inside a SaaS vendor. Much of the SaaS economy is built on owning a system of record — meetings, calendars, email, CRM, finance — and now vendors are adding a chatbot sidebar on top.

Wang dismissed that as insufficient. A finance chatbot inside Mercury, for example, does not know everything else he does, does not share his global memory, and may not allow importing his skills.

In his preferred model, the personal or company agent syncs data to the user’s side, and the user decides what to do with it. That raises the question: how many systems of record should there be, and how many $20-per-month subscriptions can survive when users can route their own agents across data?

Wang called Salesforce and Marc Benioff “forward thinking” for saying everything will be available by API. That may require a business-model change, but he framed it as preemptive defense: either Salesforce opens access, or a Salesforce killer will.

Narayanan then asked whether a system-of-record vendor could move into the business it serves. For example, could Veeva, with deep CRM infrastructure for pharmaceutical sales, move directly into pharmaceutical sales as chatbots and workflows blur the boundary? Wang said everyone will try to become “the operating system of” their domain. Most will lose. The key is starting position.

He argued that whoever owns auth tends to win. The authoritative identity layer is difficult to dislodge. Rippling, he said, picked a strong starting point by owning the org chart; once it knows everyone in the company, it can add IT administration, payroll, SSO, and other layers. Wang said this view reflects his current bias, but it has changed his thinking: where he once might have prioritized owning developers or front-end workflows, he now sees auth as a stronger strategic position.

IPO timing may reveal what insiders think about the plateau

Wang’s most speculative concern was not that AI lab IPOs would create liquidity. It was that many informed people seem to be deciding to sell at the same time. He called it his most “tinfoil hat” view and described the worry as an “Illuminati” group chat deciding now is the time to exit.

He has seen this pattern twice before in his career, he said: suddenly all the good companies IPO, and two years later the market is in a deep slide. He emphasized that nobody knows the future and that IPO conversations are complicated. But as a poker-style read, the tell is that “they’re all IPOing at the same time.”

The bull case is that the market still underestimates the earning power of agents deployed to the whole world. Wang asked what happens if agents run not only Labenz’s life experiments, but his mother’s life, his neighbor’s life, and everyday workflows at global scale. That could be a 100x revenue opportunity. If Anthropic were earning $50 billion, a 100x increase would imply $5 trillion in revenue, Labenz and Wang calculated aloud.

The immediate counter is compute. A 100x revenue case may imply something like a 100x increase in compute consumption. Wang asked where that compute would come from, citing the idea that Taiwan is sold out for the next two years and that spinning up a semiconductor industry in America takes a decade.

On balance, he said, people should be “at least a bit worried” that the big run-up has already happened and that the result is a split between those now permanently wealthy and those in the “permanent underclass.”

He also sketched a bear case: perhaps LLM scaling has reached a plateau. Fable-class models may be close to the ceiling of current transformer-based systems. If there is now a pseudo-ban or restraint around Fable-class deployment, and enterprise deployment of current models is already priced in, then future upside may require a new research era and new architectures — the kind of thing Ilya Sutskever has gestured toward.

If frontier-lab employees cash out and leave, Narayanan asked, where would they go with large amounts of capital and freedom? Wang’s answer was direct: robotics first, then medical applications, aging, cancer, and other “grand quests.” If software has been solved, the physical world remains. Medical work is especially compelling to him. He said he attended the Midjourney medical launch and found it eye-opening: maybe it does not work, but turning from cat-picture generation toward faster, cheaper, more appealing CT scans is a worthwhile direction.

He pointed to David Holz as evidence that wealth from one AI domain can fund meaningful work in another. He also cited Elon Musk, Jeff Bezos, and Mark Zuckerberg and Priscilla Chan as examples of wealthy founders pursuing ambitious projects. On Zuckerberg and Chan’s Biohub, Wang argued that in a thousand years people may not remember Facebook, but they may know far more about the human body because Facebook funded Biohub.

Google still has the inputs, even if the product muscle looks uneven

Asked simply “Google. What’s up?”, Wang gave a two-sided answer. The departures of John Jumper and Noam Shazeer looked badly timed. He sees organizational and product issues, and said Google does not have as strong a product gene as it needs. NotebookLM was an early hit, but he suggested that when someone at Google has a hit, they often leave.

At the same time, he argued that “hardware and data are destiny,” and by those measures Google remains in a powerful position. Google invested early in TPUs, LLM research, and the infrastructure needed to remain a frontier lab. Wang said he is “pretty happy” with recent Gemini model quality, including Flash and, presumably, Pro. Google may not be clearly leading, but it is still a frontier lab.

Its strongest strategic position may be the personal operating agent. Wang referred to Google’s internal personal-agent direction as possibly Scout or Spark, while noting that names vary. Jeff Dean, he said, described wanting a personal agent with access to email, calendar, and related personal data. Google has that data natively. Everyone else is syncing from Google. Apple, in Wang’s view, is less likely to execute on this, so the more urgent question might be “what’s up with Apple?” rather than Google.

Google also retains the ability to fund side projects and fundamental research. Wang contrasted that with OpenAI, where he said side projects appear to be dead as the company focuses on coding and enterprise. Google can still fund Veo, Nano Banana, Omni, and text diffusion, which he said was a major success at AI Engineer Europe.

The company has the scale, resources, and talent for 10- or 20-year research bets. Its real peers are Meta and Microsoft. Still, Wang said Google insiders want the models and products to be better. They know the problems and are working through them. He did not claim to know better than they do what exactly should change.

The junior-engineer market now rewards taste more than credentials

Narayanan raised a striking pair of numbers: Stanford computer science graduates down 42%, Berkeley down 61%. He said that at Stanford, much of the decline appears to come from students who previously treated computer science as a safety credential alongside another degree. At one point, he said, two-thirds to 75% of graduates had a computer science minor or major. Now many still take CS classes but no longer complete the secondary degree. Core CS students remain, while peripheral CS students have dropped the credential.

42% / 61%

declines Narayanan cited for Stanford and Berkeley computer science graduates

Wang said the market for junior engineers is “very challenging.” Fewer organizations are hiring junior engineers. Students are right to be concerned.

But he drew a sharp distinction between generic junior engineers and exceptional young builders. The “cool kids” and highly compensated technical staff at OpenAI, Cognition, and similar places are often under 21, he said. He joked that he is the oldest person in the room, sometimes twice their age. The market still believes in young computer science talent, but candidates must demonstrate that they are “cracked.”

The scarce thing is not coding ability or LeetCode performance. It is execution and taste. Wang advised graduates to do interesting work, demonstrate research taste, project taste, and the ability to ship something that makes people ask what it is or want to use it. He recommended reading the emerging genre of posts explaining how people got into frontier labs and attending leading industry conferences to understand what people are actually building and what problems they have.

His career advice returned to a theme he has advocated since before the AI boom: learning in public. Periodically collect what you are thinking about, write it down, and publish it as notes to yourself. Over time, that can make someone seen as an authority. Many people fear looking junior or being wrong in public, but Wang argued there is no better teacher. People do not care only about the level at which someone debuts; they care about the slope. If they see improvement and responsiveness, they want to coach.

The way to get mentors, he said, is not to ask to buy coffee or “pick their mind.” That is outdated. Work on interesting things, respond to feedback, and pick up what people put down.

Unstable model availability makes the harness more important

The systems emphasis in Wang’s argument sits against a policy and availability backdrop that Labenz and Narayanan had been discussing before he joined. Their concern was not just whether the best model exists, but who can access it, whether it is public, whether it remains available by API, and whether open-weights competitors compress the gap faster than U.S. policy can react.

Labenz said he had been reflecting on Judd Rosenblatt’s criticism that the AI safety community lacks cognitive empathy for the Trump administration. His revised view was that, even if the administration’s tactics look clumsy and export-control-heavy, it is engaged and taking AI seriously. He does not want AI policy pulled deeper into the national security apparatus or made secret by default, and he continues to believe diffusion matters. But he said the safety community should meet policymakers where they are rather than ridicule them for not sharing insider-level understanding.

That matters for builders because frontier access is no longer a neutral assumption. Labenz discussed Dean Ball’s move from the Trump administration to OpenAI, where Ball is joining to lead a Strategic Futures team. Labenz described Ball as the lead staffer behind America’s AI action plan and said Ball told him the OpenAI role may be more important and higher-stakes than his White House work. Ball’s role, as Labenz summarized it, is not conventional lobbying. OpenAI already has a government affairs team for bills and policy fights. Ball’s team is supposed to focus on frontier AI policy: catastrophic risk, recursive self-improvement, labor disruption, and lab governance six to twelve months out.

The tension Labenz identified is whether this is a leading AI governance mind getting access to ground truth, or regulatory capture framed as a hiring announcement. Ball’s stated reason, according to Labenz, is that governing recursive self-improvement from outside is hard if the public record is only hints and tea leaves. To make good policy recommendations, Ball believes he needs to understand RSI where it is being pushed hardest, in close contact with researchers.

Labenz also discussed an unverified claim from Andrew Curran that a more capable version of Anthropic’s Mythos had emerged from training and might remain internal. Labenz said he had been told people at Anthropic still did not have internal access to Mythos or Fable, or nearly all did not. He had not asked whether Anthropic had some other non-public model it could use. His point was conditional: if a lab can keep developing and deploying internal successors while earlier models are withheld or embargoed, then policymakers and outside builders may be reasoning over an incomplete map of what exists.

Narayanan argued that GLM-5.2, an open-weights Chinese model, sharply compresses the timeline. He cited Matt Veloso’s post saying he had used GLM-5.2 all day and that it was the first open model that “passes the bar as a daily driver.” Narayanan said the post’s reach exploded, and he showed a Google Finance chart indicating a large five-day stock move for Knowledge Atlas Technology, which he identified as the creator of GLM-5.2. In Narayanan’s view, the U.S. government has only a short window to secure its own stack before Chinese open models approach Mythos- or Fable-level performance.

Labenz added two qualifications. First, AI is broadly unpopular, so the administration may be able to hold the line against even Nobel Prize-winning scientists if the public is anxious about AI. Second, he said it is credibly claimed that GLM-5.2 contains a significant amount of Claude through distillation: it may sometimes identify as Claude and show a Claude-like voice. He said he had not used it enough to have a firm view. If Chinese models rely heavily on distilling American frontier models, then freezing U.S. model releases could slow the Chinese timeline too, at least for Fable-like behavior.

The implication for Wang’s AI-engineering thesis is practical. If frontier APIs can be delayed, restricted, bundled into first-party products, routed differently by labs, or overtaken in some workloads by open-weights alternatives, then the durable work shifts even more toward harnesses and operational layers that can survive model churn. Wang’s preference for an open API ecosystem, his skepticism of purely external advisor routing, and his focus on evals, memory, auditability, and infrastructure all become more important when model availability is unstable rather than guaranteed.

AI Application Architecture RAG and Knowledge Systems AI Labs and Strategy Evals and Benchmarks Inference and Deployment AI Security AI Governance and Regulation AI Market Signals Agents and Autonomy AI Infrastructure and Compute Open Models AI Policy and Geopolitics AI Economics and Labor Coding Assistants Enterprise AI Adoption