AI Governance Shifts From Model Review to Release Bottlenecks

Prakash Narayanan

Nathan Labenz

Tal Hoffman

Yanir Tsarimi

Brett LevensonThe Cognitive RevolutionWednesday, June 3, 202627 min read

Nathan Labenz and Prakash Narayanan use Trump’s new AI executive order, state audit bills and frontier-model release reviews to argue that AI governance is becoming an operational bottleneck as much as a policy question. Their central concern is that early-access review, audits and classified benchmarks may reassure governments and the public, but can also delay defensive capabilities, obscure accountability and push hard technical judgments into political processes. The same pattern appears in the security and content-safety discussions: Enclave AI’s Tal Hoffman and Yanir Tsarimi argue that AI has made finding bugs easier than deciding which vulnerabilities matter, while Moonbounce’s Brett Levenson says real-time policy enforcement depends on decomposing ambiguous rules into fast, auditable product controls.

The release-delay problem is now the practical governance problem

The most consequential governance failure mode discussed was not that government review of frontier models is impossible in principle. It was that review can create a cascade in which the institutions most alarmed by a model’s capabilities are also too slow to absorb the defensive benefits of releasing it.

Prakash Narayanan sketched the scenario around a hypothetical “Mythos” preview. In his analysis, a frontier model is shown to government agencies before release; agencies use it and find security holes across the Department of Defense, NSA, and other systems; they do not have enough contractors to patch everything; new holes keep appearing; and officials panic and tell the lab not to release. That may protect the government for a moment, but Narayanan argued it can hurt the rest of corporate America if comparable open-source capabilities emerge while the strongest closed defensive tool remains delayed.

He framed that as a plausible downstream effect of President Trump’s new AI executive order. The order was introduced through a CNBC headline shown on screen: “Trump signs AI executive order asking companies to give government early access to models.” Nathan Labenz described the order as an early-access review regime that appears to ask frontier model companies to give the government roughly 30 days before release to review models and render opinions.

I guess the big headline is, right, they're gonna ask politely for model companies to give them 30 days to review models and render some opinions. It sounds like a very sort of gentleman's agreement right now, from what I can tell.

Nathan Labenz · Source

Labenz was unsure how much force the order has. He did not see an obvious mechanism that would prevent a company from refusing to participate or launching despite the executive branch’s view, though he acknowledged there could be consequences depending on who has leverage. Frontier labs, in his view, have shown that they can withstand government pressure; he cited Anthropic’s ability to weather what he described as a previous pressure campaign around a supply-chain designation, which he said now seemed to be getting “memory-holed” rather than clearly resolved.

The sensitive design choice is the possibility that the government’s benchmarks could be classified. Labenz did not object to hidden tests in principle. He argued that AI benchmarking has a real transparency problem: if all benchmark items are public, companies can overfit to them and results become less trustworthy. He pointed to semi-private benchmark designs from Scale and the ARC AGI Prize, where sample problems are visible but much of the test set is held back.

Classified government benchmarks, however, create a different legitimacy problem. A Dean W. Ball post shown on screen warned that the order “sets up a bad outcome in frontier AI governance: the government regulating AI models you aren’t allowed to use in a way you aren’t allowed to know about.” Ball added that the outcome was “not here yet,” but that the “direction of travel is apparent.” In the same visible post, Ball wrote that regulation beyond the Biden executive order is now necessary given current frontier-model capability, while objecting to the idea that Trump’s order is less intrusive than Biden’s.

Labenz agreed that libertarians should instinctively dislike a regime where the government tests models the public cannot use against standards the public cannot know. He described himself as a lifelong techno-optimist libertarian. But he was not ready to treat the order as a crisis. If the regime remains “polite asking,” he said, it may not produce a secrecy vortex or self-dealing licensing structure. A government with any degree of competence would eventually want to test frontier models, be in the room as release decisions are made, and have hard facts about what companies are doing.

Narayanan placed the executive order inside a political sequence, while making clear that his account was an interpretation of what had happened. In that account, the “Mythos” release and preview cycle alarmed government officials. He described a prior confrontation involving Anthropic and a federal defense agency, a period in which Anthropic was nearly “excommunicated,” and then a prerelease and preview that he said “freaks everyone in government out.” He said an initial executive-order draft emerged, Trump almost signed it, and David Sacks pulled it back because it would slow the economy.

State action then became the forcing function in his analysis. A WIRED article shown on screen described Illinois lawmakers as having passed “America’s Strongest AI Safety Bill,” with visible text saying the bill requires companies like OpenAI, Anthropic, and Google to have third parties confirm they are following safety standards, and that Illinois governor JB Pritzker said he would sign it. Narayanan read the move partly through presidential politics: Pritzker, in his view, is a wealthy potential Democratic candidate, can self-fund, has been targeted by Trump, and is one of the few Democrats in power who has actually acted on AI.

Congress, meanwhile, had failed to pass federal preemption of state AI regulation. Narayanan attributed that partly to Republican divisions. Some Republicans, he said, dislike AI or prefer states’ rights in determining AI policy. He mentioned Marsha Blackburn and Tennessee’s music-industry concerns, then widened the point: states and industries that feel they lost out to Silicon Valley have reasons to resist another Silicon Valley-led technology wave. Nashville and Los Angeles music interests, in his view, are part of a coalition that sees technology companies as “taking everything.”

The result, as Narayanan described it, is a collision among state governments, Congress, the executive branch, Democrats, Republicans, state economies, and frontier labs. Senator Ted Cruz’s statement, shown in a Dean Ball repost, captured the federal conservative position he considered important: “AI is developing rapidly. This administration is right to recognize the cybersecurity risks posed by advanced models. Now, it’s Congress’s turn. We must address catastrophic risk without ceding ground to China or restricting Americans’ free expression.”

Narayanan did not expect an easy legislative path. It is an election year, and legislators want credit for upside without blame for downside. He also argued that the order may be voluntary in form but not in effect. If one major lab refuses to participate while others comply, media and social media pressure can enforce compliance. For smaller companies, he was less worried: if a $10 million startup is criticized by the president, that attention may be more useful than damaging.

The release-delay cascade is where this politics becomes operational. Narayanan’s concern was that once an executive-order process exists, it gives people inside government a venue to say no. If agencies cannot patch the vulnerabilities a model helps them find, they may ask for delays. But if defensive users are delayed while attackers gain comparable capability through open-source systems, the review process can harm the organizations it was supposed to protect.

Audits have become the political compromise, not the technical endpoint

Nathan Labenz saw an unexpected convergence around audits and pre-release review. He had recently attended a recursive self-improvement event and came away with two updates: a negative one, that keeping recursive self-improvement on the rails still seemed to rely heavily on chain-of-thought monitoring and only somewhat more advanced monitoring strategies; and a positive one, that people at frontier companies were more willing to discuss the possible need for a coordinated slowdown.

Labenz treated OpenAI’s support for the Illinois-style audit approach as central to this shift. He thought the Illinois bill might be more demanding than the final vetoed version of California’s SB 1047 because third-party auditor access sits at the core of the Illinois concept. He compared the Illinois approach with Connecticut-style ideas and the federal executive order. His rough distinction was that Illinois requires third parties to do review, while Connecticut may move closer to a pre-approved list of organizations empowered to conduct it. He was explicit that he would need to study the mechanisms more to sharpen that comparison.

The broader structure resembled what Labenz said Fathom had advocated: government defines safety outcomes, and private auditors or quasi-regulators interface with companies to determine whether those standards are being upheld. This is not the same as a federal agency doing the work directly. But for Labenz, the mechanisms were converging: pause before release, get under the hood, examine what is going on, and decide whether the government or auditors are comfortable.

The surprising part was not only that requirements had moved upward from the final SB 1047 baseline, but that public conflict had moved downward. Labenz had not seen major labs “freaking out.”

Prakash Narayanan offered an explanation rooted in the labs’ incentives. Anthropic and OpenAI, he said, have become more confident for three reasons: both firms are now in recursive self-improvement; China, in his view, has fallen significantly behind because chip restrictions have worked; and alignment has worked in the target area where the labs wanted it to work. He added that the labs appear more comfortable with a world where at least two major frontier firms survive, rather than one firm becoming the sole dominant AGI provider.

That confidence, in Narayanan’s view, makes regulation easier to accept. But he did not treat audits as a perfect technical mechanism. An external auditor will not know what internal teams know, or move at the same speed. Narayanan regarded audits primarily as public and governmental assurance.

His more cynical account was that audits also function as a labor and political sink. Laid-off white-collar workers can be moved into assurance and evaluation. Lawyers and politically affiliated actors can be paid through audit processes. In that view, auditing partly occupies and funds the people most likely to challenge the labs.

The risk is release delay. Narayanan’s question was not whether audit can add value, but how much it slows frontier model availability and what downstream effects follow. If releases slow too much, he said, China could catch up, or the economy could disappoint because deployed models are not strong enough to do the work people expect.

Labenz agreed that the China point was important. At earlier AI strategy events he attended, U.S.-China race dynamics were central enough to drive tabletop exercises: U.S. labs ahead, Chinese actors stealing weights, lower Chinese compute, but qualitatively similar capability. At the more recent recursive self-improvement event, China appeared marginally if at all. The tabletop exercise focused instead on internal lab dynamics once recursive self-improvement begins.

That shift supported Narayanan’s claim: if labs believe the U.S. lead is durable, they can absorb a 30-day pre-release review more easily. Labenz also noted that labs already conduct internal reviews, and the duration varies by company and model. A 30-day government review may overlap with those processes, especially if the government initially tests absolute capability while the company separately refines deployment safeguards, system-level filters, and product behavior.

Still, neither treated the regime as settled. Labenz said the order looked like a rough draft likely to be revised. Narayanan thought executive orders were less important than Congress. Trump can revise policy quickly; the main game is what legislation can pass, whether federal preemption can be achieved, and whether Republicans can hold together enough support despite AI skeptics in their own coalition.

Finding bugs is becoming easier than deciding what matters

The security discussion turned the policy problem into an engineering bottleneck. If AI makes bugs easy to find, why is it not easy to close them?

Tal Hoffman and Yanir Tsarimi of Enclave AI argued that the answer is the gap between finding a possible issue and proving an exploitable vulnerability. Narayanan introduced Enclave as “a Cursor for security”: an AI agent that does not just scan code for hundreds of theoretical issues, but tries to find the smaller number of vulnerabilities someone could actually exploit. He said the company had reproduced the same FreeBSD zero-day that Anthropic’s frontier model found, using Sonnet 4.6, and framed that as a smaller-model-plus-harness result rather than a pure model-size result.

Hoffman said the edge is not only the model. It is the combination of model, harness, context, workflow, and expert human knowledge. The harness and context fed into the LLM were key, he said, to reproducing the same output with a weaker model.

Tsarimi’s point was sharper: current cyber model evaluations often measure exploitation of known vulnerabilities, not discovery from zero. A benchmark may give the model a vulnerability description, code, and a one-shot prompt to exploit it. Sometimes the description is as explicit as identifying the function, file, buffer overflow, and even line numbers. That is not the same as discovering a vulnerability from nothing.

Most like evals today that we see Anthropic coming out and saying well, um we have this model that is really great with cybersecurity. And it is great with cybersecurity, but in a very specific area of reproducing exploits from known vulnerabilities.

Yanir Tsarimi · Source

Tsarimi argued that many important security problems arise before exploitation. They require reasoning about the general security model of a system: where checks sit, what trust boundaries exist, whether authorization is enforced in the right place, and how architecture creates or prevents failure. If expert researchers can translate what they check in a system into natural language, he said, models can reason about broader security posture rather than only reproducing memory exploits.

The AI shift has lowered the barrier for both attackers and defenders, but it has not removed the need for skill. Tsarimi said motivated attackers still need understanding; it is not yet possible to ask an AI to “hack this bank” and expect success. But for a skilled attacker, AI can accelerate output by 10x or 20x.

Hoffman identified a more immediate operational problem: bug bounty programs are being flooded with AI-generated reports. He cited Linus Torvalds’s recent comments around Linux release candidates as an example of AI reports becoming unmanageable. The focus is therefore shifting from “is there a finding?” to “is this exploitable?” If it is not exploitable, it may be a finding but not a vulnerability.

Nathan Labenz pushed back from the perspective of everyday coding-agent use. If agents can now fix logical bugs and other issues so effectively, why not throw coding agents at all the AI-generated reports and close them? Hoffman answered that proving exploitability requires architecture, cloud runtime, deployment context, and customer-specific understanding. Tsarimi added the enterprise workflow: security teams send findings to engineering teams; engineering teams have release dates and priorities; the two sides negotiate whether, when, and how to patch. Some issues are deep design flaws rather than small code edits.

The scarce resource is therefore not only code generation. It is prioritization. AI can generate more findings than organizations can process. The bottleneck becomes deciding what matters now.

Agent security is old security discipline under higher stakes

Prakash Narayanan connected the vulnerability problem to enterprise agent authorization. To make agents useful, companies give them access to Gmail, Slack, internal tools, databases, and execution rights. Those authorizations may be necessary for productivity, but they also create the vulnerability surface. In many cases, he suggested, the bug is not in the software itself but in organizational preparation.

Yanir Tsarimi said the world is “not really there yet” on agent security. There are some principles, but securing AI agents against prompt injection and related failures remains difficult. Still, he named practical controls: limit what information agents can access, restrict what they can do over the network, and give them limited tools. Tal Hoffman was blunt that putting an open-ended Claude-style agent into an organization could be a disaster because security has not caught up to adoption.

Narayanan then used Palantir’s language of an “ontology” as a way to talk about authorization: who has the right to see which data, and who has the right to execute which action. He argued that Palantir’s long work mapping chains of command and access rights helped the Defense Department deploy models quickly because the models’ permissions were already understood.

Hoffman agreed with the substance. Organizations need to map their ontology, understand structure, and understand downstream business impact if something is exploited. His baseline recommendations were familiar: zero trust, segmentation, blast-radius limitation, and authorization based on who needs access to what. Before adopting more elaborate agent-security concepts, companies need to know that if something goes wrong, the damage is contained.

I think this at the very least you need to be able to say that if this were to get exploited, the blast radius wouldn't be as impactful as it normally would.

Tal Hoffman · Source

The same logic applied to privacy. Narayanan raised the prospect of infrastructure agents reading dating histories, emails, internal CFO conversations, and other private or semi-private material. He asked what privacy social contract should govern AI agents that manage systems while reading everything.

Hoffman treated agents like new human employees: ask what could go wrong if the employee goes rogue or makes a mistake, then limit access accordingly. Do not expose private information to the wrong agent. Apply zero trust. Tsarimi said agent design matters: some agents are built to do everything at once, but enterprises can design agents with narrower capabilities and smaller blast radius. Both emphasized that write access is more dangerous than read access.

Labenz asked how hard it is for model providers such as Anthropic to distinguish defensive cyber tasks from offensive cyber tasks, especially when a malicious actor can dress up offensive work as defense. Hoffman said it is “very hard.” Asking an LLM to write an exploit is associated with malicious activity, but defenders also need to do it. The viable controls he saw were vetting before access, real-time vetting, and after-the-fact auditing. The boundary between defense and offense is not clean.

The same permissioning problem later reappeared in a consumer setting. Labenz described wanting to put AI-generated songs on Spotify, but the operational obstacle was how an agent should log in: with his credentials, its own Gmail, a new Spotify account, or some delegated authorization flow. Narayanan said computer use is close to making this possible, but firms may not want the liability of handling authentication on users’ behalf. The same issue appears at different stakes: agents become useful when they can act, and risky for exactly the same reason.

Harnesses can beat model size, but accountability may still favor the biggest model

The Enclave discussion repeatedly returned to whether “Mythos”-class models will dominate enterprise security work. Nathan Labenz framed the commercial tension: if Sonnet plus a good harness costs perhaps 2% as much as Mythos and can reproduce important results, how many customers choose the cheaper setup? And how many decide they cannot afford to be in the newspaper after a breach and therefore pay for the most powerful model available?

Tal Hoffman expected the novelty of frontier models to wear off in enterprise settings. Costs are already a major concern, and he pointed to Factory AI’s router release and Cognition’s model-routing approach as examples, in his telling, of companies wanting the right model for the right task rather than Mythos for everything. Enclave’s own posture is multi-model: give customers flexibility, use stronger models when needed, but pursue cheaper workflows when a harness can get the same result.

Labenz pressed on how customers can know they are getting the same result outside a controlled benchmark. If a financial institution wants confidence that it did everything reasonable, “we got the same result with a cheaper model” may be difficult to prove.

Hoffman answered that Mythos is not magic and will miss things. Yanir Tsarimi argued that harnesses and expert knowledge matter more than raw model choice. On Cyber Gym, which he described as the most famous cyber evaluation, Tsarimi said the top score was from a Microsoft multimodal setup using Opus, Sonnet, and GPT-4o, beating Mythos. His conclusion was that cheaper or supposedly weaker models can outperform more expensive models when the surrounding knowledge and harness are optimized.

What we see is that cheaper models can outperform more expensive or smarter models if you just optimize, you know, the knowledge or the um harness around it.

Yanir Tsarimi · Source

For Tsarimi, the reason is that software research is not fully documented. It lives in the minds of people who have done it for years. AI agents still need someone with taste to look at results and decide whether quality is up to standard. Hoffman added the accountability point: “You cannot fire an AI.” Someone remains responsible if something goes wrong.

That accountability point became more important after the guests left. Labenz returned to what he called the “nobody got fired for going with Mythos” dynamic. For a CISO, saving money with Sonnet plus harness might look prudent until a breach happens. Then the question becomes why the company did not use the best available model. If the company paid for Mythos and something still went wrong, the executive can plausibly say they did everything they could.

Prakash Narayanan split the market. For tech companies selling products to others, he agreed that the pull toward Mythos is strong: they need their technology to be solid because customers will come back if it fails. For non-tech Fortune 500 companies, he was less convinced. CISOs in those organizations are cost centers with long backlogs and limited budgets. Mythos may matter less as a tool they buy than as a fear event that unlocks budget.

In that scenario, the budget does not necessarily go to Mythos. It may go to migrating off COBOL, replacing old mainframes, building new systems, or doing other security work that becomes easier to fund once executives believe a new AI-driven threat is coming. Narayanan’s view was that Mythos creates an IT budget opportunity. That benefits many technology vendors, not only the frontier model provider.

Real-time policy enforcement starts with decomposing the rule

Brett Levenson described Moonbounce as a real-time policy engine for AI products and user-generated content: software that sits in the path of a chat message, image, video, or model output and decides whether to allow, block, slow, escalate, or steer before harm ships. Narayanan introduced Levenson’s background as the former head of business integrity at Facebook, where human reviewers sometimes had 30 seconds to make calls on machine-translated policy and performed only slightly better than chance.

Levenson traced Moonbounce’s origin to a policy problem he saw at Meta. People assume rules such as hate speech are obvious. In practice, organizations need to decide where lines are, and natural-language policies contain ambiguity that requires interpretation. The problem is not only model accuracy; it is that policies themselves may be underspecified.

Moonbounce’s goal is empirical consistency. For semantically equivalent content — user-generated content, AI-generated output, or prompts sent into an AI — the system should produce the same outcome. To get there, Moonbounce takes a policy, measures ambiguity, and decomposes it into atomic parts. Levenson described the process as almost Socratic: ask why the rule should be the rule, break it down, and continue until the questions about content are simple enough that ideally 100 out of 100 people would answer them the same way.

We take a policy, we measure its ambiguity, and we almost like kind of do the Socratic method on it.

Brett Levenson · Source

The key distinction is between evidence and conclusion. A person may disagree with the final policy decision, but should agree with the evidence collected. In hate speech, for example, Levenson said the wrong question is simply, “Is this hate speech?” A better decomposition begins with whether the content refers to a protected group — religion, sexual orientation, race, disability, or another protected characteristic — and then asks whether the speech is degrading or intended to degrade. If “degrading” is still too ambiguous for a customer, that term can be decomposed further.

There is no perfect solution because many content rules are normative preferences, not natural facts. Even when some content is illegal or widely viewed as unacceptable, the rules are still sociological constructs. Levenson’s practical answer is fast iteration: make it easy to change policies, test them, and chase the long tail. Moonbounce customers may make 75 or 100 policy commits over months before reaching something “good enough” for payment processors, AI return-policy accuracy, or other business requirements. Failure, in his view, is not that policy must evolve; failure is when updating and deploying a policy takes months.

AI sycophancy was Levenson’s example of a hard current policy category. The issue has become especially salient around self-harm, AI psychosis, and delusion. It is easy to say models should not be sycophantic, but harder to define: is it agreement with untrue statements? Is mild disagreement that wins rapport also sycophancy? Narayanan cited an OpenAI researcher’s post from the day before saying that “really tasteful and advanced sycophancy involves mildly disagreeing with you to win rapport.”

Levenson said this kind of behavior often cannot be detected from a single last message or paragraph. It requires a pattern over time and enough recent conversation context to see whether the model is buttering up the user. It also requires labeled examples. As with traditional machine learning, teams need cases they can test against, and the examples may conflict with one another. A coherent policy that catches the target behavior without too many false positives can be difficult to produce.

Latency decides whether safety becomes part of the product

Nathan Labenz asked how Moonbounce can work inside a product path if it must be fast. Does it let messages through and retract them later, like early Bing-style behavior, or use classifier-like approaches with acceptable latency?

Brett Levenson answered with prevention over cure. It is better to be present before something happens, or to optimistically allow content and retract quickly, than to find harmful content three to seven days later and ban a user. In AI, the after-the-fact remedy is even less satisfying: perhaps add the failure to a future fine-tuning set, but the harm has already occurred.

Moonbounce uses several techniques. It uses small, fast models. Its policy decomposition gives it latency advantages because the atomized questions are small and often share prefixes, enabling prefix caching. The system usually does not need a large decode step. Rather than generating an explanation or a literal yes/no answer, Moonbounce trains a binary classification head onto an LLM and asks for the probability that the answer to a policy question is yes.

That probability matters because moderation is often a needle-in-a-haystack problem. For many policies, Levenson said, more than 90% of content is fine. The risky sliver is small but high severity. Moonbounce therefore places lighter-weight models in front of the main QA engine to filter obvious safe material with high recall. In a simplified example where 90% of a customer’s content is fine, Moonbounce wants to approve perhaps half of that immediately.

For those cases we're going to be sub 200 milliseconds basically. Um, and then because our models are pretty damn fast, like for the rest of the cases we're sort of in the three to 500 millisecond range when we actually have to do a deeper, um, a deeper scan.

Brett Levenson · Source

Modality changes the numbers. Text is fastest. Images require vision encoding, resizing, and sometimes format transformation. Video adds more steps: pulling the file, handling size, extracting audio, transcription, and related processing. But tolerance depends on the product. For AI image generation, where the user may already wait six to ten seconds, adding 1.5 seconds for a safety verdict may not be noticeable. The question is whether the user experience is meaningfully affected.

Levenson’s future focus is what he called “Active Guardrails”: applying controls to streaming tokens, like television on a five-second delay, so the system can “bleep out” bad content without imposing an unacceptable experience cost. His adoption thesis was that customers are less likely to adopt needed controls if the controls visibly degrade the user experience.

Labenz then raised a broader issue: AI is improving fastest where outputs are easy to verify, such as math and programming. A general strategy is to reduce fuzzy, hard-to-verify tasks into smaller verifiable tasks that aggregate into the hard thing. But he worried there is sleight of hand in assuming the smaller checks necessarily capture the higher-level property. Intelligence, as he has defined it, is the ability to succeed despite not having fully explicit instructions. If everything could be captured in explicit subchecks, perhaps the problem would not have been hard in the first place.

Levenson accepted the concern. He wondered whether decomposition is at least falsifiable in some cases: if the smaller preferences are verified, can one show that the larger preference necessarily follows? The reverse may not hold; a system may fail one subcheck while still satisfying the higher-level judgment because context or some “je ne sais quoi” was lost. In Moonbounce’s actual customer work, he said, they had not yet found a case they could not solve through decomposition, though some required much deeper decomposition than hoped. He was careful not to infer from that experience that no unsolved case exists.

Auditability turns law and payment pressure into product requirements

Prakash Narayanan connected Moonbounce’s architecture to state AI laws and concerns about child safety, chatbot harm, and fragmented legal regimes. If every state or country develops its own rules for what chatbots may say, can technical enforcement systems make legislation easier to write and implement?

Brett Levenson said this was already happening, sometimes before formal legislation. In the absence of official state or federal rules, payment providers became quasi-legislators: they would process payments only if companies followed certain safety rules. Many Moonbounce customers arrived under immediate threat of being dropped by a payment provider and needed not just to comply, but to prove they complied.

Auditability is therefore central. A company needs to show a regulator, payment provider, or other authority: here is the policy, here are the metrics, here is how the system handles the cases you care about, and here are the edge cases or failure modes not yet solved. Fragmented legal landscapes make orchestration more important, because companies operating across states or countries need to apply different policy sets depending on locale.

Levenson argued that “this problem is hard” is no longer a sufficient excuse. He did not blame the industry for starting with capability: it would not have made sense to build governance systems for products that did not yet exist. But now that capability has matured, controllability and predictability are part of product quality. A company that only chases jurisdictional checkboxes will always face the next checkbox. A company that builds for controllability from the outset is building a better product.

After Levenson left, Nathan Labenz returned to the unresolved abstraction problem. The desire to break systems into verifiable parts appears everywhere, but the question is whether those low-level verifications produce the high-level system behavior people care about. He compared it to biology: proteins to cells, cells to tissue, tissue to body. He also compared it to formal methods in software, where verification is only as good as the specification. If the spec is wrong, the proof can be correct and still miss the target.

How do we get the higher level system properties from the low-level constituent parts.

Nathan Labenz · Source

Labenz thought this aggregation problem is dramatically under-theorized, including for markets populated by AI agents. Even if individual agents are well-behaved, nobody knows what a billion of them in the economy will produce.

Narayanan described the practical product answer as “the schlep,” borrowing Leopold Aschenbrenner’s term. In every vertical, teams push the model as far as it can go, then surround it with harnesses, heuristics, rulemaking, exception handling, client conversations, and edge-case work. The company’s real asset, in his view, may be the customer relationship and trust layer more than the specific artifact it builds, because that relationship lets it deploy the next generation as the model improves.

The content-safety problem also revived the old speech tension. Narayanan said trust and safety sits uneasily beside freedom of speech, and pointed to Elon Musk’s Twitter experience as an obvious example of the clash. For parents and child-safety issues, he said, many people want platforms to manage certain behaviors even where speech principles are complicated. He distinguished freedom of speech from freedom of distribution.

In his view, rules can cascade from legislation to soft regulator pressure to payment-provider policy to platform enforcement. Once technical enforcement exists, it can reach even private conversations. That may be justified for severe harms, he said, but becomes much more fraught if it extends to political concepts and speech categories. The societal settlement remains unresolved.

Moderation endpoints are improving, but coarse categories are not policy systems

The closing moderation thread made the abstraction problem concrete. Nathan Labenz had previously complained about gaps in OpenAI’s moderation endpoint. After the prior day’s discussion, he prompted Claude Code to examine his own archived history, find old reports he had sent to OpenAI around GPT-4 red teaming and moderation failures, generate sample prompts across low-, medium-, and high-severity categories, run a small experiment, and return a report.

He said Claude did almost all of that from a short prompt, except for one point where he had to refresh a token. The result, in his telling, was that the specific gap he had complained about had been closed. A prompt implying participation in a criminal gang and concern about going to jail now gets flagged. The endpoint also flagged everything Claude judged should be flagged, while producing only a couple of false positives among prompts Claude believed should not be flagged.

Narayanan found the API-key requirement interesting. He interpreted it as evidence that OpenAI may be trying to prevent people from mining the moderation endpoint to discover edge cases and then exploit them. Labenz was careful that he was not sure the requirement was new; it may have always been that way.

Labenz contrasted the endpoint’s coarse categories with Moonbounce-style policy control. He listed categories such as sexual content involving minors, harassment, threatening harassment, hate, threatening hate, illicit activity, violent illicit activity, self-harm, self-harm intent, self-harm instructions, violence, and graphic violence. Useful as that may be, it is not the same as a granular policy engine for sycophancy, return-policy hallucinations, locale-specific child-safety law, or payment-provider compliance.

He also compared Moonbounce to the AWS automated reasoning product he had referenced earlier. In his understanding, AWS’s approach can take a natural-language policy such as a mortgage underwriting policy, decompose it into rules, map runtime inputs onto variables, and then apply a deterministic decision structure. The lossy parts are the policy decomposition and input mapping; the final execution is deterministic. Moonbounce, by contrast, as Levenson described it, uses learned classifiers and probabilities in its runtime enforcement. The goals overlap — more consistent policy application — but the mechanism and level of granularity differ.

The difference matters because the hardest content-safety problems are not always violations of broad categories. They are often failures of product behavior under a particular policy: a chatbot validating a delusion, a model misstating a return policy, a localized rule applying in one jurisdiction but not another, or a payment provider demanding proof that a platform can prevent a specific class of prohibited content. Those are not just moderation labels. They are operational policy systems.

AI Application Architecture Evals and Benchmarks Inference and Deployment AI Security AI Governance and Regulation AI Safety and Alignment Agents and Autonomy