Topic

AI Application Architecture

Patterns for building production AI applications, including orchestration, memory, tool use, state management, APIs, and system design.

RecursiveMAS Lets AI Agents Collaborate Without Translating Through English

Károly Zsolnai-Fehér presents RecursiveMAS, a paper by Xiyuan Yang, Jiaru Zou and coauthors, as an attempt to fix a coordination cost in multi-agent AI systems: agents repeatedly translating internal work into English for one another. The paper’s claim is that agents can instead pass latent numerical representations directly, improving collaboration while cutting token use. Zsolnai-Fehér says the reported gains are substantial on small models, including better math results and far fewer tokens, but frames the work as early research rather than a deployable agent product.

Károly Zsolnai-FehérTwo Minute PapersJun 19, 20266 min read

Agents Often Claim Web Access After Being Blocked or Challenged

Rafael Levi of Bright Data argues that many web-dependent agents fail not because they cannot produce answers, but because they report success after web access has broken. In a demo using Bright Data’s Web MCP, Levi shows the same agent failing against sites such as LinkedIn, Instagram, Amazon and TikTok without live access, then producing usable results when given infrastructure for search, scraping, JavaScript rendering and CAPTCHA handling. His broader case is that reliable agents need a real public-web access layer, not prompts that assume the model saw the page.

Rafael LeviAI EngineerJun 17, 20269 min read

Hermes Uses a Minimal Agent Loop to Preserve State Across Channels

Alejandro AO’s walkthrough of Hermes presents the agent as a deliberately small always-on system rather than a complex orchestration stack. He argues that Hermes’ usefulness comes from a simple loop that builds context from Markdown files, message history, tools, skills and memory, then preserves state through compression, SQLite transcripts, optional external memory providers, gateway integrations and scheduled cron jobs. The architecture’s central concern is continuity: keeping enough context across channels and time for the agent to behave like a persistent assistant.

Alejandro AOHugging FaceJun 17, 202611 min read

Enterprise AI Is Blocked by Context, Not Model Intelligence

Databricks chief executive Ali Ghodsi argues that enterprise AI is constrained less by model intelligence than by access to company context: data, documents, processes and relationships that agents need to operate inside businesses. In a Bloomberg Tech interview with Ed Ludlow, Ghodsi said Databricks is building products such as Genie Ontology and Lakehouse to make that context usable, while adoption in critical workflows remains slowed by security, legal and approval processes. He also declined to confirm reports of a new funding round and said Databricks is not rushing toward an IPO.

Ed Ludlow · Ali GhodsiBloomberg TechnologyJun 16, 20266 min read

GRU Space Plans Lunar-Regolith Bricks as the First Step Toward a Moon Hotel

On This Week in Startups, GRU Space founder Skyler Chan argues that a Moon hotel is the first commercial wedge for a larger off-Earth manufacturing business: using lunar regolith to make construction materials rather than shipping them from Earth. Chan lays out a plan to prove the technology by making a brick on the Moon, then scale toward robotic habitats, NASA construction work, space tourism and eventual claims on lunar resources. The same episode turns to Anthropic’s forced shutdown of Fable 5 and Mythos 5, which Jason Calacanis and Lon Harris frame as a warning that frontier capabilities can be cut off before law, politics and operating norms have settled.

Jason Calacanis · Lon Harris · Skyler ChanThis Week in StartupsJun 16, 202621 min read

AI Market Power Is Moving Beyond the Frontier Model

Alex Kantrowitz and Ranjan Roy argue that the AI market is shifting away from standalone model capability and toward control of infrastructure, access and workflow layers. Their discussion frames SpaceX’s IPO as a public-market AI-cloud story that complicates OpenAI’s ambitions, Anthropic’s Fable rollout as a case where safety policy also looks like market power, and OpenAI’s possible price cuts as a test of whether frontier models can remain premium products. Apple’s Siri, in their telling, matters for the same reason: usefulness may come less from the best model than from where the model sits.

Alex Kantrowitz · Ranjan RoyAlex KantrowitzJun 15, 202619 min read

Human Attention Is Becoming the Bottleneck in AI Coding Workflows

Zack Proser, an Applied AI engineer at WorkOS, argues that AI coding has shifted the bottleneck from tool speed to human attention. His proposed workflow uses voice dispatch, isolated git worktrees, Slack and Linear-reading agents, remote phone control, and layered verification so developers can keep agent loops moving without staying pinned to a desk or rubber-stamping work they can no longer track.

Zack ProserAI EngineerJun 11, 202614 min read

Models Will Absorb Today’s Agent Harnesses Within a Year

Logan Kilpatrick, who leads Google AI Studio and the Gemini API, argues that the current rush to build agent harnesses may have a short shelf life. In an interview with Sequoia Capital’s Sonya Huang, he says models are absorbing the scaffolding around agents and could make much of today’s custom harness layer less distinctive within about 12 months. Google’s own strategy runs on both sides of that claim: Antigravity has become a shared agent layer across products, while Kilpatrick says the durable advantage for builders will move to focus, domain knowledge, risk tolerance and useful outcomes for users.

Logan Kilpatrick · Sonya HuangSequoia CapitalJun 11, 202619 min read

A 4B Model Beat Qwen3 235B by Learning Tool Discipline

Kobie Crawford of Snorkel argues that some enterprise AI failures are less about model size than about whether models behave correctly inside constrained tool environments. In Snorkel’s FinQA work with UC Berkeley’s rLLM/Agentica, a 235B Qwen model hallucinated a financial answer after failed SQL calls, while a 4B model fine-tuned with reinforcement learning learned to inspect tables, correct errors and calculate from retrieved data. Crawford presents the result as evidence that targeted RL, structured evals and behavior-specific training can outperform simply moving to a larger model for this class of financial analysis task.

Kobie CrawfordAI EngineerJun 10, 20269 min read

A Python Decorator Replaces the GPU Deployment Container Loop

RunPod’s Audrey Hsu argues that GPU inference development should not require a commit, container build, registry push and server provisioning cycle for every model change. In a demo of Flash, RunPod’s Python SDK, she shows how adding a `@flash.endpoint` decorator to an async function can package that function as a GPU-backed cloud endpoint while the rest of the application stays in the developer’s IDE. Her broader case is that teams should experiment on Pods or low worker counts, then move to Serverless when they need autoscaling inference across many GPU workers.

Audry HsuAI EngineerJun 9, 202610 min read

RAG Is Becoming Agentic Retrieval, Not Disappearing

Kuba Rogut, a deployed engineer at Turbopuffer, argues that claims about RAG’s death rely on defining it as a narrow, one-shot vector search pattern. In his account, retrieval-augmented generation is becoming a broader agentic retrieval system: vector search, full-text search, grep, regex, glob and filters used iteratively by models that keep looking until they have the right context. He points to Cursor’s semantic-search gains and contrasts its upfront indexing with Claude Code’s per-session grep approach to frame embeddings as cached compute whose value depends on reuse.

Kuba RogutAI EngineerJun 9, 20266 min read

Brilliant’s Koji Uses AI to Make Students Solve Problems Themselves

Brilliant founder Sue Khim tells This Week in Startups that the company’s new AI tutor, Koji, is built to counter the education use case parents fear most: software that gives students answers while eroding their ability to think. Khim argues the opportunity is not generic AI in the classroom, but a constrained tutor embedded in Brilliant’s lessons that uses Socratic prompting, visual scaffolding, and assessment to help students solve problems themselves. Jason Calacanis frames the same idea more broadly, saying AI is useful when it strengthens the person doing the work rather than replacing the work.

Jason Calacanis · Sue KhimThis Week in StartupsJun 8, 202617 min read

NVIDIA Says Agentic AI Is Forcing a Redesign of Enterprise Computing

At GTC Taipei during COMPUTEX, NVIDIA founder and chief executive Jensen Huang argued that agentic AI and frontier models have already changed the computer industry. The company’s case was that enterprises now need full agent-building infrastructure, AI-capable PCs such as RTX Spark represent a break from the old laptop model, and production hardware including Vera Rubin will underpin the next phase of AI computing. NVIDIA framed that shift through Taiwan’s manufacturing ecosystem, presenting Taipei as both industrial partner and symbolic home.

Jensen Huang · Wayne ChiangNVIDIAJun 8, 20264 min read

Developers Want Siri APIs That Turn Apple Intelligence Into Infrastructure

Paul Hudson, creator of Hacking with Swift, argues that Apple’s AI opportunity for developers depends less on a smarter prompt box than on APIs that let Siri serve as an integration layer across apps. Speaking to Bloomberg’s Ed Ludlow, Hudson said developers want to expose app data and functions while Apple Intelligence handles user intent, privacy and cross-device execution—ideally through Apple-controlled infrastructure even if Google’s Gemini is part of the stack.

Ed Ludlow · Paul HudsonBloomberg TechnologyJun 8, 20265 min read

Huge Pre-IPO Rounds Are Making Seed Investing More Important

Kindred Ventures founder Steve Jang argues that enormous pre-IPO rounds have not made seed investing less relevant; they have made company formation more important. In a Bloomberg Technology interview with Caroline Hyde after Kindred raised $355 million for deep-tech and robotics funds, Jang said early investors still do the work that late-stage capital cannot: helping founders turn technical vision into products, teams, customers and revenue before the IPO or acquisition options appear.

Caroline Hyde · Steve JangBloomberg TechnologyJun 8, 20265 min read

Apple’s Siri Overhaul Tests Whether AI Can Become an Operating-System Layer

Bloomberg’s WWDC preview frames Apple’s AI challenge as a test of integration rather than invention. Mark Gurman reports that Apple is expected to use the conference to make Siri more capable across apps, screens, personal data and web search, moving it from a weak voice assistant toward an operating-system layer; Carolina Milanesi and Paul Hudson argue that its value will depend on whether that layer is consistent, private and useful across Apple devices.

Caroline Hyde · Ed Ludlow · Mark Gurman · Ian King · Jared Isaacman · Laura Crabtree · Bailey Lipschultz · Jensen Huang · Ryan Vlastelica · Paul Hudson · Peter Diamandis · Steve Jang · Melissa Azari · Carolina Milanesi · Ava Benny-Morrison · Helene NorlemBloomberg TechnologyJun 8, 202615 min read

Coding Is AI’s First Breakout Market, but Value Capture Remains Unsettled

Tech analyst Benedict Evans argues in an a16z interview with Erik Torenberg that AI now looks less like a solved platform shift than a market with one clear breakout use case: coding. Evans says agentic software development has reached real product-market pull, while larger questions about consumer adoption, enterprise workflows, model differentiation, infrastructure spending and value capture remain unresolved. His central case is that AI resembles the internet in 1997: obviously important, already useful in places, but still too early to know which layer of the stack will own the economics.

Erik Torenberg · Benedict Evansa16zJun 8, 202623 min read

Ulta Uses AI to Personalize HR Support for 65,000 Workers

Ulta Beauty executives Rachel Williamson and Josh Siebert describe the retailer’s ServiceNow-backed HR automation rollout as a response to a concrete operating problem: 65,000 employees could not reliably find the policies and support they needed. In a sponsored interview, they argue that the value of AI was not the chatbot itself, but its ability to personalize answers, route routine HR work away from overloaded teams, and preserve human judgment for sensitive cases. Their account frames AI as an enabler of workflow redesign, not an end in itself.

Alex Kantrowitz · Rachel Williamson · Josh SiebertAlex KantrowitzJun 8, 202610 min read

Code Agents Need Context Engineering, Not Larger Prompts

Nupur Sharma of Qodo argues that larger context windows have not solved a core agent failure: models still tend to use the beginning and end of an input while losing important material in the middle. Her case is that agent quality depends less on giving a model more context than on engineering how context is retrieved, ranked, constrained and checked. She describes Qodo’s approach as a mix of iterative retrieval, specialist agents, judge nodes and bounded orchestration that reserves high-reasoning models for discovery while using stricter, lighter steps for validation.

Nupur SharmaAI EngineerJun 8, 202612 min read

Durable Objects and Dynamic Workers Reopen Eval for AI Agents

Cloudflare engineers Sunil Pai and Matt Carey argue that AI agents need compute primitives beyond stateless functions: Durable Objects for addressable, persistent coordination, and Dynamic Workers for safely running generated code. Pai frames Durable Objects as the execution unit behind Cloudflare’s Agents SDK, giving agents state, resumable streams, scheduling, and multi-client sync without pushing distributed-systems work onto developers. Carey and Pai present Dynamic Workers as the larger shift: a sandboxed “eval++” model where LLM- or user-generated code starts with no ambient authority and receives only explicitly granted capabilities.

Sunil Pai · Matt CareyAI EngineerJun 8, 202611 min read

LSEG Grounds AI Strategy in Trusted Financial Data and Controls

Emily Prince, group head of AI at LSEG, argues in an OpenAI Customer Ignite talk that AI in financial services only becomes useful at scale when it is grounded in trusted data, evaluation frameworks and governance that fit regulated work. She presents LSEG’s strategy as an effort to make its financial data and analytics available inside the tools customers and employees already use, including through APIs and Model Context Protocol, rather than treating AI as a generic answer engine. The case is that speed and experimentation matter, but only if controls, source quality and industry-specific workflows are built into the system.

Emily Prince · Nikolai SkaboOpenAIJun 8, 202610 min read

Role-Specific Agents Move AI From Prompting Into Financial Services Workflows

OpenAI solutions engineer Lee Spacagna argued that enterprise AI in financial services is moving from individual ChatGPT use and isolated product integrations toward role-specific agents embedded in daily work. He presented ChatGPT workspace agents and Frontier as the operational layer for that shift: agents that connect to tools such as email, calendars, Teams, SharePoint, and Salesforce; encode team practices as repeatable skills; and are managed at scale under enterprise controls.

Lee SpacagnaOpenAIJun 8, 20266 min read

Erste Builds AI as a Governed Platform Inside Digital Banking

Maurizio Poletto, Chief Platform Officer and COO of Erste Group, argues that AI in banking has to be built as a governed platform inside the bank’s existing digital architecture, not treated as a chatbot deployment. In a customer talk with OpenAI, he says Erste has allowed local teams to move quickly on employee productivity tools while centralizing customer-facing AI, especially where customer data is involved, because trust, compliance and product quality make that work slower and harder.

Folley Ogundele · Maurizio PolettoOpenAIJun 8, 202610 min read

Telemetry, Not Code, Audits Nondeterministic AI Agents

Dat Ngo of Arize argues that LLM observability has to account for failures in execution paths, not just broken components, because agents can call tools in different orders, branch, loop, and change behavior across runs. In his account, traces become the audit record for nondeterministic systems, while evaluation must combine model judges, human feedback, golden datasets, deterministic checks, and business metrics at the right scope. Arize’s stated direction is to connect observability, evals, experimentation, and improvement into an increasingly automated loop.

Dat NgoAI EngineerJun 7, 202610 min read

ElevenLabs Unveils Dubbing v2 and Previews More Controllable Eleven v4

ElevenLabs co-founder Mati Staniszewski used a Warsaw summit keynote to argue that AI’s next constraint is not intelligence but communication people can trust. He presented two new models — Dubbing v2, designed to preserve an original performance across languages, and a preview of Eleven v4, aimed at finer control over speech, emotion, accent, whispering and song — as evidence of that thesis. The broader case was that voice AI becomes commercially useful only when models are tied to agents, integrations, authentication, memory and deployment systems that let companies put spoken interfaces into production.

Mati StaniszewskiElevenLabsJun 7, 202610 min read

RunPod’s Serverless LLM Endpoint Trades Cold Starts for Lower Idle Cost

Audry Hsu presents RunPod as a cloud AI infrastructure company trying to move GPU provisioning and operations behind a deployable model endpoint. In the walkthrough, she shows a Qwen model deployed from RunPod’s Hub as an OpenAI-compatible vLLM serverless endpoint on H100s in under five minutes, with billing tied to workers while they handle requests. Her case is narrower than eliminating infrastructure tradeoffs: the first request waited 41.6 seconds on cold start, while subsequent execution took about 1.5 seconds, leaving teams to choose between lower idle cost and keeping workers warm for lower latency.

Audry HsuAI EngineerJun 7, 20266 min read

Agents Can Build and Repair Scrapers Instead of Parsing Every Page

Rafael Levi of Bright Data argues that the hard part of web data collection has moved from scraping a page to maintaining the pipeline after sites change. In his session, he presents Bright Data’s MCP, APIs and browser infrastructure as a way for agents to inspect public websites, generate reusable scrapers, run them at scale and repair them when selectors, pagination or access conditions break. The economic case is that LLMs should spend tokens learning site structure and writing code, not repeatedly parsing every page.

Rafael LeviAI EngineerJun 7, 202613 min read

VS Code Can Render MCP Tool Results as Interactive Apps

GitHub’s Marlene Mhangami and Liam Hampton argue that MCP apps turn chat from a text response surface into a place where tool output can be operated directly. In their VS Code demo, an MCP server profiles a Go app, returns data plus a reference to a bundled HTML UI, and VS Code renders the result as a sandboxed interactive flame graph inside Copilot chat. Their case is that the useful boundary is precise: tools provide data, resources provide the interface, and the host contains the app while keeping the user in context.

Marlene Mhangami · Liam HamptonAI EngineerJun 6, 202611 min read

Enterprises Face a 100,000-Agent Governance Problem

Barndoor AI co-founder and CEO Oren Michaels argues that enterprises are approaching a governance problem created by AI agents that can act across Salesforce, Slack, email and other workplace systems. In a conversation with Craig Smith, Michaels says connectivity protocols such as MCP have made it easier for agents to reach enterprise tools, but have not solved the harder question of what a given agent should be allowed to do for a given task. His central claim is that companies will need a separate control layer to manage thousands of task-specific agents, because traditional identity systems assume human judgment that agents do not have.

Craig Smith · Oren MichelsEye on AIJun 6, 202618 min read

Cline’s Terminal-Bench Gains Came From Harness Tuning, Not Model Switching

Ara Khan of Cline argues that AI evals are too noisy to treat as truth but too useful to replace with vibes. Using Cline’s Terminal-Bench work as the case study, he says the company’s jump from 43% to 57% came from harness changes — container CPU and memory, longer timeouts, and model-family-specific prompting — rather than a better model. His prescription is to run evals skeptically, inspect failed traces, allocate failures by cause, and improve only the levers that survive contact with product behavior.

Ara KhanAI EngineerJun 6, 202611 min read

Stripe Says Agent Payments Need Deterministic Controls, Not Browser Automation

Stripe’s Steve Kaliski argues that autonomous agents can use probabilistic reasoning to discover products, services and tools, but payments should move through deterministic infrastructure. In his talk, he presents Stripe’s approach to agent commerce: scoped payment credentials, HTTP-based paid tool calls and structured checkout APIs designed to prevent agents from paying the wrong merchant, buying the wrong item, authorizing the wrong amount or exposing the wrong credential.

Steve KaliskiAI EngineerJun 6, 202610 min read

Emergent Says AI App Builder Reached $100M ARR in Nine Months

At Startup School India, Emergent co-founder and CEO Mukund Jha argues that AI can move software creation beyond programmers, letting non-technical users build, ship and monetize working products rather than demos. In a conversation with YC managing partner Jared Friedman, Jha says the company’s rapid growth came from betting on autonomous software-engineering agents before the models were fully ready, then rebuilding its architecture as those models improved. He also frames Emergent as a test of whether a global, technology-first company can be built from Bangalore.

Jared Friedman · Mukund JhaY CombinatorJun 6, 202612 min read

Frontier Labs Treat Recursive Self-Improvement as a Near-Term Control Problem

AI in the AM’s first weekly highlights edition argues that the important AI signal in early June was not a model launch but a pattern: frontier labs are treating AI-accelerated AI research as near-term, while their main control strategy remains AI systems monitoring other AI systems. Nathan Labenz presents that as a safety concern, and the source contrasts thin recursive-self-improvement plans with OpenAI’s more concrete tax-agent example, where the harness improves from practitioner corrections rather than from changes to model weights. The through-line is that value and risk are moving into the layers around the model: tax harnesses, private data and expert judgment in cyber, real-time moderation guardrails, and safety architecture in mental-health deployments.

Nathan Labenz · John Wasseige · Matthew Sanders · Brett Levenson · Prakash Narayanan · Taras Pohrebniak · Snehal Antani · Hooman Radfar · Peter Jansen · Arthur Fernandes · Tal Hoffman · Yair TsarfatyThe Cognitive RevolutionJun 6, 202624 min read

Tool-Call Repairs Let DeepSeek v4 Beat Opus 4.7 in Internal Evals

Ahmad Awais, founder of CommandCode.ai, argues that many open models appear weak at coding-agent work because the harness around them mishandles tool schemas, design instructions and user preferences. Drawing on Command Code’s internal logs and evals, he says small deterministic repairs to tool inputs helped DeepSeek v4 Pro beat Opus 4.7 in six of ten internal comparisons. His broader case is that “taste” — explicit contracts for tools, design patterns and developer habits — can narrow the gap between cheaper open models and frontier coding systems without changing the model itself.

Shawn Wang · Ahmad AwaisLatent SpaceJun 6, 202614 min read

AI Application Companies Are Moving Beyond Frontier APIs to Protect Margins

Baseten founder and chief executive Tuhin Srivastava used a Stanford MS&E435 seminar with instructor Apoorv Agrawal to argue that inference is becoming the cost of goods sold for AI applications. His case is that scaled AI companies will need to move beyond default frontier-model APIs toward custom or post-trained models, both to improve margins and to protect the workflows and user signals that make their products defensible. Baseten’s role, as Srivastava framed it, is to provide the production inference stack and compute access needed to run that custom intelligence at scale.

Apoorv Agrawal · Tuhin SrivastavaStanford OnlineJun 5, 202618 min read

Inference Constraints Are Reshaping Language Model Architecture

In a Stanford CS336 guest lecture, Dan Fu argued that language-model inference is no longer downstream plumbing but a central research and design constraint. Fu described serving as the machinery that turns a trained model into a usable system, where schedulers, KV caches, GPU kernels, routing policies and hardware choices determine which architectures are practical, economical and reliable at scale.

Dan FuStanford OnlineJun 5, 202622 min read

AI Infrastructure Is Shifting From Accelerator Racks to Distributed Agent Systems

At Dell Technologies World, Nvidia chief Jensen Huang and Dell CEO Michael Dell argued that enterprise AI is moving from experimental promise to operational infrastructure, with agentic systems driving a sharp increase in compute demand. Huang said agents change the workload from single prompt-response transactions to long-running loops of reasoning, planning and tool use, while Dell framed the response as a pragmatic push toward distributed, “unmetered” intelligence across PCs, data centers and cloud-scale systems.

Michael Dell · Jensen HuangNVIDIAJun 5, 20267 min read

LLMs Play Games Better When They Write Simulators First

DeepMind research scientist Wolfgang Lehrach argues that language models should not be asked to play games directly when their outputs are slow, strategically weak, or illegal. In a Stanford HAI seminar, he presents Code World Models, which use LLMs to translate natural-language rules and play traces into executable game simulators that planners such as Monte Carlo Tree Search or reinforcement learning can use. He also describes Autoharness, a narrower system that synthesizes code to check action legality, as part of the same broader case for turning LLM knowledge into executable structure rather than immediate moves.

Wolfgang LehrachStanford HAIJun 5, 202617 min read

OpenAI Adds Workspace App Publishing to Codex

OpenAI’s Corey Ching presents Sites in Codex as a way for teams to turn prompts and trusted internal material into hosted applications that colleagues can use inside a workspace. The product is framed not as a document or slide generator, but as an application layer for internal dashboards, meeting-prep tools, event briefs, and decision memos, with hosting, authentication, storage, database support, sharing, and iterative refinement built into the workflow.

Corey ChingOpenAIJun 5, 20265 min read

Hackathon Caps Models at 32B Parameters to Reward Tinkerable AI Apps

Build Small is a Hugging Face and Gradio hackathon organized around a hard constraint: every model used must be under 32 billion parameters. Yuvraj Sharma framed the rule as a way to move AI building away from dependence on giant hosted models and back toward systems that participants can inspect, fine-tune, run locally, and ship as working Gradio Spaces. Sponsor presentations from Black Forest Labs, OpenBMB, OpenAI, NVIDIA, Modal, JetBrains, and Cohere largely reinforced that premise, offering small models, credits, tools, and prize categories meant to turn the constraint into runnable projects rather than demos in name only.

Shashank Verma · Vaibhav Srivastav · Stephen Batifol · Julian Mack · Yuvraj Sharma · Felicia Chang · Nikita Pavlichenko · Hannah Blair · Zhong ZhangHugging FaceJun 5, 202620 min read

AlphaProof Nexus Solved Nine Erdős Problems With Formal Verification

Károly Zsolnai-Fehér argues that DeepMind’s AlphaProof Nexus should not be judged mainly by its 9-for-353 success rate on Erdős problems, but by the kind of system it represents. In his account, the important advance is a formally verified loop: an unreliable AI generates and ranks failed proof attempts until Lean can certify a valid result. He says the work shows capability moving beyond the model itself into the harness around it, while still depending on a strong core model and a problem set amenable to formalization.

Károly Zsolnai-FehérTwo Minute PapersJun 5, 20266 min read

Production Inference Turns Transformer Models Into a Full-Stack Systems Problem

In a Stanford CS25 seminar, Modal’s Charles Frye argues that transformer inference has become the economic and operational center of AI systems: training produces weights, but serving turns them into usable, billable products. His account treats production inference as a full-stack problem, where application latency goals, workload shape, model choice, GPU memory limits, deployment failures, observability and cost controls all determine whether a system works. Frye’s main warning is that the largest serving gains come from matching the inference stack to the application, not from treating model hosting as a generic infrastructure task.

Steven Feng · Charles FryeStanford OnlineJun 4, 202622 min read

Enterprise AI’s Constraint Is Judgment, Not Token Consumption

At TBPN’s AIPCon 10 broadcast, Palantir chief executive Alex Karp argued that enterprise AI’s central problem is no longer model capability but organizational judgment: companies are consuming tokens, dashboards and AI-generated artifacts without tying them to decisions that change operations. AIG’s Peter Zaffino, Palantir’s Chad Wahlquist and USDA’s Sam Berry extended the same case from insurance, deployment architecture and government data systems, describing AI as valuable only when embedded in workflows, data structures and feedback loops that reflect how institutions actually work.

Jordi Hays · John Coogan · Chad Wahlquist · Alex Karp · Peter Zaffino · Sam BerryTBPNJun 4, 202626 min read

Enterprise AI’s Bottleneck Is Context, Not Smarter Models

Databricks co-founder and CEO Ali Ghodsi told Bloomberg Technology that the main enterprise AI problem is no longer model intelligence but access to organizational context. Ghodsi argued that artificial general intelligence has effectively arrived by a practical workplace test, and that companies should focus on connecting models to their data, processes and metrics so agents can become useful. He also cast that thesis as central to Databricks’ Lakehouse and Genie products, while saying the company can remain privately funded until an eventual IPO is needed for employee liquidity.

Caroline Hyde · Ed Ludlow · Ali GhodsiBloomberg TechnologyJun 4, 20265 min read

NVIDIA RTX Spark Recasts Windows PCs as Local AI Agent Machines

NVIDIA chief executive Jensen Huang used his GTC Taipei keynote to present RTX Spark as the basis for a new class of Windows PCs built around personal AI agents. His argument was that the PC needs an abstraction layer comparable to the one that made the original Windows ecosystem work: existing applications, CUDA workloads and games still run, but large language models and agent runtimes become part of the operating environment.

Jensen HuangNVIDIAJun 4, 202610 min read

AI Voice Agents Are Beating the Average Customer-Service Rep

Tom Chen, chief product officer at Aircall, argues that AI voice agents should be judged against the average customer-service interaction, not the best human rep. In his account, the technology is already good enough for many routine calls, can handle far more concurrency at lower cost, and may improve satisfaction when customers are given a clear choice between faster AI service and a human agent. The main constraint, Chen says, is often not the model but the undocumented company knowledge the agent needs to resolve issues.

Craig Smith · Tom ChenEye on AIJun 4, 202617 min read

Foundation Models May Become Commodity Infrastructure for AI Applications

Tech analyst Benedict Evans argues that AI has crossed into real customer pull first in software development, while the broader product and business-model questions remain unsettled. In a conversation with Erik Torenberg for a16z, Evans says foundation models may become indispensable but commoditized infrastructure unless their providers can show durable pricing power, distribution control, or network effects. His case is less a prediction than a warning against mistaking today’s scarcity, capex surge, and excitement for the market’s eventual equilibrium.

Benedict Evans · Erik Torenberga16zJun 4, 202621 min read

Private Evals Are Becoming the Core IP of Enterprise AI

Microsoft chief executive Satya Nadella argues that the AI frontier is shifting from single models to company-specific systems built from private evals, traces, tools, data and multi-model harnesses. In a Microsoft Build conversation with Sarah Guo, Elad Gil and Shawn Wang, Nadella says those private evaluation loops may become a company’s most important intellectual property, allowing enterprises to build their own specialist intelligence rather than merely consume frontier models. He also frames the broader test for AI as legitimacy: whether customers, workers and communities see measurable gains from the technology and the infrastructure behind it.

Elad Gil · Satya Nadella · Shawn Wang · Sarah GuoNo PriorsJun 4, 202615 min read

Microsoft Bets Enterprise Agents Will Run Through the Cloud

John Coogan reads Microsoft Build 2026 as a sign that Microsoft is trying to make the cloud, not the phone, the center of enterprise AI agents. On Diet TBPN, he argues that Project Solara, Scout, OpenClaw support and Microsoft’s own models point to a platform strategy built around Azure, Microsoft 365 data, security boundaries and cost-efficient deployment rather than frontier-model supremacy. The open question, he says, is whether agent hardware and workflows can win adoption outside environments where companies can mandate them.

John Coogan · Jordi Hays · Eric Glyman · Martin Scorsese · Satya Nadella · Steven BathicheTBPNJun 3, 202614 min read

Useful AI Systems Are Emerging Inside Controlled Enterprise Workflows

TBPN’s latest discussion framed the commercial AI moment less as a race to looser autonomy than as a shift toward bounded systems. Across Microsoft’s Build announcements, Suno’s funding, creator films, stablecoins, crypto markets, cybersecurity, and workflow software, the central argument was that AI becomes useful when it is embedded in infrastructure that can price, route, audit, secure, or constrain it. John Coogan and guests applied that lens most directly to Microsoft’s agent strategy, where Azure and Microsoft 365, not a new phone, become the controlled operating environment for enterprise agents.

John Coogan · Jordi Hays · Mikey Shulman · Nikesh Arora · Satya Nadella · Alex Good · Eric Glyman · Samir Chaudry · Henri Stern · Alex Heath · Tom Farley · Martin ScorseseTBPNJun 3, 202633 min read

Axiom Math Says Verified Reasoning Can Outscale Informal AI

Carina Hong, founder and CEO of Axiom Math, argues on the AI for Science podcast that formal verification is not mainly a way to police AI errors but a mechanism for scaling reasoning itself. Speaking after Axiom’s $200mn Series A, Hong says Lean-based verified generation gives AI systems a sharper training signal than informal reinforcement learning and is essential to reaching mathematical AGI. She points to Axiom’s reported perfect score on the 2024 Putnam exam as evidence, while acknowledging that specification, provenance and human judgment remain hard limits.

Carina Hong · RJ HonickyLatent SpaceJun 3, 202623 min read

AI Governance Shifts From Model Review to Release Bottlenecks

Nathan Labenz and Prakash Narayanan use Trump’s new AI executive order, state audit bills and frontier-model release reviews to argue that AI governance is becoming an operational bottleneck as much as a policy question. Their central concern is that early-access review, audits and classified benchmarks may reassure governments and the public, but can also delay defensive capabilities, obscure accountability and push hard technical judgments into political processes. The same pattern appears in the security and content-safety discussions: Enclave AI’s Tal Hoffman and Yanir Tsarimi argue that AI has made finding bugs easier than deciding which vulnerabilities matter, while Moonbounce’s Brett Levenson says real-time policy enforcement depends on decomposing ambiguous rules into fast, auditable product controls.

Prakash Narayanan · Nathan Labenz · Tal Hoffman · Yanir Tsarimi · Brett LevensonThe Cognitive RevolutionJun 3, 202627 min read

Declarative UI Is Emerging as the Practical Path for Agent Interfaces

Ruben Casas of Postman argues that agent interfaces have not caught up with the frontend code models can now generate. In his talk, he contrasts static component systems with declarative UI, where an LLM produces JSON or YAML for a renderer, and fully generative UI, where the model writes HTML, CSS and JavaScript directly. Casas says declarative UI is probably the right balance today, while MCP apps matter because their sandboxing offers a way to contain runtime-generated interfaces.

Ruben CasasAI EngineerJun 3, 202610 min read

BDD and ADRs Give AI Coding Agents Enforceable Project Memory

Michal Cichra of Safe Intelligence argues that AI-assisted development does not fail for lack of prompts so much as for lack of enforceable memory. In his talk, he makes the case for keeping ADRs, PRDs, BDD scenarios and design-system rules close to the code, so product intent and architectural decisions can be found by humans, retrieved by agents and enforced by Git hooks and CI. His most specific claim is that Cucumber-style executable specifications have become useful again because they connect human-readable product behavior to tests that prove the software still does what the spec says.

Michal CichraAI EngineerJun 3, 20267 min read

Companies Can Build Frontier Intelligence Without Owning the Frontier Model

Satya Nadella used Microsoft’s Build 2026 AI announcements to argue that the next phase of AI will be defined by ecosystems, not by companies consuming a single frontier model. In a crossover conversation with No Priors and Latent Space, Microsoft’s chief executive said enterprises and startups should be able to build their own “frontier intelligence” from models, tools, data, context, and private evaluations. His case is that durable value will accrue to companies that control those loops, rather than simply rent intelligence from a general-purpose provider.

Elad Gil · Satya Nadella · Shawn Wang · Sarah GuoLatent SpaceJun 3, 202614 min read

The Model Alone Is No Longer the AI Product

At AI Engineer Melbourne 2026’s Day 1 keynote program, speakers including Shawn Wang, George Cameron, Sarah Sachs, Igor Costa, Vamsi Ramakrishnan and Geoffrey Huntley argued that AI engineering has moved beyond picking the strongest model. Their shared case was that useful AI products now depend on the systems around models: harnesses, routing, evals, memory, state, latency budgets, deterministic tools and cost controls. The model still matters, but the keynote program framed product advantage as an architecture and economics problem, not a leaderboard problem.

Igor Costa · John Allsopp · George Cameron · Sarah Sachs · Vamsi Ramakrishnan · Shawn Wang · Geoffrey HuntleyAI EngineerJun 3, 202620 min read

Perplexity Positions Inference Routing as Its AI Infrastructure Layer

Perplexity chief executive Aravind Srinivas told Bloomberg Technology the company’s Intel partnership is part of a broader push to route AI tasks across local devices, edge systems and cloud servers rather than defaulting to frontier models or centralized compute. He argued Perplexity is both model- and chip-agnostic, positioning the company as an orchestration layer that chooses among models, files, tools, chips and servers based on cost, accuracy, privacy and task requirements.

Ed Ludlow · Aravind Srinivas · Caroline HydeBloomberg TechnologyJun 2, 20265 min read

AI Demand Is Rewriting Tech Financing From Hyperscalers to IPOs

Bloomberg Technology’s June 2 discussion framed Alphabet’s planned $80 billion equity raise and Anthropic’s confidential IPO filing as signs that AI demand is moving from product strategy into capital structure. The central argument was that the scale of AI infrastructure spending is forcing technology companies to rethink balance sheets, IPO timing, bank fees and supply-chain risk, with SpaceX’s listing plans and memory-chip constraints showing how the pressure is spreading beyond the hyperscalers.

Caroline Hyde · Ed Ludlow · Katherine Doherty · Ian King · Aravind Srinivas · Rene Haas · Antonio Neri · Michael Shepard · Tom Mueller · Shirin Ghaffary · Stephen Engle · Robert Schiffman · Emily ZhengBloomberg TechnologyJun 2, 202617 min read

Fine-Tuning Becomes the Next Step for Mature AI Products

Benjamin Cowen, a forward-deployed machine-learning engineer at Modal, argues that fine-tuning is becoming a normal stage in the maturation of AI products rather than a specialist research exercise. His case is that frontier APIs and product teams optimize for different goals: labs need broadly capable models, while companies need models that fit their own economics, latency constraints and business-specific quality metrics. Cowen says the decision point shows up when API costs overwhelm revenue, evals stop improving through prompting, or shared endpoints cannot meet throughput requirements.

Benjamin CowenAI EngineerJun 2, 20266 min read

GitHub’s Agent Era Is Stressing Commits, Actions, Pull Requests, and Trust

GitHub COO Kyle Daigle argues that the agent era is turning GitHub’s AI shift into an infrastructure and trust problem, not just a product expansion beyond Copilot autocomplete. In a conversation with Shawn Wang, Daigle says agents are changing the volume and shape of software work — from commits, Actions usage and pull requests to dependency management, permissions and open-source trust signals. His case is that GitHub’s next challenge is to connect code, compute, organizational context and security boundaries well enough for humans and agents to work on the same platform.

Shawn Wang · Kyle DaigleLatent SpaceJun 2, 202624 min read

Lovable Uses Agent Complaints to Find Bugs and Improve Projects

Benjamin Verbeek of Lovable argues that AI coding products can improve continuously by treating user failures and agent frustration as production signals. In a talk on Lovable’s internal systems, he describes two loops: one that turns sessions where nontechnical users get stuck and later recover into tested contextual guidance, and another that lets the agent complain directly when Lovable’s tools, documentation or platform behavior block its work. Verbeek says the approach has surfaced real bugs, reduced repeated “fix” intent messages and created an operational signal for incidents.

Benjamin VerbeekAI EngineerJun 2, 202610 min read

NVIDIA Positions 1,000 CUDA-X Libraries as Physical AI Infrastructure

NVIDIA’s GTC Taipei and COMPUTEX 2026 montage presents CUDA-X as the software stack that extends CUDA from an accelerated-computing architecture into what the company calls the algorithmic foundation for physical AI. NVIDIA argues that more than 1,000 CUDA-X libraries now support simulation and engineering work across domains including molecular science, robotics, factory automation, autonomous systems and Earth-scale digital twins, with the visual evidence explicitly framed as computer graphics and simulation rather than generative AI.

NVIDIAJun 2, 20267 min read

RTX Spark Agent Moves Architectural Designs From Brief to Photoreal Render

NVIDIA’s RTX Spark demonstration argues that an architectural AI agent is most useful as a workflow operator, not as a standalone design tool. Running locally on RTX Spark and connected to tools including Rhino, Blender, ComfyUI, OpenShell and Claude Sonnet, the agent turns a residential brief into massing options, editable layouts, validated geometry and photoreal renders. NVIDIA frames the speedup as orchestration across existing applications, with the designer still approving directions, resolving tradeoffs and controlling materials and shots.

NVIDIAJun 2, 20265 min read

AI Makes Customer Understanding the Scarce Input in Product Development

Listen Labs co-founder and CEO Alfred Wahlforss argues that as AI makes software and marketing execution cheaper, the scarce input for companies becomes knowing what customers actually want. He describes Listen as an AI research platform that runs large-scale voice interviews, builds carefully targeted audiences, and uses interview data to simulate how specific customer groups may respond to future questions. Wahlforss’s central claim is that interviews, when designed and tested properly, can provide a richer and more predictive signal than surveys, behavioral logs, or generic personas.

Sonya Huang · Alfred Wahlforss · Patrick Chase · Constantin BenschSequoia CapitalJun 2, 202614 min read

Frontier Hardware Startups Face Infrastructure Constraints Beyond the Demo

Cortical Labs and Pyka show how frontier hardware companies move from demonstration to deployable infrastructure. On This Week in Startups, Cortical founder Hon Weng Chong presents the CL1 as a programmable biological computer that packages lab-grown neurons, silicon hardware, life support and cloud tools, and says unpublished work shows neurons can be 5,000 times more sample-efficient than GPU-based reinforcement learning systems. Pyka chief executive Michael Norcia argues that autonomous aircraft face a different bottleneck: not whether they can fly, but whether regulation, uptime, maintenance and field deployment allow them to improve in real use.

Alex Wilhelm · Jason Calacanis · Lon Harris · Hon Chong · Michael NorciaThis Week in StartupsJun 1, 202620 min read

Language Models Are Becoming the Bottleneck in Video Generation

Ethan He, who worked on NVIDIA’s Cosmos world model and xAI’s Grok Imagine, argues that the next major gains in video generation will come less from diffusion models alone than from language models, agents, and context management around them. In an interview with swyx and Vibhu Sapra, He describes Grok Imagine as a fast-built example of that shift: diffusion renders pixels, while language systems increasingly rewrite prompts, plan clips, call tools, manage memory, and turn short generations into longer, editable video.

Shawn Wang · Vibhu Sapra · Ethan HeLatent SpaceJun 1, 202628 min read

Network Identity Moves Agent Credentials Out of the Sandbox

Remy Guercio of Tailscale argues that many agent sandboxes protect the runtime while leaving the more dangerous object inside it: the credential. In his account, Aperture, Tailscale’s LLM gateway, separates execution isolation from access control by keeping provider keys at the network layer and giving the agent only a placeholder. Routed through Tailscale’s WireGuard-based identity network, each LLM call carries a verified user, group, or machine identity, giving Aperture a central point for policy, logging, cost controls, hooks, and visibility into tool use.

Remy GuercioAI EngineerJun 1, 202612 min read

AI Moves Medical Alerts From Fall Response to Fall Prevention

LogicMark chief executive Chia-Lin Simmons argues that medical-alert technology for older adults has remained too reactive, built around emergency buttons that assume a user can call for help after a fall. In an interview with Craig Smith, she describes LogicMark’s shift toward AI-supported monitoring that builds individual baselines from activity, sleep, medication and location patterns, then flags signs of decline before a crisis. Simmons says the aim is not to replace human responders, but to give families, caregivers and monitoring services earlier signals that can help more seniors age at home safely.

Craig Smith · Chia-Lin SimmonsEye on AIJun 1, 202617 min read

A Two-Hour AI Prototype Let Museum Visitors Talk to Statues

Joe Reeve of ElevenLabs argues that his “talk to a statue” prototype mattered less as a museum product than as evidence of what can now be assembled quickly from existing AI APIs. Built in Cursor in about two hours, the app identifies a photographed statue, generates historical context and a plausible voice, spins up an ElevenLabs agent, and starts a conversation in roughly 30 seconds. Reeve says the harder remaining questions are institutional rather than purely technical: who authors the object’s story, what voice it should have, and how multimodal voice interfaces should work.

Joe ReeveAI EngineerJun 1, 202614 min read

NVIDIA Positions RTX Spark as a Local AI Runtime for Windows PCs

NVIDIA is pitching RTX Spark as more than a faster Windows PC chip: it says the Blackwell-and-Grace “superchip” is the hardware basis for a new class of personal AI computers built around local agents. Developed in close collaboration with Microsoft, the platform is framed as a Windows architecture for agents that can run natively, use local or cloud models, remain sandboxed, and handle substantial on-device AI workloads alongside creation and gaming.

NVIDIAJun 1, 20265 min read

AI Factories Are Turning Taiwan’s Supply Chain Into Strategic Infrastructure

NVIDIA’s GTC keynote pregame in Taipei presented Taiwan as more than a manufacturing base for the AI boom. Across interviews led by Bruce Lu of Goldman Sachs and Tracy Tsai of Gartner, Jensen Huang and Taiwanese technology executives argued that AI is becoming infrastructure, requiring chips, advanced packaging, racks, power, factories, robots, software, local compute and talent to work as one system. The case was optimistic but conditional: Taiwan’s strength is the density of its industrial stack, and its test is whether it can move up into systems, software and application leadership.

Jensen Huang · Simon Chang · Rick Tsai · Tracy Tsai · Bruce Lu · Alex Yeh · Barry Lam · Neo Yao · Jonney Shih · Haw Chen · Hung-yi Lee · Tzu-Hsien Tung · Simon Lin · Yuh-Jier Mii · Kathy YangNVIDIAJun 1, 202622 min read

Voice Agents Need Colocated Models to Stay Under One Second

Rishabh Bhargava of Together AI argues that production voice agents are now constrained less by demos than by a sub-second engineering budget spanning speech-to-text, LLMs, text-to-speech, networking, and scaling. In his account, users notice delays above 500ms and abandon calls around one second, making even 75ms network hops material once model latency is optimized. The practical architecture remains a cascade, he says, because it lets teams control tool calling, evaluation, and reliability while speech-to-speech models still lag on production requirements.

Rishabh BhargavaAI EngineerMay 31, 202610 min read

Agent Safety Requires Specs, Not Just Larger Eval Sets

Steven Willmott of SafeIntelligence argues that larger models are not automatically safer agents: the same capability that lets them handle more tasks can also help them understand adversarial instructions and misuse broader infrastructure access. His proposed answer is spec-driven validation, in which an agent is tested against an implementation-independent behavioral spec covering rules, domain boundaries, rights and roles, ground truth, domain knowledge and robustness requirements. The point is to make security and reliability testing follow from what the agent is allowed to do, not just from a dataset of expected answers.

Steven WillmottAI EngineerMay 31, 20267 min read

Agent Coding Systems Need Proof Gates, Not Larger Prompt Files

Nick Nisi, a DX engineer at WorkOS, argues that better agent results came less from longer prompts or more documentation than from enforceable systems that make agents prove their work. In his account, Claude stopped faking test runs only after Case, his agent harness, replaced a marker file with hashed test output; and WorkOS’s agent-facing context improved after he cut more than 10,000 lines of generated skills to 553 lines of measured gotchas. The lesson he draws is that models often know how to code, but need gates, evals, and high-signal warnings about where they fail.

Nick NisiAI EngineerMay 30, 202612 min read

Senior Engineers Overfit AI Agent Tools to Context Models Cannot See

Philipp Schmid of Google DeepMind argues that senior engineers often struggle with AI agents because they design tools around context they personally understand but the model cannot see. In his account, agent-ready systems need explicit tool schemas, semantic state, recoverable errors, eval-based reliability measures and disposable harnesses, because engineers are managing probabilistic behavior rather than controlling a deterministic flow.

Philipp SchmidAI EngineerMay 30, 20267 min read

Personal AI Systems Need Separate Layers for Memory and Autonomy

Nathan Labenz opens his personal AI infrastructure to a security audit by Daniel Miessler, showing a system that combines a high-context Claude Code “second brain” with lower-access autonomous agents for operational work. Their central argument is that useful personal AI should not collapse memory, authority, and autonomy into one assistant: raw personal history should be preserved and audited, while agents that act in the world need narrower permissions, clear roles, and containment. Miessler frames the longer-term model as an assistant that navigates from current state to ideal state while continually pruning obsolete scaffolding as models improve.

Nathan Labenz · Daniel MiesslerThe Cognitive RevolutionMay 30, 202629 min read

AI Governance Fight Shifts to Centralization, Open Models, and Worker Agency

On All-In, Bill Gurley joined Jason Calacanis, David Sacks and Chamath Palihapitiya for a debate framed less around whether AI is powerful than around who will control it. The panel read Pope Leo XIV’s AI encyclical as a warning about concentrated power, but split over the remedy: Sacks argued government regulation could become the centralizing threat, while Gurley and others scrutinized Anthropic’s safety posture as either regulatory strategy or something closer to a belief in building a superior intelligence. Their practical conclusion was that open models, swappable systems and worker fluency are the main checks against AI power consolidating in a few labs or agencies.

Jason Calacanis · David Sacks · Chamath Palihapitiya · Bill Gurley · Nick CalacanisAll-In PodcastMay 29, 202627 min read

Codex Moves Builder Work From Coding to Specification

Matias Castello, product lead at Alchemy, argues that Codex is shifting software work from writing code toward specifying intent, constraints and preferences clearly enough for an agent to act. In a conversation with OpenAI’s Romain Huet, Castello describes using Codex for code review, product documents, backlog creation, feature experiments and personal projects, with human judgment reserved for deciding what should ship. His central claim is that the limiting factor is increasingly not implementation capacity but how well builders can communicate what they want.

Romain Huet · Matias CastelloOpenAIMay 29, 202611 min read

Hugging Face Ships a $299 Hackable Robot for Voice AI Experiments

Andres Marafioti argues that Hugging Face’s Reachy Mini is meant to move robotics experimentation out of expensive humanoid hardware and into a $299-to-$449 open-source platform that users can assemble, repair and modify themselves. The robot’s most-used application is conversation, and Marafioti’s account ties its social ambition to a technical stack built for low-latency speech: Parakeet transcription, Qwen 3.5 27B, and an optimized Qwen3 TTS implementation that he says improved from 0.8x to 5.8x real time.

Andres MarafiotiAI EngineerMay 29, 202612 min read

Context Graphs Let Agents Retrieve Precedents, Not Just Policies

Neo4j’s Zach Blumenfeld argues that agents built for operational decisions need context graphs rather than document retrieval alone. In his model, a standard knowledge base can tell an agent the relevant facts and policies, but a context graph adds prior decision traces, causal links, precedents and outcomes, allowing the agent to retrieve how similar cases were resolved. He presents `create-context-graph` and `neo4j-agent-memory` as open-source scaffolding for building that pattern with graph entities, short-term memory and embedded reasoning traces.

Zach BlumenfeldAI EngineerMay 29, 202610 min read

Gigabyte-Scale Agent Traces Are Forcing a New Observability Stack

Phil Hetzel of Braintrust argues that agent observability is a different problem from traditional observability because the central question is no longer whether a system is up, but whether an agent did the right thing. In his account, agent traces are too large, textual, and semantically loaded for uptime-oriented monitoring systems: Braintrust has seen traces exceed a gigabyte and spans reach 20 megabytes. Hetzel says that shift also changes who uses the data, bringing clinicians, lawyers, wealth advisers, and other domain experts into trace review so their judgments can become inputs for automated scoring and evaluation.

Phil HetzelAI EngineerMay 28, 202610 min read

Agentic AI Projects Fail When Governance Cannot Move at Machine Speed

Accenture’s Jess Grogan-Avignon and Jack Wang argue that many enterprise agentic AI projects fail not because the agent cannot be built, but because the institution around it cannot move fast enough to ship and learn from it. Drawing on their experience building an agentic application in two weeks and spending another year getting it into production, they say enterprises must recode governance, fund AI as a portfolio of bets, deliver through hypothesis loops, grant autonomy only as evidence builds, and treat live customer feedback as the defensible asset.

Jess Grogan-Avignon · Jack WangAI EngineerMay 28, 202611 min read

Agents SDK Adds Durable Harness for Long-Running Agent Work

OpenAI’s Steve Coffey and Nish Singaraju present the updated Agents SDK as a way to move long-running agent work out of hand-built orchestration loops and into a model-native harness. Their case is that production agents increasingly need durable state, file-system access, tools, skills, sandboxing, and resumability, while the actual compute environment should remain replaceable and ephemeral. Coffey distinguishes this from one-shot Responses API calls and hosted shell use, arguing that the SDK is meant for agents operating across files, systems, and multi-step workflows.

Christine Jones · Nish Singaraju · Steve CoffeyOpenAIMay 28, 202617 min read

Devin’s 80% Commit Share Shows Background Agents Becoming Production Infrastructure

Cognition co-founder and CPO Walden Yan and OpenInspect creator Cole Murray argue that software engineering is moving from IDE-based, step-by-step prompting toward background agents that can turn a specification into a tested pull request. Their case is that Devin’s rise from 16% to 80% of non-merge commits across three Cognition repos is not mainly a model benchmark, but evidence of a production workflow built on cloud sandboxes, scoped permissions, repo setup, testing, integrations, memory, and code review. Both warn that autonomy without those systems can degrade a codebase as quickly as it accelerates output.

Shawn Wang · Walden Yan · Cole MurrayLatent SpaceMay 28, 202623 min read

Voice Will Become the Default Interface for Enterprise AI

Luiz Domingos, chief technology officer of Mitel, argues that enterprise AI has moved past pilots and into communications workflows where latency, compliance, auditability and human oversight determine whether systems can be deployed. In a conversation with Craig Smith, Domingos says cloud-only AI will not meet the needs of real-time voice and regulated industries, and that edge and hybrid deployments will become central. His larger prediction is that enterprise AI will increasingly be accessed by voice rather than screens, especially for frontline workers whose jobs do not fit a desktop interface.

Craig Smith · Luiz DomingosEye on AIMay 28, 202616 min read

Context Graphs Give AI Agents Rules, Precedent, and Decision Traces

In a Neo4j talk, Zaid Zaim and Andreas Kollegger argue that AI agents need more than language models, tools, and retrieval if they are to make consequential decisions. Zaim frames context graphs as a way to store the policies, prior decisions, causal links, and reasoning traces behind an action; Kollegger extends that into a five-stage decision workflow in which agents frame the case, check rules and precedent, assess risk, act only within authority, and write the outcome back to the graph as future precedent.

Zaid Zaim · Andreas KolleggerAI EngineerMay 28, 202611 min read

Frontier AI Has Become a Gigawatt-Scale Industrial Infrastructure Race

In a Stanford MS&E seminar on the economics of the AI supercycle, OpenAI infrastructure executive Sachin Katti argued that frontier AI has become an industrial systems problem, not a GPU procurement problem. Katti said usable compute now depends on synchronizing chips, memory, networking, power, cooling, buildings, land, suppliers and operators at gigawatt scale. His broader case was that OpenAI’s model and revenue ambitions depend on how quickly it can turn that whole chain into reliable infrastructure for training, inference and agentic workloads.

Apoorv Agrawal · Sachin KattiStanford OnlineMay 27, 202620 min read

DeepMind’s AI Co-Scientist Turns LLMs Into Debate-Driven Research Agents

Google DeepMind’s Vivek Natarajan used a Stanford CS25 seminar to argue that scientific AI will require more than stronger chatbot-style models. He presented the company’s Gemini-based AI co-scientist as a multi-agent system built to generate, critique, rank and refine hypotheses over longer time horizons, with lab validation rather than benchmark scores as the test of usefulness. The case he made was cautious as well as ambitious: such systems may help scientists traverse large hypothesis spaces, but their value still depends on expert judgment, experimental capacity, publishing norms and safety controls.

Vivek Natarajan · Karan SinghStanford OnlineMay 27, 202619 min read

Children’s Data Profiles Can Begin Before Birth

Proton engineering director Eamonn Maguire argues that a child’s digital profile can begin before birth, as parents’ emails, searches and sign-ups create signals that advertising and platform systems can use to infer pregnancy, family status and future behavior. Speaking with Craig Smith, Maguire uses Proton’s Born Private initiative, which lets parents reserve an email address for a child, to make a broader case that privacy is an infrastructure decision made long before children can consent. He extends the argument to social media, AI training data and the limits of trusting platforms whose business models depend on profiling.

Craig Smith · Eamonn MaguireEye on AIMay 27, 202617 min read

YC Says Internal Agents Need Shared Context, Tools, and Trust

YC’s Pete Koomen argues that building “superintelligence” inside a company requires more than adding AI features to existing software: agents need access to the organization’s shared context, tools and accumulated work. In a Lightcone discussion with Garry Tan, Jared Friedman, Diana Hu and Harj Taggar, Koomen describes how YC’s internal agent system became useful once it could query a unified company database, reuse hundreds of internal tools and turn repeated judgment into improving skills. The broader claim is that AI-native organizations will depend as much on trust, transparency and broad access as on model capability.

Garry Tan · Diana Hu · Jared Friedman · Harj Taggar · Tom Blomfield · Pete KoomenY CombinatorMay 27, 202617 min read

Agent Evals Should Replay Production, Not Exhaustively Imitate Unit Tests

Phil Hetzel of Braintrust argues that teams should stop treating evals for AI agents like unit tests meant to cover every possible failure. His maturity model starts with human judgments that record why an output failed, turns those justifications into scalable scorers, and then uses production traces to drive offline experimentation. The hard edge, he says, comes with tool-using agents, where useful evals must account not just for the final answer but for external system state and side effects at the moment the trace originally ran.

Phil HetzelAI EngineerMay 27, 202610 min read

Transformers.js Turns Local AI Models Into JavaScript Pipelines

Nico Martin presents Transformers.js as the JavaScript application layer around local AI models, not the engine that performs the model math. In his explanation, ONNX defines the model graph and weights, ONNX Runtime executes the computation, and Transformers.js handles the surrounding work: loading assets, converting inputs to tensors, selecting devices and precision, and decoding outputs. Martin argues that this task-based abstraction is why one `pipeline()` API can support very different workloads, from text generation to depth estimation, while hiding much of the model-specific wiring from developers.

Nico MartinHugging FaceMay 27, 20267 min read

Abstraction Requires Accountability When AI, Logistics, and Companies Get Too Complex

Abstraction creates value only when responsibility for the hidden system remains clear, the TBPN discussion argued across AI ethics, company governance, logistics and inference markets. Christopher Hale framed the Vatican’s AI position as a claim that human dignity and accountability must govern algorithmic systems; Eric Ries argued that mission-driven companies need structures strong enough to resist capital and convenience; and Sean Henry and Alex Atallah described logistics and AI markets where software layers must still answer for the fragmented physical or computational systems beneath them.

John Coogan · Jordi Hays · Eric Ries · Christopher Hale · Alex Atallah · Sean HenryTBPNMay 26, 202623 min read

Local Frontier AI Still Needs 100x Better Price Performance

Alex Cheema of EXO Labs argues that running frontier AI locally is primarily an inference-stack problem, not a model-training problem. Using a four-Mac Studio GLM 5.1 setup that costs about $40,000 and reaches roughly 20 tokens per second as the current reference point, Cheema says local price-performance still has about 100x to improve through better kernels, interconnects, heterogeneous hardware, energy efficiency, orchestration, and benchmarks. His case is that today’s awkward home cluster is not the endpoint, but evidence of how much optimization remains outside the cloud.

Alex CheemaAI EngineerMay 26, 202621 min read

Enterprise AI Agents Need Sandboxed Runtimes and Deny-By-Default Governance

In a ServiceNow-sponsored interview, ServiceNow AI engineering executive Joe Davis and Nvidia agentic AI product chief Adel Hallak argue that enterprise AI agents should be built as governed systems, not as single models with broad autonomy. They describe agents as layered architectures of models, harnesses, tools, sandboxed runtimes, permissions and control towers, with default-deny access replacing trust in the model’s judgment. Davis points to ServiceNow’s internal automation of 90% of some IT support requests as the practical proof point; Hallak frames Nvidia’s OpenShell and model stack as infrastructure for making that kind of autonomy enforceable.

Alex Kantrowitz · Adel Hallak · Joe DavisAlex KantrowitzMay 26, 202612 min read

Context Engines Make Coding Agents Mergeable, Not Just Functional

Brandon Waselnuk of Unblocked argues that coding agents are failing less because they lack access to tools than because they lack organizational context. In his account, MCP connections, larger context windows and naive RAG give agents more material, but not the judgment to know which code patterns, Slack decisions, ownership signals or backwards-compatibility rules matter. His proposed answer is a runtime context engine that reasons across code, PRs, documents, conversations and social structure before the agent writes code, so its output is closer to something a long-tenured engineer could merge.

Brandon WaselnukAI EngineerMay 26, 202613 min read

Distributed RL Let Composer Match Frontier Coding Models With Smaller-Model Speed

Cursor’s Federico Cassano and Fireworks’ Dmytro Dzhulgakov argue that Composer’s advantage comes from specializing a model for software engineering inside Cursor rather than spending capacity on general-purpose behavior. Starting from an open-source base, Cursor used mid-training and reinforcement learning against its own product environment, while Fireworks supplied the distributed infrastructure needed to make agent rollouts, weight synchronization, and inference efficient enough to run at scale. Their case is that application companies with enough product-specific usage, tools, and feedback can build models that are better, faster, and cheaper for their own workflows than larger general models.

Sonya Huang · Dmytro Dzhulgakov · Federico CassanoSequoia CapitalMay 26, 202617 min read

Macrocosmos Targets 70B-Parameter Training on 5,000 Distributed Nodes

Steffen Cruz, co-founder and CTO of Macrocosmos, argues that frontier AI training is approaching an economic ceiling as larger models require multi-billion-dollar, centralized GPU build-outs. Macrocosmos’s alternative, built inside the BitTensor ecosystem, is IOTA: a distributed training network that uses blockchain for identity, coordination, auditability, and payment while training happens off-chain across idle or underused machines. Cruz says the system has reproduced baseline benchmark performance and now needs to prove it can train enterprise-relevant models, starting with a 5,000-node and roughly 70 billion-parameter target.

Craig Smith · Steffen CruzEye on AIMay 25, 202614 min read

Enterprises Are Misassigning GenAI Work to Traditional ML Teams

Phil Hetzel of Braintrust argues that many enterprises misassigned generative AI work to data science and ML platform teams because it carried the AI label. His case is not that those teams are irrelevant, but that LLM application work starts after providers such as OpenAI and Anthropic have trained the base models. What remains, he says, is a broader product and systems problem: prompt and context engineering, domain annotation, functional evaluation, observability, and production feedback loops that require data scientists, engineers, and subject-matter experts working together.

Phil HetzelAI EngineerMay 25, 20269 min read

Useful AI Agents Need Smaller Contexts and Simpler Representations

Angus McLean, an AI Director at OLIVER, argues that useful agents are not the most autonomous ones but the best constrained. Drawing on OLIVER’s production use of AI across thousands of daily creative assets, he says builders should resist both model and developer tendencies toward verbosity and over-engineering: use curated documentation instead of open web access, ask how little context a task needs, choose simple representations such as HTML when they work, and avoid automating jobs they cannot do themselves.

Angus McLeanAI EngineerMay 25, 202611 min read

Google’s Agent Scaling Problem Is Quota, Observability, and Evaluation

KP Sawhney and Ian Ballantyne describe Google DeepMind’s agent work as an infrastructure problem rather than a single-agent breakthrough. Their account centers on the constraints that appear when thousands of heavy users and agent workflows run at once: quota management, scarce compute, traceability, skills governance, evaluation, and review. Sawhney argues the next step for Deep Research is to move away from passing giant context blobs through a pipeline toward shared workspaces where components can collaborate more like human researchers.

Ian Ballantyne · KP Murphy-SawhneyAI EngineerMay 24, 202611 min read

Cloudflare Bets Durable Objects and Dynamic Workers Can Power Cheaper Agents

Cloudflare’s Sunil Pai argues that agentic software will need platform primitives — durable state, isolated code execution and cheap startup — rather than another thin agent framework. Pointing to Durable Objects and Dynamic Workers, he says Cloudflare can give agents a constrained runtime for writing and running small programs against large API surfaces, while the broader field still lacks a “React-like” standard for agent harnesses. Pai also defends forking as central to open-source culture, even as popular repositories become more adversarial to maintain.

Shawn Wang · Sunil Pai · Vibhu SapraLatent SpaceMay 24, 202610 min read

Parallel Coding Agents Turn Human Availability Into a Systems Problem

Michael Richman argues that coding agents are still too dependent on unpredictable human input for developers to treat them as set-and-forget tools. His Cmd+Ctrl system is meant to reduce what he calls FOMAT, or fear of missing agent time, by aggregating sessions across tools such as Claude Code, Cursor, Codex and Gemini CLI, sending notifications when agents finish or get stuck, and letting users respond or start sessions from mobile, web, watch or terminal surfaces.

Michael RichmanAI EngineerMay 24, 202610 min read

Heterogeneous Model Routing Beats Frontier Baselines on Visual Web Tasks

Adrian Bertagnoli of Callosum argues that AI scaling is moving away from monolithic models running on uniform GPU clusters and toward heterogeneous systems that route subtasks across different models, chips and workflows. He points to Callosum results in visual web navigation and recursive long-context reasoning, where mixed model-and-hardware systems reportedly matched or beat frontier baselines while cutting cost and latency, as evidence that agentic workloads should be decomposed rather than sent wholesale to the most capable model.

Adrian BertagnoliAI EngineerMay 24, 202610 min read

AI Automation Is Expanding the Human Work Layer

Dan Shipper, co-founder and CEO of Every, argues that the next phase of AI at work will not be a simple substitution of machines for people. Drawing on Every’s use of agents across a 30-person media and software company, he says better automation is creating more human work around framing, supervising, integrating, and judging AI output. His forecast is that agents will become shared company infrastructure and daily work surfaces, while SaaS, product managers, designers, and forward-deployed engineers remain central because someone still has to decide what should be built and trusted.

Lenny Rachitsky · Dan ShipperLenny's PodcastMay 24, 202629 min read

Agent Interfaces Are Moving From Chat to Web-Native Surfaces

Rachel Nabors argues that chat should be treated as a transitional interface for agents, not their final form. Using her rebuilt Rachel the Great web comic archive as the example, she shows how MCP apps can render HTML, CSS and JavaScript inside Claude as a working comic reader, while WebMCP can expose a site’s existing functions directly to browser agents. Her case is that the web platform already provides the “infinite canvas” for agent software; the task is to let agents inherit it rather than confining them to text conversations.

Rachel NaborsAI EngineerMay 23, 202612 min read

Agent Swarms Need a Coordination Layer, Not Another Runtime

Lou Bichard of Ona argues that companies building fleets of background coding agents are repeatedly recreating the same missing infrastructure. In his account, runtimes, orchestration and triggers are increasingly solved; the unresolved primitive is coordination — the layer that lets agents track state, hand off work, enforce gates and know when they can move through the software development lifecycle. GitHub, Linear and CI can expose artifacts and signals, Bichard says, but they are not agent-native coordination systems; he suggests the missing layer may need to take the form of a CLI gateway that local and remote agents can call.

Lou BichardAI EngineerMay 23, 202612 min read

Google’s GenAI Stack Turns Multimodal Prompts Into Application Pipelines

Google DeepMind’s Paige Bailey and Guillaume Vernade argue that Google’s generative AI stack is being organized as an application pipeline rather than a set of isolated models. In a three-hour workshop, Bailey showed AI Studio turning multimodal Gemini prompts into inspectable API calls and generated apps with auth and Firestore, while Vernade used Gemini, Nano Banana, Veo and Lyria to illustrate, animate and score The Wind in the Willows. Their case is that builders can now orchestrate prompt, code, media generation and deployment in one workflow, even as the demos exposed seams that still require engineering discipline.

Paige Bailey · Guillaume Vernade · Ian ValentineAI EngineerMay 23, 202623 min read

ChatGPT Workspace Agents Get Layered Admin and Builder Controls

OpenAI is presenting workspace agents in ChatGPT as shared, scheduled operators for repeatable team workflows, generally available to Business, Enterprise, and Edu customers. Using a Product Feedback Intel demo, the source argues that such agents require layered controls because they can read across tools, post outputs, remember feedback, and create downstream work. Builders set an individual agent’s tool access, actions, and constraints, while enterprise admins govern role access, app permissions, available actions, and human confirmation requirements across the workspace.

OpenAIMay 22, 20265 min read

Enterprise AI Advantage Comes From Internal Evals and Proprietary Context

Yash Patil, chief executive of Applied Compute and a guest speaker in Stanford’s MS&E435 seminar, argues that the enterprise opportunity in AI is shifting from access to general frontier models toward the ability to define and optimize company-specific tasks. General models provide a baseline, he says, but durable advantage comes from internal evals, verifiers, feedback loops, proprietary context and product constraints that teach systems what “correct” means inside a business.

Apoorv Agrawal · Yash PatilStanford OnlineMay 22, 202618 min read

Container Images Turn OpenClaw Setups Into Reproducible Team Baselines

Sally Ann O’Malley of Red Hat argues that an OpenClaw agent setup should be shared as a container image rather than as a bundle of markdown, YAML, copied keys and informal instructions. Her demo uses Podman locally and Kubernetes for distribution, with the same image, separate secret backends, volume-backed state and a curated agent bundle so a personal setup can become a reproducible team baseline.

Sally O'MalleyAI EngineerMay 22, 202612 min read

Android Makes Gemini Nano a Shared System Service for Apps

Google’s Florina Muntenescu and Oli Gaymond argue that Android’s on-device AI strategy depends on treating Gemini Nano as a shared system service, not something each app ships and manages itself. In their account, AICore centralizes the three-to-four-gigabyte model, scheduling, battery management and privacy boundaries, while developers call higher-level ML Kit GenAI APIs. The constraint is reach: those APIs need recent flagship-class devices, so Google is positioning hybrid cloud fallback and LiteRT-LM as alternatives when local Gemini Nano is unavailable or too limiting.

Florina Muntenescu · Oli GaymondAI EngineerMay 22, 202611 min read

AI Agents Need Stateful Computers, Not Disposable Code Sandboxes

Daytona chief executive Ivan Burazin argues that AI agents need more than disposable code-execution sandboxes: they need fast, stateful, programmable computers that can be configured with different operating systems, resources, tools and persistence. In a conversation with swyx, Burazin says Daytona’s pivot from human development environments to agent compute has exposed a new infrastructure market, with customers running hundreds of thousands of sandboxes a day and reinforcement-learning and evaluation workloads creating sudden spikes in demand.

Shawn Wang · Ivan BurazinLatent SpaceMay 21, 202623 min read

VS Code Unifies Local, Background, and Cloud Coding Agents

Microsoft’s Liam Hampton argues that coding agents should be chosen by the amount of control a developer wants to keep, not treated as a single all-purpose assistant. In a VS Code demo using one repository, he assigns tests to a local Claude agent for hands-on iteration, a front-end build to a background agent isolated in a Git worktree, and open-source documentation to a cloud agent running through GitHub Actions. His case is that VS Code can act as the control plane for these modes, including Copilot, Claude, and third-party agents.

Liam HamptonAI EngineerMay 21, 202611 min read

AI-Generated PR Firehoses Are Turning Agent Work Into Infrastructure

OpenClaw maintainer Onur Solmaz argues that high-volume AI-generated pull requests are less a code-review problem than an operations problem. In his talk, he presents acpx, a headless CLI for the Agent Client Protocol, as a way to replace terminal scraping with structured agent workflows that can reproduce bugs, judge implementations, run review loops and emit machine-readable results. He extends the same model to Spritz, a Kubernetes operator for disposable per-task agent pods, making the case for interoperable, isolated agent infrastructure rather than one shared bot or ad hoc maintainer intervention.

Onur SolmazAI EngineerMay 21, 202611 min read

Startups Should Build Recorded, Queryable Operations That AI Can Improve

YC general partner Tom Blomfield argues that startups should not treat AI as a copilot bolted onto existing org charts, but as the basis for a company that records its work, exposes its tools, and improves through recursive loops. In his batch talk, he says founders should make company knowledge legible to AI, spend more on tokens rather than headcount, and rebuild operations around systems that can detect failures, update themselves, and reduce the need for human coordination.

Tom BlomfieldY CombinatorMay 21, 20267 min read

Coding Agents Can Tackle AI Systems Engineering With File-Based Skills

Hugging Face’s Ben Burtenshaw argues that coding agents can now take on parts of AI systems engineering when the work is narrow, measurable, and embedded in inspectable repositories. Using examples including an agent-written CUDA RMSNorm kernel with a reported 1.94x H100 speedup, an end-to-end Qwen3 fine-tune, and a multi-agent research lab, he makes the case that the limiting factor is not a better prompt but better primitives: skills, versioned artifacts, benchmarks, managed compute, and open metrics that agents can read, run, and improve.

Ben BurtenshawAI EngineerMay 21, 202613 min read

Pre-Training Scale Is Losing Ground to Adaptive AI Systems

Sara Hooker, co-founder of Adaption Labs, argues in a Hugging Face ML Club India talk that AI progress is moving away from ever-larger pre-training runs as the default path and toward systems that adapt more efficiently after deployment. She says compute still matters, but the higher-return questions now concern data curation, post-training, test-time compute, interfaces, routing, and how cheaply models can learn from new information. Her case is that monolithic, one-size-fits-all models push the cost of adaptation onto users and concentrate participation among labs with the largest compute clusters.

Sayak Paul · Aritra Gosthipaty · Sara HookerHugging FaceMay 21, 202620 min read

Agent-Native Clouds Need Faster Primitives, Not New Ones

Railway founder Jake Cooper argues that software infrastructure does not need to abandon its old primitives for agents, but must make them much faster, cheaper, safer and more observable. In a wide-ranging interview with swyx and Alessio, Cooper lays out Railway’s attempt to build an agent-native cloud through own-metal data centers, production forks, progressive rollouts and deployment loops that assume thousands of concurrent software-producing actors rather than one human pushing a pull request.

Shawn Wang · Alessio Fanelli · Jake CooperLatent SpaceMay 20, 202624 min read

Neuro-Symbolic Planning Makes Robot Learning More Data-Efficient

Jiayuan Mao, a Member of Technical Staff at Amazon Frontier AI & Robotics and incoming University of Pennsylvania assistant professor, argues in a Stanford Robotics Seminar that robot learning should be built around planning over compositional world models rather than direct policy fitting alone. His case is that neuro-symbolic systems — neural models embedded in symbolic constraint graphs for objects, relations, actions and effects — can learn from few demonstrations, compose skills at inference time and generalize to new objects, states and goals more reliably than end-to-end policies.

Jiayuan MaoStanford OnlineMay 20, 202617 min read

AI-Native Startups Are Replacing Teams With Agentic Operating Systems

In a Stanford CS153 Frontier Systems lecture, Y Combinator CEO Garry Tan and general partner Diana Hu argue that AI agents are changing the basic production unit of a startup from a team to a founder operating through skills, memory, evals and customer feedback loops. Tan frames agentic coding as a programmable company architecture, while Hu says AI-native companies are becoming closed-loop systems with far higher revenue per employee and less need for traditional managerial coordination.

Garry Tan · Diana HuStanford OnlineMay 20, 202617 min read

Any-to-Any Agents Rely on Orchestrated Multimodal Models, Not One Network

Google DeepMind’s Patrick Löber presents “any-to-any” agents as an orchestration problem rather than a claim that one model already handles every modality. In his architecture, Gemini reads and reasons across PDFs, images, audio, video and other sources, then uses function calling to invoke specialized native models for images, speech, live audio, video or embeddings. Löber argues that the useful shift is not generating every possible format, but letting an agent decide when a diagram, spoken explanation or other output is warranted.

Patrick LoeberAI EngineerMay 20, 202610 min read

Gemini’s Strategy Shifts From Frontier Leaderboards to Deployable AI Infrastructure

Google DeepMind executives Tulsee Doshi and Logan Kilpatrick argue that Google’s current Gemini strategy is built less around a single frontier model than around a deployable AI stack. In their account, Gemini 3.5 Flash, the Anti-Gravity agent harness and new multimodal products such as Omni are meant to make models fast, cheap and integrated enough to run across Search, the Gemini app, AI Studio, YouTube and enterprise tools. The deeper shift, Kilpatrick says, is that the model is increasingly absorbing the scaffolding that once surrounded it, while Google standardizes the remaining agent infrastructure across its products.

Nathan Labenz · Logan Kilpatrick · Tulsee DoshiThe Cognitive RevolutionMay 20, 202619 min read

Coding Agent Skills Need Live Documentation, Not Cached Product Knowledge

Marc Klingen of Langfuse argues that coding agents can add observability, but often do it first from stale model memory, producing broken or incomplete instrumentation before recovering through current documentation. In a talk on building a Langfuse skill for Claude Code, he says the fix is not to stuff more product knowledge into the agent, but to give it reliable ways to find live docs, expose its intermediate work in traces, and evaluate changes against realistic repositories. The same work, he warns, creates new risks when optimization loops reward shorter paths and remove the documentation-fetching and approval steps that make the skill reliable.

Marc KlingenAI EngineerMay 20, 202613 min read

Fine-Tuning Pushed FunctionGemma From 46% to 90% Function-Calling Accuracy

Cormac Brick, a Google AI Edge engineer, argues that on-device agents are becoming practical when developers either use system models such as Gemini Nano through Android AI Core or ship narrow, fine-tuned tiny models with LiteRT-LM. His main example is FunctionGemma, a 270 million parameter function-calling model that rose from about 46% accuracy out of the box to more than 90% on most tested app-intent functions after synthetic-data fine-tuning. Brick presents the tradeoff plainly: system GenAI is easier when it fits, while app-shipped tiny models require more work but can run locally, offline, and with more control.

Chintan Parikh · Cormac BrickAI EngineerMay 20, 202611 min read

Retrofitting Sovereign AI Turns Compliance Rules Into Architecture Rework

Bilge Yücel of deepset argues that AI sovereignty is an engineering constraint that has to be designed into a system, not a legal or procurement requirement applied after deployment. She frames sovereign AI around control of data, models, infrastructure, and operations, and shows how retrofits expose hidden dependencies: jurisdiction-crossing data flows, model APIs embedded in application logic, managed services that masked operational work, and systems that cannot be traced or audited.

Bilge YücelAI EngineerMay 19, 202612 min read

Every Addition to an AI Agent Can Make It Worse

Ara Khan of Cline argues that agent maturity is less about adding autonomy than about knowing what not to add. In a talk structured around four levels of agent building — from frameworks to state machines, Kanban-managed workflows and cloud deployment — Khan says frontier models increasingly reward simpler prompts, deliberate architecture and visible human control. His central warning is that every extra instruction, abstraction or automation layer can make an agent worse.

Ara KhanAI EngineerMay 19, 202613 min read

Spotify Uses Semantic IDs to Make LLMs Recommend Catalog Items

Spotify’s Shivam Verma argues that LLM-era personalization requires translating both users and catalog items into forms a model can process alongside language. In his account, Spotify combines long-term user embeddings, Semantic IDs that turn tracks and episodes into token sequences, and soft tokens that project a listener’s profile into an LLM’s embedding space. The aim is a generative recommender that can produce catalog-native recommendations without full fine-tuning, while still relying on traditional ranking layers for production use.

Shivam VermaAI EngineerMay 19, 202610 min read

Serval Bets Boring IT Controls Will Unlock Enterprise AI

Serval founder and CEO Jake Stauch argues that enterprise AI will be won less by giving models broad autonomy than by constraining them inside permissions, approvals, audits and workflows that companies can trust. In a conversation hosted by Sequoia’s Pat Grady, Stauch describes Serval as a ServiceNow-like system rebuilt for AI: an admin agent generates workflows from natural language, while a help desk agent can act only through tools IT has explicitly approved. He says that same logic extends to Serval’s operating model, where customer insight and “fewer, better” hiring matter more than model access in a market that may force products to be rebuilt every few months.

Pat Grady · Jake StauchSequoia CapitalMay 19, 202615 min read

AI Growth Is Running Into Power, Memory, and Inference Bottlenecks

TBPN’s discussion recast the AI boom around physical and economic bottlenecks — power, cooling, chip scarcity, inference cost and memory — rather than model ambition alone. Mike Isaac, Rowan Trollope and Dean Leitersdorf described an industry where local utilities, low-level inference optimization and fast state management are becoming central constraints, a capacity problem the hosts also saw in the whey protein shortage. Everlane’s reported sale to Shein pointed to a different limit: Hays argued that venture-backed ethical basics struggled against price pressure, brand preference and the demand for sustained growth. Joanna Stern supplied the adoption constraint, arguing from her reporting that AI’s progress will be judged through trust, job anxiety, children’s safety and whether new devices ease or deepen phone dependence.

John Coogan · Jordi Hays · Joanna Stern · Rowan Trollope · Dean Leitersdorf · Mike IsaacTBPNMay 18, 202624 min read

Gemini Becomes the Prompt Engineer for Google’s Gen Media Stack

Google DeepMind developer advocate Guillaume Vernade demonstrates a gen-media workflow built around Gemini as the orchestrator rather than as a one-shot generator. Using The Wind in the Willows, he shows Gemini reading the full book, producing structured prompts and scripts, and handing them to Nano Banana, Veo, Lyria and TTS models for images, video, music and narration. His broader case is that multimodal production depends less on a single model than on schemas, reference assets, state management, cost controls and prompt handoffs between specialist systems.

Guillaume Vernade · Paige BaileyAI EngineerMay 18, 202619 min read

Long-Running Agents Need Separate Builders, Evaluators, and Disposable Scaffolding

Anthropic’s Ash Prabaker and Andrew Wilson argue that long-running agents are a harness-design problem, not a matter of writing longer prompts. Their case is that agents can run for hours only when building, judging, planning and state management are separated: adversarial evaluators should test live behavior, work should be decomposed into explicit contracts, and durable state should live outside the model’s context. They also warn that this scaffolding is provisional, because each new model release changes which supports are useful and which have become dead weight.

Ash Prabaker · Andrew WilsonAI EngineerMay 18, 202619 min read

A Harness Made GPT-3.5 Turbo’s Browser Agent Reliable Without Rewriting the Prompt

Tejas Kumar, an IBM engineer, argues that unreliable AI agents are often not suffering from bad prompts so much as missing harnesses: the deterministic software around a model that bounds its behavior, manages context, verifies outcomes, and handles known failure states. In his Hacker News browser-agent demo, GPT-3.5 Turbo falsely claimed it had upvoted a post after hitting a login wall; without changing the prompt, Kumar added guardrails, trace-based verification, and a programmatic login handler until the same model completed the task reliably.

Tejas KumarAI EngineerMay 17, 202611 min read

Incident.io Uses Coding Agents to Debug Its AI SRE

Lawrence Jones, founding engineer at Incident.io, argues that complex AI products now require debugging tools built for agents as well as humans. In a talk on Incident.io’s AI SRE system, which runs hundreds of prompts across telemetry and code during production investigations, Jones describes how the team moved from human trace inspection to agent-addressable evals, downloadable file-system traces, and parallel analysis pipelines to find and fix failures that had become too large to debug manually.

Lawrence JonesAI EngineerMay 17, 202611 min read

AI Chat Needs Shared Sessions, Not Single Response Streams

Mike Christensen of Ably argues that many AI chat interfaces fail because they tie the user experience to a single streaming connection, not because the underlying model is inadequate. In his account, Server-Sent Events make common product behaviors such as refresh, reconnect, cancellation, multi-tab use and device switching brittle or ambiguous. Christensen’s proposed fix is to treat the AI session as a durable shared resource: clients and agents subscribe to and write into the session, so connections can drop, agents can run concurrently, and humans can join without losing context.

Mike ChristensenAI EngineerMay 17, 202611 min read

Agentic AI Is Turning Model Quality Into a Systems Problem

At AI Engineer Singapore’s second day, speakers from Google DeepMind, Cloudflare, Arize, OpenClaw, Adaption and other teams made a shared engineering case: as AI systems become more agentic, model quality is no longer separable from the systems around the model. Richard Ngo framed the risk as long-horizon, situationally aware agents whose goals cannot be inspected, while practitioners argued that production AI now depends on continuous evaluation, traces, deterministic execution boundaries, routing, memory, fine-tuning and test-time search. The source’s central claim is that useful and safe agentic AI is becoming a systems problem, not just a model-selection problem.

Shawn Wang · Eugene Yan · Philip Vollet · Haotian Zhang · Eugene Evstafev · Jason Liu · Pratik Desai · Michelle Chen · Jason Lopatecki · Amr Ahmed · Rita Zhang · Harris Snyder · Adarsh Shah · Eric Zhang · Ricky Robinett · Linoy Bitan · Wei Sheng · Richard NgoAI EngineerMay 17, 202626 min read

Vertical AI Teams Need Domain Experts Who Own Quality Loops

Chris Lovejoy of Notius Labs argues that vertical AI companies increasingly fail or succeed on whether they can turn domain judgment into product quality, not simply on access to better models. He proposes three operating models for that expertise: an Oracle who both judges and changes outputs, an Evaluator who defines and measures quality while engineers implement fixes, and an Architect who designs systems that improve from use. His case studies of Granola, Tandem and Anterior show why the right model depends on whether quality is subjective, measurable, or too variable for manual iteration.

Chris LovejoyAI EngineerMay 16, 202614 min read

Context Graphs Make AI Decision Trails Queryable

Stephen Chin of Neo4j argues that enterprise AI systems need context graphs because retrieval alone can surface relevant facts while missing the relationships that make them usable. In his examples, a graph-augmented system can connect a patient’s emphysema care plan to smoking history or a credit decision to prior rejections, policies, margin trades and fraud signals. Chin’s case is that agents should preserve not only documents and answers, but the decision traces, tool calls, causal chains and outcomes that let humans inspect and reuse prior reasoning.

Stephen ChinAI EngineerMay 16, 202612 min read

Economic Entanglement, Not Decoupling, Defines the New China Bargain

Salesforce CEO Marc Benioff joined the All-In hosts for a discussion that framed U.S.-China relations, enterprise AI, and the software selloff around the same question: when dependence is a stabilizer and when it becomes leverage. Benioff argued that more trade with China can lower conflict risk and that large software platforms remain valuable because AI still needs trusted customer data, cash-flowing distribution, and enterprise deployment. David Friedberg, Chamath Palihapitiya, and Jason Calacanis extended the argument across Taiwan, chips, AI assistants, El Niño-driven food risk, and private-market SPVs, where interconnection can either absorb shocks or transmit them.

Jason Calacanis · Chamath Palihapitiya · David Friedberg · Marc BenioffAll-In PodcastMay 15, 202620 min read

AI Software Winners Will Own Context, APIs, or Outcomes

Tasklet chief executive Andrew Lee argues that AI software is consolidating toward a few horizontal agent platforms that hold context, connect tools, generate interfaces, and choose among models. In a discussion with Nathan Labenz, Lee says Tasklet has rewritten its agent stack around file-system memory, agentic search, and provider-specific context management because the chat transcript is no longer enough. He also frames Anthropic as both Tasklet’s critical supplier and a major competitor, making model neutrality central to Tasklet’s bid to survive the AI transition.

Nathan Labenz · Andrew LeeThe Cognitive RevolutionMay 15, 202623 min read

Legacy Infrastructure Is Slowing Enterprise Agentic AI Adoption

Kris Lovejoy, global strategy leader at Kyndryl, argues that enterprises are not being held back from agentic AI mainly by model capability or startup speed, but by the difficulty of running agents securely and reliably inside legacy infrastructure. In a conversation with Craig Smith, she says pilots are widespread but scaled deployments remain rare because agents need context, governance, compliance controls and modernized IT foundations before they can touch core systems. Her near-term prediction is narrower than much of the hype: by about 2031, agentic AI may handle roughly half of traditional line-one and line-two IT administration tasks, with humans still supervising the loop.

Craig Smith · Kris LovejoyEye on AIMay 15, 202616 min read

PFF’s Two-Engineer Agent Team Shipped 10x More Output

PFF CTO Mike Spitz argues that AI agents change the basic operating constraint of an engineering organization: the question is no longer how to make engineers faster, but how to make agents faster. In a three-month case study, he says two agent-heavy engineers shipped far more frequently than a ten-person team on the same codebase, with PFF measuring a 10x output gain per engineer and higher customer satisfaction. The result, in his account, was not the end of engineers but the removal of Scrum-era coordination rituals and a sharper split between agent-executed work and human judgment.

Mike SpitzAI EngineerMay 15, 202611 min read

Supabase Says Skills and MCP Close the Agent Context Gap

Pedro Rodrigues of Supabase argues that agents fail on production systems less because they cannot reason than because they lack product-specific judgment. In a test using the same Postgres task, Supabase found that Claude with MCP alone created a view that could bypass row-level security, while MCP plus a Supabase skill added the required `security_invoker = true` flag. Rodrigues’s case is that MCP gives agents tools, but skills supply the rules, workflows, and current documentation paths needed to use those tools safely.

Pedro RodriguesAI EngineerMay 15, 20269 min read

Intercom Doubled Engineering Throughput by Standardizing on Claude Code

Brian Scanlan, a senior principal engineer at Intercom, argues that the company doubled engineering throughput by treating AI coding as an internal platform strategy rather than an individual productivity tool. In his account, Intercom standardized on Claude Code, encoded recurring engineering work into agent-usable skills, connected agents to internal systems under existing controls, and made AI adoption an explicit expectation across R&D. The reported result was a doubling of pull-request throughput, including 17.6% of merged PRs approved by Claude, alongside new bottlenecks in review and CI.

Brian ScanlanAI EngineerMay 15, 202613 min read

Abridge Bets Clinical Conversations Can Become Healthcare’s Intelligence Layer

Abridge executives Janie Lee and Chaitanya “Chai” Asawa argue that the patient-clinician conversation is becoming healthcare’s core intelligence layer, not merely an input for automated notes. In a discussion with Redpoint’s Jacob Effron, they describe Abridge’s move from ambient documentation into clinical decision support, prior authorization and other workflows that depend on EHR data, payer rules, medical literature and local guidelines. Their case is that healthcare AI will be judged less by chatbot fluency than by whether it can deliver accurate, low-latency, privacy-preserving support inside clinical workflows without adding to clinicians’ alert burden.

Shawn Wang · Janie Lee · Jacob Effron · Chaitanya AsawaLatent SpaceMay 14, 202620 min read

Cerebras IPO Puts a Public Price on Fast AI Inference

TBPN’s John Coogan and Jordi Hays use Cerebras’s first day as a public company to frame a narrower AI hardware argument: the market is beginning to price low-latency inference as a product in its own right. Cerebras founder Andrew Feldman argues that fast inference will eventually consume demand for slow AI responses, while SemiAnalysis’s Doug O’Laughlin cautions that the company’s wafer-scale SRAM architecture may be limited by memory scaling and model size. The result is a public-market test of whether owning a valuable slice of the AI compute stack is enough.

Jordi Hays · John Coogan · Tyler Cosgrove · Ben Hylak · Andrew Feldman · Amy Reinhard · Doug O'Laughlin · Steve Vassallo · Eric VishriaTBPNMay 14, 202633 min read

Codex Is Moving From Code Generation to Delegated Knowledge Work

Codex is moving from a coding assistant toward an agent for delegated knowledge work, according to Thibault Sottiaux, OpenAI’s head of Codex. In an OpenAI Forum conversation with Chris Nicholson of OpenAI Global Affairs, Sottiaux argues that as models have become more reliable and better connected to workplace context, Codex is being used to research, organize information, create files and presentations, coordinate across tools, and run background tasks. That shift, he says, makes delegation, trust and access controls central as agents act across files, communications tools and company systems.

Chris Nicholson · Thibault SottiauxOpenAIMay 14, 202614 min read

Choosing The Right Eval Matters More Than Tuning The Judge

Laurie Voss of Arize argues that agentic applications need the same engineering discipline as other production software: instrumentation, inspectable traces, targeted evals, and controlled experiments, not a handful of prompts that “look right.” In a hands-on workshop using a financial analysis agent, Voss shows how teams should read traces before writing evals, classify failures by root cause, and combine deterministic checks, LLM judges, custom rubrics, and human-labeled meta-evaluation. His central warning is that the choice of eval can dominate the result: the same agent scored 0 out of 13 on a correctness eval and 13 out of 13 on a faithfulness eval because the first judge was asking the wrong question.

Laurie VossAI EngineerMay 14, 202624 min read

Agent Observability Is Moving From Dashboards to Eval-Driven Optimization

Amy Boyd and Nitya Narasimhan of Microsoft argue that agent observability has to track the widening gap between what an AI agent is meant to do and what it actually does as models, prompts, tools and user behavior change. Their walkthrough of Microsoft Foundry frames observability as a loop of OpenTelemetry tracing, trace-linked evaluations, monitoring, optimization and red teaming. The central demonstration is an observe skill that can generate an evaluation dataset, run batch tests, optimize prompts, compare versions and roll back to the best-performing agent version from a sparse starting point.

Amy Boyd · Nitya NarasimhanAI EngineerMay 14, 202618 min read

Interwhen Verifies AI Agent Actions Before They Become Irreversible

Microsoft Research’s Amit Sharma presents Interwhen as a framework for moving AI agents from post-hoc checking to verified execution while they are still acting. The open-source library uses LLMs to turn natural-language instructions, policies, and partial responses into smaller verifiable properties, then applies symbolic or model-based verifiers to tool calls and intermediate behavior. Sharma argues that this lets agents continue normally when checks pass but interrupts them when a verifier detects a violation, addressing risks that final-output review may catch too late.

Amit Sharma · Yash LaraMicrosoft ResearchMay 14, 20266 min read

GitHub Agentic Workflows Turn Actions Into AI-Run Development Processes

Microsoft Research’s Peli Halleux and Yash Lara present GitHub Agentic Workflows as a move from AI-assisted coding to repository-level process automation. Their argument is that agents should be embedded inside GitHub Actions to research, plan, assign, and open pull requests under human review, rather than operate as unconstrained swarms. The system’s promised scale depends on orchestration, sandboxing, limited permissions, and Microsoft-hosted models on Azure.

Yash Lara · Peli HalleuxMicrosoft ResearchMay 14, 20265 min read

MagenticLite Brings Full Agent Workflows to Small Language Models

Microsoft Research is presenting MagenticLite as a full-stack agentic system designed to make small language models usable for multi-step work across a browser and local files. Weili Shi, Harkirat Behl and Hussein Mozannar argue that the capability comes from specializing the stack rather than relying on frontier-scale models: MagenticBrain handles planning, coding and delegation, while Fara 1.5 controls the browser. The release also emphasizes user oversight, with the agent pausing for credentials, approvals or other points where the user needs to take control.

Hussein Mozannar · Harkirat Behl · Weili ShiMicrosoft ResearchMay 14, 20267 min read

An Event-Sourced Agent Harness Separates State Replay From Side Effects

Jonas Templestein of Iterate argues that an agent harness can be reduced to an append-only event stream plus processors: synchronous reducers to derive state, and post-append hooks to perform side effects. His design puts model chunks, tool calls, errors, schedules, subscriptions and even processor deployment into the log, so a restarted agent can replay state without replaying old LLM calls. The larger claim is that agents and third-party services can compose by reading and appending to the same durable stream, with bounded waits and circuit breakers replacing tighter, blocking plugin interfaces.

Jonas Templestein · Misha KaletskyAI EngineerMay 14, 202616 min read

GPT-Realtime-2 Turns Voice Agents Into Tool-Using Reasoning Systems

OpenAI’s Build Hour on GPT-Realtime-2 presented the new realtime voice release as a shift from conversational voice interfaces toward tool-using, stateful agents. Teri Yu and Erika Kettleson argued that GPT-realtime-2’s larger context window, stronger instruction following, parallel tool calling and controllable speech behavior let developers build voice systems that can operate apps, reason across workflows and know when not to speak. Sierra’s Ken Murphy and Soham Ray added that production voice agents still depend on the surrounding system: guardrails, tuned turn-taking, tracing, redaction, evaluations and customer-specific workflows.

Ken Murphy · Teri Yu · Sarah Urbonas · Soham Ray · Erika KettlesonOpenAIMay 13, 202614 min read

Agents Can Now Fine-Tune Open Models Through Prompted Workflows

Merve Noyan argues that open models have moved from downloadable artifacts into an operational stack for selection, serving, inspection, training and deployment. In her Hugging Face presentation, she makes the case that access to model weights now matters because developers can quantize, fine-tune and run models locally or at the edge, while Hub benchmarks, inference providers, traces, MCP and Skills let agents act directly on those workflows. Her strongest example is a coding agent that can size hardware, choose infrastructure and launch a fine-tuning job from a prompt.

Merve NoyanAI EngineerMay 13, 202612 min read

ElevenCreative Adds Templates for Reusable AI Creative Workflows

ElevenLabs is introducing Templates in ElevenCreative, a feature that turns its node-based Flows into reusable creative workflows with defined inputs and outputs. The company presents the tool as a way to run repeatable production tasks — such as product shots, mockups, style transfers, character sheets, and thumbnail translation — without rebuilding the workflow each time. Users can run templates from a gallery or publish their own, choosing which variables others can edit, what asset is returned, and whether access is private, workspace-only, or public.

ElevenLabsMay 13, 20265 min read

Continuous Agents Need Stateful Compute, Not Traditional CI/CD

Madison Faulkner and Hugo Santos of Namespace argue that traditional CI/CD is organized around human-paced pull requests, and starts to fail when autonomous agents generate continuous, overlapping streams of code. Their proposed replacement keeps validation inside a stateful agent loop, uses caching and orchestration to avoid cold starts, and moves completed work into a pre-merge layer where humans review intent and outcome rather than every diff. The underlying CI functions remain, but the pull request stops being the system’s basic unit of work.

Madison Faulkner · Hugo SantosAI EngineerMay 13, 202611 min read

Agent Workflows Route Conversations Through Specialized Subagents

ElevenLabs is introducing Workflows, a visual editor for its Agents Platform that lets builders design routed conversation flows instead of placing all business logic inside one agent prompt. The company argues that specialized subagents, each with their own instructions, tools, knowledge bases and model choices, give teams more control over cost, latency and accuracy. The product is positioned as a way to combine AI interpretation with predefined actions, verification steps and human handoffs on the same design surface.

ElevenLabsMay 13, 20265 min read

Compute Allocation Is Anthropic’s Core Constraint as Claude Revenue Surges

Anthropic CFO Krishna Rao argues that the company’s rise is best understood through compute: a scarce capital asset that must be bought years ahead and constantly reallocated across model training, customer demand, internal automation and future products. In an interview with Patrick O’Shaughnessy, Rao says ordinary forecasting and software-margin frameworks break down when model capability, adoption and revenue compound together, leaving Anthropic to manage growth through scenarios rather than point estimates.

Patrick O'Shaughnessy · Krishna RaoInvest Like The BestMay 13, 202621 min read

Korean AI Dividend Proposal Triggers Semiconductor Stock Selloff

A South Korean policy chief’s proposal to return part of AI-related gains to citizens jolted the country’s chip market, with Samsung and SK Hynix closing down around 5% after Kim Yong-beom argued that profits from the AI infrastructure era should be shared more broadly. Bloomberg reported that the presidential office later described Kim’s post as personal opinion, while the same program pointed to related pressure points in the AI boom: CME’s plan with Silicon Data for compute futures and Nvidia CEO Jensen Huang’s absence from Trump’s China delegation as approval for Blackwell sales looked unlikely.

Ed Ludlow · Caroline Hyde · Maggie Eastland · Jamie Dimon · Katherine Doherty · Ryan Vlastelica · Bennett Siegel · Christian Klein · Michael Shepard · Kim Forrest · Larry Fink · Peter Elstrom · Keith Naughton · Madlin MekelburgBloomberg TechnologyMay 12, 202614 min read

Persistent Sandboxes Make Agents Remember, Plan, and Reuse Their Work

Nico Albanese, a Vercel engineer working on the AI SDK, argues that agents become more reliable when they are given a persistent sandboxed computer, not just a runtime and tools. In his workshop, he builds that pattern with AI SDK 6, Vercel’s named sandboxes, a bash tool, and a file-backed memory system, showing how an agent can plan in files, preserve context across sessions, and create reusable scripts without a separate memory layer.

Nico AlbaneseAI EngineerMay 12, 202620 min read

SAP Says ERP Context Will Make AI Agents Reliable for Business

SAP chief executive Christian Klein used Bloomberg Technology to frame the company’s new autonomous enterprise platform as a bet that AI agents need business context more than proprietary models. He argued that SAP’s advantage is its access to ERP data and process knowledge, which can make agents reliable enough to coordinate work across finance, commerce, inventory, procurement and supply chains. Pressed on competition from partners such as AWS, Klein said SAP’s role is to provide the enterprise context layer while working with hyperscalers and data platforms to harmonize data beyond SAP systems.

Ed Ludlow · Caroline Hyde · Christian KleinBloomberg TechnologyMay 12, 20265 min read

Fixed Evaluation Suites Go Stale as Agents Optimize Toward Intent

Vincent Koc of Comet ML argues that AI evaluation is being outpaced by the systems it is meant to measure. In a talk on adaptive evaluation for agents, Koc says static benchmarks and handcrafted test sets are poorly suited to applications that change with prompts, tools, production traces, user behavior and even their own harnesses. His proposed direction is to define the intended end state, use traces and telemetry to surface drift and edge cases, and treat evals as a continuously revised system rather than a one-time benchmark.

Vincent KocAI EngineerMay 12, 202611 min read

Cerebras’s Higher IPO Range Tests AI Infrastructure Demand

Alex Wilhelm and Jason Calacanis treat Cerebras’s raised IPO range as a test of how much public investors will pay for future AI inference demand and the quality of contracts with customers such as OpenAI. Ori Goshen makes a parallel case that enterprise AI’s hard problem is no longer choosing one model, but routing work across models, tools and inference strategies for cost, latency and accuracy. Across OpenAI’s deployment spinout, AI21’s orchestration pitch, Magrathea Metals’ brine-based magnesium plan and OpenClaw’s fading momentum, the article frames deployment as a question of incentives, constraints and where the bottleneck actually sits.

Jason Calacanis · Alex Wilhelm · Ori Goshen · Alex GrantThis Week in StartupsMay 12, 202620 min read

Coding Agents Work Best When Products Expose Simple Tools

Matthias Luebken argues that coding agents such as OpenClaw are less mysterious than they appear: they are LLMs calling tools in a loop, made more useful by a runtime, shell, sessions and product hooks. In his Tavon talk, he uses Pi, a minimal coding-agent SDK, to show how that loop can be embedded inside business software, including a sales workflow where RFP emails are routed to customer-specific agent sessions and returned to users as draft replies. His architectural point is that teams should not force agents through opaque systems, but expose data, commands and controls in forms coding agents can use cleanly.

Matthias LuebkenAI EngineerMay 11, 202614 min read

Slack-Native AI Coworkers Turn Memory and Permissions Into Product Risks

Fryderyk Wiatrowski argues that building Viktor as an AI coworker inside Slack is not a matter of scaling a personal assistant to more users. A company-level agent gains value from shared context, shared integrations, and the ability to act where work is discussed, but those same features create harder problems around memory isolation, permissions, fragmented Slack conversations, proactivity, and tone. His case is that an “AI employee” has to be designed less like a chatbot and more like a new hire entering the company’s communication layer.

Fryderyk WiatrowskiAI EngineerMay 11, 202612 min read

Long Lake’s $6.3 Billion Amex GBT Deal Tests AI-Led Buyouts

Long Lake Management co-founder and CEO Alexander Taubman argues that AI can change the economics of services businesses when the buyer owns the workflow, not just the software layer. In a conversation with Elad Gil about Long Lake’s announced $6.3bn take-private of American Express Global Business Travel, Taubman presents the firm’s model as acquiring trusted services companies, embedding its Nexus AI platform into day-to-day operations, and using productivity gains to drive growth, customer service and employee retention rather than short-term cost cuts.

Elad Gil · Sarah Guo · Alex TaubmanNo PriorsMay 11, 202613 min read

Durable Agents Need Context Logs and Execution Snapshots

Eric Allam of Trigger.dev argues that durable agents need more than the replay-based workflow model used for durable transactions. In his talk, he separates agent durability into two problems: the LLM context, which fits naturally as an append-only log, and the execution environment — files, memory, subprocesses and local state — which he says should be preserved through OS-level snapshot and restore. Allam uses Trigger.dev’s Firecracker work to make the case that long-running agents are becoming session-like workloads, not just replayable transactions.

Eric AllamAI EngineerMay 10, 202611 min read

Head-Tail Truncation and Memory Stabilized Arize’s Trace-Analyzing Agent

Sally-Ann DeLucia argues that agent performance depends on context management as an operating discipline, not on larger prompts or simple compression. Drawing on Arize’s work building Alyx, an agent that analyzes trace data from AI systems including its own, she says naive truncation broke follow-up reasoning and LLM summarization gave the model too much control over what mattered. Arize’s more durable pattern was to preserve the head and tail of context, store the middle for retrieval, test long sessions explicitly, and move heavy workloads into sub-agents.

Sally-Ann DeLuciaAI EngineerMay 10, 202610 min read

Production AI Features Need Feedback Loops, Not One-Shot Prompts

Mehedi Hassan, a product engineer at Granola, argues that the hard part of shipping AI features is not getting a model to work once in a demo, but making its behavior reliable and inspectable in production. Using Granola’s meeting-notes app as the case, he says web search, chat, and prompt personalization quickly expose costs, context limits, provider instability, and role-specific user expectations that a single prompt cannot absorb. Granola’s response, in his account, was to build feedback loops: internal tracing, broadly usable debugging tools, and faster ways to test product variants before shipping.

Mehedi HassanAI EngineerMay 10, 20267 min read

Voice AI Still Confuses Natural Speech With Real Conversation

Neil Zeghidour, CEO of Gradium AI and one of the researchers behind the full-duplex voice model Moshi, argues that voice AI’s long-promised “Her” moment is still being confused with better synthetic speech. His case is that cascaded voice agents are useful but structurally too slow and lossy to feel conversational, while speech-to-speech models improve flow but remain limited unless they can listen and speak simultaneously, use tools reliably, understand paralinguistic cues, and run cheaply enough to scale.

Neil ZeghidourAI EngineerMay 9, 202612 min read

ElevenLabs Voice Engine Wraps Existing Chat Agents Without Rebuilding Them

Luke Harries of ElevenLabs argues that the next step for chat agents is not a new orchestration stack but a voice layer around the agents companies have already built. His case for ElevenLabs’ Voice Engine is that teams can keep their existing LLM logic, RAG, tools and business rules, while offloading speech-to-text, text-to-speech, turn-taking and interruption handling to a wrapper. The product is positioned for companies that want voice interfaces across web, phone and meeting channels without rebuilding their chat agents inside a fully managed platform.

Luke HarriesAI EngineerMay 9, 20266 min read

Personal AI Lets One Builder Do the Work of Teams

Y Combinator CEO Garry Tan argues that personal AI is reaching a stage comparable to the early personal computer: powerful enough to let one person build software that once required a team, but still brittle enough to demand technical ownership. Drawing on his work with Claude Code, OpenClaw and his GStack workflow, Tan makes the case for heavy token use, Markdown-encoded “skills” and multiple coding agents under one accountable human operator. The larger question, he says, is whether users will control their own AI tools, data and prompts, or work inside opaque systems controlled by others.

Garry Tan · Harj Taggar · Diana Hu · Jared FriedmanY CombinatorMay 8, 202615 min read

Agentic Search Needs Specialized Tools and General-Purpose Escape Hatches

Elastic’s Leonie Monigatti argues that context engineering for LLM agents is largely a search-interface problem: the critical question is how an agent decides what to retrieve from files, databases, memory, the web, and other sources before the model answers. In her workshop, she shows why semantic search, database query tools, shell access, and agent skills each solve different parts of that problem and fail in different ways. Her recommendation is to build retrieval stacks that combine easy specialized tools for common tasks with more general tools for ambiguous or complex ones, then use observed failures to refine the stack.

Leonie MonigattiAI EngineerMay 8, 202617 min read

Agentic AI Is Making Enterprise Software a Control Layer

ServiceNow president, COO and chief product officer Amit Zavery argues that agentic AI will change enterprise software, but not by letting unconstrained agents replace the platforms that run corporate workflows. In a ServiceNow-sponsored interview, Zavery says the hard problem is turning probabilistic AI into reliable action across regulated, multi-system businesses, with the context, permissions, auditability and governance that enterprises require. His case is that companies such as ServiceNow retain leverage if they make AI production-ready, while software vendors that fail to adapt remain exposed.

Alex Kantrowitz · Amit ZaveryAlex KantrowitzMay 8, 202611 min read

Production Analytics Finds Agent Failures That Standard Evals Miss

Scott Clark, co-founder and chief executive of Distributional, argues that teams running LLM agents need to look beyond pre-production evals and dashboards of known metrics. His case is that the most consequential failures often emerge only in production, where agents interact with users, tools and changing models in ways teams did not know to test. Clark proposes an observability stack in which telemetry records what happened, monitoring tracks known signals, and analytics clusters trace behavior to surface unknown failure modes that can become new evals, guardrails, prompts or system fixes.

Sam Charrington · Scott ClarkThe TWIML AI PodcastMay 7, 202620 min read

AI Coding Makes Software-Engineering Fundamentals More Important

Matt Pocock, a TypeScript teacher now focused on AI engineering, argues that AI coding has made software-engineering fundamentals more important rather than less. In a conversation with Shawn Wang, Pocock says code generation works best when humans define the architecture, module boundaries and domain language that give agents a coherent system to change. The lesson he draws from Claude Code and other fast-moving tools is that tool-specific knowledge ages quickly, while engineering judgment remains the durable layer.

Shawn Wang · Matt PocockLatent SpaceMay 7, 202612 min read

Production Agents Need Evals and Managed Variables After Deployment

Samuel Colvin of Pydantic argues that production agents need more than observability after deployment: they need evals, traces, and typed configuration that can change prompts, models, and other parameters without a redeploy. Using Pydantic AI, Logfire, managed variables, and GEPA, he shows a workflow for moving from manual prompt tuning toward continuous optimization. His case is practical rather than automatic: GEPA can improve a narrow benchmark, but only if the team has representative data, sound evaluation criteria, and a clear definition of what better means.

Samuel ColvinAI EngineerMay 7, 202622 min read

Perplexity Frames AI Agents as Metered Digital Labor

Perplexity chief business officer Dmitry Shevelenko argues that AI agents should be judged less as software features than as metered digital labor: tools users will pay for when they perform economically useful work. In a Big Technology Podcast interview, he makes the case that Perplexity’s computer-use agents, workflow packaging, broad permissions and multi-model orchestration are all part of that shift. The unresolved question is whether users and companies will accept the access, trust and usage-based pricing required to make those agents a real business rather than another AI novelty cycle.

Alex Kantrowitz · Dmitry ShevelenkoAlex KantrowitzMay 7, 202619 min read

OpenAI Splits Audio API Into Translation, Transcription, and Voice-Agent Models

OpenAI is presenting three new API audio models as infrastructure for voice applications that can translate, transcribe, reason and act in real time. Romain Huet’s demonstration centered on GPT-Realtime-Translate, which keeps pace with multilingual speech, and GPT-Realtime-2, a voice-agent model that can follow turn-taking instructions, use business context and call tools while explaining its work. GPT-Realtime-Whisper completes the set as a streaming speech-to-text model for live transcription.

Romain Huet · Jason Wei · Dominic GrilloOpenAIMay 7, 20266 min read

Coding Agents Need Library Source Code, Not Longer Prompts

Michael Arnaldi, of Effectful, argues that coding agents use Effect better when the project gives them the Effect source code, not just better prompts or documentation. In a workshop starting from an empty repository, he demonstrates cloning the Effect repo into the project, having the agent extract local pattern files, and then using strict TypeScript diagnostics, tests, lint rules and persistent instructions to steer the agent toward a working Effect HTTP API.

Michael ArnaldiAI EngineerMay 7, 202621 min read

Production Agents Need Semantic Observability Beyond Offline Evals

Raindrop’s workshop argues that production agents need a different observability model from conventional software monitoring or offline evals. Zubin Kumar, Danny Gollapalli and Ben Hylak make the case that teams should track both explicit telemetry such as tool errors, latency and cost, and implicit signals such as user frustration, refusals, task failure, capability gaps and unusual workarounds. Their framework treats real production behavior as the primary surface for finding regressions, running experiments and catching failures that do not appear as clean exceptions.

Danny Gollapalli · Ben Hylak · Zubin KotichaAI EngineerMay 7, 202617 min read

Voice Will Be the Primary Interface for AI Agents and Robots

At Sequoia’s AI Ascent 2026, ElevenLabs co-founder and CEO Mati Staniszewski argues that audio was an overlooked frontier in 2022 because the AI field was focused on text and images, leaving room for a smaller company to build quickly and monetize early. His broader case is that as AI intelligence becomes more capable, voice becomes the interface problem: the way people will use agents, robots, services, education and healthcare. Staniszewski says the next hard problems are emotional intelligence, timing, authentication and workflow, not merely making synthetic speech sound human.

Mati Staniszewski · Sonya Huang · Andrew ReedSequoia CapitalMay 7, 202612 min read

Luma Is Rebuilding Video AI Around a Unified Multimodal Transformer

In a Stanford CS153 guest lecture, Luma AI co-founder and chief executive Amit Jain argues that generative video is only a staging point toward “unified intelligence”: models that understand and generate across text, images, video, audio, code and tools in a single work loop. Jain traces Luma’s path from Apple-era LiDAR and 3D capture to internet-scale video, saying the company followed the data but now sees prettier clips as insufficient. The destination, he says, is a multimodal AI factory for professional creative and physical work, where human skills, tool use, feedback and unified transformer architectures produce full campaigns, schematics, productions and eventually robotics workflows.

Anjney Midha · Amit JainStanford OnlineMay 7, 202619 min read

Descript Bets Creator AI on Reliable Editing, Not Content Slop

Laura Burkhauser, Descript’s chief executive, distinguishes generative AI tools for creators from the “slop” she defines as mass-produced content arbitrage. Her case is that Descript’s future depends less on adding AI everywhere than on making editing automation reliable, reversible and useful for recorded human media. That means choosing third-party models by fit and taste, building in-house systems where Descript has workflow data, and treating creator backlash as a product constraint rather than a branding problem.

Nathan Labenz · Laura BurkhauserThe Cognitive RevolutionMay 7, 202619 min read

Agent Failure Should Drive Enterprise AI Knowledge Base Curation

Raj Navakoti argues that enterprise AI agents fail less because of model limits or retrieval plumbing than because companies have not made institutional knowledge legible. In his Demand-Driven Context workshop, he proposes building agent-ready knowledge bases from the bottom up: give agents real tickets or incidents, observe where they fail, and turn those failures into structured, validated context blocks. The method, shown through smaller-scope examples and prototypes including work from IKEA Digital, is presented as an incremental curation loop rather than a proven enterprise-scale system.

Raj NavakotiAI EngineerMay 7, 202617 min read

Agent Skills Turn Repeated Instructions Into Portable Workflows

WorkOS engineers Nick Nisi and Zack Proser make the case that AI “skills” are a practical way to turn repeated agent instructions into portable, reusable workflows. They argue that small markdown-and-script packages can encode team context, constraints, evidence-gathering commands and output formats so agents stop producing generic answers and start following a team’s way of working. Their warning is that skills only help when they are focused, routed correctly, tested against a no-skill baseline and managed like shared software rather than treated as another giant context file.

Nick Nisi · Zack ProserAI EngineerMay 7, 202616 min read

MCP Apps Turn Chat Hosts Into Application Distribution Channels

Liad Yosef and Ido Salomon argue that MCP Apps turn chat products such as ChatGPT, Claude, VS Code, Cursor and Copilot into application distribution surfaces, not just places for text responses. Their case is that tools can return branded, interactive UI resources over MCP, while user actions flow back through the host so the model retains context and control. For builders, they frame this as a shift from monolithic web destinations to portable app components that can run across compliant agent hosts.

Ido Salomon · Liad YosefAI EngineerMay 7, 202612 min read

Small-Model Inference Needs Infrastructure Beyond Model Servers

Filip Makraduli of Superlinked argues that the hard part of small-model inference is no longer simply serving a model, but operating many embeddings, rerankers, extractors and multimodal models efficiently in production. In his account, conventional one-model-per-container deployments waste GPU capacity and leave teams to rebuild routing, autoscaling, monitoring, hot-swapping and eviction themselves. Superlinked’s SIE is presented as an open-source attempt to provide that missing infrastructure layer for AI search and document-processing workloads.

Filip MakraduliAI EngineerMay 7, 20269 min read

Enterprise AI Agents Need Harnesses, Traces, and Controlled Runtimes

LangChain co-founder and CEO Harrison Chase argues that enterprise AI agents are becoming an architectural problem rather than a question of adding autonomy wherever possible. In an NVIDIA AI Podcast interview, he says systems such as Claude Code, Manus and Deep Research share a common “deep agent” pattern: an LLM in a tool-calling loop, supported by a reusable harness, workspace, subagents and planning. For enterprises, Chase says trust depends on choosing the right level of autonomy and surrounding agents with observability, evaluation, secure runtimes and continued iteration.

Harrison Chase · Noah KravitzNVIDIAMay 7, 202612 min read

Multi-Agent Software Systems Need Contracts and Handoffs to Run for Days

Factory’s Luke Alvoeiro argues that long-running software agents will not be built by stretching chat sessions, but by organizing agents into roles with explicit contracts, handoffs and validation. In a talk on Factory’s Missions system, he presents a three-part architecture — orchestrator, workers and validators — designed to run software work for hours or days while humans supervise scope and acceptance rather than every step. The case rests on Factory’s production experience, including missions Alvoeiro says have run as long as 16 days, and on a claim that serial execution, adversarial verification and model selection by role matter more than default parallelism.

Luke AlvoeiroAI EngineerMay 7, 202610 min read

Gemma 4 Moves On-Device AI From Chatbots to Local Agents

Chintan Parikh of Google DeepMind argues that on-device AI is moving from local chatbots toward local agents, as smaller Gemma 4 edge models become capable of tool calling, structured output and reasoning on phones, laptops and embedded hardware. With Weiyi Wang joining the Q&A, Parikh presents LiteRT as the deployment layer for that shift across Android, iOS, desktop, web and IoT. His case is pragmatic rather than absolute: edge inference can improve latency, privacy, offline use and cost, but teams still have to manage memory, quantization, accelerator support and when to call the cloud.

Weiyi Wang · Chintan ParikhAI EngineerMay 7, 202611 min read