The Model Alone Is No Longer the AI Product

Igor Costa

John Allsopp

George Cameron Sarah Sachs

Vamsi Ramakrishnan

Shawn Wang

Geoffrey HuntleyAI EngineerWednesday, June 3, 202620 min read

At AI Engineer Melbourne 2026’s Day 1 keynote program, speakers including Shawn Wang, George Cameron, Sarah Sachs, Igor Costa, Vamsi Ramakrishnan and Geoffrey Huntley argued that AI engineering has moved beyond picking the strongest model. Their shared case was that useful AI products now depend on the systems around models: harnesses, routing, evals, memory, state, latency budgets, deterministic tools and cost controls. The model still matters, but the keynote program framed product advantage as an architecture and economics problem, not a leaderboard problem.

The product boundary has moved outside the model

Shawn Wang framed 2026 around a shift that recurred across the keynote program: AI engineering is no longer centered on the model alone. The model still matters, but the work that makes it useful is increasingly in the harness, product surface, data, orchestration, deployment, memory, and inference economics around it.

Wang put Greg Brockman’s old and new formulations side by side. In 2023, the slide read, “The model is the product.” In 2026: “the model alone is no longer the product.” His examples pointed to where attention is moving. AI21 appeared in the context of layoffs, while DeepSeek was shown hiring for a “Harness team” to build “Code Harness from the ground up.” OpenAI appeared through a slide about launching a deployment company to help businesses build around intelligence. A Claude Code source-code leak appeared under the heading “Brand/Product,” alongside Wang’s broader point that AI is also about services, data, brand, and product.

That distinction matters because model choice is necessary but insufficient. A model becomes useful when it is embedded in a system that decides when to call it, when not to call it, how to route traffic, how to preserve context or memory, how to expose capability to users, and when the task should be handled by deterministic software instead of another expensive model call.

George Cameron made the same point from the benchmarking side. Artificial Analysis benchmarks agents, models, inference providers, hardware, and multimodal systems across text, image, video, speech, and music. Cameron introduced the Artificial Analysis Intelligence Index as a synthesis of 10 language-model intelligence benchmarks. On that index, he said Claude Opus 4.8 had recently taken the lead from GPT-5.5.

But he did not treat that as the whole decision. Cost, speed, and other trade-offs remain decisive. The highest-index model is not always the right model for a product or workflow. His coding-agent benchmark made the harness point concrete: the relevant product is not “Claude 3.7 Sonnet” in isolation, but “Claude Code + Claude 3.7,” “Cursor Agent + Claude 3.7,” or “Copilot V2 + Claude 3.7.” The slide ranked “Claude Code + Claude 3.7” at 97, “Claude Code + Claude 3.5” at 92, “Cursor Agent + Claude 3.7” and “Copilot V2 + Claude 3.7” at 83, and base “Claude 3.7 Sonnet” at 61.

Coding setup	Artificial Analysis Coding Agent Index score shown
Claude Code + Claude 3.7	97
Claude Code + Claude 3.5	92
Cursor Agent + Claude 3.7	83
Copilot V2 + Claude 3.7	83
Claude 3.7 Sonnet (Base)	61

Cameron used Artificial Analysis’ coding-agent benchmark to show that model and harness both drive performance.

Wang’s second theme was that agents “really work now.” He pointed to software as the clearest real-world benchmark: how much code is being written by AI. A SemiAnalysis visual estimated Claude Code at 4% of public GitHub commits in February 2026, trending toward 25–50% by year-end. Wang said his current estimate was about 10%, with a possible rise to 40–50% by the end of the year.

4–5%

Claude Code share of public GitHub commits in February 2026, as Wang cited from SemiAnalysis

The claim was not limited to software. Wang pointed to Claude’s computer-use capability, Claude Design for prototypes and slides, and examples at the frontier of math and science. But he placed those examples inside a larger paradox: the more agents can do, the more AI engineering expands rather than disappears. His closing argument was that AI engineering may be “the last job” because the work increasingly consists of everything around the model — harnesses, services, data, product, deployment, and workflows.

Model-routing economics now decide product strategy

George Cameron argued that frontier progress remains fast and competitive. Artificial Analysis’ chart of leading model releases over several years, in his reading, contradicted claims that AI progress had slowed. He said those claims were common in late 2025, before Opus 4.6 and longer-horizon agents quieted them down.

The current model landscape, in Artificial Analysis’ view, is heavily concentrated in the United States and China. France, South Korea, and the United Arab Emirates appeared among the countries with competitive models. Australia, Cameron said, was notably absent from frontier language-model intelligence, and he did not expect that to change soon.

He also argued that open-weight models remain viable even though they trail proprietary models. Artificial Analysis compared open-weight and proprietary intelligence over time, beginning around Llama 2 70B in mid-2023. Cameron described open weights as roughly 3–9 months behind the proprietary frontier, but said that lag still leaves them powerful enough for many practical uses. If open weights are around the level of Opus 4.5 or GPT-5.2, his point was, “you could do a lot” with them.

Question	Cameron’s answer
Has frontier progress slowed?	No. Artificial Analysis sees more leading releases and continued movement up and to the right.
Which model led the Intelligence Index?	Claude Opus 4.8 had recently taken the lead from GPT-5.5.
Where are frontier labs concentrated?	Primarily the US and China, with some presence from France, South Korea, and the UAE.
Are open-weight models competitive?	They trail proprietary models by roughly 3–9 months, but remain viable for many tasks.

Artificial Analysis’ view of the language-model landscape in June 2026

To connect benchmark progress to real-world output, Cameron used Artificial Analysis’ GDPVal-AA agentic benchmark, built from an OpenAI dataset. In an inventory-management task, Claude 4 Sonnet from a year earlier produced a useful table. GPT-5.5, by contrast, added synthesis, executive summary, totals, and further analysis. In a music-video mood-board task, smaller open-weight Gemma variants produced results Cameron did not think he could use; GPT-5.5 produced something he said he might be able to do something with. His claim was that benchmark deltas do correlate with practical differences in economically valuable work and creative-adjacent tasks.

The cost picture was more complicated. Cameron described two forces moving in opposite directions. On one side, intelligence is cheaper: smaller models are getting more capable, sparse architectures reduce active parameters, inference software is improving, quantization is moving below BF16 into 4-bit approaches, and newer hardware can serve more users at scale even if the hardware itself costs more.

On the other side, companies are spending more. Frontier models are larger. Reasoning models generate more reasoning tokens. Agents multiply calls by taking many turns. For GDPVal-AA tasks, Cameron said Artificial Analysis commonly sees around 60 turns, with 20–100 turns sensible for many knowledge-work tasks. The agent improves the output by exploring files, checking its own work, and iterating, but that behavior multiplies cost.

20–100 turns

typical range Cameron described for many agentic knowledge-work tasks

His most important pricing claim was that the cost of accessing a given level of intelligence is falling quickly. Artificial Analysis bucketed models by intelligence tier and tracked how the cheapest way to reach each tier changed over time. In 6–18 month periods, Cameron said, the cost of a given level of intelligence often falls by 10–100x. That creates an engineering opportunity: if a task worked with Opus 4.5 six months ago, a newer cheaper model may now perform it at orders-of-magnitude lower cost.

At the frontier, however, cost is rising. Cameron said running Artificial Analysis’ 10-benchmark Intelligence Index against frontier models now costs more than $4,000 for models such as Opus 4.8. The model-selection problem is therefore a Pareto problem, not a leaderboard problem: different points on the curve may differ by more than 100x in cost.

Sarah Sachs took that cost curve into product strategy. She focused on companies that are not Fortune 50 buyers: teams with real AI products, real customers, and little leverage in token markets.

Her first examples were deliberately anonymized. In one, a reasoning model was upgraded at the same per-token price but used roughly three times as many output tokens on certain tasks. In another, a successor model made significant reasoning gains, cost 40% more than its predecessor, and arrived with deprecation pressure on the older model. Her question was practical: if your product is built on the predecessor, do you raise prices by 40%? Her answer was no — or at least, “hopefully not.”

The Fortune 500, she said, has dedicated AI teams and negotiating leverage. Everyone else negotiates alone. Notion, in her framing, represents the “Fortune 5 Million”: the long tail of businesses and users whose access to AI is determined by the price of workflows. Notion handles “tens and tens of trillion tokens a month” across more than 100 million customers, giving Sachs’ team leverage it can translate into product decisions for those customers.

Her critique of the market was blunt: the supplier is also your competitor. Applied AI companies can lock themselves into a frontier lab in exchange for large discounts, committing tens of millions of dollars and losing the ability to move when a better open-weight or frontier model appears. Or they can build value they cannot defend, because the lab can subsidize tokens in its own product while the applied company must maintain margins.

You can't win on the token price. Okay? It's impossible.

Sarah Sachs · Source

Sachs’ answer was to “win on the product, not the token.” That means data flywheels, reinforcement fine-tuning on open weights, product moats, UI, orchestration, architecture, integrations, and product value that justifies the model spend. She described Notion as moving away from the ambition to train the best model and toward building the best product that uses many models.

A Notion example showed managed agents moving work across different systems: Decagon agents, Claude Code writing a fix, Codex reviewing it, and humans reviewing the final task in a Notion board. Sachs said Notion can charge sticker price or a small discount on the models because the value is in the experience around them.

Her governing metric was “cost per capability per second.” Capability alone, she said, is no longer enough. To illustrate the risk of ignoring cost and latency, her slides showed article headlines claiming that Uber had burned through its 2026 AI budget in four months, that one company spent half a billion dollars on Claude in a single month, and that Microsoft reports were exposing AI’s real cost problem. A joke tweet shown immediately after described “Uber employees using $100k of Claude tokens to respond to an email.” Sachs’ point was that customers should not be put in that position.

Not all traffic is equal. Sachs divided tasks into moderate and frontier categories. Changing a database field, triaging an inbox, and summarizing meeting notes should not automatically move to the most expensive next frontier model. Autonomous agent paths, large-scale data analysis, and deep research may justify frontier spend. The product’s job is to choose.

Optionality beats discounts when the best model keeps changing

Sarah Sachs said Notion changes its default model for customers roughly every three to four weeks. That cadence makes lock-in dangerous: if a product is tied to a single provider and the frontier changes from January to February, users may get a worse experience because the company cannot switch.

Notion’s response is its Auto model. The picker shown on screen included Sonnet 4.6, Opus 4.7, Opus 4.8, Gemini 3.1 Pro, GPT-5.2, GPT-5.4, GPT-5.5, Grok 4.3, Grok Build 0.1, and open models such as Kimi K2.6 and DeepSeek V4 Pro hosted by US providers. Auto handles 75% of Notion’s AI traffic, while 25% of users choose explicitly. Sachs said the manual choice is valid because customers differ: some want to spend heavily on email triage, while some do not care as much about research accuracy.

Playbook item	What Sachs said it requires
Build for multi-model	Run infrastructure across major providers so you can walk away.
Evaluate on value, not tokens	Measure cost, capability, and latency across the whole task, not a single API call.
Switch fast, switch often	Move every 2–3 weeks as new models and tool requirements change.
Give labs something back	Provide eval scorecards and feedback explaining why a provider was or was not chosen.
Forgo discounts for optionality	Avoid short-term margin gains that create long-term lock-in.

Notion’s Auto Model playbook as presented by Sachs

The phrase “evaluate on value, not tokens” carried most of the operational weight. Sachs showed Notion evals for web search providers and argued that the relevant unit is the whole task, not one API request. A provider may be cheaper per request but cause more agent calls, worse accuracy, more retries, or higher latency. Product teams, in her view, are the experts on quality for their own use case and should measure accordingly.

She also described competitive feedback as a tool. When Notion picks one provider but the result is close, Sachs said the team can tell the winner to stay sharp and tell the others exactly why they were not chosen. That gives providers useful scorecards and gives the buyer more influence than a simple price negotiation.

Open weights were Sachs’ third option. She said Notion already runs open-weight traffic and reinforcement-learned models in production. Open weights are not, in her view, about immediately occupying the upper-right frontier of capability; they are about being strong enough for moderate workloads, lowering cost, and creating negotiating leverage. Kimi K2.6 was the first moment, she said, when Notion saw an open-weight model compare to GPT-5.2 in quality for some Notion-specific tasks.

An eval table from Notion compared Kimi K2.6, Sonnet 4.6, Opus 4.8, Opus 4.7, GPT-5.2, GPT-5.4, and GPT-5.5 across average score, eval cost, tool calls, errors, and time. Sachs emphasized that the evals were Notion-specific: each product team owns the definition of quality for its own product. She also warned that errors cost money because retries and failed calls still consume tokens.

Her final turn was away from tokens altogether. If a repeated task can be done deterministically with a shell command, CPU workflow, state machine, or internal API call, it should not burn reasoning tokens. “Frontier labs want you to token max,” she said. “Your users want you to outcome max.” Notion’s Workers, built with Vercel, let agents call internal APIs, use computer sandboxes, and host small code actions that the LLM invokes. Sachs said this reduced token costs by up to 80% for some repeated customer tasks.

up to 80%

token-cost reduction Sachs attributed to Notion Workers on some repeated tasks

Architecture mattered as much as model shopping. At Notion, Sachs said, harness engineering and architecture decisions can account for about 3x the change in price as model selection. Sometimes that means native harnesses such as Codex or Claude APIs, sometimes open source, and sometimes a custom harness tuned to caching and product-specific constraints.

Agent memory is the next reliability boundary

Igor Costa argued that coding agents forget because the industry has treated context and memory as if they were the same problem. He began with an old Melbourne electricity advertisement whose promise was that electricity gave households “the power of countless workers without wages, sleep, or rest.” The analogy to AI workers was intentional, but Costa added the caveat: today’s agents are not that smart because they are built stateless.

His description of the current agent loop was simple: request, context window, response, session ends, memory discarded. The next session starts from scratch. Costa said that was why he left GitHub and started building his own tools. Frontier coding agents, in his experience, did not scale well beyond about 20 simultaneous instances on his machine, and after 10 or 15 messages a session could lose track of what it was doing.

The industry response, as Costa described it, has been to add more: tools, retrieval, reflection, Memory.md, and ever-larger context windows. He traced context growth from GitHub Copilot’s 4,000-token era through GPT-4, Claude, Gemini, and million-token windows. But he argued that context-window growth has plateaued in its ability to solve the deeper problem, because the industry treats context and memory as the same thing.

Costa separated memory into primitives: working, episodic, semantic, procedural, reflective, predictive, strategic, sensational, and collective memory. He focused on four. Semantic memory is the domain-rules layer — AGENTS.md, API contracts, conventions, tools, best practices, and architecture patterns — but Costa said seven out of 10 agents do not follow what is written there. Procedural memory is the workflow layer: understand, plan, code, test, review, deploy, monitor. Episodic memory adds time: something learned in January may or may not still be valid in June. Reflective memory is the loop that observes what happened, analyzes why it happened, learns from it, and adapts.

The hard problem is not merely storage. It is agreement, validity, and drift. In teams, different humans and agents hold different opinions. Different opinions create lack of consensus; lack of consensus creates drift and collapse. Costa’s warning sign for drift was the phrase “you’re absolutely right.” If an agent says it, he advised shutting down the session because “you are completely lost.”

AutoHand’s open-source coding CLI implements user preferences and project memory from the beginning, storing memory in simple file-based formats. Costa argued that file reads on an SSD beat more elaborate storage choices for this purpose and joked that adopting a fancy KV or vector database may be “basically adopting SAP.” He claimed the coding CLI had been downloaded around 300,000 times and was used by around 150,000 people daily.

For multiple agents, AutoHand’s experiment is collective memory: successful outcomes, failures, and reflections are written back so future agents start from accumulated experience rather than zero. Near-duplicate memories are merged. Patterns are captured after sessions and reused. His AgentSpawn paper, shown in an arXiv screenshot, proposed adaptive multi-agent collaboration for code generation: agents spawn dynamically based on task complexity, transfer context and expertise at runtime, and selectively share memory to avoid context overflow while maintaining coherence.

Costa’s motivating target is long-horizon coding: tasks longer than 48 hours. He described experiments that detect drift between agents by comparing memory and behavior, then encourage merging, coordination, or failure. Collapse occurs when agents hold incompatible information, pursue contradictory goals, degrade communication, and stall in loops or contradictory actions.

His most ambitious experiment was a system attempting to migrate the Linux kernel to Rust. The slide showed a “Linux kernel to Rust migration” running for more than 300 days, 76% migrated, with 12,434 errors and 99.8% auto-resolved. Costa said the experiment was still running after more than 10 months. It had not succeeded — “I’m sorry, Linus. Not today” — but he said it had moved from about 12% migrated at the start toward its current state.

The “secret sauce,” he said, was a Hierarchical Reasoning Model architecture from a Singapore AI lab, where “memory is the model.” Rather than make a large language model the first-class citizen, Costa said AutoHand uses LLMs as dependencies. The HRM approach, as shown on slides, involves smaller dense reflective models from 20 million to 2 billion parameters, easier training, lower resource intensity, and more steerable outcomes. Costa said training cycles moved from weekly checkpoints to roughly three to five hours, depending on the day.

He did not present memory as solved. Two unsolved problems were explicit. First is memory correctness: everyone is building memory, but almost nobody is building memory verification. An agent can remember outdated, hallucinated, or contradictory information; there is no equivalent yet of type checking, unit testing, or formal verification for memory. Second is memory as a first-class training signal: today, models learn and memory stores, but the architecture does not cleanly couple memory changes, weight changes, curriculum changes, and future exploration.

Voice agents expose the cost of non-deterministic architecture

Vamsi Ramakrishnan brought the same anti-token-maximal instinct into voice infrastructure. His subject was a Rust SDK for Gemini Live, built from lessons deploying full-duplex voice agents at scale. He described customers in countries such as India looking for cost-effective ways to call perhaps 10 million users per week, and high-value customer interactions where even a slight glitch or pause would ruin the experience.

Ramakrishnan’s core point was that native speech-to-speech models need a control loop, not just a model call. The architecture he showed had full-duplex streaming for simultaneous audio and text frames, plus state derivation through a separate low-latency observer. A transcript stream becomes an out-of-band control mechanism: compute state from live text, run logic through state machines, regex, or low-latency LLMs, then steer the model cadence by detecting turn boundaries, pruning system prompts dynamically, and injecting state.

He explicitly connected this to deterministic software. If the customer must repeat a phrase for consent in a debt-collection workflow, a regex or state machine can detect it with ultra-low latency. There is no reason to waste an LLM call on that. State machines and regex, he said, go well together.

The shared architecture is a typed state store: phases, structure, tools, actions, extractors, facts, watchers, reactions, telemetry, and signals all converge on one concurrent, typed, prefix-scoped “spine.” Phases gate, extractors write, watchers fire, telemetry populates, and everything reads from and writes to the same place. The system becomes deterministic where it must be, while the model handles arbitrariness where it is useful.

A Rust code slide modeled flow as a governed DAG. Steps such as verifying an account, initiating payment, charging a card, and waiting for 2FA can be enforced in order. Ramakrishnan said current SDKs from frontier model companies do not include these capabilities because the companies have not yet seen many customers deploy real-time voice at this scale. He expects such capabilities may be absorbed into products later, but for now “it’s really the Wild West.”

He was careful not to dismiss Python. Python ADK still belongs in exploration, notebooks, ecosystem-heavy work, research, evals, and batch jobs where iteration speed wins. Rust belongs on the real-time hot path: production voice at scale, where the wire is unforgiving and the budget is measured in milliseconds.

The latency budget is the architecture here for voice agents.

Vamsi Ramakrishnan · Source

Text agents have UX forgiveness: the model can think while the user reads. Voice does not. Ramakrishnan said his team rebuilt ADK in Rust and reached a scale of a million outbound calls per day and more than 100,000 inbound calls per day for demanding contact centers handling vernacular languages in India. The SDK was published as gemini-rs, with repository, book, and crate links, under Apache 2.0. He said the license would remain Apache 2.0 and asked for contributors.

Cheap execution moves the bottleneck to judgment and organizations

Geoffrey Huntley pushed the same cost-and-harness logic into labor, identity, and organizational design. If Cameron’s and Sachs’ claims hold — that capability can be routed, harnessed, substituted, and made cheaper over time — then the scarce work shifts away from typing code and toward deciding what should exist, how systems should be organized, and which humans can operate in the new loop.

Huntley’s provocation was that software development now costs less than minimum wage, software has been commoditized, and “anyone is now a software developer.” He compared the change to smartphone photography: anyone can take photos, but that does not make everyone a wedding photographer. Similarly, product managers and designers can now create software, but that does not make them software engineers.

Huntley connected the current generation of long-running coding tasks to the “Ralph Wiggum” loop he had popularized a year earlier. The host described tools such as Codex, Cursor grind mode, and other long-running tasks as implementations of that loop. Huntley said the unit economics of business had changed because software is now easy to create, even if creating the right software remains hard.

His broader frame was a shift from a knowledge-scarcity economy to a knowledge-abundance economy. Software developers, accountants, lawyers, and other white-collar professionals have historically charged for scarce expertise. Huntley argued that AI makes many previously scarce skills more broadly available. His instrument analogy was aimed at both employees and employers: people getting the most out of AI, he said, have put in deliberate practice. A musician does not strum a guitar once, fail, declare the guitar bad, and quit. Yet many corporate AI rollouts, in his view, have become little more than curiosity tests: will employees pick up the instrument and invest in themselves?

That curiosity test became a hiring filter. Huntley described a progression from “AI is not good enough” to experimenting, fearing job loss, using Claude productively, and finally programming swarms of AI. He said he no longer hires people on the left side of that line. A senior engineer, in his view, should be able to explain how AI works under the hood.

The new basic interview question, he said, is not “what is a primary key?” but “what is an agent?” He reduced the basic agent to a simple loop: take user input, call the LLM, execute a tool if requested, feed the result back, and continue. A slide showed it as a Python while True loop. A senior software engineer should be able to explain that as a sequence diagram, he argued, and should be able to build a simple coding agent in a few hundred lines.

Experience as a software developer today does not guarantee relevance tomorrow.

Geoffrey Huntley · Source

The organizational implications were sharper. Huntley argued that companies now split into two classes: new “model-first” companies that intend to stay lean, and existing companies that must go through a three- or four-year J-curve transformation. The danger for incumbents is that AI-native startups move at “slope-on-slope” pace while larger organizations are still transforming.

His critique of venture economics followed from the same premise. If a five-person team can build what previously required seed capital to hire a team for, then limited partners will ask general partners why pre-seed and seed capital are needed in the same way. Software remains investable, he said, but must be charged on outcome economics rather than legacy per-seat SaaS.

For existing teams, his operational advice was less glamorous than “use more AI”: remove waste. A client with one Git repo per UI component in a design library was, in his words, “stupid.” Multiple sources of truth across Jira, Google Docs, and other systems are waste. He said engineering-manager candidates should be asked what AI has broken in their systems and processes, whether they still use Agile, where they removed waste, and what outcomes followed.

This follows the same architecture-first argument Sachs and Ramakrishnan made from a different angle. If an organization has fragmented repos, duplicated planning systems, unclear ownership, or process theater, more tokens will not fix the bottleneck. Cheap execution amplifies the quality of the surrounding system; it does not make the system coherent.

Huntley’s final inversion was about ideas. The old saying was that ideas are worth nothing and execution is everything. Huntley argued execution has been commoditized enough that ideas — what to build — become more important again. He described taking screenshots of competitors’ marketing material and feeding them into a coding agent to get the feature. That makes product judgment, not code production alone, the scarce skill.

AI Application Architecture Evals and Benchmarks Inference and Deployment Voice and Audio AI Agents and Autonomy Open Models AI Business Models AI Economics and Labor AI Product Management Coding Assistants