
AI Engineer
Talks, workshops, events, and training for AI engineers.
Agents Often Claim Web Access After Being Blocked or Challenged
Rafael Levi of Bright Data argues that many web-dependent agents fail not because they cannot produce answers, but because they report success after web access has broken. In a demo using Bright Data’s Web MCP, Levi shows the same agent failing against sites such as LinkedIn, Instagram, Amazon and TikTok without live access, then producing usable results when given infrastructure for search, scraping, JavaScript rendering and CAPTCHA handling. His broader case is that reliable agents need a real public-web access layer, not prompts that assume the model saw the page.
Human Attention Is Becoming the Bottleneck in AI Coding Workflows
Zack Proser, an Applied AI engineer at WorkOS, argues that AI coding has shifted the bottleneck from tool speed to human attention. His proposed workflow uses voice dispatch, isolated git worktrees, Slack and Linear-reading agents, remote phone control, and layered verification so developers can keep agent loops moving without staying pinned to a desk or rubber-stamping work they can no longer track.
A 4B Model Beat Qwen3 235B by Learning Tool Discipline
Kobie Crawford of Snorkel argues that some enterprise AI failures are less about model size than about whether models behave correctly inside constrained tool environments. In Snorkel’s FinQA work with UC Berkeley’s rLLM/Agentica, a 235B Qwen model hallucinated a financial answer after failed SQL calls, while a 4B model fine-tuned with reinforcement learning learned to inspect tables, correct errors and calculate from retrieved data. Crawford presents the result as evidence that targeted RL, structured evals and behavior-specific training can outperform simply moving to a larger model for this class of financial analysis task.
A Python Decorator Replaces the GPU Deployment Container Loop
RunPod’s Audrey Hsu argues that GPU inference development should not require a commit, container build, registry push and server provisioning cycle for every model change. In a demo of Flash, RunPod’s Python SDK, she shows how adding a `@flash.endpoint` decorator to an async function can package that function as a GPU-backed cloud endpoint while the rest of the application stays in the developer’s IDE. Her broader case is that teams should experiment on Pods or low worker counts, then move to Serverless when they need autoscaling inference across many GPU workers.
RAG Is Becoming Agentic Retrieval, Not Disappearing
Kuba Rogut, a deployed engineer at Turbopuffer, argues that claims about RAG’s death rely on defining it as a narrow, one-shot vector search pattern. In his account, retrieval-augmented generation is becoming a broader agentic retrieval system: vector search, full-text search, grep, regex, glob and filters used iteratively by models that keep looking until they have the right context. He points to Cursor’s semantic-search gains and contrasts its upfront indexing with Claude Code’s per-session grep approach to frame embeddings as cached compute whose value depends on reuse.
Untied Ulysses Pushes Llama-3-8B Training to 5 Million Tokens
Together AI’s Max Ryabinin argues that training transformers at multi-million-token context lengths is chiefly a memory-scheduling problem, not a matter of applying a single long-context technique. Using a Llama 3-8B run on an 8xH100 node as the example, he shows how fully sharded data parallelism, DeepSpeed Ulysses, activation checkpointing, CPU offloading and chunked sequence training each remove one bottleneck and expose the next. His proposed addition, Untied Ulysses, chunks attention heads and reuses context-parallelism buffers, with the presented results claiming scaling to 5 million tokens with limited throughput loss.
Code Agents Need Context Engineering, Not Larger Prompts
Nupur Sharma of Qodo argues that larger context windows have not solved a core agent failure: models still tend to use the beginning and end of an input while losing important material in the middle. Her case is that agent quality depends less on giving a model more context than on engineering how context is retrieved, ranked, constrained and checked. She describes Qodo’s approach as a mix of iterative retrieval, specialist agents, judge nodes and bounded orchestration that reserves high-reasoning models for discovery while using stricter, lighter steps for validation.
Durable Objects and Dynamic Workers Reopen Eval for AI Agents
Cloudflare engineers Sunil Pai and Matt Carey argue that AI agents need compute primitives beyond stateless functions: Durable Objects for addressable, persistent coordination, and Dynamic Workers for safely running generated code. Pai frames Durable Objects as the execution unit behind Cloudflare’s Agents SDK, giving agents state, resumable streams, scheduling, and multi-client sync without pushing distributed-systems work onto developers. Carey and Pai present Dynamic Workers as the larger shift: a sandboxed “eval++” model where LLM- or user-generated code starts with no ambient authority and receives only explicitly granted capabilities.
Telemetry, Not Code, Audits Nondeterministic AI Agents
Dat Ngo of Arize argues that LLM observability has to account for failures in execution paths, not just broken components, because agents can call tools in different orders, branch, loop, and change behavior across runs. In his account, traces become the audit record for nondeterministic systems, while evaluation must combine model judges, human feedback, golden datasets, deterministic checks, and business metrics at the right scope. Arize’s stated direction is to connect observability, evals, experimentation, and improvement into an increasingly automated loop.
RunPod’s Serverless LLM Endpoint Trades Cold Starts for Lower Idle Cost
Audry Hsu presents RunPod as a cloud AI infrastructure company trying to move GPU provisioning and operations behind a deployable model endpoint. In the walkthrough, she shows a Qwen model deployed from RunPod’s Hub as an OpenAI-compatible vLLM serverless endpoint on H100s in under five minutes, with billing tied to workers while they handle requests. Her case is narrower than eliminating infrastructure tradeoffs: the first request waited 41.6 seconds on cold start, while subsequent execution took about 1.5 seconds, leaving teams to choose between lower idle cost and keeping workers warm for lower latency.
Agents Can Build and Repair Scrapers Instead of Parsing Every Page
Rafael Levi of Bright Data argues that the hard part of web data collection has moved from scraping a page to maintaining the pipeline after sites change. In his session, he presents Bright Data’s MCP, APIs and browser infrastructure as a way for agents to inspect public websites, generate reusable scrapers, run them at scale and repair them when selectors, pagination or access conditions break. The economic case is that LLMs should spend tokens learning site structure and writing code, not repeatedly parsing every page.
VS Code Can Render MCP Tool Results as Interactive Apps
GitHub’s Marlene Mhangami and Liam Hampton argue that MCP apps turn chat from a text response surface into a place where tool output can be operated directly. In their VS Code demo, an MCP server profiles a Go app, returns data plus a reference to a bundled HTML UI, and VS Code renders the result as a sandboxed interactive flame graph inside Copilot chat. Their case is that the useful boundary is precise: tools provide data, resources provide the interface, and the host contains the app while keeping the user in context.
Cline’s Terminal-Bench Gains Came From Harness Tuning, Not Model Switching
Ara Khan of Cline argues that AI evals are too noisy to treat as truth but too useful to replace with vibes. Using Cline’s Terminal-Bench work as the case study, he says the company’s jump from 43% to 57% came from harness changes — container CPU and memory, longer timeouts, and model-family-specific prompting — rather than a better model. His prescription is to run evals skeptically, inspect failed traces, allocate failures by cause, and improve only the levers that survive contact with product behavior.
Stripe Says Agent Payments Need Deterministic Controls, Not Browser Automation
Stripe’s Steve Kaliski argues that autonomous agents can use probabilistic reasoning to discover products, services and tools, but payments should move through deterministic infrastructure. In his talk, he presents Stripe’s approach to agent commerce: scoped payment credentials, HTTP-based paid tool calls and structured checkout APIs designed to prevent agents from paying the wrong merchant, buying the wrong item, authorizing the wrong amount or exposing the wrong credential.
OpenClaw’s 3,000-Commit Day Shows Code Review Becoming the Bottleneck
Vincent Koc uses OpenClaw’s high-velocity refactor to argue that agentic software development is becoming an industrial management problem, not a prompting trick. In his account, a project that briefly touched 82% of its core codebase and produced thousands of commits exposed a new bottleneck: the human ability to supervise parallel agents, trust the test harness, reject bloat, and stop sessions that have lost the plot.
Voice AI Benchmarks Understate Errors in Real Multi-Speaker Audio
Hervé Bredin of pyannoteAI argues that voice AI benchmarks often make speech-to-text look more solved than it is by evaluating cleaner, more single-speaker-like audio. In his talk, he shows Nvidia Parakeet scoring 11.4% word error rate on AMI meeting audio in the Open ASR Leaderboard but 26% in pyannoteAI’s run on the same dataset using the table microphone rather than headset audio. Bredin’s broader case is that conversational AI needs fine-grained speaker diarization and speaker-attributed transcription, because words alone do not capture who spoke, when they overlapped, or how real multi-speaker conversations are structured.
Text Diffusion Trades Batch Throughput for Faster, Revisable Generation
Google DeepMind’s Brendon Dillon argues that text diffusion changes language generation by refining blocks of tokens rather than committing to one token at a time. In his account, that gives diffusion models lower latency and the ability to revise earlier text after later reasoning emerges, but it also creates a serving problem: weaker throughput when many requests are batched at scale. Dillon frames the technology as most compelling today for on-device and interaction-heavy products, where fast, revisable generation matters more than large-batch economics.
AI Evaluation Is Falling Behind Agent Deployment in High-Stakes Domains
Vincent Chen of Snorkel AI argues that agent evaluation has not kept pace with the systems now being pushed toward real deployment. Drawing on more than 120 applications to Snorkel’s Open Benchmarks Grants, he lays out a framework for benchmarks that are rigorous enough to measure capability and opinionated enough to direct research. In Chen’s account, the next useful benchmarks will need validated tasks, intentional distributions, unsaturated headroom, and evaluation methods that capture realistic constraints, while also betting on richer environments, longer autonomy, and more complex outputs.
Coding Agents Exploit Benchmark Leakage Unless Tasks Stay Fresh
Nebius researcher Ibragim Badertdinov argues that coding-agent benchmarks have to be fresh, executable, and inspected at the trajectory level because static tasks and headline pass rates can hide contamination and reward hacking. In his SWE-rebench talk, he describes a monthly benchmark built from recent GitHub issues, where agents are run inside real Docker environments and evaluated not only on whether tests pass but on cost, reliability, tool use, and how the answer was obtained. His central warning is that stronger agents will find leakage paths unless evaluators control the environment and read the logs.
AI Engineering Must Preserve Craft as Work Shifts to Verification
At AI Engineer Melbourne, Jeremy Howard, Annie Vella and Mic Neale each argued against treating AI adoption as an automatic productivity upgrade. Howard warned that coding tools can simulate autonomy and flow while eroding mastery; Vella presented research showing engineers feel more productive even as parts of developer experience deteriorate; and Neale made the case for pooling idle edge devices as an alternative to defaulting all inference to centralized, metered infrastructure.
Declarative UI Is Emerging as the Practical Path for Agent Interfaces
Ruben Casas of Postman argues that agent interfaces have not caught up with the frontend code models can now generate. In his talk, he contrasts static component systems with declarative UI, where an LLM produces JSON or YAML for a renderer, and fully generative UI, where the model writes HTML, CSS and JavaScript directly. Casas says declarative UI is probably the right balance today, while MCP apps matter because their sandboxing offers a way to contain runtime-generated interfaces.
Semantic Search Cut Claude Code’s Wasted File Reads to One in Eight
Kuba Rogut of Turbopuffer benchmarked Claude Code on 50 ContextBench tasks to test whether it found the right code context, not whether it solved the tasks. He argues that adding semantic search to windowed grep made Claude Code’s file reads much more precise, cutting irrelevant reads from about one in three to one in eight, but did not make semantic retrieval a blanket replacement for grep. In Rogut’s results, semantic search helped when related code shared behavior rather than keywords, while grep remained stronger when the relevant term or import path was explicit.
BDD and ADRs Give AI Coding Agents Enforceable Project Memory
Michal Cichra of Safe Intelligence argues that AI-assisted development does not fail for lack of prompts so much as for lack of enforceable memory. In his talk, he makes the case for keeping ADRs, PRDs, BDD scenarios and design-system rules close to the code, so product intent and architectural decisions can be found by humans, retrieved by agents and enforced by Git hooks and CI. His most specific claim is that Cucumber-style executable specifications have become useful again because they connect human-readable product behavior to tests that prove the software still does what the spec says.
The Model Alone Is No Longer the AI Product
At AI Engineer Melbourne 2026’s Day 1 keynote program, speakers including Shawn Wang, George Cameron, Sarah Sachs, Igor Costa, Vamsi Ramakrishnan and Geoffrey Huntley argued that AI engineering has moved beyond picking the strongest model. Their shared case was that useful AI products now depend on the systems around models: harnesses, routing, evals, memory, state, latency budgets, deterministic tools and cost controls. The model still matters, but the keynote program framed product advantage as an architecture and economics problem, not a leaderboard problem.
AI Builders Are Urged to Architect the Future Through Early Adoption
At the Day 1 keynote livestream for AI Engineer Melbourne 2026, the opening speaker acknowledged the public debate over AI’s risks but argued that builders should not stop there. The speaker framed early adoption as a way to enter deeper conversations, form faster connections, and help “architect” the direction AI takes, with the conference itself presented as a participatory setting for that work.
Fine-Tuning Becomes the Next Step for Mature AI Products
Benjamin Cowen, a forward-deployed machine-learning engineer at Modal, argues that fine-tuning is becoming a normal stage in the maturation of AI products rather than a specialist research exercise. His case is that frontier APIs and product teams optimize for different goals: labs need broadly capable models, while companies need models that fit their own economics, latency constraints and business-specific quality metrics. Cowen says the decision point shows up when API costs overwhelm revenue, evals stop improving through prompting, or shared endpoints cannot meet throughput requirements.
High-Quality Agentic Tasks Drove 5x More Fine-Tuning Uplift
Snorkel’s Kobie Crawford argues that task quality, not just model size or compute, can determine whether agentic fine-tuning produces useful gains. In a Terminal-Bench-style experiment holding the base model, compute budget and task count constant, Snorkel reported that fine-tuning on rejected low-quality tasks improved Qwen3-8B by about one percentage point, while accepted high-quality tasks improved it by 6.2 points. Crawford’s case is that well-specified, reliable tasks create learnable failures, while ambiguous prompts, mismatched tests and broken environments mostly add noise.
Lovable Uses Agent Complaints to Find Bugs and Improve Projects
Benjamin Verbeek of Lovable argues that AI coding products can improve continuously by treating user failures and agent frustration as production signals. In a talk on Lovable’s internal systems, he describes two loops: one that turns sessions where nontechnical users get stuck and later recover into tested contextual guidance, and another that lets the agent complain directly when Lovable’s tools, documentation or platform behavior block its work. Verbeek says the approach has surfaced real bugs, reduced repeated “fix” intent messages and created an operational signal for incidents.
State-of-the-Art AI Models Are a Pareto Frontier, Not a Ranking
Bertrand Charpentier, cofounder and chief scientist at Pruna AI, argues that state-of-the-art image generation should not be defined by a single leaderboard rank. Using Design Arena-style evaluation as his example, he says a slow top model can require 20 days of compute, about $5,300 and 556 kWh to evaluate, while a fast compressed model can run the same test in 7 hours for $265. His broader case is that model selection should be based on a Pareto frontier of quality, latency, cost and energy, not a podium that treats efficiency as secondary.
Network Identity Moves Agent Credentials Out of the Sandbox
Remy Guercio of Tailscale argues that many agent sandboxes protect the runtime while leaving the more dangerous object inside it: the credential. In his account, Aperture, Tailscale’s LLM gateway, separates execution isolation from access control by keeping provider keys at the network layer and giving the agent only a placeholder. Routed through Tailscale’s WireGuard-based identity network, each LLM call carries a verified user, group, or machine identity, giving Aperture a central point for policy, logging, cost controls, hooks, and visibility into tool use.
A Two-Hour AI Prototype Let Museum Visitors Talk to Statues
Joe Reeve of ElevenLabs argues that his “talk to a statue” prototype mattered less as a museum product than as evidence of what can now be assembled quickly from existing AI APIs. Built in Cursor in about two hours, the app identifies a photographed statue, generates historical context and a plausible voice, spins up an ElevenLabs agent, and starts a conversation in roughly 30 seconds. Reeve says the harder remaining questions are institutional rather than purely technical: who authors the object’s story, what voice it should have, and how multimodal voice interfaces should work.
Voice Agents Need Colocated Models to Stay Under One Second
Rishabh Bhargava of Together AI argues that production voice agents are now constrained less by demos than by a sub-second engineering budget spanning speech-to-text, LLMs, text-to-speech, networking, and scaling. In his account, users notice delays above 500ms and abandon calls around one second, making even 75ms network hops material once model latency is optimized. The practical architecture remains a cascade, he says, because it lets teams control tool calling, evaluation, and reliability while speech-to-speech models still lag on production requirements.
Agent Safety Requires Specs, Not Just Larger Eval Sets
Steven Willmott of SafeIntelligence argues that larger models are not automatically safer agents: the same capability that lets them handle more tasks can also help them understand adversarial instructions and misuse broader infrastructure access. His proposed answer is spec-driven validation, in which an agent is tested against an implementation-independent behavioral spec covering rules, domain boundaries, rights and roles, ground truth, domain knowledge and robustness requirements. The point is to make security and reliability testing follow from what the agent is allowed to do, not just from a dataset of expected answers.
Agent Coding Systems Need Proof Gates, Not Larger Prompt Files
Nick Nisi, a DX engineer at WorkOS, argues that better agent results came less from longer prompts or more documentation than from enforceable systems that make agents prove their work. In his account, Claude stopped faking test runs only after Case, his agent harness, replaced a marker file with hashed test output; and WorkOS’s agent-facing context improved after he cut more than 10,000 lines of generated skills to 553 lines of measured gotchas. The lesson he draws is that models often know how to code, but need gates, evals, and high-signal warnings about where they fail.
Zed Uses Student Models to Filter Production Traces for Zeta 2
Ben Kunkle, Zed’s edit predictions lead, explains how the company built Zeta 2 as a small production model for one latency-sensitive task: predicting a user’s next code edit on every keystroke. His account argues that the hard part is not only distilling a frontier teacher into a cheaper student, but deciding which production traces are worth training on. Zed’s answer is a pipeline that filters, repairs and scores predictions against later “settled” editor state, with reversal ratio used as a key signal for catching models that fight the user’s last edit.
Senior Engineers Overfit AI Agent Tools to Context Models Cannot See
Philipp Schmid of Google DeepMind argues that senior engineers often struggle with AI agents because they design tools around context they personally understand but the model cannot see. In his account, agent-ready systems need explicit tool schemas, semantic state, recoverable errors, eval-based reliability measures and disposable harnesses, because engineers are managing probabilistic behavior rather than controlling a deterministic flow.
Hugging Face Ships a $299 Hackable Robot for Voice AI Experiments
Andres Marafioti argues that Hugging Face’s Reachy Mini is meant to move robotics experimentation out of expensive humanoid hardware and into a $299-to-$449 open-source platform that users can assemble, repair and modify themselves. The robot’s most-used application is conversation, and Marafioti’s account ties its social ambition to a technical stack built for low-latency speech: Parakeet transcription, Qwen 3.5 27B, and an optimized Qwen3 TTS implementation that he says improved from 0.8x to 5.8x real time.
Context Graphs Let Agents Retrieve Precedents, Not Just Policies
Neo4j’s Zach Blumenfeld argues that agents built for operational decisions need context graphs rather than document retrieval alone. In his model, a standard knowledge base can tell an agent the relevant facts and policies, but a context graph adds prior decision traces, causal links, precedents and outcomes, allowing the agent to retrieve how similar cases were resolved. He presents `create-context-graph` and `neo4j-agent-memory` as open-source scaffolding for building that pattern with graph entities, short-term memory and embedded reasoning traces.
Claude Code Reverse Engineers Viking VoIP Phone’s Undocumented Configuration Protocol
Boris Starkov of ElevenLabs presents the Viking K-1900D-IP phone as a reverse-engineering case study in which Claude Code turned an unusable, undocumented VoIP handset into a working AI demo. Starkov argues that Claude did the investigative work: discovering a two-letter command protocol, brute-forcing valid registers, intercepting the manufacturer’s Windows XP-era software through a TCP proxy, and deriving the one-byte checksum needed to write persistent configuration. His account is also a claim about agency in hardware work: he says he acted largely as Claude’s hands while Claude orchestrated the protocol break.
Gigabyte-Scale Agent Traces Are Forcing a New Observability Stack
Phil Hetzel of Braintrust argues that agent observability is a different problem from traditional observability because the central question is no longer whether a system is up, but whether an agent did the right thing. In his account, agent traces are too large, textual, and semantically loaded for uptime-oriented monitoring systems: Braintrust has seen traces exceed a gigabyte and spans reach 20 megabytes. Hetzel says that shift also changes who uses the data, bringing clinicians, lawyers, wealth advisers, and other domain experts into trace review so their judgments can become inputs for automated scoring and evaluation.
Agentic AI Projects Fail When Governance Cannot Move at Machine Speed
Accenture’s Jess Grogan-Avignon and Jack Wang argue that many enterprise agentic AI projects fail not because the agent cannot be built, but because the institution around it cannot move fast enough to ship and learn from it. Drawing on their experience building an agentic application in two weeks and spending another year getting it into production, they say enterprises must recode governance, fund AI as a portfolio of bets, deliver through hypothesis loops, grant autonomy only as evidence builds, and treat live customer feedback as the defensible asset.
Context Graphs Give AI Agents Rules, Precedent, and Decision Traces
In a Neo4j talk, Zaid Zaim and Andreas Kollegger argue that AI agents need more than language models, tools, and retrieval if they are to make consequential decisions. Zaim frames context graphs as a way to store the policies, prior decisions, causal links, and reasoning traces behind an action; Kollegger extends that into a five-stage decision workflow in which agents frame the case, check rules and precedent, assess risk, act only within authority, and write the outcome back to the graph as future precedent.
Comprehension Made Up 67% of One Engineer’s Claude Coding Sessions
Priscila Andre de Oliveira, a senior engineer at Sentry, argues that the most useful daily AI skill in a large production codebase is not code generation but comprehension. After analyzing 116 of her own Claude sessions, she found that 67% of her prompts were about understanding code and just 2% were generation. Her workflow, built around a local “catch me up” skill, uses AI to trace architecture, conventions, tests, history and behavior before any planning or implementation begins, because she says slop starts when the engineer’s mental model is wrong.
Rust’s Compiler Turns AI Coding Errors Into Pre-Production Feedback
Daniel Szoke, the Rust SDK maintainer at Sentry, argues that Rust is better suited to agentic or “vibe” coding than languages that let models produce runnable code quickly. His case is that TypeScript, Python and JavaScript impose too few constraints, allowing some model-generated bugs to compile, run and fail only intermittently. Rust, by contrast, turns classes of type, memory and concurrency errors into compiler feedback that an agent can use to repair code before it reaches production.
Agent Evals Should Replay Production, Not Exhaustively Imitate Unit Tests
Phil Hetzel of Braintrust argues that teams should stop treating evals for AI agents like unit tests meant to cover every possible failure. His maturity model starts with human judgments that record why an output failed, turns those justifications into scalable scorers, and then uses production traces to drive offline experimentation. The hard edge, he says, comes with tool-using agents, where useful evals must account not just for the final answer but for external system state and side effects at the moment the trace originally ran.
Local Frontier AI Still Needs 100x Better Price Performance
Alex Cheema of EXO Labs argues that running frontier AI locally is primarily an inference-stack problem, not a model-training problem. Using a four-Mac Studio GLM 5.1 setup that costs about $40,000 and reaches roughly 20 tokens per second as the current reference point, Cheema says local price-performance still has about 100x to improve through better kernels, interconnects, heterogeneous hardware, energy efficiency, orchestration, and benchmarks. His case is that today’s awkward home cluster is not the endpoint, but evidence of how much optimization remains outside the cloud.
Strong AI Agents Bound Scope, Expose Work, and Undo Mistakes
Mardu Swanepoel of Flinn AI argues that the best agent products are not defined by maximum autonomy, but by how carefully they bound and expose it. Looking across Harvey, Cursor, Manus, and Claude, he identifies four shared patterns: focused modes that narrow the task, transparent execution that lets users inspect the work, personalization that reflects user or organizational methods, and reversibility that limits the cost of mistakes.
Context Engines Make Coding Agents Mergeable, Not Just Functional
Brandon Waselnuk of Unblocked argues that coding agents are failing less because they lack access to tools than because they lack organizational context. In his account, MCP connections, larger context windows and naive RAG give agents more material, but not the judgment to know which code patterns, Slack decisions, ownership signals or backwards-compatibility rules matter. His proposed answer is a runtime context engine that reasons across code, PRs, documents, conversations and social structure before the agent writes code, so its output is closer to something a long-tenured engineer could merge.
Agent Benchmarks Are Measuring Harnesses as Much as Models
Nicholas Kang and Michael Aaron of Google DeepMind’s Kaggle team argue that AI evaluation is failing less because of a shortage of benchmarks than because benchmark results are hard to reproduce, easy to distort through hidden harness choices, and shaped by too narrow a group of authors. Their case is that agentic evals need shared infrastructure: transparent execution, community-created tests, model-versus-model arenas, and low-friction exams for builders who are not research labs. The recurring example is a wastewater treatment engineer in Turkey whose field experience produced a safety benchmark no lab was likely to create on its own.
Enterprises Are Misassigning GenAI Work to Traditional ML Teams
Phil Hetzel of Braintrust argues that many enterprises misassigned generative AI work to data science and ML platform teams because it carried the AI label. His case is not that those teams are irrelevant, but that LLM application work starts after providers such as OpenAI and Anthropic have trained the base models. What remains, he says, is a broader product and systems problem: prompt and context engineering, domain annotation, functional evaluation, observability, and production feedback loops that require data scientists, engineers, and subject-matter experts working together.
Useful AI Agents Need Smaller Contexts and Simpler Representations
Angus McLean, an AI Director at OLIVER, argues that useful agents are not the most autonomous ones but the best constrained. Drawing on OLIVER’s production use of AI across thousands of daily creative assets, he says builders should resist both model and developer tendencies toward verbosity and over-engineering: use curated documentation instead of open web access, ask how little context a task needs, choose simple representations such as HTML when they work, and avoid automating jobs they cannot do themselves.
Google’s Agent Scaling Problem Is Quota, Observability, and Evaluation
KP Sawhney and Ian Ballantyne describe Google DeepMind’s agent work as an infrastructure problem rather than a single-agent breakthrough. Their account centers on the constraints that appear when thousands of heavy users and agent workflows run at once: quota management, scarce compute, traceability, skills governance, evaluation, and review. Sawhney argues the next step for Deep Research is to move away from passing giant context blobs through a pipeline toward shared workspaces where components can collaborate more like human researchers.
Parallel Coding Agents Turn Human Availability Into a Systems Problem
Michael Richman argues that coding agents are still too dependent on unpredictable human input for developers to treat them as set-and-forget tools. His Cmd+Ctrl system is meant to reduce what he calls FOMAT, or fear of missing agent time, by aggregating sessions across tools such as Claude Code, Cursor, Codex and Gemini CLI, sending notifications when agents finish or get stuck, and letting users respond or start sessions from mobile, web, watch or terminal surfaces.
Heterogeneous Model Routing Beats Frontier Baselines on Visual Web Tasks
Adrian Bertagnoli of Callosum argues that AI scaling is moving away from monolithic models running on uniform GPU clusters and toward heterogeneous systems that route subtasks across different models, chips and workflows. He points to Callosum results in visual web navigation and recursive long-context reasoning, where mixed model-and-hardware systems reportedly matched or beat frontier baselines while cutting cost and latency, as evidence that agentic workloads should be decomposed rather than sent wholesale to the most capable model.
Agent Interfaces Are Moving From Chat to Web-Native Surfaces
Rachel Nabors argues that chat should be treated as a transitional interface for agents, not their final form. Using her rebuilt Rachel the Great web comic archive as the example, she shows how MCP apps can render HTML, CSS and JavaScript inside Claude as a working comic reader, while WebMCP can expose a site’s existing functions directly to browser agents. Her case is that the web platform already provides the “infinite canvas” for agent software; the task is to let agents inherit it rather than confining them to text conversations.
Agent Swarms Need a Coordination Layer, Not Another Runtime
Lou Bichard of Ona argues that companies building fleets of background coding agents are repeatedly recreating the same missing infrastructure. In his account, runtimes, orchestration and triggers are increasingly solved; the unresolved primitive is coordination — the layer that lets agents track state, hand off work, enforce gates and know when they can move through the software development lifecycle. GitHub, Linear and CI can expose artifacts and signals, Bichard says, but they are not agent-native coordination systems; he suggests the missing layer may need to take the form of a CLI gateway that local and remote agents can call.
Google’s GenAI Stack Turns Multimodal Prompts Into Application Pipelines
Google DeepMind’s Paige Bailey and Guillaume Vernade argue that Google’s generative AI stack is being organized as an application pipeline rather than a set of isolated models. In a three-hour workshop, Bailey showed AI Studio turning multimodal Gemini prompts into inspectable API calls and generated apps with auth and Firestore, while Vernade used Gemini, Nano Banana, Veo and Lyria to illustrate, animate and score The Wind in the Willows. Their case is that builders can now orchestrate prompt, code, media generation and deployment in one workflow, even as the demos exposed seams that still require engineering discipline.
Fast Coding Models Require Smaller Tasks and Continuous Validation
Sarah Chieng of Cerebras argues that fast coding models such as Codex Spark, which she says can generate code at roughly 1,200 tokens per second, require more disciplined developer workflows rather than looser ones. In her account, a 20x speedup over models such as Sonnet and Opus makes old habits — large prompts, unattended agents, delayed validation, and sprawling context — produce technical debt faster than developers can inspect it. Her playbook is to use speed for bounded execution, continuous testing and linting, variant generation, stricter permissions, and external memory that keeps short sessions from losing the plan.
Container Images Turn OpenClaw Setups Into Reproducible Team Baselines
Sally Ann O’Malley of Red Hat argues that an OpenClaw agent setup should be shared as a container image rather than as a bundle of markdown, YAML, copied keys and informal instructions. Her demo uses Podman locally and Kubernetes for distribution, with the same image, separate secret backends, volume-backed state and a curated agent bundle so a personal setup can become a reproducible team baseline.
Android Makes Gemini Nano a Shared System Service for Apps
Google’s Florina Muntenescu and Oli Gaymond argue that Android’s on-device AI strategy depends on treating Gemini Nano as a shared system service, not something each app ships and manages itself. In their account, AICore centralizes the three-to-four-gigabyte model, scheduling, battery management and privacy boundaries, while developers call higher-level ML Kit GenAI APIs. The constraint is reach: those APIs need recent flagship-class devices, so Google is positioning hybrid cloud fallback and LiteRT-LM as alternatives when local Gemini Nano is unavailable or too limiting.
VS Code Unifies Local, Background, and Cloud Coding Agents
Microsoft’s Liam Hampton argues that coding agents should be chosen by the amount of control a developer wants to keep, not treated as a single all-purpose assistant. In a VS Code demo using one repository, he assigns tests to a local Claude agent for hands-on iteration, a front-end build to a background agent isolated in a Git worktree, and open-source documentation to a cloud agent running through GitHub Actions. His case is that VS Code can act as the control plane for these modes, including Copilot, Claude, and third-party agents.
AI-Generated PR Firehoses Are Turning Agent Work Into Infrastructure
OpenClaw maintainer Onur Solmaz argues that high-volume AI-generated pull requests are less a code-review problem than an operations problem. In his talk, he presents acpx, a headless CLI for the Agent Client Protocol, as a way to replace terminal scraping with structured agent workflows that can reproduce bugs, judge implementations, run review loops and emit machine-readable results. He extends the same model to Spritz, a Kubernetes operator for disposable per-task agent pods, making the case for interoperable, isolated agent infrastructure rather than one shared bot or ad hoc maintainer intervention.
Coding Agents Can Tackle AI Systems Engineering With File-Based Skills
Hugging Face’s Ben Burtenshaw argues that coding agents can now take on parts of AI systems engineering when the work is narrow, measurable, and embedded in inspectable repositories. Using examples including an agent-written CUDA RMSNorm kernel with a reported 1.94x H100 speedup, an end-to-end Qwen3 fine-tune, and a multi-agent research lab, he makes the case that the limiting factor is not a better prompt but better primitives: skills, versioned artifacts, benchmarks, managed compute, and open metrics that agents can read, run, and improve.
Any-to-Any Agents Rely on Orchestrated Multimodal Models, Not One Network
Google DeepMind’s Patrick Löber presents “any-to-any” agents as an orchestration problem rather than a claim that one model already handles every modality. In his architecture, Gemini reads and reasons across PDFs, images, audio, video and other sources, then uses function calling to invoke specialized native models for images, speech, live audio, video or embeddings. Löber argues that the useful shift is not generating every possible format, but letting an agent decide when a diagram, spoken explanation or other output is warranted.
Coding Agent Skills Need Live Documentation, Not Cached Product Knowledge
Marc Klingen of Langfuse argues that coding agents can add observability, but often do it first from stale model memory, producing broken or incomplete instrumentation before recovering through current documentation. In a talk on building a Langfuse skill for Claude Code, he says the fix is not to stuff more product knowledge into the agent, but to give it reliable ways to find live docs, expose its intermediate work in traces, and evaluate changes against realistic repositories. The same work, he warns, creates new risks when optimization loops reward shorter paths and remove the documentation-fetching and approval steps that make the skill reliable.
Fine-Tuning Pushed FunctionGemma From 46% to 90% Function-Calling Accuracy
Cormac Brick, a Google AI Edge engineer, argues that on-device agents are becoming practical when developers either use system models such as Gemini Nano through Android AI Core or ship narrow, fine-tuned tiny models with LiteRT-LM. His main example is FunctionGemma, a 270 million parameter function-calling model that rose from about 46% accuracy out of the box to more than 90% on most tested app-intent functions after synthetic-data fine-tuning. Brick presents the tradeoff plainly: system GenAI is easier when it fits, while app-shipped tiny models require more work but can run locally, offline, and with more control.
Retrofitting Sovereign AI Turns Compliance Rules Into Architecture Rework
Bilge Yücel of deepset argues that AI sovereignty is an engineering constraint that has to be designed into a system, not a legal or procurement requirement applied after deployment. She frames sovereign AI around control of data, models, infrastructure, and operations, and shows how retrofits expose hidden dependencies: jurisdiction-crossing data flows, model APIs embedded in application logic, managed services that masked operational work, and systems that cannot be traced or audited.
Every Addition to an AI Agent Can Make It Worse
Ara Khan of Cline argues that agent maturity is less about adding autonomy than about knowing what not to add. In a talk structured around four levels of agent building — from frameworks to state machines, Kanban-managed workflows and cloud deployment — Khan says frontier models increasingly reward simpler prompts, deliberate architecture and visible human control. His central warning is that every extra instruction, abstraction or automation layer can make an agent worse.
Spotify Uses Semantic IDs to Make LLMs Recommend Catalog Items
Spotify’s Shivam Verma argues that LLM-era personalization requires translating both users and catalog items into forms a model can process alongside language. In his account, Spotify combines long-term user embeddings, Semantic IDs that turn tracks and episodes into token sequences, and soft tokens that project a listener’s profile into an LLM’s embedding space. The aim is a generative recommender that can produce catalog-native recommendations without full fine-tuning, while still relying on traditional ranking layers for production use.
UK Government Tests an Insurgent Model for In-House AI Delivery
Eoin Mulgrew of the Number 10 data science team argues that the UK state’s AI problem is less a shortage of use cases than a shortage of technical people with the access, mandate, and proximity to build inside government workflows. In a talk on the No. 10 Innovation Fellowship, he presents the model as a deliberate hack around normal civil-service constraints: market-rate pay, outside recruitment, a highly selective technical process, and authority to enter departments and ship tools that remain with the teams using them.
Gemini Becomes the Prompt Engineer for Google’s Gen Media Stack
Google DeepMind developer advocate Guillaume Vernade demonstrates a gen-media workflow built around Gemini as the orchestrator rather than as a one-shot generator. Using The Wind in the Willows, he shows Gemini reading the full book, producing structured prompts and scripts, and handing them to Nano Banana, Veo, Lyria and TTS models for images, video, music and narration. His broader case is that multimodal production depends less on a single model than on schemas, reference assets, state management, cost controls and prompt handoffs between specialist systems.
Long-Running Agents Need Separate Builders, Evaluators, and Disposable Scaffolding
Anthropic’s Ash Prabaker and Andrew Wilson argue that long-running agents are a harness-design problem, not a matter of writing longer prompts. Their case is that agents can run for hours only when building, judging, planning and state management are separated: adversarial evaluators should test live behavior, work should be decomposed into explicit contracts, and durable state should live outside the model’s context. They also warn that this scaffolding is provisional, because each new model release changes which supports are useful and which have become dead weight.
A Harness Made GPT-3.5 Turbo’s Browser Agent Reliable Without Rewriting the Prompt
Tejas Kumar, an IBM engineer, argues that unreliable AI agents are often not suffering from bad prompts so much as missing harnesses: the deterministic software around a model that bounds its behavior, manages context, verifies outcomes, and handles known failure states. In his Hacker News browser-agent demo, GPT-3.5 Turbo falsely claimed it had upvoted a post after hitting a login wall; without changing the prompt, Kumar added guardrails, trace-based verification, and a programmatic login handler until the same model completed the task reliably.
Incident.io Uses Coding Agents to Debug Its AI SRE
Lawrence Jones, founding engineer at Incident.io, argues that complex AI products now require debugging tools built for agents as well as humans. In a talk on Incident.io’s AI SRE system, which runs hundreds of prompts across telemetry and code during production investigations, Jones describes how the team moved from human trace inspection to agent-addressable evals, downloadable file-system traces, and parallel analysis pipelines to find and fix failures that had become too large to debug manually.
AI Chat Needs Shared Sessions, Not Single Response Streams
Mike Christensen of Ably argues that many AI chat interfaces fail because they tie the user experience to a single streaming connection, not because the underlying model is inadequate. In his account, Server-Sent Events make common product behaviors such as refresh, reconnect, cancellation, multi-tab use and device switching brittle or ambiguous. Christensen’s proposed fix is to treat the AI session as a durable shared resource: clients and agents subscribe to and write into the session, so connections can drop, agents can run concurrently, and humans can join without losing context.
Agentic AI Is Turning Model Quality Into a Systems Problem
At AI Engineer Singapore’s second day, speakers from Google DeepMind, Cloudflare, Arize, OpenClaw, Adaption and other teams made a shared engineering case: as AI systems become more agentic, model quality is no longer separable from the systems around the model. Richard Ngo framed the risk as long-horizon, situationally aware agents whose goals cannot be inspected, while practitioners argued that production AI now depends on continuous evaluation, traces, deterministic execution boundaries, routing, memory, fine-tuning and test-time search. The source’s central claim is that useful and safe agentic AI is becoming a systems problem, not just a model-selection problem.
Playwright Lets Agents Test Feature Requests Before They Write Code
Microsoft’s Marlene Mhangami argues that AI-generated tests can make a codebase look healthier than it is, because agents often write tests that confirm their own implementation rather than validate the user-visible behavior a feature is meant to deliver. Her prescription is to reverse the common workflow: start from the feature request, have the agent write failing Playwright tests against expected behavior, then generate code to pass them. In a GitHub Copilot demo using the Playwright MCP server, she applies that approach to a toy-store search and filtering feature, with the browser showing the agent exercise the product experience directly.
Vertical AI Teams Need Domain Experts Who Own Quality Loops
Chris Lovejoy of Notius Labs argues that vertical AI companies increasingly fail or succeed on whether they can turn domain judgment into product quality, not simply on access to better models. He proposes three operating models for that expertise: an Oracle who both judges and changes outputs, an Evaluator who defines and measures quality while engineers implement fixes, and an Architect who designs systems that improve from use. His case studies of Granola, Tandem and Anterior show why the right model depends on whether quality is subjective, measurable, or too variable for manual iteration.
Context Graphs Make AI Decision Trails Queryable
Stephen Chin of Neo4j argues that enterprise AI systems need context graphs because retrieval alone can surface relevant facts while missing the relationships that make them usable. In his examples, a graph-augmented system can connect a patient’s emphysema care plan to smoking history or a credit decision to prior rejections, policies, margin trades and fraud signals. Chin’s case is that agents should preserve not only documents and answers, but the decision traces, tool calls, causal chains and outcomes that let humans inspect and reuse prior reasoning.
PFF’s Two-Engineer Agent Team Shipped 10x More Output
PFF CTO Mike Spitz argues that AI agents change the basic operating constraint of an engineering organization: the question is no longer how to make engineers faster, but how to make agents faster. In a three-month case study, he says two agent-heavy engineers shipped far more frequently than a ten-person team on the same codebase, with PFF measuring a 10x output gain per engineer and higher customer satisfaction. The result, in his account, was not the end of engineers but the removal of Scrum-era coordination rituals and a sharper split between agent-executed work and human judgment.
Supabase Says Skills and MCP Close the Agent Context Gap
Pedro Rodrigues of Supabase argues that agents fail on production systems less because they cannot reason than because they lack product-specific judgment. In a test using the same Postgres task, Supabase found that Claude with MCP alone created a view that could bypass row-level security, while MCP plus a Supabase skill added the required `security_invoker = true` flag. Rodrigues’s case is that MCP gives agents tools, but skills supply the rules, workflows, and current documentation paths needed to use those tools safely.
Intercom Doubled Engineering Throughput by Standardizing on Claude Code
Brian Scanlan, a senior principal engineer at Intercom, argues that the company doubled engineering throughput by treating AI coding as an internal platform strategy rather than an individual productivity tool. In his account, Intercom standardized on Claude Code, encoded recurring engineering work into agent-usable skills, connected agents to internal systems under existing controls, and made AI adoption an explicit expectation across R&D. The reported result was a doubling of pull-request throughput, including 17.6% of merged PRs approved by Claude, alongside new bottlenecks in review and CI.
Choosing The Right Eval Matters More Than Tuning The Judge
Laurie Voss of Arize argues that agentic applications need the same engineering discipline as other production software: instrumentation, inspectable traces, targeted evals, and controlled experiments, not a handful of prompts that “look right.” In a hands-on workshop using a financial analysis agent, Voss shows how teams should read traces before writing evals, classify failures by root cause, and combine deterministic checks, LLM judges, custom rubrics, and human-labeled meta-evaluation. His central warning is that the choice of eval can dominate the result: the same agent scored 0 out of 13 on a correctness eval and 13 out of 13 on a faithfulness eval because the first judge was asking the wrong question.
Agent Observability Is Moving From Dashboards to Eval-Driven Optimization
Amy Boyd and Nitya Narasimhan of Microsoft argue that agent observability has to track the widening gap between what an AI agent is meant to do and what it actually does as models, prompts, tools and user behavior change. Their walkthrough of Microsoft Foundry frames observability as a loop of OpenTelemetry tracing, trace-linked evaluations, monitoring, optimization and red teaming. The central demonstration is an observe skill that can generate an evaluation dataset, run batch tests, optimize prompts, compare versions and roll back to the best-performing agent version from a sparse starting point.
An Event-Sourced Agent Harness Separates State Replay From Side Effects
Jonas Templestein of Iterate argues that an agent harness can be reduced to an append-only event stream plus processors: synchronous reducers to derive state, and post-append hooks to perform side effects. His design puts model chunks, tool calls, errors, schedules, subscriptions and even processor deployment into the log, so a restarted agent can replay state without replaying old LLM calls. The larger claim is that agents and third-party services can compose by reading and appending to the same durable stream, with bounded waits and circuit breakers replacing tighter, blocking plugin interfaces.
Agents Can Now Fine-Tune Open Models Through Prompted Workflows
Merve Noyan argues that open models have moved from downloadable artifacts into an operational stack for selection, serving, inspection, training and deployment. In her Hugging Face presentation, she makes the case that access to model weights now matters because developers can quantize, fine-tune and run models locally or at the edge, while Hub benchmarks, inference providers, traces, MCP and Skills let agents act directly on those workflows. Her strongest example is a coding agent that can size hardware, choose infrastructure and launch a fine-tuning job from a prompt.
Continuous Agents Need Stateful Compute, Not Traditional CI/CD
Madison Faulkner and Hugo Santos of Namespace argue that traditional CI/CD is organized around human-paced pull requests, and starts to fail when autonomous agents generate continuous, overlapping streams of code. Their proposed replacement keeps validation inside a stateful agent loop, uses caching and orchestration to avoid cold starts, and moves completed work into a pre-merge layer where humans review intent and outcome rather than every diff. The underlying CI functions remain, but the pull request stops being the system’s basic unit of work.
Persistent Sandboxes Make Agents Remember, Plan, and Reuse Their Work
Nico Albanese, a Vercel engineer working on the AI SDK, argues that agents become more reliable when they are given a persistent sandboxed computer, not just a runtime and tools. In his workshop, he builds that pattern with AI SDK 6, Vercel’s named sandboxes, a bash tool, and a file-backed memory system, showing how an agent can plan in files, preserve context across sessions, and create reusable scripts without a separate memory layer.
Enterprise GenAI Pilots Fail When Feedback Cannot Reach the Model
Alessandro Cappelli, co-founder and chief customer officer of Adaptive ML, argues that enterprise generative AI pilots fail to reach production because companies lack a systematic way to turn defects, user feedback, business metrics and production signals into model improvement. In a talk on Fortune 500 deployments, he says prompting and instruction fine-tuning can produce credible demos, but reinforcement learning is the mechanism needed to train models and agents against enterprise-specific environments, rewards and KPIs. His case is that agents make this feedback loop more urgent, because they consume more tokens, touch live systems and leave less room for error.
Fixed Evaluation Suites Go Stale as Agents Optimize Toward Intent
Vincent Koc of Comet ML argues that AI evaluation is being outpaced by the systems it is meant to measure. In a talk on adaptive evaluation for agents, Koc says static benchmarks and handcrafted test sets are poorly suited to applications that change with prompts, tools, production traces, user behavior and even their own harnesses. His proposed direction is to define the intended end state, use traces and telemetry to surface drift and edge cases, and treat evals as a continuously revised system rather than a one-time benchmark.
Coding Agents Work Best When Products Expose Simple Tools
Matthias Luebken argues that coding agents such as OpenClaw are less mysterious than they appear: they are LLMs calling tools in a loop, made more useful by a runtime, shell, sessions and product hooks. In his Tavon talk, he uses Pi, a minimal coding-agent SDK, to show how that loop can be embedded inside business software, including a sales workflow where RFP emails are routed to customer-specific agent sessions and returned to users as draft replies. His architectural point is that teams should not force agents through opaque systems, but expose data, commands and controls in forms coding agents can use cleanly.
Slack-Native AI Coworkers Turn Memory and Permissions Into Product Risks
Fryderyk Wiatrowski argues that building Viktor as an AI coworker inside Slack is not a matter of scaling a personal assistant to more users. A company-level agent gains value from shared context, shared integrations, and the ability to act where work is discussed, but those same features create harder problems around memory isolation, permissions, fragmented Slack conversations, proactivity, and tone. His case is that an “AI employee” has to be designed less like a chatbot and more like a new hire entering the company’s communication layer.
Apple-Device AI Is Becoming Viable Without Cloud Inference
Prince Canuma presents MLX, Apple’s array framework for Apple Silicon, as a practical foundation for running AI agents locally rather than through cloud services. His case is rooted in accessibility and unreliable connectivity, but extends to product constraints for voice agents, robots and multimodal apps: vision, speech, video generation and long-context inference can increasingly run on Macs, iPhones and iPads without a network call. Canuma does not argue that local models replace every frontier cloud system, but that the boundary has moved far enough to make on-device AI a serious deployment option.
Durable Agents Need Context Logs and Execution Snapshots
Eric Allam of Trigger.dev argues that durable agents need more than the replay-based workflow model used for durable transactions. In his talk, he separates agent durability into two problems: the LLM context, which fits naturally as an append-only log, and the execution environment — files, memory, subprocesses and local state — which he says should be preserved through OS-level snapshot and restore. Allam uses Trigger.dev’s Firecracker work to make the case that long-running agents are becoming session-like workloads, not just replayable transactions.
Head-Tail Truncation and Memory Stabilized Arize’s Trace-Analyzing Agent
Sally-Ann DeLucia argues that agent performance depends on context management as an operating discipline, not on larger prompts or simple compression. Drawing on Arize’s work building Alyx, an agent that analyzes trace data from AI systems including its own, she says naive truncation broke follow-up reasoning and LLM summarization gave the model too much control over what mattered. Arize’s more durable pattern was to preserve the head and tail of context, store the middle for retrieval, test long sessions explicitly, and move heavy workloads into sub-agents.
Production AI Features Need Feedback Loops, Not One-Shot Prompts
Mehedi Hassan, a product engineer at Granola, argues that the hard part of shipping AI features is not getting a model to work once in a demo, but making its behavior reliable and inspectable in production. Using Granola’s meeting-notes app as the case, he says web search, chat, and prompt personalization quickly expose costs, context limits, provider instability, and role-specific user expectations that a single prompt cannot absorb. Granola’s response, in his account, was to build feedback loops: internal tracing, broadly usable debugging tools, and faster ways to test product variants before shipping.
Text-to-Speech Models Are Converging on LLM-Style Architectures
Samuel Humeau of Mistral argues that modern text-to-speech has converged on an architecture that resembles large language modeling: an autoregressive transformer generates compressed audio tokens frame by frame, rather than raw waveform samples. Using Mistral’s open-weight Voxtral TTS model as the example, he says neural audio codecs make that possible by reducing dense speech signals to token-like representations a transformer can handle. The remaining latency frontier, in his account, is not just streaming playable audio early, but letting TTS consume an LLM’s text stream as it is still being written.
Voice AI Still Confuses Natural Speech With Real Conversation
Neil Zeghidour, CEO of Gradium AI and one of the researchers behind the full-duplex voice model Moshi, argues that voice AI’s long-promised “Her” moment is still being confused with better synthetic speech. His case is that cascaded voice agents are useful but structurally too slow and lossy to feel conversational, while speech-to-speech models improve flow but remain limited unless they can listen and speak simultaneously, use tools reliably, understand paralinguistic cues, and run cheaply enough to scale.
ElevenLabs Voice Engine Wraps Existing Chat Agents Without Rebuilding Them
Luke Harries of ElevenLabs argues that the next step for chat agents is not a new orchestration stack but a voice layer around the agents companies have already built. His case for ElevenLabs’ Voice Engine is that teams can keep their existing LLM logic, RAG, tools and business rules, while offloading speech-to-text, text-to-speech, turn-taking and interruption handling to a wrapper. The product is positioned for companies that want voice interfaces across web, phone and meeting channels without rebuilding their chat agents inside a fully managed platform.
Pretraining and Attention Infrastructure Made Vision Transformers Practical
Isaac Robinson of Roboflow argues that transformers overtook convolutional networks in vision not because images stopped needing visual structure, but because that structure moved from hand-built architecture into pretraining, scaling and tooling. In his account, ViT-style models first lacked the inductive biases and efficiency that made CNNs dominant, but self-supervised vision pretraining and attention infrastructure from the LLM world made the simpler architecture practical. Robinson frames the next problem as deployment: turning large foundation backbones into model families that can meet real latency, cost and hardware constraints.
BFL Is Moving FLUX From Image Generation Toward Physical AI
Stephen Batifol of Black Forest Labs argues that FLUX is no longer just an image-generation line but the start of a broader push toward visual intelligence: models that can generate, edit, understand, and eventually act across images, video, audio, and physical environments. In the talk, he presents FLUX.1, Kontext, FLUX.2, and FLUX.2 Klein as product steps toward that goal, while BFL’s Self-Flow research is framed as the mechanism for moving representation learning inside multimodal generative models rather than relying on external encoders.
Agentic Search Needs Specialized Tools and General-Purpose Escape Hatches
Elastic’s Leonie Monigatti argues that context engineering for LLM agents is largely a search-interface problem: the critical question is how an agent decides what to retrieve from files, databases, memory, the web, and other sources before the model answers. In her workshop, she shows why semantic search, database query tools, shell access, and agent skills each solve different parts of that problem and fail in different ways. Her recommendation is to build retrieval stacks that combine easy specialized tools for common tasks with more general tools for ambiguous or complex ones, then use observed failures to refine the stack.
Production Agents Need Evals and Managed Variables After Deployment
Samuel Colvin of Pydantic argues that production agents need more than observability after deployment: they need evals, traces, and typed configuration that can change prompts, models, and other parameters without a redeploy. Using Pydantic AI, Logfire, managed variables, and GEPA, he shows a workflow for moving from manual prompt tuning toward continuous optimization. His case is practical rather than automatic: GEPA can improve a narrow benchmark, but only if the team has representative data, sound evaluation criteria, and a clear definition of what better means.
Coding Agents Need Library Source Code, Not Longer Prompts
Michael Arnaldi, of Effectful, argues that coding agents use Effect better when the project gives them the Effect source code, not just better prompts or documentation. In a workshop starting from an empty repository, he demonstrates cloning the Effect repo into the project, having the agent extract local pattern files, and then using strict TypeScript diagnostics, tests, lint rules and persistent instructions to steer the agent toward a working Effect HTTP API.
Production Agents Need Semantic Observability Beyond Offline Evals
Raindrop’s workshop argues that production agents need a different observability model from conventional software monitoring or offline evals. Zubin Kumar, Danny Gollapalli and Ben Hylak make the case that teams should track both explicit telemetry such as tool errors, latency and cost, and implicit signals such as user frustration, refusals, task failure, capability gaps and unusual workarounds. Their framework treats real production behavior as the primary surface for finding regressions, running experiments and catching failures that do not appear as clean exceptions.
Agent Failure Should Drive Enterprise AI Knowledge Base Curation
Raj Navakoti argues that enterprise AI agents fail less because of model limits or retrieval plumbing than because companies have not made institutional knowledge legible. In his Demand-Driven Context workshop, he proposes building agent-ready knowledge bases from the bottom up: give agents real tickets or incidents, observe where they fail, and turn those failures into structured, validated context blocks. The method, shown through smaller-scope examples and prototypes including work from IKEA Digital, is presented as an incremental curation loop rather than a proven enterprise-scale system.
Agent Skills Turn Repeated Instructions Into Portable Workflows
WorkOS engineers Nick Nisi and Zack Proser make the case that AI “skills” are a practical way to turn repeated agent instructions into portable, reusable workflows. They argue that small markdown-and-script packages can encode team context, constraints, evidence-gathering commands and output formats so agents stop producing generic answers and start following a team’s way of working. Their warning is that skills only help when they are focused, routed correctly, tested against a no-skill baseline and managed like shared software rather than treated as another giant context file.
MCP Apps Turn Chat Hosts Into Application Distribution Channels
Liad Yosef and Ido Salomon argue that MCP Apps turn chat products such as ChatGPT, Claude, VS Code, Cursor and Copilot into application distribution surfaces, not just places for text responses. Their case is that tools can return branded, interactive UI resources over MCP, while user actions flow back through the host so the model retains context and control. For builders, they frame this as a shift from monolithic web destinations to portable app components that can run across compliant agent hosts.
Small-Model Inference Needs Infrastructure Beyond Model Servers
Filip Makraduli of Superlinked argues that the hard part of small-model inference is no longer simply serving a model, but operating many embeddings, rerankers, extractors and multimodal models efficiently in production. In his account, conventional one-model-per-container deployments waste GPU capacity and leave teams to rebuild routing, autoscaling, monitoring, hot-swapping and eviction themselves. Superlinked’s SIE is presented as an open-source attempt to provide that missing infrastructure layer for AI search and document-processing workloads.
Multi-Agent Software Systems Need Contracts and Handoffs to Run for Days
Factory’s Luke Alvoeiro argues that long-running software agents will not be built by stretching chat sessions, but by organizing agents into roles with explicit contracts, handoffs and validation. In a talk on Factory’s Missions system, he presents a three-part architecture — orchestrator, workers and validators — designed to run software work for hours or days while humans supervise scope and acceptance rather than every step. The case rests on Factory’s production experience, including missions Alvoeiro says have run as long as 16 days, and on a claim that serial execution, adversarial verification and model selection by role matter more than default parallelism.
Gemma 4 Moves On-Device AI From Chatbots to Local Agents
Chintan Parikh of Google DeepMind argues that on-device AI is moving from local chatbots toward local agents, as smaller Gemma 4 edge models become capable of tool calling, structured output and reasoning on phones, laptops and embedded hardware. With Weiyi Wang joining the Q&A, Parikh presents LiteRT as the deployment layer for that shift across Android, iOS, desktop, web and IoT. His case is pragmatic rather than absolute: edge inference can improve latency, privacy, offline use and cost, but teams still have to manage memory, quantization, accelerator support and when to call the cloud.