AI Engineering Must Preserve Craft as Work Shifts to Verification

Annie VellaAI EngineerThursday, June 4, 202619 min read

At AI Engineer Melbourne, Jeremy Howard, Annie Vella and Mic Neale each argued against treating AI adoption as an automatic productivity upgrade. Howard warned that coding tools can simulate autonomy and flow while eroding mastery; Vella presented research showing engineers feel more productive even as parts of developer experience deteriorate; and Neale made the case for pooling idle edge devices as an alternative to defaulting all inference to centralized, metered infrastructure.

AI can deepen craft or hollow it out

Jeremy Howard framed AI engineering as a psychological problem before he framed it as a technical one. The work people choose, he argued, does not merely produce outputs; it shapes the person doing it. At a moment when AI is changing what engineers spend their days doing, Howard asked the audience to treat that as more than a productivity question.

His starting point was self-determination theory, particularly Richard Ryan and Edward Deci’s review paper, “Self-Determination Theory and the Facilitation of Intrinsic Motivation, Social Development, and Well-Being.” Howard emphasized that the paper was not an isolated claim but a synthesis after decades of research and “hundreds and hundreds of experiments.” He read its opening contrast closely: people at their best are “curious, vital, and self-motivated,” agentic and inspired, striving to learn and master new skills; yet the same human spirit can be “diminished or crushed,” producing apathy, alienation, and passivity.

That split gave Howard the core distinction for the talk: eudaimonia versus hedonia. Hedonia, in his description, is “frictionless pleasant ease” — not inherently wrong, but passive and oriented toward easy pleasure. Eudaimonia is living well by “fully actualizing your capacities.” Howard’s concern was that much AI tooling is being sold as hedonic software for knowledge work: a way to remove friction, have things done for you, generate more output, and reduce the need to struggle with the work itself.

Self-determination theory, as Howard presented it, explains why that matters. Motivation is not only a way to make workers more productive, despite how business books often frame it. Authentic motivation is associated with more interest, excitement, confidence, performance, persistence, creativity, vitality, self-esteem, and well-being. Howard organized the relevant needs around autonomy, mastery, relatedness, and purpose, then narrowed his focus to autonomy and mastery.

Hundreds

of controlled experiments Howard cited as backing self-determination theory’s claims about authentic motivation

Autonomy, in this account, is not simply having a choice between options. Mastery is not simply producing more things. Autonomy means meaningful agency in the work; mastery means building genuine craft through effortful practice and feedback. AI can support both, Howard argued, but it can also decay both.

Psychological need	How AI can support it	How AI can decay it
Autonomy	Break down barriers to work that was previously out of reach	Offer fake choices between poorly understood options, creating an illusion of control
Mastery	Help people tackle more complex tasks while focusing on foundational learning	Outsource challenges to AI, reducing effortful practice and foundational learning

Howard’s autonomy/mastery chart separated AI that augments craft from AI that weakens it.

The decay path is familiar to anyone who has used agentic coding tools without understanding the choices the agent is asking them to approve. Howard gave the example of an AI system offering a choice between architectural options — a distributed system, green threads, a polling loop — to a user who does not understand the tradeoffs and chooses arbitrarily. Psychologists call this an “illusion of control.” It feels like agency, but it is not autonomy in the sense self-determination theory cares about.

The mastery risk is similar. AI can help a person tackle more complex tasks while learning the underlying foundations, or it can invite them to outsource more of the challenge, more quickly, with less effortful practice and less learning. Howard’s warning was direct: AI itself is neither psychologically good nor bad, but the incentives around it are not neutral.

The people getting you to use AI don't care about your autonomy and mastery. They care about your outputs.

Jeremy Howard

He named the interested parties plainly: model vendors, platforms, harness providers, and managers seeking output metrics. Their marketing is rarely about helping engineers become more capable. It is more often about summarizing, writing, deciding, or generating on the user’s behalf. Howard’s counsel was that engineers will have to protect their own autonomy and mastery because the default commercial pressure will be toward the decay path.

The dangerous version of flow feels productive

Howard’s critique of AI coding depended on a specific distinction between real flow and “junk flow.” He drew on Mihaly Csikszentmihalyi’s work to define positive flow as the condition in which a person’s skills are adequate to the challenge in a goal-directed, rule-bound system with clear feedback about performance. Howard’s personal example was learning to ride a motorcycle fast around the Phillip Island circuit: the rules were clear, the feedback was immediate, the challenge was high, and the skill was real.

But Csikszentmihalyi also described darker versions of flow: experiences that look like flow at first but become addictive rather than growth-producing. Howard connected this to gambling environments, which are designed to give players an “illusion of control” and keep them engaged in a loop that no longer develops skill.

That is the frame Howard applied to some forms of AI coding. He pointed to Rachel Thomas’s article “Breaking the Spell of Vibe Coding,” which described “sinister variations” on the positive state of flow. The danger is not that the interaction is boring. It is that it is compelling. It can feel like flow while detaching the engineer from real feedback, real understanding, and real craft.

Howard then showed that this concern is not confined to AI skeptics. He cited public reflections by developers who had been enthusiastic users of coding agents and had begun describing the psychological pull of the tools with caution. Armin Ronacher, whom Howard identified as the creator of Flask, wrote about getting “hooked on Claude,” spending months excessively prompting it, building tools he did not use, and feeling the dopamine hit of agent-assisted work. In the quoted passage Howard showed, Ronacher wrote that the process can make everything feel productive and coherent while becoming “decoupled from any external validation.” As long as nobody looks under the hood, the project feels fine; once an outsider inspects it, it can look “pretty crazy.”

Howard paired that with another developer’s reflection on getting from zero to 95 percent quickly, then spending many more hours failing to reach done. The first version appeared fast. The remainder became uncertain and unpleasant: problems kept appearing, the number of undiscovered problems was unknown, and the developer wondered what had been gained besides “dopamine anticipation.”

A third example came from Howard’s own community. A member described working on a genuinely interesting product requiring domain expertise, but with the core challenge obscured by “200k lines” of code closer to vibe coding. The pace had slowed as models improved and token spend increased, because the team could generate more code with less careful engineering. Demos and documentation worked for simple cases; harder work did not. Debugging and evaluations became painful. The team felt progress day to day, but quarterly meetings forced reality checks: what had shipped, what real-world accuracy looked like, and how many clients had signed. The results, in the quoted account, were “not good.”

Howard did not treat these stories as proof that AI coding is useless. He treated them as evidence of a specific failure mode: a tool can increase the sensation of progress while making verification, understanding, and accountability harder. The “slot machine lever” version of flow is dangerous precisely because it does not feel like stagnation while it is happening.

That concern also appeared in the broader business context Howard referenced. He showed a collage of mainstream coverage from The Wall Street Journal, Business Insider, and Axios about token costs, “tokenmaxxing,” and corporate concern over AI spending and return on investment. Howard mentioned that Uber was imposing a stricter token budget because it was not seeing the ROI. His focus, however, stayed on the human side: the same practices that create questionable economic returns may also erode the conditions under which engineers learn and flourish.

The older computer tradition was augmentation, not replacement

Howard’s alternative to AI-as-outsourcing was not nostalgic hand-coding. It was an older and richer tradition of computing as intelligence amplification. He placed AI in a lineage that began long before current language models: Ivan Sutherland’s Sketchpad, Douglas Engelbart’s “Mother of All Demos,” Kenneth Iverson’s APL, Bret Victor’s interactive environments, and Chris Lattner’s work across LLVM, Clang, Swift, Playgrounds, MLIR, and Mojo.

In 1963, Sutherland demonstrated direct manipulation on a computer display with a light pen, drawing and constraining geometry on a vector monitor. Howard described it as “an extraordinary level of deep connection between the human and the computer.” The computer was not doing the work instead of Sutherland; it was creating a tighter medium for thought and craft.

Engelbart’s 1968 demonstration extended the same ambition. Howard listed what Engelbart introduced: the mouse, hypertext and hyperlinks, real-time collaborative editing, video conferencing, word processing, screen windowing, and dynamic file linking. The purpose, expressed in Engelbart’s 1962 language, was “intelligence amplification”: organizing human intellectual capabilities into “higher levels of synergistic structuring” so that the resulting human-tool system exhibited more intelligence than an unaided human could.

Iverson represented the same principle through notation. Howard showed Iverson’s Turing Award lecture, “Notation as a Tool of Thought,” and explained that APL’s notation let Iverson generalize mathematical and computational relationships more deeply than standard notation. The inner product in APL was not merely multiplication plus addition; it could combine arbitrary functions. Through notation, Howard said, Iverson could prove properties about whole classes of functions, including ones mathematicians had not separately examined. The point was not terseness for its own sake. It was that a better notation changes what a person can think.

Bret Victor’s demos gave Howard a more recent example of environments that make abstract systems explorable. Howard described Victor as having built “extraordinary and inspiring” real-world demos for understanding climate, electricity, games, graphics, and waveforms. In Victor’s environments, code and representation are connected; changing something can immediately change the visible system. Howard’s interpretation was that Victor’s work supports “effortful craft.” It does not remove the need to understand; it makes understanding more direct.

Chris Lattner’s work, including Swift Playgrounds, sat in the same lineage for Howard. Across compiler infrastructure, languages, playgrounds, and newer systems, Howard saw an effort to improve the human-computer connection.

The lineage gives AI engineers a different design target. The question is not whether AI can generate more artifacts. The question is whether it helps people confer with computers “more intuitively and more deeply,” as Howard’s slide put it. He argued that AI can continue the chain from Sketchpad and Engelbart if it is designed to augment human creativity rather than replace it.

That principle is also where Howard placed his own work at Answer.AI. He said the mission had been his focus for years and was now central to the company: seek to augment human creativity, not replace it. He contrasted that with the usual pitch for AI products, which is about having something summarized, written, or otherwise done for the user.

Howard’s SolveIt examples kept the human inside the loop

Howard’s demonstration of SolveIt was meant to show what augmentation can look like in practice. He presented it as a system that keeps the human in contact with the material while making it easier to ask questions, test hypotheses, run code, and build artifacts inside the learning process.

The first example was a paper on recursive language models. Howard loaded the paper into SolveIt and read it section by section. When he reached a figure he did not understand, he asked the system to explain it. SolveIt identified the tasks in the figure — S-NIAH, OOLONG, and OOLONG-Pairs — and described them as a progression from constant to linear to quadratic complexity. Howard emphasized that this kind of clarification helped him avoid the common habit of skipping over confusing figures or citations.

The interaction became active rather than merely explanatory. Howard said he works poorly at an abstract level and needs concrete examples, so he asked for examples of each task and used the system to clarify another figure. He then pushed beyond reading comprehension into replication. His hypothesis was that a recursive language model might be reproducible as a tool loop with the right tools. SolveIt clarified where sub-agents fit; because it had a sub-agent tool, Howard asked it to use Python to compute a complex square root. He could then run code himself, compare results, and confirm that his environment could perform the relevant steps.

He also used SolveIt to inspect related work. When the paper cited recent work such as RULER, Context Rot, and OOLONG-Pairs, Howard stopped rather than skipping the references. He asked SolveIt to read the cited papers and explain what they were and why they mattered. From there, he tested the recursive-language-model idea against evaluation tasks. He selected a CodeQA task, had SolveIt locate data, debugged failed code, inspected the dataset, and asked the system to solve examples. It returned answers; he checked them, asked how the system reached them, and repeated the process on other tasks, including harder quadratic ones. Howard said that in the examples he tried, SolveIt correctly solved every task. He described the result as having not only understood recursive language models but reimplemented the relevant behavior in a couple of hours, and concluded that the platform he was using was more powerful than the RLM setup in the paper.

His second example was Julia Evans’s article “Moving away from Tailwind, and learning to structure my CSS.” Howard loaded the article into SolveIt and read it interactively, asking questions as he went. When Evans mentioned Tailwind Preflight, he asked what alternatives existed. Because SolveIt runs in a browser context, he could work with actual HTML and styles while discussing them. He experimented with CSS layers, rebuilt parts of Evans’s styling approach, and then created components such as badges and buttons.

The CSS example made the distinction between assistance and replacement visible. Howard formed ideas, tested them, viewed rendered output, and refined the result. With colors, he developed a new palette, mapped colors to semantic roles such as “danger,” viewed foreground and background combinations, and generated swatches. With typography, he created and inspected font scales. The AI assisted the work, but the work remained grounded in Howard’s judgment, experimentation, and comprehension.

The examples made Howard’s abstract claim concrete. A tool that supports autonomy and mastery does not merely produce output. It creates a tighter loop between question, explanation, code, execution, visual feedback, and revised understanding. It lets the user take on a more complex task while remaining responsible for sense-making.

Vella’s study found the craft moving from creation to verification

Annie Vella approached the same broad shift empirically. Her question was whether AI’s impact on software engineering is an evolution in the long history of computing or a genuine revolution. A show of hands in the room produced a mixed result, with more people leaning toward “revolution,” but Vella stressed that nobody really knows where the change is going.

That uncertainty took her back to university in early 2024. She completed a part-time Master’s of Engineering and designed a longitudinal study of professional software engineers using AI at work or at home. The study used two questionnaires spaced six months apart, run at the end of 2024 and beginning of 2025, with participants from 28 countries. Vella shared four findings from that research.

The first was a “creation to verification shift.” Vella wanted to know whether AI coding assistants change engineers’ perceived focus across common development tasks such as writing, refactoring, designing, reviewing, testing, and debugging. Most engineers reported spending less time on all but one of these tasks. Reviewing code was the exception, increasing slightly. Across the six-month period, the study found a statistically significant shift away from creation-focused tasks and toward verification-focused tasks.

countries represented among participants in Vella’s longitudinal study of professional software engineers using AI

For Vella, that meant the nature of software engineering craft is changing. The work is not disappearing, and engineers are not suddenly free at midday. Instead, new work appears around the AI-assisted development process. She gave it a name: supervisory engineering work.

Vella described this as a new “middle loop” between the traditional inner loop and outer loop. The inner loop is increasingly automated by AI. The outer loop remains integration and delivery. The middle loop consists of directing AI, evaluating its output, and integrating the results into a coherent engineering process. The dimensions of craft do not vanish; they relocate into a new kind of work.

That relocation changes what verification means. If AI accelerates creation, then judgment, testing, debugging, review, and integration become more central to mastery rather than peripheral chores. The engineer’s responsibility shifts from producing every line to supervising systems that produce candidate work. But supervision still requires domain understanding, technical taste, and the ability to tell whether an output is correct, maintainable, and fit for purpose.

Engineers feel more productive while developer experience deteriorates

Vella’s second finding was a paradox. Before AI, productivity and developer experience were commonly assumed to move together. Leaders were told that improving developer experience improves productivity. Vella’s study found that AI may decouple the two.

She asked engineers whether they felt more productive with AI. At both study points, 84% said yes. That result was stable. But the same engineers increasingly reported declines in developer experience, which Vella measured across cognitive load, flow state, and feedback loops.

At the first time point, 14% said at least one of those three dimensions had worsened. Six months later, that had almost doubled to 27%. Flow state was the most negatively affected, followed by rising cognitive load. Feedback loops, however, improved.

Measure	First questionnaire	Second questionnaire
Engineers reporting higher productivity with AI	84%	84%
Engineers reporting decline in at least one developer-experience dimension	14%	27%
Most negatively affected dimension	Flow state	Flow state
Dimension that improved	Feedback loops	Feedback loops

Vella’s study found stable perceived productivity gains but a growing deterioration in developer-experience measures.

Vella’s interpretation was that the improved feedback may itself interrupt flow. AI gives more frequent responses, suggestions, corrections, and possibilities. That can be useful, but it can also fragment attention. Engineers may feel more productive while the things that made the work feel like a craft begin to erode.

This finding sits near Howard’s distinction between positive flow and junk flow, though Vella’s evidence came from her own study. More feedback is not automatically better work. A tool can shorten loops while increasing cognitive load, and it can create the feeling of movement while making sustained concentration harder. The issue is not whether engineers feel busy or even productive; it is whether the new loop remains sustainable.

Vella linked this to emerging language such as “AI burnout” and mentioned Steve Yegge’s blog post “The AI Vampire” as another discussion of the phenomenon. Her message to engineers was that if they are feeling this pattern, they are not imagining it. Her message to leaders was sharper: measuring productivity alone is insufficient. If productivity appears to rise while cognitive load increases and flow deteriorates, leaders may be missing the cost that determines whether the work remains healthy.

Vella’s third finding offered more agency. She expected demographics to matter: seniority, company size, perhaps tools or work setting. Instead, the strongest predictor of productivity and developer-experience outcomes was self-efficacy: a person’s belief in their own ability to accomplish something.

Engineers who felt more confident with AI-assisted software work were more than 10 times more likely to report higher productivity gains. Vella emphasized that this matters because self-efficacy is a belief, and beliefs can change more readily than demographics or job circumstances. Engineers do not have to wait for the right title, employer, or perfect tool. They can build confidence through mastery experiences, and mastery experiences can be gained through experimentation.

Her loop was simple: experiment, learn, gain confidence, adapt. The more engineers experiment, the more they learn; the more they learn, the more confidence they build; the more confidence they build, the better they can adapt.

The future roles split between artisanal, orchestrator, and clerical work

Vella’s final framework described where software engineering work may be going. Inspired by “The AI-Native Developer” by Chaudhuri et al., she presented three possible futures: artisanal, orchestrator, and clerical.

The artisanal developer is the craftsperson writing software more directly by hand. Vella expects this work to remain valuable, especially in safety-critical or heavily regulated domains, but to become relatively rare.

The clerical coder is the role she warned against. In that future, agents do much of the work overnight and humans arrive in the morning to accept pull requests uncritically. Vella described this as the path with the least joy. It is supervision stripped of craft: approving generated output without creativity, ownership, or meaningful judgment.

The center of gravity, in her view, is the orchestrator or “code conductor”: someone building software with agents at scale. Within that role, she saw at least two directions. Engineers closer to the domain may focus on understanding the problem deeply, capturing the “what” and “why” in strong specifications, and feeding those to agents. Engineers closer to the platform may focus on the harness: “building the machine that builds the machine.”

Her advice followed directly. Engineers should be curious, experiment, and take control of their craft. Leaders should name what is possible and create pathways toward it. That means opening opportunities, blurring boundaries, and helping people move into the kinds of work where pride and joy can survive the transition.

Vella ended on a deliberately uncertain note. Nobody has this figured out — not top practitioners, not technology leaders, not academics. She said she had been in rooms with all of them during the year, and all had more questions than answers. Her conclusion was not that engineers should wait for clarity. It was that they should treat the future of their craft as something to design.

Pride and joy, they don't just happen by accident. They're outcomes that you can design into your system.

Annie Vella

Neale wants idle edge devices to become an inference mesh

Mic Neale shifted from the craft of AI-assisted work to the physical and economic constraints behind AI inference. His premise was that compute is “all around us” — in phones, laptops, desktops, and other consumer devices — and much of it sits idle. At the same time, inference is becoming expensive, resource-heavy, and subject to anxiety over new data-center builds and token spending.

Neale’s project, mesh-llm, asks whether that idle compute can be pooled for AI workloads. The architecture he showed treated clients and peers as devices that can host a model, share resources, and perform inference. A mesh-llm server runs inference, but the broader design points toward a network in which ordinary devices can participate rather than all inference being centralized in hyperscale data centers.

The argument was not that a phone is equivalent to a data-center GPU. Neale explicitly cautioned that the comparison is not “apples with apples.” His slide was meant to show the scale of latent consumer hardware, not to claim identical performance characteristics across device classes. The figures were presented on Neale’s slide as peak inference-throughput estimates based on industry reports, company public data, and discussions from 2024–2025.

Fleet	Installed base or active devices	Peak inference throughput shown
Data-center AI fleet	~8M H100/H200/B200-class AI GPUs	~10–15 ExaOps
AI smartphones	~1.4B active use	~20 ExaOps
AI PCs with NPUs	~200–300M active use	~10 ExaOps
Consumer edge fleet total	Phones, AI PCs, Apple devices	~30+ ExaOps

Neale’s slide compared centralized AI accelerators with the latent peak compute of consumer edge devices; he cautioned that this was not an apples-to-apples comparison.

On those assumptions, the global installed base of H100/H200/B200-class AI GPUs was roughly 8 million units with peak inference throughput around 10–15 ExaOps. The consumer edge fleet — AI smartphones, AI PCs with NPUs, and Apple devices — was shown at roughly 30+ ExaOps of peak inference throughput. The slide’s conclusion was that consumer edge devices deliver roughly two to three times the inference compute of the data-center AI fleet on a peak-operations basis. Neale’s point was the scale of the unused pool, not an audited claim that the edge fleet can straightforwardly replace data-center infrastructure.

Neale gave four reasons to care: sovereignty, zero marginal cost, efficiency, and modal choice. Sovereignty, as he used the term, means optionality over where computation and data run. In an Australian context, he described the dependence implied by sending data through fiber routes out of Bondi Beach toward California. Local or national compute capacity changes those choices.

Zero marginal cost changes behavior too. If inference can run on hardware already owned and idle, the usage model differs from metered cloud tokens. That linked Neale’s talk back to the broader “tokenmaxxing” anxiety elsewhere in the source: if every agentic action burns paid tokens, experimentation and deployment patterns are constrained by invoices. If some inference runs on already-owned local devices, different designs become possible.

Neale said mesh-llm had already succeeded in running an approximately 80B-class model — “what used to be frontier” — on commodity hardware over Ethernet with good throughput, leaving GPU and CPU capacity spare when machines were not needed. He also described a “Mesh Mixture of Agents” that could span internet scale.

The public mesh, meshllm.cloud, was presented as “demo-ware.” Users can discover it via nostr, set up a private mesh without publishing, invite friends or family with a single token to pool compute, or join an existing public or private mesh as a client or server. The result is an OpenAI- or Anthropic-compatible virtual LLM that can be used by an agent or chat application. Neale stressed that the public mesh has no financial incentives, no fair-use mechanism yet, and no money behind it. It is a proof point, not a token scheme.

Technically, Neale said the project uses QUIC, address discovery through relays with no shared information in the clear, and upgrades to direct UDP for lower latency where possible. It can optionally publish via nostr using NIP-89 Kind 31990. The closing invitation was pragmatic: join the project and help “stop waste.”

Inference and Deployment Agents and Autonomy AI Infrastructure and Compute AI Economics and Labor Human-AI Interaction Coding Assistants