Codex Moves Builder Work From Coding to Specification

Matias CastelloOpenAIFriday, May 29, 202611 min read

Matias Castello, product lead at Alchemy, argues that Codex is shifting software work from writing code toward specifying intent, constraints and preferences clearly enough for an agent to act. In a conversation with OpenAI’s Romain Huet, Castello describes using Codex for code review, product documents, backlog creation, feature experiments and personal projects, with human judgment reserved for deciding what should ship. His central claim is that the limiting factor is increasingly not implementation capacity but how well builders can communicate what they want.

The work is moving from execution into specification

Matias Castello’s use of Codex is not centered on code completion. He described a broader shift: more of the builder’s work moves into specifying intent, preferences, constraints, and working assumptions, while Codex handles planning, task creation, implementation, review, and follow-up experiments.

That shift is especially visible because Castello is not an engineer by training. He said his ability to build comes from being interested in many things, trying them over time, and improving as the models improve. Looking back at founding a company six or seven years ago, he described the earlier path for a non-engineer as much heavier: build a prototype, often by “copy-pasting code left and right until something worked,” raise money, hire three or four engineers, and spend months getting to an MVP.

Romain Huet made the same comparison from his own startup experience. He said it took roughly 15 engineers and about a year and a half to reach the first shipped V1 milestone for many customers. With current AI tools, he guessed a first version might take him less than a week by himself.

Castello’s point was not just that prototypes are faster. It was that anyone with an idea can now test it, provided they can communicate enough of what they want. He described a period of anxiety in which he felt he should be building constantly because the tools made so much possible. His response was to create a workflow that lets Codex work for hours while he is doing something else.

That workflow depends on reusable skills and up-front clarification. For a new idea, he can ask Codex to create a plan, implement it, test it, and report back. For an existing product, he can ask Codex to research competitors, infer possible features from his goals and preferences, and build the strongest candidates as modular experiments. The aim is to avoid becoming the bottleneck for either implementation or ideation while keeping final product judgment with the human.

I can prompt Codex before going to bed to go do some research about an app that I have, tell it to build the top 10 features it can come up with, build them as experiments in the product, wake up to 10 feature flags that I can toggle on and off.

Matias Castello

Castello described dispatching work in short “Codex moments”: during lunch, before returning to work, before going out on weekends, and before bed. The leverage comes less from one-off prompting than from investing in repeatable processes that let Codex continue while he is away from the computer.

Alchemy adopted Codex when review caught a real bug

At Alchemy, Matias Castello said the first internal use of Codex was modest: editing developer documentation from Slack. Instead of running the docs site locally and going through the full editing process, employees could mention Codex in the company docs channel and ask it to make small changes. Alchemy still uses that pattern, though Castello said the setup is now more sophisticated.

The larger turning point was code review. Alchemy had a small incident tied to a large migration from months earlier. In the postmortem, the team identified and fixed a race condition. Someone then suggested running Codex code review retroactively to see whether it would have caught the bug. Castello said it did.

That mattered because it tested the tool against a failure the team already understood. Castello said the team repeated the exercise a few times as curiosity grew. Soon after, he saw an engineer using Codex review inside a pull request almost as a teammate: submit the PR, ask “@Codex review,” address the comments, ask again, and iterate. For Castello, that behavior showed engineers getting past the assumption that LLMs were not good enough for a professional setting with a large, complex codebase and production-quality requirements.

Romain Huet said he had seen the same adoption wedge at other companies: code review was often where teams realized past incidents might have been caught automatically. He cited Datadog as having said in January that more than one incident out of five could have been saved by Codex, and speculated that with GPT-5.5 the figure could now be “more than half.” Castello suggested “9 out of 10 maybe,” which Huet accepted as possibly “most of them.”

Castello also said Codex is now used for product-management work at Alchemy: writing documents and PRDs, analyzing customer feedback, and reusing internal skills created for those workflows. Those skills live in a shared company repository across functions, so the benefit is not limited to PMs. In his framing, skills make product work faster and more consistent, while also allowing people outside the formal product role to perform parts of the workflow.

The platform implications are broader for Alchemy. Huet framed developers not only as humans consuming APIs and infrastructure but also as agents that need to integrate with those systems. Castello separated the problem into two cases. First, human developers increasingly build with AI, so Alchemy assumes “100% of developers are building software with the help of AI.” Second, a more nascent class of autonomous agents may show up, make implementation decisions, sign up, integrate tools, and use the blockchain to perform tasks. Castello said those two groups need different tooling today, though he expects their needs to converge over time.

The operating model is skills, Linear, and fewer assumptions

Matias Castello demonstrated the operating model through a personal project called poly, a macOS menu-bar assistant for rewriting outbound text. He built it first for his own use at work, later extending the same idea into an iOS app with a keyboard extension. The app uses Codex App Server and, in his setup, his ChatGPT subscription behind the scenes; Castello said it has a model picker and that he had recently added GPT-4o.

Poly opens with a global shortcut and rewrites rough dictated text into a selected style. In the demonstration, a casual sentence — “Hey, I’m not totally sure I understand what you’re saying. Maybe we should just get together and figure it out or something like that” — became: “Hi, I’m not sure I fully understand your point. It may be best for us to meet and work through it together.”

The more consequential layer was not the rewrite interface but the project system behind it. In Linear, poly appeared among a dozen projects in progress. Castello said the project had 159 completed issues, with others in review and in progress, and that he had not written any of them. Codex had created them.

His role was to describe what he wanted through skills he had written for himself. Codex turned the idea into a plan, milestones, and tasks. The visible milestones included “Foundation & App Shell,” “Preferences, Persistence & Modes,” “History & UX Polish,” and “Invocation & Context Capture.”

The key artifact is an agents.md file that summarizes how Castello likes to work. He said he spent time concentrating his preferences into that file so he can initialize new projects with the same operating assumptions. From there, a skill creates the plan, and he can tell Codex to build it.

Huet noted that Linear had become the interface for backlog, milestones, and releases, while Codex both wrote issues and picked them up to close them. Castello agreed: the delegation covers project management as well as execution.

The reason this works, in Castello’s view, is that it reduces the model’s need to guess. When an LLM produces something surprising in a bad way, he said, it is usually because it had to make one or more assumptions and did not make them the way the user would have. His process tries to clarify those assumptions before the build starts.

Feature flags preserve human product judgment

After a first version exists, Matias Castello uses Codex to generate and test new features without handing over the final product decision. His pattern is to ask Codex to research an area, inspect competitors, propose features, score or prioritize them, and build multiple options as experiments behind feature flags.

In poly, the experimental flags shown on screen included quick rewrite actions, advanced quick actions, rewrite alternatives, output language, audience target, output format, readability target, and protected phrases. Castello said those features were built in one evening while he was sleeping. Before bed, he asked Codex to research AI writing assistants, identify interesting or impactful features, and implement them in a modular way so he could toggle each one independently in the morning.

Romain Huet emphasized that Codex was being used before implementation, during research and product definition. Castello said Codex is better than he is at research because it does not get bored. The condition is that the user must specify what to research, how findings should be structured, and what should happen next. In Castello’s setup, those instructions are encoded into skills, so Codex can execute the sequence in one run.

The resulting features were small but concrete. Turning on output language exposed translation choices such as original language, English, French, Spanish, German, Italian, Dutch, and Polish. Turning on output format exposed choices such as message, bulleted, email, and Slack style. Castello’s point was not that each feature was difficult in isolation. It was that Codex could research, propose, build, and isolate many of them so that his job became evaluation rather than manual production.

Codex App Server turns small devices into job launchers

Matias Castello also built interfaces around Codex App Server for starting coding jobs away from a desktop. One example used OpenClaw, where his assistant is named Lou. Castello said Lou has two jobs: making jokes in group chats with friends and helping him code.

At home, Lou runs on a dedicated machine with access to Codex. Castello connected OpenClaw, Discord, and Codex so that specific Discord channels map to specific repositories. If he wants to trigger work from his phone while walking around, he can talk to Lou in the relevant Discord channel and have Codex work on that repository.

He also built an Apple Watch flow: a complication that records a short voice memo and turns it into a Codex job. His example was a small one-shot task: “Create a test.md file in my skills repo.” The memo went to an iPhone app, which transcribed it, inferred intent, routed it to the correct place, understood that “skills repo” referred to his personal skills repository because it had GitHub context, and ran the task. A few minutes later, a pull request arrived adding the file.

The app could run with a typical OpenAI API key or through what its settings called a backend session, which Castello described as Codex App Server running elsewhere from his account, connected to GitHub with his context. He said this is not useful for everything and works best for small tasks. But he framed it as a step toward making implementation “an implementation detail”: speak to a watch for 10 seconds and have the system route, execute, and open the pull request.

Romain Huet tied that use case to OpenAI’s decision to open source the Codex CLI, Codex Harness, and Codex App Server. He said the purpose was to foster an ecosystem where builders could extend Codex into workflows OpenAI had not shipped. Castello’s response was that OpenAI should build the Apple Watch complication and iPhone home-screen entry point so he would not have to.

Snapcat is a personal eval for model progress

Matias Castello described two recent surprises. The first was computer use. While working at home on a Raspberry Pi, he needed to copy and paste a list of domain URLs into an admin panel. He asked Codex to SSH into the Raspberry Pi, find the URLs in a file, connect through his browser, and perform the admin-panel task. He said it completed the work. The irony, he added, was that he tried it to save time but spent the whole time watching because he was impressed by how it opened the right things and pasted the right values.

The second surprise was GPT-5.5 on Snapcat, a hackathon project Castello has had for more than 10 years and now uses as a personal model eval. Snapcat is an app for cats to take selfies. It shows a black screen with a moving red laser dot. When the cat taps at the dot, the app takes a photo with the front-facing camera. The owner exits cat mode with the volume buttons, which cats cannot easily access, and then reviews the images in a gallery.

Castello said the original hackathon version took more than a full day and a team of five people. With GPT-5.5, he said, one well-crafted prompt and a couple of skills could build the whole thing in one go.

The UI workflow also changed. Castello described the style he wanted — light, colorful, playful, appropriate to the theme — generated an image of that UI, and told Codex to build it. After seeing the camera-roll view, he asked it to redesign the settings page in the same style; it generated another image and implemented it. It did the same for the home screen. The resulting settings shown on screen included laser speed, movement style, tap feedback, and tap sound, with playful copy such as “Quick jumps for laser pros” and “Cats are not surprised.”

Huet summarized the ritual: every time a new model comes out, Castello rebuilds Snapcat. Castello said he judges “how easy it is for cats to take selfies.” Asked whether the GPT-5.5 version was the best so far, he said it was “the best by far,” much better than an attempt a few months earlier, and “infinitely better” than what the five-person team built in 24 hours. He said he had made it in one shot the previous night.

Assume it is possible before blaming the tool

Matias Castello’s advice for builders rested on three assumptions.

First, assume it is possible. If someone has an idea, he said, more often than not it can probably be done.

Second, assume you can do it. Castello called this more personal: many people could be building, but stop because they think they cannot. His advice was to assume they can, and “maybe be in denial even that you can’t do it.”

Third, when an LLM fails to produce the desired result, assume first that the communication failed rather than that the tool is incapable. Castello said this requires putting ego aside, but his experience has been that repeated attempts, clearer instructions, and better-encoded preferences are what improve outcomes.

When you try to get something out of an LLM and it doesn't work, this one's a bit difficult, you have to put your ego aside, assume it's your fault. Don't assume the tool isn't capable.

Matias Castello · Source

Romain Huet’s final formulation was that “anyone can just build anything.” Castello agreed, while noting that it sounds cheesy. “It’s true,” he said.

AI Application Architecture Evals and Benchmarks Agents and Autonomy Human-AI Interaction AI Product Management Coding Assistants Enterprise AI Adoption