AI-Native Startups Are Replacing Teams With Agentic Operating Systems

Diana HuStanford OnlineWednesday, May 20, 202617 min read

In a Stanford CS153 Frontier Systems lecture, Y Combinator CEO Garry Tan and general partner Diana Hu argue that AI agents are changing the basic production unit of a startup from a team to a founder operating through skills, memory, evals and customer feedback loops. Tan frames agentic coding as a programmable company architecture, while Hu says AI-native companies are becoming closed-loop systems with far higher revenue per employee and less need for traditional managerial coordination.

The unit of production is no longer the team

Frontier progress is not blocked only by chips, power, or compute. The capital and company-formation layer is also a systems problem: standards and institutions can remove bottlenecks just as technical infrastructure can. The opening framing compares Y Combinator’s SAFE to earlier standardization moments in infrastructure. Electrical standards and utility coordination made electricity a dependable platform; YC’s simple agreement for future equity made early-stage financing more scalable by replacing a fragmented seed-financing environment with a shared instrument founders and investors could use.

Garry Tan accepts that analogy but shifts the mechanism from legal documents to code. The SAFE was a legal instrument. The next standard, in Tan’s telling, is closer to “markdown as code”: structured instructions, agent skills, resolvers, memory systems, and evals that let one founder operate with the output of a much larger organization.

The practical thesis is not simply that AI makes coding faster. Tan and Diana Hu argue that company operations themselves are becoming programmable: workflows can be captured, routed, tested, remembered, evaluated, and improved through closed loops. The founder’s job becomes less like personally doing every task and more like designing a reliable agentic operating system around customer work.

Tan’s core claim is that AI changes the “unit of production.” The default startup unit used to be a team. The new unit is “human + agents + memory + evals + customer loop.” Humans and teams still matter, he says, but they are no longer the whole production function. Tan tells the students that their generation is going to “create the cognitive layer for all of society.”

Hu states the company-level implication more directly: YC is seeing portfolio companies go from zero to tens of millions of dollars in revenue in a year, a level of traction she says previously would have taken four or five years and Series B-scale progress. Tan adds that the old path would also have required far more capital. Their shared claim is not that AI makes companies effortless; it is that the leverage profile has changed enough that the organizational model has to change with it.

Tan illustrates the compression with his own old startup. In 2008, Posterous raised about $4 million, hired roughly 10 people, built a simple blogging platform over about two years, and sold to Twitter three years later for $20 million. Tan says that with a $200-per-month Claude Code Max plan, he recently recreated the software in about five days. The point is not merely that code generation is faster. It is that the same founder, working with agents, memory, and process, can now do work that once required a funded team.

5 days

Tan’s stated time to rebuild the software his old startup had built over two years

The contrast presented to students is blunt: in 2010, a 10-person startup felt “impossibly lean”; in 2026, Tan and Hu argue, a six-person team can reach $10 million in revenue when each person is amplified by agents, memory, evals, and customer loops.

The 1000x engineer claim is bounded by production discipline

Tan’s path into this argument began with a claim from Steve Yegge: “People using AI coding agents are 10x to 100x as productive as engineers using Cursor and chat today, and roughly 1000x as productive as Googlers were back in 2005.” Tan says he saw that claim, opened Claude Code, and ended up writing around a million lines of code. But he also spends time distinguishing the serious version of the claim from the unserious one.

The standard objections to AI coding are explicitly treated as reasonable. LLMs generate verbose code; “demo” is not the same as production; hallucinations can produce plausible-looking false outputs; lines of code are a poor metric. Tan agrees that each objection is “partially true.” His response is not that the objections are wrong. It is that an AI-native software factory must be built specifically to prevent those failure modes.

The crucial distinction is between a quick demo and a production system. Tan says production requires tests, reviews, and process. He uses a GStack skill called plan-eng-review around 20 times a day, with the goal of reaching 80% to 90% test coverage before shipping. Testing is the boundary he draws between agentic speed and slop.

Tan presents GStack as evidence that the work is not only a demo. He says GStack was first a toolkit he built for himself and then open-sourced. The figures he presents for GStack include 87,000 GitHub stars in two months, 14,965 unique installations with opt-in telemetry, 305,309 skill invocations since January, a 95.2% success rate across skill runs, 7,000 weekly active users, and 27,157 real-browser Playwright sessions. Tan separately says GBrain has 13,000 stars and describes his post-December open-source work as totaling more than 100,000 GitHub stars. He also says about 15,000 people use the work every day; that spoken claim is presented alongside, but not identically to, the installation and weekly-active-user figures.

Project or metric	Reported value
GStack GitHub stars	87,000 in two months
GBrain GitHub stars	13,000
Unique GStack installations	14,965 with opt-in telemetry; Tan says the real number is 2x+
Skill invocations	305,309 since January
Success rate across skill runs	95.2%
Weekly active users	7,000
Real-browser sessions	27,157 Playwright sessions
Tan’s separate spoken usage claim	About 15,000 people use it every day

Adoption and usage figures Tan presents for his GStack and GBrain projects

Tan’s deeper claim is that the interface has shifted from “copilot” to “software factory.” The old mode was one assistant, one prompt at a time, helping a human write code. In that model, typing remains the bottleneck and quality is often whatever the first answer happened to be. The new mode, as he describes it, is orchestration: product, engineering, design, security, QA, release, and retrospectives become roles that can be invoked as specialist skills.

That is why GStack is organized around commands rather than a single omniscient assistant. A sprint loop maps to commands: /office-hours asks what should be built; /plan-ceo-review asks how big the opportunity could be; /plan-design-review focuses on making the user happy; /plan-eng-review on building it well; /review, /qa, /ship, and /retro close the loop. Tan says the office-hours skill is a distillation of YC partner conversations: questions about the problem, the customer, how the founder knows, and what should be built.

The skill file does not contain magic. Its value is that a repeatable role can be made executable. Tan says /plan-ceo-review asks for the “10x version” or “platonic ideal” of the product, then keeps the current work on a straight-line roadmap toward that ideal. In the old organizational setting, that would be product management judgment in a meeting. In the new setting, it becomes an invocable process.

Tan’s most provocative phrase is that founders should “boil the ocean.” In normal company language, that phrase is a warning against overreach. Tan argues that agentic leverage changes the baseline: if a person at a terminal can do the work of hundreds or a thousand people, then many inherited assumptions about company scope are “a thousand x wrong.” He notes that even Claude Code’s own estimates often lag the new reality: it may say a task will take three weeks, then complete it in an hour once the user approves the plan.

Skills are latent; code is deterministic

Tan’s technical architecture begins with a boundary: skills are for latent work, code is for deterministic work. He says agentic systems break when that boundary is violated — when deterministic tasks are left to markdown instructions, or when judgment-heavy tasks are forced into code.

Latent work, in his definition, includes reading and interpreting, synthesis across documents, pattern recognition, calls of judgment, and holding contradictions in mind. Deterministic work is where trust lives: SQL queries, compiled code, arithmetic, timestamps, math, combinatorial optimization, and any same-input, same-output process.

Tan’s example is seating people at a dinner. An LLM can take biographies for eight guests, search or infer context, and reason about who should sit next to whom. But at the scale of an 800-person dinner party, or 6,000 people attending Startup School, the model alone is not enough. It will hallucinate plausible-looking outputs. The correct design, he says, is to combine latent interpretation with deterministic optimization.

A skill file is a runbook. Tan uses /investigate as a recipe with parameters for target, question, and dataset, then steps: scope the dataset, build a timeline, diarize every document, synthesize, argue both sides, cite sources, and write the brief. Critics may dismiss this as “just markdown,” Tan says, but LLMs make markdown instructions operational in a new way: a human-readable process can call tools and code, and the agent can execute the sequence.

The deterministic side is equally important. Tan gives a simple example from OpenClau: if he asks an LLM what time it is, it may assume UTC or Greenwich time. His fix was to write TypeScript code, context-now.mjs, with tests, so the agent receives the current local time and upcoming events from a deterministic source. The lesson is not about clocks; it is about system design. Do not ask the model to infer what code can know.

Resolvers solve the opposite problem: not every instruction belongs in the context window. Tan describes a common failure mode in Claude Code: CLAUDE.md grows to tens of thousands of lines because the user keeps adding instructions whenever the model does something wrong. The result is slower responses and degraded attention. His remedy is a resolver: a compact decision tree of pointers. Instead of keeping a long changelog policy in the main context, the agent sees an instruction such as “anytime you have to write to the changelog, load changelog.md.”

The before-and-after comparison is stark: 20,000 lines crammed into CLAUDE.md versus 200 lines pointing to the right knowledge on demand. Tan calls the resolver “the core of having a really great agent.” It determines which skill or code path should run for a request: “check my signatures” maps to an executive-assistant skill; “who is Pedro Franceschi” maps to brain operations; “save this article” maps to idea ingest; “what time is my meeting” maps to context-now; “find my 2016 trip” maps to calendar recall.

The same pattern produces one of Tan’s most important primitives: Skillify. Skillify is the process of turning a one-off successful workflow into a reusable, tested skill. Tan’s example is “save this article.” First, he has the agent perform the task once. He examines the input and output until it behaves as desired. Then he says “Skillify,” and the system promotes the workflow into a skill.

But promotion is not just writing a markdown file. Tan’s checklist has ten parts: SKILL.md as the contract; deterministic code in scripts; unit tests; integration tests; LLM evals; a resolver trigger in AGENTS.md; resolver evals to verify routing; CheckResolvable and DRY audit; an end-to-end smoke test; and brain filing rules. Tan emphasizes that writing the skill and writing the code are only two of the ten steps. The rest is compliance, testing, routing, and memory hygiene.

This is where Tan connects agentic systems to organizational structure. In a human company, much of the work that looks bureaucratic exists because human systems are messy. In agentic systems, the same need reappears. The system has to know what it can do, route requests to the right capability, avoid duplicate processes, and prove that the claimed capability actually works.

Memory becomes the company brain

Tan’s next layer is memory. His project GBrain is described as a three-layer memory system built on top of the “knowledge wiki” idea. The first layer is a brain repo: markdown files in Git, intended to be plain, human-readable, diffable, greppable, and reviewable. The second layer is retrieval: Postgres plus pgvector, hybrid vector and BM25 search, reciprocal rank fusion, backlink-boosted ranking, and a typed knowledge graph. The third layer is agent skills: an MCP server with typed tools including Skillify and CheckResolvable.

Agentic layer	Company analogue	Function
Skills	Employees	Each one has a capability
Resolver	Org chart	Decides who handles what and how escalation works
Filing rules	Internal process	Determines where information lives
CheckResolvable	Audit and compliance	Checks whether the system can do what it claims
Trigger evals	Performance reviews	Checks whether the right team responds

The mapping Tan gives between agentic infrastructure and company operating structure

The progression matters because a simple markdown wiki eventually breaks under scale. Tan says his own knowledge wiki started “falling over” because it relied on grep. Retrieval, graph structure, backlinks, and typed knowledge are additions needed to make memory useful to agents without making it opaque to humans.

Tan also wants the system to distinguish kinds of knowledge. He says he is adding an epistemology system so GBrain can track whether something is a hunch, a belief held by a particular person, or world knowledge. His example is a founder’s early conviction: someone believes the world needs something before anyone else believes it, spends years proving it, and later the system should be able to trace that belief from hunch to manifested reality.

That interest leads to a broader design requirement: memory schemas must be dynamic. Tan says the current schema is built for his own use case, but a researcher, journalist, politician, or founder would need different ontologies. The memory system cannot be only a personal filing cabinet; if it is to become the substrate for AI-native work, it has to support different domains and ways of knowing.

The organizational mapping is the key bridge to Hu’s company argument. Tan presents “The Agentic Company” as a set of equivalences: skills are employees, each with a capability; the resolver is the org chart, deciding who handles what and how escalation works; filing rules are internal process, deciding where information lives; CheckResolvable is audit and compliance, proving the system can do what it claims; trigger evals are performance reviews, checking whether the right team responds.

This mapping is deliberately imperfect, because Tan’s point is that agentic systems are “squishy” like human systems. They are not pure deterministic programs. A trigger eval itself has to test a latent routing behavior. A resolver can be wrong. A skill can overlap with another skill. A memory rule can file information in the wrong place. The more agents resemble an organization, the more they require organizational controls.

AI-native companies are closed-loop systems

Diana Hu moves the argument from individual leverage to company design. Her claim is that AI-native companies must fundamentally change how companies are run. Pre-AI companies, she says, operate as open-loop systems: decisions are made, information returns slowly, and status is lossy. In control-systems terms, errors accumulate and the system drifts off course.

A closed-loop company is different because every workflow produces artifacts agents can read. Hu contrasts open-loop company behavior — information in human heads, DMs and side channels, unwritten meetings, memory and vibes, lossy human status updates, agents seeing perhaps 10% of company state — with closed-loop behavior: Linear tickets, GitHub commits, Slack channels rather than DMs, recorded sales calls, customer feedback in Pylon, documents in Notion, and agents able to read the full state.

Hu says an agent such as Hermes or OpenClau should be embedded into company decision-making with read access to every artifact the company produces. For a student project, that might mean connecting an agent to the GitHub codebase, Discord, and recorded teammate meetings. With that context, the agent can suggest next items, bug fixes, or filings into memory. The system starts to become self-healing because its state is captured and available to the agents that operate on it.

The company-performance claim is large: Hu says YC is seeing companies where each employee contributes at least $1 million to $2 million in revenue. She contrasts that with public-company revenue-per-employee figures in a passage that is imprecise in the transcript, naming Salesforce and suggesting the comparison is at least a 10x shift from what YC is seeing in startups. The central point is that Hu regards AI-native startups as operating with a sharply different revenue-per-employee profile.

$1M–$2M

revenue per employee Hu says YC is seeing in AI-native startups

Hu also says YC has applied the same pattern internally with its engineering team. She says YC was able to cut sprint time in half and produce 10x the amount of work; she connects that to a broader pattern in which agents can read the full state of work rather than receiving a lossy status summary.

The organizational consequence is a flatter structure with less middle management. Hu references Jack Dorsey’s writing on the agentic organization and argues that middle management has often existed to route lossy information. In the AI-native structure, she identifies three roles.

The first is the individual contributor. Everyone builds. Even non-technical people can ship prototypes or automate their own workflows. A salesperson, for example, might build a pipeline for calls, meetings, and follow-ups rather than waiting for an engineering team.

The second is the DRI, or directly responsible individual. Every outcome has one owner. The DRI orchestrates the ICs and the agents toward a concrete result. Hu gives the example of a company goal to increase revenue 3x by the end of the week: the DRI coordinates sales calls, engineering work, and other necessary actions end to end.

The third is the AI founder. This role cannot delegate model strategy. Hu says the best founders are “living at the edge of the future” with the tools because the tools are changing too quickly to understand from a distance. She dates a major shift in agentic coding to the late-2025 release of Claude 4.5, when she says things “started to work.” Garry Tan adds that people still operating at last year’s Copilot level “are not gonna make it.”

Evals are where taste becomes operational

Diana Hu treats taste as the durable human constraint. If shipping code is moving toward zero marginal cost, the scarce capability is knowing whether the output is good. In AI-native systems, she says, that taste has to become executable through evals.

Generic benchmarks are insufficient product tests. MMLU does not tell a company whether its collections agent upset a customer. A product’s context length, tool count, domain, customer expectations, and business rules differ from the public benchmark. What matters is multidimensional grading per call: did the agent follow instructions, was the answer correct, did it preserve customer trust, did it hit the business goal, and did it comply with domain rules?

Hu’s point is that the real judge is the user. If founders want to build companies, they have to define correctness in the domain where users experience the product. There is no universal benchmark that can substitute for that. The founder has to inspect traces, label failures, and decide which interactions should become eval cases.

Garry Tan adds a next step he has not yet released: cross-model eval. He describes a system in which frontier models such as Opus, GPT-5.5, and DeepSeek V4 evaluate inputs and outputs, rate them, and feed the critique back to the original sub-agent for another attempt. The goal is iterative meta-prompting: a second or third pass that can be “ten times better than the first version.” He describes founders already stacking models in this way, with one model treated as an “ADHD CEO” and another as a “nearly non-verbal 200 IQ CTO,” each useful for different parts of the evaluation process.

Hu converts that into a production loop. Founders build the evals because the inputs are the customer’s context and no one outside the company has them. The loop is: capture traces, convert failures to eval cases, replay regressions, improve prompts and tools, and retest before deploy. The principle is direct: “Taste maps to evals. If you can’t evaluate it, you can’t scale it.”

That loop also clarifies why Tan’s Skillify checklist is long. A one-off prompt improvement is not a system. A failure becomes useful only when it is captured, converted into a repeatable test, routed through the right skill, and replayed so the system does not regress. In that sense, taste is not just aesthetic judgment; it is the founder’s ability to define, measure, and enforce the product’s standard of correctness.

The wedge is a painful workflow, not an AI demo

Diana Hu gives concrete company-formation advice: pick one painful workflow, live inside the customer, and become the forward-deployed engineer. The best AI startups, she says, do not demo intelligence; they deploy solutions.

She gives three YC examples. Salient builds voice agents for loan servicing and, according to Hu, closed top U.S. banks through forward-deployed pilots. HappyRobot works in logistics voice, embedded itself with freight forwarders, and 10x’d revenue in less than a year by automating coordination with truckers and timelines. Reducto works on document parsing, which Hu describes as enabling infrastructure: better document processing improves the RAG, memory, and agent systems that depend on reading documents.

The founders of these companies, Hu says, did not necessarily come from the industries they entered. Garry Tan interjects that the knowledge was “not in the training set.” The way founders became experts was by shadowing, taking jobs, or otherwise learning the workflow in depth before automating it. One tactic is to “go undercover”: Hu describes a founder taking a Zoom job in medical billing for three months before writing a line of code.

The target vertical should have high-volume repetitive labor, messy domain rules, workflows still run through phone, email, spreadsheets, and portals, and a buyer with urgent ROI. Those characteristics make the latent-deterministic architecture useful: agents can handle messy interpretation, while code and tools handle the parts that must be reliable.

Hu then points to an Anthropic chart on agent deployment by domain to argue that the opportunity is still early outside software engineering. The chart is attributed on screen to Anthropic’s work on measuring agent autonomy.^† It shows software engineering at 49.7% of tool calls, back office automation at 13.0%, data at 7.7%, and most other domains below 5%. Medicine and healthcare, legal, travel and logistics, education, customer service, cybersecurity, finance and accounting, sales and CRM, and marketing all appear far below software engineering.

Domain	Share of tool calls shown
Software engineering	49.7%
Back office automation	13.0%
Data	7.7%
Marketing and copywriting	4.4%
Sales and CRM	4.2%
Finance and accounting	3.9%
Cybersecurity	3.1%
Academic research	2.5%
Engineering	2.5%
Customer service	2.4%
Gaming and interactive media	2.3%
Document and presentation writing	2.0%
Education and tutoring	1.8%
E-commerce operations	1.3%
Medicine and healthcare	1.0%
Legal	1.0%
Travel and logistics	0.8%

Agent tool-call deployment by domain in the Anthropic chart Hu discusses

Hu’s conclusion from the chart is that there is large white space across back office, finance, data, academics, cybersecurity, customer service, healthcare, legal, logistics, and other verticals. Tan says the field is at “the first pitch of the first inning.” Hu says some students may feel the ideas are done, but YC’s portfolio suggests the opposite.

The growth data is part of the same argument. Hu says Paul Graham’s old benchmark was 10% week-over-week growth, and that historically only the top 1% of companies achieved it. In older YC batches, she says, perhaps Airbnb and one other company would hit that level. In the last three YC batches, she says 10% week-over-week revenue growth is now the average across the batch — what used to be the single best company per batch is now the median.

Tan says this has never happened before in YC’s history. The signal, for him, is not press or demos; it is customers saying they cannot believe the product exists, thanking the company, paying for it, and then 10% more customers paying the next week.

AI Application Architecture AI Startups and Funding RAG and Knowledge Systems Evals and Benchmarks Agents and Autonomy AI Economics and Labor Coding Assistants Enterprise AI Adoption