Good Agent Skills Are Small, Auditable, and Behaviorally Specific

Matt PocockAI EngineerMonday, June 29, 202612 min read

Matt Pocock argues that the bottleneck in agent skills has shifted from access to judgment: developers and teams can now download and share skills, but lack a reliable way to tell which ones work, how they should be invoked, and what makes them fail. His proposed answer is a four-part checklist — trigger, structure, steering, and pruning — for turning agent skills from loose markdown instructions into small, inspectable units of behavior.

The skill problem is no longer availability. It is judgment.

Matt Pocock frames the current state of agent skills as a new version of a familiar developer trap. Tutorial hell was the loop of consuming tutorials without assembling durable understanding. Framework hell was the churn of learning the next JavaScript framework every few minutes. “Skill hell,” in Pocock’s account, is what happens when skills are abundant, downloadable, editable, and shareable, but developers cannot tell which ones are good, how they compose, or why they fail to deliver the results they appear to promise.

The problem is not only individual. Organizations face the same missing layer when they try to convert operating procedures into things an agent can do. Without a shared way to evaluate and improve skills, teams do not know how to turn procedural knowledge into agent behavior.

His answer is a checklist: trigger, structure, steering, and pruning. The checklist is meant to make a skill inspectable. A reader should be able to look at a skill and ask whether it is invoked at the right time, whether its internal layout separates procedure from supporting material, whether it actually steers the model’s behavior, and whether it is small enough to maintain and cheap enough to use.

Pocock has also encoded the framework into a new skill in his mattpocock/skills repository: /writing-great-skills. His stated purpose is practical rather than theoretical: use it to write new skills, improve existing ones, and evaluate community-authored skills before pulling them into an agent environment.

The first design decision is who invokes the skill

The trigger is the way a skill is invoked. Matt Pocock distinguishes between user-invoked skills and model-invoked skills, and treats the choice as one of the most important early design decisions.

A user can always invoke a skill manually by telling the agent to use it. In some harnesses that may look like a slash command such as /my-skill; in others it may be expressed differently. The key point is that the user deliberately calls the skill.

A model-invoked skill works through the agent’s context. Its description is placed where the model can see it. That description functions as what Pocock calls a “context pointer”: it tells the model that another file exists, typically SKILL.md, and that the model can read it when the described situation applies. When the model decides the skill is relevant, it follows that pointer and brings the skill’s main file into the context window.

Pocock shows the distinction through two examples from his own repository. His codebase-design skill includes a name and description in its frontmatter. The visible description says it provides “shared vocabulary for designing deep modules” and is meant for work such as designing or improving module interfaces, finding deepening opportunities, deciding where seams go, or making code more testable or AI-navigable. Because that description is available to the model, codebase-design is model-invocable.

By contrast, his grill-me skill contains disable-model-invocation: true. The visible file describes it as “a relentless interview to sharpen a plan or design,” then instructs the agent to run a /grilling session. Pocock says the disabling flag means the description is only shown to the user and is not visible to the agent.

The trade-off is not simply that model-invoked skills are better because they are more flexible. Every model-invoked skill adds context load. Its description costs tokens on every request, and it also adds another possible tool-like concept for the model to consider.

100

model-invoked skills would mean 100 descriptions inside the agent context, in Pocock’s example

User-invoked skills move the burden elsewhere. They keep the agent’s context load smaller, but increase the cognitive load on the user. The user must know which skills exist, when to call them, and how to pilot the agent through them.

This trade-off explains the difference Pocock draws between his own mattpocock/skills repository and obra/superpowers, another popular set of engineering skills. Superpowers is primarily model-invoked: it gives the agent superpowers. His own skills lean toward user invocation because he prefers to remain “in full control.” That choice reduces the model’s context load but requires him to understand his skills more deeply.

The other cost of model invocation is unpredictability. A context pointer can always be ignored. Even when a skill is perfect for a task, the model may choose not to invoke it. That creates an evaluation problem: if a skill is supposed to be model-invoked, teams need to evaluate whether it is actually being called at the right time. Pocock calls that “nasty” and says he prefers to remove that class of failure by accepting more cognitive load as the user.

The resulting rule is not “always use user-invoked skills.” It is narrower and more useful: decide deliberately. Model-invoked skills impose context load and invocation uncertainty. User-invoked skills impose cognitive load and require a more skilled pilot.

A skill should separate procedure from supporting material

Most skills can be decomposed into two units: steps and reference. For Matt Pocock, that distinction is the basis for making a skill easier to design, audit, and shrink.

Steps are the procedure the agent should walk through. Reference is the supporting material needed to execute those steps. Some skills may be all reference and no procedure. Others may be little more than a sequence of steps. Treating these as distinct parts prevents the procedure and its supporting material from blurring into one undifferentiated markdown file.

His example is /to-prd, a skill that creates a product requirements document from the current context window. It has three steps: find relevant context, confirm test seams, and write the PRD. The second step is a human-in-the-loop checkpoint. The skill confirms test seams with the user “to make sure we’re not doing anything weird with the testing,” which Pocock considers important. To support the steps, the skill needs two pieces of reference material: an explanation of what a test seam is, and a PRD template.

That gives Pocock a writing method. Start by asking whether the skill needs steps. Write those steps. Then ask what reference material those steps require, and place that supporting material in its own location rather than letting it blur into the procedure.

This leads directly to his next constraint: SKILL.md should be as small as possible. Every skill is composed of a description, a main SKILL.md file, and any reference material branching from it. The main file should be kept small because smaller skills are easier to maintain, easier to audit, cheaper in tokens, and lighter for both maintainers and users.

The main mechanism for shrinking SKILL.md is branch analysis. A branch is a different way a skill can be used. If reference material is needed on every branch, it may belong in the main file. If it is only needed on one branch, it is a candidate to move behind a context pointer.

In /to-prd, both pieces of reference are likely needed every time. The PRD template is always needed because the skill always writes a PRD. The test-seam explanation is probably always needed because the skill always asks about test seams. Pocock says /to-prd effectively has one branch, so the reference material probably belongs in SKILL.md.

The situation is different for /domain-modeling. That skill does two things: it updates a local glossary called CONTEXT.md, and it creates architectural decision records. It may also choose to do neither. That creates two or three branches. The CONTEXT.md template and ADR template do not need to live in the main skill file, because they are only relevant on particular branches.

The technique is to place a pointer in SKILL.md that tells the agent when to open a separate file, such as CONTEXT-TEMPLATE.md. Pocock calls that separate file an external reference: bundled with the skill, easy for the agent to pull in, but not loaded into the main skill unless the branch requires it.

The design rule is therefore more specific than “organize your markdown.” A good skill identifies its branches and keeps branch-specific reference material out of the main path.

Steering depends on words the model will repeat to itself

The central steering technique is what Matt Pocock calls “leading words.” He presents it as the main idea he wants readers to take from the framework.

The failure mode is familiar: the skill appears to specify the desired behavior, but the agent still does not do it. Pocock’s diagnosis is that the skill is often missing compact phrases that carry a lot of behavioral meaning. A leading word, or short phrase, “packs in a bunch of meaning into a very small space.”

The mechanism matters. A leading word enters the skill text, then gets repeated by the agent in its operations, reasoning tokens, and output. Because the model re-emphasizes the word, and because the word describes the intended behavior, Pocock says it can change how the model acts.

These leading words are really powerful with agents because you put the leading word in the skill itself in the text, and then the agent will repeat the leading word back to itself as part of its operations, as part of its thinking tokens, and as part of its output to you.

Matt Pocock · Source

His concrete example is the tendency of agents to code layer by layer. Given a large implementation task, an agent may build all of the database layer, then all of the schemas, then all API endpoints, then all frontend work. Pocock contrasts that with the “typical human thing”: get feedback early, make something small work, then expand outward.

A skill could try to prevent the behavior with a literal instruction: do not code layer by layer; create a small slice first. Pocock’s preferred alternative is to use the phrase “vertical slice.” It is compact, familiar development terminology, and it invokes a larger set of expectations: implement a thin end-to-end path through the system instead of horizontal layers.

The technique is observable, but not automatic. If the skill says “vertical slice,” Pocock says the reasoning traces may begin to say something like “a thin vertical slice,” and the resulting implementation plans should improve. The point is not to reduce the skill to a two-word incantation. It is to find a phrase that captures the desired behavior, repeat it consistently, and watch whether the model adopts that frame.

Pocock says many people recognize the technique once it is named: they have already been using small phrases to nudge agents. His recommendation is to make that practice more systematic. If the agent is not doing what the skill intends, make the leading words more consistent and more powerful. He also notes that agents themselves can help find candidate phrases.

English is a pretty wide API in terms of different functions you can call.

Matt Pocock

The second steering lever is hiding future steps when the agent is not doing enough work in the current step. Pocock’s example is plan mode. In his experience, plan mode usually contains two steps: ask clarifying questions, then create a plan. The problem is that the model sees the final goal — creating the plan — and rushes through the question-asking step. It asks a few clarifying questions and eagerly proceeds.

His solution is to split the workflow. Instead of one plan-mode skill, he uses /grill-with-docs for the clarifying phase and then /to-prd afterward. The agent only sees one phase at a time. Because the future goal is hidden, it puts more legwork into the current task.

Pocock does not say every skill should be split into individual steps. He says the technique is especially useful when one phase requires more effort than the agent otherwise gives it. When the model under-invests in a step because it is looking ahead, hide the future step.

Pruning is a behavioral test, not a neatness pass

Pruning comes after the skill is working. Matt Pocock defines the goal as removing everything that does not materially help the skill perform.

The first failure mode is simply massive skills. Pocock says large skills are usually symptoms of some other problem rather than the root issue. They may contain duplication, accumulated sediment, irrelevant branch material, or no-ops.

Duplication is the simplest. Pocock wants each part of the skill to have a single source of truth. If a PRD template exists, do not repeat it in multiple places. If the skill defines “test seam,” do not redefine it elsewhere. The same applies across reference material. Each concept or artifact should live in one place.

Sediment is what accumulates when multiple people contribute to a shared markdown file without deleting or restructuring each other’s additions. People add their own material, hesitate to remove anyone else’s, and the file grows. Pocock says sediment often contains irrelevant or poorly laid-out material. The first response should be structural: determine whether added material applies to all branches. If not, move it to the correct branch. If it is irrelevant or stale, remove it.

No-ops are subtler. Pocock says they are common when an agent writes the skills. A no-op is text that appears to do something but does not actually influence the agent’s behavior in the context of the skill. His example is an implementation skill with a full paragraph instructing the agent to write a long, detailed commit message. If deleting that paragraph would still result in a decent long commit message, the paragraph was probably not doing meaningful work.

That leads to Pocock’s deletion test: remove a passage and ask whether behavior changes. If behavior does not change, the passage is a candidate for deletion. He connects the small size of his own skills to this practice, along with compacting instructions into leading words, removing irrelevant material, and avoiding sediment.

Pruning, in this sense, is not about making the file look cleaner. It is about testing whether each phrase, paragraph, and reference changes the agent’s behavior enough to justify its cost.

The checklist turns skill writing into an auditable practice

The framework is cumulative. Matt Pocock uses the trigger pass to ask whether the skill should be user-invoked or model-invoked, and therefore whether the design is imposing context load on the agent or cognitive load on the user. The structure pass asks whether the skill is organized into steps and reference, whether its branches are understood, and whether branch-specific material has been moved out of SKILL.md behind context pointers. The steering pass asks whether the skill uses consistent leading words and whether the agent is doing enough legwork on each step. The pruning pass removes duplication, sediment, stale material, and no-ops.

Pass	Diagnostic question	Failure mode or trade-off
Trigger	Who invokes the skill: the user or the model?	Model invocation adds context load and uncertainty; user invocation adds cognitive load.
Structure	Is the skill separated into steps, reference, and branches?	Branch-specific reference material bloats `SKILL.md` when it should sit behind a context pointer.
Steering	Does the language actually change the agent’s behavior?	Weak or inconsistent leading words fail to shape plans; visible future steps can reduce legwork.
Pruning	What can be deleted without changing behavior?	Duplication, sediment, stale material, and no-ops make skills massive without making them better.

Pocock’s four-part checklist for evaluating and improving agent skills

The underlying standard is that a skill should be small, purposeful, and inspectable. It should be clear why it is invoked, what procedure it runs, what reference material it needs, how its language changes model behavior, and what every remaining line contributes.

Pocock’s practical recommendation is to start with the /writing-great-skills skill in mattpocock/skills. He suggests using it not only to improve one’s own skills, but also to evaluate community-authored skills before adopting them.

AI Application Architecture Agents and Autonomy