Coding Agents Can Tackle AI Systems Engineering With File-Based Skills

Ben BurtenshawAI EngineerThursday, May 21, 202613 min read

Hugging Face’s Ben Burtenshaw argues that coding agents can now take on parts of AI systems engineering when the work is narrow, measurable, and embedded in inspectable repositories. Using examples including an agent-written CUDA RMSNorm kernel with a reported 1.94x H100 speedup, an end-to-end Qwen3 fine-tune, and a multi-agent research lab, he makes the case that the limiting factor is not a better prompt but better primitives: skills, versioned artifacts, benchmarks, managed compute, and open metrics that agents can read, run, and improve.

A 1.94x kernel speedup is the hook; the infrastructure is the point

Ben Burtenshaw’s strongest concrete example was an agent-generated CUDA kernel for Qwen3-8B RMSNorm on H100 that Hugging Face benchmarked at a 1.94x average speedup. He described the result as a 94% speedup, while also stressing that it was not state of the art for that model. The point was more practical: coding agents can pick up useful systems-engineering work when the task is narrow, benchmarkable, and surrounded by the right repository structure.

1.94x

average speedup Burtenshaw reported for an agent-generated Qwen3-8B RMSNorm kernel on H100

Burtenshaw’s broader claim is that coding agents are no longer confined to ordinary application-code assistance. They can be applied to AI systems engineering and machine-learning engineering: writing and benchmarking CUDA kernels, fine-tuning models, and running automated experiment loops. The constraint is less glamorous but more important. Agents need standard repositories and primitives they can inspect, version, test, and reuse.

He described coding agents as having crossed an “acceptance gradient.” Many engineers had already been using them for years, but in recent months they had become broadly accepted enough that the professional question changes. If agents can handle more of the software layer, Burtenshaw argued, engineers should move “closer to the silicon” and use agents to tackle harder problems rather than stop at routine coding.

Burtenshaw grouped the examples into three increasingly autonomous levels. The first is interactive: an engineer uses an agent to write and benchmark a CUDA kernel. The second is a zero-shot task: an agent is given a prompt to fine-tune an LLM on Hugging Face. The third is a multi-agent “autoresearch” lab: researcher, planner, worker, and reporter agents run parallel experiments, launch GPU jobs, and report results to a dashboard.

The through line is not that agents suddenly understand every specialized domain unaided. Burtenshaw repeatedly returned to “skills”: file-based context that gives the agent examples, scripts, references, templates, and instructions. His shorthand was that skills turn a zero-shot task into a few-shot task. The agent is not being handed an opaque API and asked to guess. It is being given a repository structure and domain examples it can open, inspect, and execute.

Burtenshaw’s compressed recommendation was: “Do not abstract. Expose, well.” Abstractions remain useful, in his view, but opaque layers create ceilings for agents. The systems he favored expose files, metrics, jobs, compatibility metadata, and data structures that agents can actually operate on.

The CUDA example starts with memory, not math

Burtenshaw used CUDA kernels as the first test case because they have long looked like an awkward fit for coding agents. Writing a custom kernel requires specialized languages and domain-specific knowledge, integration with hardware, benchmarking, testing, and dealing with a difficult hardware-software compatibility matrix. He said the common assumption that agents could not do this has turned out, in many cases, to be wrong. He pointed to kernel hackathons, GPU mode, AMD hackathon work, and papers such as KernelBench as evidence that agents can write valid and optimized CUDA kernels.

His explanation of why custom kernels matter began with the mechanics of GPU execution. When an AI model runs on a GPU, the work is executed through kernels: functions compiled for the hardware and typically invoked from Python. Operations like matrix multiplication, softmax, and GELU dispatch to GPU kernels. Custom kernels can exploit hardware-specific features for a particular mathematical operation and “squeeze everything” out of the hardware so inference runs faster.

The main performance bottleneck, in Burtenshaw’s telling, is often not compute. He broke deep-learning efficiency into compute, memory, and overhead. Compute is the FLOPs: the matrix multiplications and arithmetic. Memory is the movement of tensors between memory locations, typically from slower to faster memory. Overhead is everything else, including Python, PyTorch dispatch, and the surrounding runtime environment.

The intuitive assumption is that compute dominates because the math is large. Burtenshaw said that is usually wrong. On a modern GPU such as an H100, he said, computation can reach a petaflop per second while memory bandwidth is around three terabytes. The result is that the GPU may sit idle waiting for tensors to arrive. The job of many optimized kernels is to increase arithmetic intensity: move data across, do as much math as possible while it is on the GPU, and then write it back. He used Flash Attention as the poster child for this pattern and summarized the goal as keeping “the GPUs warm.”

That creates an opportunity for agents because much of the work is narrow, testable, and benchmarkable. A kernel either compiles, runs correctly, and improves a metric on a target hardware configuration, or it does not. Burtenshaw’s first example was not framed as agents inventing state-of-the-art kernels from scratch. It was framed as agents picking up low-hanging optimization work for a specific model and hardware generation.

He said the Qwen3-8B RMSNorm result was about compatibility and the optimization matrix. Models and kernels are not always optimized for the hardware a team actually has available. If a cloud provider has cheap hardware that is not the ideal target for a model, there may be practical speedups available simply by optimizing for that configuration. His recommendation was to “come here and pick up some easy speedups.”

Generated kernels still need to become usable artifacts

Burtenshaw’s CUDA section did not stop at whether an agent could write a fast kernel. He asked what happens after the agent generates one: how it is distributed, how it gets into inference engines, and how engineers know where it works.

His answer was Hugging Face’s kernels library, which he described as a way to distribute custom kernels using a predictable project structure. A kernel package has a TOML file describing which hardware it supports and which CUDA or software versions it requires. These kernel projects are also repositories on the Hugging Face Hub, “just like models.” His pitch was that a kernel writer, or an engineer using an agent to become one, can now be a kernel publisher in the same way someone can publish a model.

The Hub interface for a sample kernel exposed compatibility metadata, supported hardware, and an installation command. Burtenshaw emphasized this because the hard part of using custom kernels is not only writing the GPU code; it is knowing whether the kernel is compatible with a specific machine, software stack, or deployment environment. The “boring” infrastructure—standard repos, metadata, versioning, and compatibility declarations—is what makes the agent output usable by others.

The skills layer sits inside that same practical structure. In the CUDA-kernel example, the skill contained a compact instruction file plus scripts and references. The file tree included benchmark templates, a micro-benchmark for RMSNorm, examples for Diffusers and Transformers integration, a Hugging Face kernels integration example, hardware optimization guides for H100, A100, and T4, kernel templates, and troubleshooting notes.

That matters because the agent is not merely being told, “write a CUDA kernel.” It is being given a workbench: how to benchmark, where to integrate, how to test, what pitfalls to avoid, and how previous examples are structured. Burtenshaw argued that skills should live inside the projects they support wherever possible, maintained by the project maintainers rather than floating around as unreviewed prompt snippets. Hugging Face also maintains a separate huggingface-skills repository for more experimental examples, but his preferred model is project-owned skills that evolve with the tool.

He also described upskill, an open-source library Hugging Face maintains for evaluating skills across models. In his explanation, upskill generates skills, generates evaluations for them, and lets users compare different models on the same skill. The point is partly cost and model selection: if a team uses a skill regularly, it can evaluate whether a cheaper or open model performs well enough with that skill.

The eval examples compared models including GPT-o3, GPT-o1, Kimi, and Haiku on accuracy and token usage with and without a skill. Burtenshaw’s practical recommendation was to use such evaluations to iterate on skills rather than assume the first version is good. Skills are not just prompt decoration in this framing; they are artifacts that can be tested, improved, and matched to models.

Fine-tuning becomes promptable when the repository already knows the steps

The second example was deliberately shorter: using an agent to fine-tune a model end to end. Burtenshaw pointed to prior Hugging Face work in which Claude was prompted to fine-tune Qwen3-0.6B on the open-r1/codeforces-cots dataset. He described the dataset as a chain-of-thought dataset intended to improve the model’s chain-of-thought behavior.

His point was not that fine-tuning is trivial in the abstract. It was that the agent can perform it when the surrounding system already exposes the right primitives: Hub integration, GPU jobs, Hugging Face CLI skills, datasets, training scripts, and enough repository structure for the task to become executable. In his framing, the user can ask for the model and dataset combination, and the agent can run the training workflow on Hugging Face infrastructure.

He also referenced a related workflow using Unsloth and Hugging Face Jobs. Unsloth was presented as a way to use optimized models for cheaper training; the visible material described it as allowing 2x faster training and roughly 60% less VRAM usage than standard methods, making small-model training cost only a few dollars. Burtenshaw said the example is maintained by Unsloth and Hugging Face, and he encouraged engineers to try it, noting that associated posts often include free credits.

The fine-tuning example reinforces the same pattern as the kernel example. The agent succeeds because the task is not an open-ended request floating outside a system. It is attached to a repository, a command-line interface, a compute substrate, a training library, and skills that encode the expected workflow. Specialized ML engineering knowledge is compressed into file-based operational context the agent can use.

A multi-agent lab turns one autoresearch loop into divided work

The most ambitious example was Burtenshaw’s “autolab,” a multi-agent autoresearch setup. He traced the inspiration to Andrej Karpathy’s autoresearch project, which used Claude Code to propose improvements to a training script derived from the nanoGPT and nanoChat line of work. Burtenshaw described Karpathy’s version as an automated loop where experiments changed the training script and kept improvements that reduced validation loss, measured in bits per byte.

Burtenshaw said the part he found strange was that one agent was doing the whole process in a single path: finding improvements, implementing them, and iterating. His response was to distribute the work across a research team of agents.

Role	Function in Burtenshaw’s autolab
Researcher	Searches Hugging Face papers or other paper sources, acts as a literature scout, and turns ideas into hypotheses.
Planner	Maintains the experiment queue, compares hypotheses with current state, and defines jobs.
Worker	Receives one hypothesis and one Git worktree, implements the change in the training script, and launches the GPU job.
Reporter	Monitors jobs, pushes metrics and status into Trackio, and maintains the dashboard.

Burtenshaw divided the automated research loop into four agent roles.

Hugging Face papers were useful in this setup because they have a CLI, allowing the researcher agent to pull and search papers from the Hub. Burtenshaw said arXiv papers could also be used. The researcher’s job is not to run the experiment; it formulates paper-derived ideas as hypotheses. The planner turns those hypotheses into an experiment queue. Workers each pick up one hypothesis and implement it, often as an architecture change or hyperparameter adjustment. The reporter monitors the jobs and maintains observability.

The implementation used a main branch containing the master train.py and the best validation bits-per-byte score. Parallel workers received one hypothesis each and launched managed GPU jobs on the Hub, using H200 GPUs in the example. The workers submitted patches back. The reporter pushed status and metrics into Trackio. The local operator machine used OpenCode in the demonstration, but Burtenshaw said the repository also includes implementations for Codex and Claude. He had also implemented it in another tool he called “Gastown,” though he described that as “wild west stuff” and kept it separate.

The Git structure was central to the design. The main branch keeps the promoted training script and a separate original training script. A data structure on main tracks scores. Each branch or worktree isolates a proposed change. The agents exchange structured state about the current score, failed and successful experiments, hyperparameters available for modification, and queued jobs.

Burtenshaw admitted that the demonstration’s inter-agent tables were probably more verbose than necessary, but he recommended the repository as a starting point for anyone who found the pattern useful. The implementation details matter because the lab is not an undefined swarm of agents. It is a queue, a Git workflow, a set of job launchers, role-specific prompts, and a metric that determines whether a change is worth keeping.

Observability is part of the agent system, not an afterthought

Burtenshaw treated observability as a core primitive for autonomous research. In the autolab setup, the reporter agent operates through Trackio, an open-source dashboard for metrics. He described Trackio as particularly useful for agents because it has an open data layer based on Parquet. If the dashboard is not enough, the agent can read the underlying data and create its own visualization or analysis.

Trackio exposed 84 runs in the autolab example and charts for metrics such as step time, train loss, training seconds, and validation bits per byte (val_bpb). It also recorded reports with warnings and errors, including an example where a worker escalation reported a Hub refresh timeout and a blocked worker. Burtenshaw said Trackio can report events and warnings from different agents, filter them, and tie them to notifications such as email if agents “are kind of going rogue” and need intervention.

He emphasized Trackio’s freeform structure: agents can write tables that do not necessarily fit a rigid experiment-tracking schema. The Media & Tables view included a job_board table with fields such as wave ID, bead ID, job ID, stage, mode, worker name, creation time, and hypothesis. The Hugging Face Jobs view showed hundreds of jobs, with labels indicating mode, launcher, master commit, and hypothesis. Burtenshaw noted that labels make the jobs sortable and reviewable.

Because Trackio’s data is open, Burtenshaw showed a Gantt chart generated from the underlying run data. The chart plotted experiments over time and annotated each with its resulting bits-per-byte score. It included a master score of 0.962777 and a best validation score of 0.938925, with named runs such as amber, citrine, obsidian, garnet, turquoise, onyx, granite, topaz, and mica. Burtenshaw’s point was not that this specific visualization is required. It was that the data layer lets agents and humans inspect what happened in whatever form is useful.

This is where the “open primitives” theme becomes operational. The lab is not a black-box automation system. It is a Git repository, a queue, a set of worktrees, managed Hub jobs, metrics tables, reports, labels, and files that agents can read and modify. Burtenshaw argued that if the experiment is verifiable—training a model, writing a kernel, improving a validation metric—then building a small AI lab around it is relatively straightforward.

The recommendation is to expose more, not hide more

Burtenshaw’s final takeaways were deliberately infrastructural. First, agents work best with primitives. Tools such as Trackio and kernels should expose their internals well enough that agents can control them, inspect them, and build on them. Abstraction is not the enemy, but opaque abstraction creates a ceiling. His phrasing was: “Do not abstract. Expose, well.”

Second, he argued that the Hugging Face Hub is ready for these workloads because it already has storage, tracking, compute, and versioning. In the examples, the Hub is not just a place to publish final artifacts. It is the substrate where kernels are packaged, skills live near projects, jobs run, model fine-tuning happens, experiment results are tracked, and agent-generated work becomes inspectable.

The more ambitious claim is about professional leverage. Burtenshaw presented CUDA kernels and ML training pipelines as domains that have historically required deep specialization: hardware knowledge, integration work, benchmarking, training workflows, and enough operational judgment to know whether a result is usable. He did not claim that agents eliminate that knowledge or make correctness automatic. His examples instead show how much of the operational path can be encoded into repositories and skills: benchmark scripts, hardware guides, integration examples, job launchers, dashboards, and review loops.

In that setup, a coding agent can attempt work that would otherwise be gated by systems expertise, and a human can evaluate the output through measurable artifacts: speedups, compatibility, loss curves, bits per byte, failed runs, submitted patches, and logs. The agent’s usefulness depends less on a magical prompt than on whether the environment gives it enough concrete surface area to act.

AI Application Architecture Data and Training Evals and Benchmarks Inference and Deployment Agents and Autonomy AI Infrastructure and Compute Coding Assistants