Charles Frye

Charles Frye is a Member of Technical Staff at Modal, where he writes and speaks about AI infrastructure, GPU workloads, model deployment, and production inference. He previously worked on Weights & Biases and holds a PhD from UC Berkeley.

Production Inference Turns Transformer Models Into a Full-Stack Systems Problem

In a Stanford CS25 seminar, Modal’s Charles Frye argues that transformer inference has become the economic and operational center of AI systems: training produces weights, but serving turns them into usable, billable products. His account treats production inference as a full-stack problem, where application latency goals, workload shape, model choice, GPU memory limits, deployment failures, observability and cost controls all determine whether a system works. Frye’s main warning is that the largest serving gains come from matching the inference stack to the application, not from treating model hosting as a generic infrastructure task.

Stanford OnlineJun 4, 202622 min read