
Steven Feng
Steven Feng is a Stanford Computer Science PhD student affiliated with the Stanford AI Lab and Stanford NLP Group. He is the lead instructor for Stanford CS25: Transformers United, and his research focuses on foundation models, language models, reasoning, generalization, efficiency, and human-aligned AI learning.
Native Multimodal Models Extend LLMs but Still Lack Unified Representations
Victoria Lin of Thinking Machines uses a Stanford CS25 seminar to argue that native multimodal models have extended much of the large-language-model recipe into images, audio, video and action, but have not yet unified multimodal intelligence. Her account is that tokenization, Transformers, autoregressive conditioning and scaling transfer only partly: images, video and action require different representations, objectives and sometimes modality-specific parameters. The result, she says, is a field moving beyond text-only systems while still relying on text as its strongest abstraction for reasoning.
Production Inference Turns Transformer Models Into a Full-Stack Systems Problem
In a Stanford CS25 seminar, Modal’s Charles Frye argues that transformer inference has become the economic and operational center of AI systems: training produces weights, but serving turns them into usable, billable products. His account treats production inference as a full-stack problem, where application latency goals, workload shape, model choice, GPU memory limits, deployment failures, observability and cost controls all determine whether a system works. Frye’s main warning is that the largest serving gains come from matching the inference stack to the application, not from treating model hosting as a generic infrastructure task.
Language Models Generalize Differently From Parameters Than From Context
In a Stanford CS25 seminar, Anthropic researcher Andrew Lampinen argues that language models generalize differently depending on whether information is stored in their parameters or supplied in context. His experiments find that models can often use relations flexibly when the relevant facts are visible in the prompt, but fail to make the same reversals, syllogistic inferences, or codebook translations when those facts have only been learned through training. Lampinen presents augmentation, retrieval, and reinforcement-learned recall as partial ways to make latent implications more usable, while stressing that parametric learning and in-context learning remain complementary rather than substitutes.
Reasoning Gains Persist When Models Learn Them During Pretraining
Shrimai Prabhumoye of Mistral AI used a Stanford CS25 seminar to argue that large-language-model pretraining is becoming less a matter of adding tokens and more a question of training strategy. Drawing on studies of curriculum ordering, early reasoning data, and reinforcement as a pretraining objective, she said base models improve when they see broad data before high-quality data, encounter reasoning traces during pretraining rather than only post-training, and are rewarded for intermediate thoughts that improve prediction.
Ultra-Scale Training Depends on Memory Sharding and Communication Overlap
Nouamane Tazi of Hugging Face uses a Stanford CS25 seminar to argue that ultra-scale model training is less a question of adding GPUs than of managing memory, communication, batch size, and hardware topology. His central case is that 5D parallelism—data, tensor, pipeline, context, and expert parallelism—lets training runs span massive clusters only when each axis is chosen for a specific bottleneck. The practical rule, he says, is conservative: shard only as much as the workload requires, because every added parallelism dimension buys scale by spending communication, complexity, or both.