Kobie Crawford

Developer Advocate at Snorkel AI focused on AI benchmarks, agentic evaluation, and high-quality data for model training; author and speaker on topics including Terminal-Bench-style agentic tasks, coding-agent evaluation, and task fidelity scaling laws.

A 4B Model Beat Qwen3 235B by Learning Tool Discipline

Kobie Crawford of Snorkel argues that some enterprise AI failures are less about model size than about whether models behave correctly inside constrained tool environments. In Snorkel’s FinQA work with UC Berkeley’s rLLM/Agentica, a 235B Qwen model hallucinated a financial answer after failed SQL calls, while a 4B model fine-tuned with reinforcement learning learned to inspect tables, correct errors and calculate from retrieved data. Crawford presents the result as evidence that targeted RL, structured evals and behavior-specific training can outperform simply moving to a larger model for this class of financial analysis task.

AI EngineerJun 10, 20269 min read

High-Quality Agentic Tasks Drove 5x More Fine-Tuning Uplift

Snorkel’s Kobie Crawford argues that task quality, not just model size or compute, can determine whether agentic fine-tuning produces useful gains. In a Terminal-Bench-style experiment holding the base model, compute budget and task count constant, Snorkel reported that fine-tuning on rejected low-quality tasks improved Qwen3-8B by about one percentage point, while accepted high-quality tasks improved it by 6.2 points. Crawford’s case is that well-specified, reliable tasks create learnable failures, while ambiguous prompts, mismatched tests and broken environments mostly add noise.

AI EngineerJun 2, 20269 min read