Small Local Models Can Replace Frontier Calls When Product Evals Prove Fit

Rachel NaborsAI EngineerMonday, June 29, 202611 min read

RL Nabors of Arize argues that teams should stop treating frontier-model calls as the default for production AI features. Her case is to prototype with the strongest model when needed, then use golden datasets, capability evals, and trace-based measurements to work down to the smallest local or task-specific model that meets the product’s own bars for accuracy, latency, cost, privacy, and reliability. In her Mima summarization example, that process moved a Claude-backed feature to an on-device Llama 3.2 3B configuration, with evals becoming the guardrail for future changes.

Most frontier calls should be treated as a hypothesis, not a default

Rachel Nabors’s central claim is practical rather than ideological: many teams reach for frontier models by default, pay for them on every call, and never verify whether the task actually requires that scale. Her proposed replacement is not “use small models everywhere.” It is a measurement loop: prove the behavior with the strongest available model, define what success means, then work down to the smallest model that passes the tests.

The costs she wants teams to account for are broader than token prices. A cloud model means data leaves the user’s environment and is sent to remote servers, with what Nabors described as risks of exposure, interception, retention, and breach. Latency is also part of the bill. She cited research on LLM-powered conversational agents that puts four seconds as a threshold after which user experience degrades; in her own later evals, Claude Sonnet crossed that bound at p95. Remote inference also fails outright when the application is offline, in an outage, on poor connectivity, or in secure environments where cloud access is unavailable.

Cash is only one dimension, but it matters. Nabors argued that falling token prices do not necessarily lower total inference spend because agentic and reasoning workloads can consume tokens faster than prices decline. A feature that looks cheap per call can become expensive when an agent makes multiple calls, or when a consumer product runs the same operation repeatedly for many users.

Prototype big. Deploy small.

Rachel Nabors

The alternative begins with asking whether the task needs a general-purpose language model at all. If the input is a camera stream, Nabors pointed to vision models such as MobileNet, YOLO, and MediaPipe. If it is audio, models such as Whisper or Wav2Vec2 may be the right class. If the task is chat, translation, analysis, or summarization, a small language model may be enough.

Small language models, in Nabors’s framing, are not toy systems. They contain millions to billions of parameters, while larger foundation models may have hundreds of billions or trillions. The important consequence is not just model size as an abstract number, but where the model can run. Smaller models can fit on consumer devices, especially when quantized to 8-bit or 4-bit formats, which can halve disk and memory requirements. Nabors said one billion parameters fits at roughly 2 GB in FP16, and her slide listed examples ranging from TinyLlama at 1.1B parameters and about 2.2 GB to Qwen 3 variants from 0.6B to 8B parameters and about 1.2 GB to 16 GB in FP16.

That size difference supports a different product architecture: more work on device, fewer round trips, lower latency, and offline availability. Nabors also emphasized that keeping information on device can reduce the need to transmit user data or PII to a remote model provider. On energy, she presented a research citation saying SLMs consume the same or less energy than LLMs to produce correct responses, and a simplified comparison chart in which an SLM used about 25% of the proportional energy of an LLM for the illustrated task, with a task-specific model using still less.

The right model is the smallest one that passes your own eval

The selection process Nabors described has four steps. First, prove the task is possible using the most capable model or most capable task-specific system available. Second, set success criteria using representative inputs and expected outputs. Third, test models from small to large. Fourth, choose the smallest model that gives acceptable responses for the actual inputs.

The point of starting with a frontier model is speed of discovery, not long-term dependency. Nabors used Claude to prototype a summarization feature in Mima, her social client. The feature summarizes long conversation threads — the kind of thread where a user wakes up to dozens of replies and wants to know what happened, who is talking about what, and whether there is a problem. Claude’s outputs were good enough to establish that the feature was possible.

The next step was to turn that prototype into an eval. Nabors exported a golden dataset: a curated set of input-output pairs used as the ground truth for model evaluation. In her case, the dataset contained public social-thread data with fields such as author handle, content, creation time, and whether the message came from her. She built a JSONL dataset from 14 threads and evaluated each one in two forms: a short summary and a summary with references into the conversation.

golden-dataset examples from 14 Mima threads, with two outcomes per thread

The success criteria were intentionally concrete. Nabors measured whether outputs were valid JSON by attempting to parse them. She measured whether reference tokens such as [ref:N] actually pointed to messages that existed in the thread. She measured factual consistency — whether the summary stayed faithful to the underlying thread rather than inventing claims — using an LLM-as-judge, with Claude scoring each summary against its source. She also measured length compliance, p50 latency, and p95 latency.

Dimension	Question	Measurement
JSON validity	Does the output parse?	Try JSON.parse and count successes
Reference structural validity	Do [ref:N] tokens point to real messages?	Extract refs with regex and check each N against the message count
Factual consistency	Does the summary stay faithful to the thread?	LLM-as-judge scores each summary against its source
Length compliance	Does it stay in the target word band?	Word count against context-specific limits
p50 latency	What is the typical time to summary?	Median across the eval set
p95 latency	What is the worst-case wait?	95th percentile across the eval set

Nabors’s Mima eval combined deterministic checks, LLM-as-judge scoring, and latency measurements.

This is where her advice diverges from casual model selection. She did not recommend asking colleagues which model they like and shipping that. She tested a set of local models against the same dataset and compared them with Claude Sonnet as a baseline. The contestants were Qwen 2.5 Instruct at 1.5B parameters and 1.0 GB on disk, Qwen 3 at 1.7B and 1.1 GB, Llama 3.2 Instruct at 3B and 2.0 GB, and Gemma 4 E2B-IT at roughly 5B and 3.1 GB. These were quantized sizes as shown in the eval slide.

The baseline was not free. Nabors said Claude Sonnet averaged 2.9 seconds and cost about 22 cents to run 14 tasks. Based on her own Mima usage, she estimated roughly a dollar per day in inference. That was too expensive for the kind of consumer product she wanted to build, especially if many users were making similar calls. Local inference, by contrast, moved the compute cost to the user’s device, eliminating her API bill while raising a different product question: whether the app would drain the user’s battery. Nabors noted that battery impact would itself be a future eval target.

The first result did not favor the model she initially expected. Qwen 2.5 was fastest, around one second p50 latency, but its reference accuracy was too low. Gemma 4 E2B had been recommended to her by engineers, but in this test it was much slower, around eight seconds. Claude was the accuracy ceiling but was slower than the best local candidate. Llama 3.2 3B landed near 90% reference accuracy while staying much faster than Gemma and close enough to Claude’s outputs that Nabors said many raw responses were “pretty much indistinguishable” from Claude’s when inspected in Phoenix.

Her conclusion for this feature was that Llama 3.2 3B was the “SAGE” model — her term for “small and good enough.” Nabors interpreted the result partly through domain fit: because Llama 3.2 comes from Meta, and Mima is a social-thread summarization product, she said it made sense that a model from a company with strong interest in social inputs would perform well on that kind of material. The broader lesson was not that Llama is always the right choice. It was that the right answer emerged only after task-specific evals.

Prompt engineering narrowed the gap, but only one variant helped

After choosing Llama 3.2 3B, Rachel Nabors treated the remaining performance gap as an optimization problem. A local model at about 90% accuracy may still be unacceptable depending on the feature. Her next step was to test whether prompt changes could improve Llama’s output without changing the model.

This matters most when the team cannot easily control or update the model. Nabors gave mobile apps as an example: if each capability improvement requires shipping a newly distilled one- or two-gigabyte model, users may have to download too much data. A team may also be committed to a specific on-device model for release reasons. In those cases, prompt engineering is often the available lever.

She narrowed the target metrics to the places where Sonnet and Llama diverged: JSON and reference structural validity, factual consistency, p50 latency, and p95 latency. Her bars were strict where system correctness required it. JSON and reference structural validity needed to be at least 99%, because parse failures and invalid references introduce bugs. Factual consistency needed to be at least 95%, with the remaining 5% allowing for ambiguity and reasonable inference rather than outright invention. p50 latency needed to be at or below 1500 ms, and p95 latency at or below 3500 ms, just under the four-second user-experience threshold she had cited earlier.

Metric	Bar	Why it mattered
JSON and reference structural validity	≥99%	Outputs must be parseable and references must not introduce bugs
Factual consistency	≥95%	Anti-hallucination bar, allowing limited ambiguity and inference
p50 latency	≤1500 ms	Feels instant enough on an M-series Mac
p95 latency	≤3500 ms	Stays under the four-second worst-case experience threshold

The tightened success bars Nabors used while optimizing Llama 3.2 3B.

Nabors then tested five prompt variants, changing one variable at a time. V1 was the baseline prompt. V2 reformatted the thread as numbered messages rather than JSON, based on the hypothesis that small models might track natural-language indices better than array offsets. V3 added a few worked examples, on the theory that small models learn format faster from examples than from rules. V4 added strict negative constraints such as “no preamble” and instructions to count words before responding. V5 used chain of thought, forcing the model to identify key moments before writing, with the hypothesis that explicit reasoning would improve grounding.

The results were not evenly positive. Reformatted input barely helped and added latency. Explicit rules made performance worse. Chain of thought improved length somewhat but hurt reference accuracy and factuality while adding roughly 638 ms of latency. The few-shot prompt was the clear winner: length compliance improved by 10 points, reference accuracy by 8.3 points, factual consistency by 5.8 points, and latency increased by only 241 ms.

Variant	Length	Ref accuracy	Factual	Latency
Baseline	77.4%	91.2%	87.1%	1055 ms
Reformatted input	+1.2	-1.1	+0.6	+606 ms
Few-shot	+10.0	+8.3	+5.8	+241 ms
Explicit rules	-4.8	-6.6	-3.4	~flat
Chain of thought	+5.9	-5.3	-1.9	+638 ms

Prompt-variant results for Llama 3.2 3B. The slide described averages across a 29-example golden dataset with three repetitions per example, while Nabors earlier described 28 examples from 14 threads.

Her interpretation was blunt: the model did not respond well to being bossed around with explicit negative rules. The few-shot prompt, by contrast, gave it examples of the desired behavior and produced the largest overall gain for a small latency cost.

The last gap was not all model quality

The optimized comparison between Claude Sonnet and Llama 3.2 3B showed both the promise and the ambiguity of eval-driven replacement. With the few-shot prompt alone, Llama reached 100% JSON validity, 91.7% reference structural validity, 92.9% factual consistency, 93.3% length compliance, 1296 ms p50 latency, and 3998 ms p95 latency. Claude Sonnet had 100% JSON validity, 100% reference structural validity, 100% length compliance, 3046 ms p50 latency, and 4750 ms p95 latency. Claude’s factual consistency was not scored in the same way because Claude was being used as the judge.

Metric	Bar	Claude Sonnet	Llama 3.2 3B with few-shot prompt
JSON validity	≥99%	100%	100%
Ref structural validity	100%	100%	91.7%
Factual consistency	≥95%	—	92.9%
Length compliance	≥90%	100%	93.3%
p50 latency	≤1500 ms	3046 ms	1296 ms
p95 latency	≤3500 ms	4750 ms	3998 ms

With the few-shot prompt alone, Llama beat Claude on p50 latency but still missed the p95 target and some structural and factual targets.

The next move was to inspect the traces rather than accept the aggregate scores at face value. On factual consistency, the Claude judge could be strict in ways that were subjective rather than product-critical. Nabors’s example was a distinction like calling someone “angsty” versus “cross.” The disagreement was not necessarily hallucination; it could be a stricter preference by the evaluator model. That did not make the score useless, but it meant the team had to understand what the judge was penalizing.

Reference consistency and length did not need to be solved entirely by the model. Those failures were amenable to deterministic post-processing. A harness can check the number of messages in a thread and strip references outside the valid range. It can also enforce a word-count limit by truncating output that is too long.

After adding those safety nets, Nabors presented a shipped local configuration: Llama 3.2 3B with the V3 few-shot prompt, post-hoc reference validation, post-hoc word-count truncation, and KV-cache reuse on the few-shot prefix. The distinction matters: the post-processing handled reference and length compliance, while the p95 latency improvement came from KV-cache reuse; the slide noted that V3 alone was 3998 ms at p95.

In the final configuration, JSON validity was 100%, reference structural validity was 100%, length compliance was 100%, factual consistency remained 92.9% under the Claude judge, p50 latency was 1296 ms, and p95 latency came in under 3500 ms. Claude Sonnet, by comparison, showed 3046 ms p50 and 4750 ms p95 in the same comparison slide. Nabors said the local setup ended up meeting and beating Claude Sonnet after that extra work, while saving her roughly a dollar a day in inference costs.

The lesson was not simply that prompt engineering fixed the model. The final system combined model choice, prompt choice, evaluator inspection, deterministic application logic, and a latency optimization. Some failures belonged in the prompt. Some belonged in post-processing. Some belonged in a human review of the evaluator’s judgment.

Evals become the guardrail after the migration

Once a local model is good enough to ship, Rachel Nabors argued, the evals should not be discarded. They become regression tests. A prompt change, model upgrade, or LLM migration can undo prior gains: summaries might expand into paragraphs, references might break, or hallucinations might reappear. Nabors compared regression evals to CI/CD tests and described them as the way to keep a CTO from accidentally destroying an agentic experience with a seemingly minor change.

That regression frame is important because the model-selection process is not a one-time benchmark. The golden dataset, evaluators, traces, and success bars become a product asset. They encode what “good enough” means for the feature, not in general but for the inputs and user experience the product actually has.

Nabors closed by turning the argument into an audit question: how many Claude calls could be Llama calls? For an existing system, the first step is to inspect current AI spending and identify calls that may not require a frontier model. For new work, she advised looking for small or specialized models already available on the user’s device.

One visual example was Chrome’s built-in local-AI architecture. The slide, attributed to chrome.com/docs/ai/built-in, showed a web app interfacing with Chrome exploratory APIs, task APIs, Gemini Nano, expert models or fine-tunings, and an AI runtime running on CPU, GPU, NPU, or TPU. The practical point was that if a browser already provides access to a local model such as Gemini Nano, a team may not need to ship a model to every user at all.

The same principle applies across the stack: prototype with the largest or most capable model when uncertainty is high, then replace production paths with small language models or task-specific models when evals show they satisfy the use case. The replacement is justified only when measured against the product’s own criteria: parseability, references, factuality, length, latency, cost, offline behavior, and whether the product can avoid sending data to a remote provider.

AI Application Architecture Evals and Benchmarks Inference and Deployment Open Models