Orply.

Five Code-Level Fixes Cut Agent Token Costs Without Prompt Changes

AI EngineerSunday, June 28, 20265 min read

Erik Hanchett, a developer advocate at AWS, argues that production agents can become expensive not because they fail, but because they repeatedly send too much to the model and use costly models for work that does not require them. His talk lays out five code-level controls for token spend: cache static prompts, route simpler tasks to cheaper models, keep large tool outputs out of repeated context, cap tool loops, and trim long conversation histories.

Agent bills rise when the same material keeps going back to the model

Agent cost can rise even when the agent is working as intended. ? erik-hanchett frames the problem as unnecessary token use: static prompts are resent, long histories are carried forward, large tool outputs are added back into context, and expensive models handle calls that could have gone to cheaper ones.

His five fixes are implementation changes rather than prompt rewrites: cache the system prompt, route requests by task difficulty, offload large tool results, cap tool loops, and trim conversation history. Each targets a specific way token use compounds as an agent runs.

The recap slide states the intended outcome as “Same Agent. Smaller Bill.” The practical premise is that the bill is not only a function of the final answer the user sees. It also depends on what the agent repeatedly sends to the model while deciding what to do next.

Static prompt text should not be sent as fresh context every turn

The first target is the system prompt. ? erik-hanchett uses AWS’s Strands Agents and a Bedrock model configuration in the example, while saying the idea works across different providers. The slide is presented as pseudocode: an agent is configured with a BedrockModel, model_id="claude-sonnet", and cache_prompt="default", while the agent still receives system_prompt=BIG_SYSTEM_PROMPT.

The behavior Hanchett describes is that the first agent call sends the full system prompt, and later calls send a much smaller version because the prompt is cached. He does not spell out provider-level mechanics. The implementation point is to use caching where the framework or provider supports it, so static instructions are not treated the same way as changing context on every call.

On the first call of your agent, it will send the full system prompt over. And then on every subsequent call, it will have a much reduced system prompt being sent over.

? erik-hanchett · Source

Hanchett also says tool prompts and messages can be cached. The broader rule is to identify parts of the agent context that do not change and avoid repeatedly sending them in full.

Model choice should depend on how hard the call is

? erik-hanchett’s second fix is to route by difficulty. His warning is direct: do not use the most expensive model for every task.

The example is intentionally simple. A pick_model(task) function returns "claude-haiku" when task.is_simple is true and returns "claude-sonnet" otherwise. The slide labels Haiku as “cheap” and Sonnet as “expensive reasoning.” Hanchett says difficult tasks may justify a newer frontier model, but simpler inference calls should be sent to a cheaper model.

The routing logic can be as basic as an if statement. Hanchett also says a cheap model can be used to decide which model should handle the request. The main design choice is to make model selection part of the agent’s execution path instead of setting one expensive default for everything the agent does.

Large tool results should become summaries and references

The third fix addresses tool outputs that are too large to keep passing through the agent loop. ? erik-hanchett shows a fetch_report tool where api.get(id) returns data annotated as “10k tokens.” Instead of returning that full payload to the model, the tool stores the data locally or in the cloud, then returns a summary plus a reference key.

The slide’s footnote makes the intended behavior explicit: “only the summarized tool result gets added not the full tool result.” The pattern shown is compact: fetch the data, store it, and return a summary with a reference.

StepCode shownPurpose
Fetch`data = api.get(id)`Retrieve a large result, shown as 10k tokens.
Store`key = store.put(data)`Move the full result out of the agent context.
Return`return summarize(data, key)`Send the model a smaller summary plus a reference.
Hanchett’s manual pattern for keeping large tool results out of repeated context

Hanchett says Strands Agents offers APIs for this, though the slide shows a manual version. The framework-independent principle is to keep the heavy object outside the context window and give the model the smaller representation it needs for the next step.

This matters because a large tool result can be charged more than once. If the result is added back into context every time the agent loops, the cost of that one oversized payload can multiply across later model calls.

Runaway tool loops need hard limits before they need tuning

? erik-hanchett’s fourth recommendation is to cap tool loops. He describes agents calling the same tool “over and over and over again.” Without a cap, a tool call might run 10 or 20 times, or even enter an infinite loop.

The code example sets a maximum number of iterations on the agent: max_iterations=8. The number eight is an example, not a universal recommendation. The important control is the ceiling, so repeated tool use stops before it becomes unbounded token spend.

If you don't cap that tool call, then it might run 10, 20 times, it might get into an infinite loop, which would be very bad for your token usage.

? erik-hanchett

Hanchett pairs the hard limit with measurement. Before deployment, he recommends using observability tools to inspect tool-call behavior for every tool: how long calls run, how often they loop, and whether the pattern looks efficient. The cap prevents the worst case; observability shows where the loop itself needs improvement.

Long conversations should not resend the entire history

The final fix is to trim multi-turn history. In long-running conversations, ? erik-hanchett says the whole conversation history may be sent back to the model on every call. That can consume hundreds or thousands of tokens as the exchange grows.

His Strands Agents example uses SlidingWindowConversationManager with window_size=10, so the agent sends only the last 10 messages. The window size is configurable. The point is to stop treating the full transcript of the conversation as necessary input for every new turn.

There is a tradeoff: trimming the window means earlier messages drop out of direct context. Hanchett’s mitigation is summarization. Older history can be compressed into a smaller summary and placed into the context window, rather than resending every old message verbatim.

That gives the agent a bounded recent history while preserving earlier conversation only in summarized form. The savings come from replacing repeated full-history sends with a smaller window and, when needed, a summary of older turns.

The frontier, in your inbox tomorrow at 08:00.

Sign up free. Pick the industry Briefs you want. Tomorrow morning, they land. No credit card.

Sign up free