Claude Cowork’s Travel Test Shows Agent Value Beyond Token Consumption

Alex KantrowitzAlex KantrowitzThursday, May 21, 20267 min read

Anthropic’s Claude Code head Boris Cherny argues that agentic AI should be judged by completed work, not raw token use, citing a recent test in which Claude Cowork checked his email and calendar, corrected his itinerary, and booked eight flights and five hotels. Pressed by Alex Kantrowitz on whether corporate AI adoption is being distorted by “tokenmaxxing,” Cherny says the more important signal is the scale of productivity gains Anthropic and customers are seeing, and that companies may need to redesign work around AI rather than simply mandate usage.

Useful autonomy is completed work, not token consumption

Boris Cherny described a recent travel-planning test as the strongest result he has seen from Anthropic’s agent tools: he gave Claude Cowork a rough itinerary for a multi-city international trip, asked it to check his email and calendar, and then told it to book the travel.

The task was not simply “buy tickets.” Cherny needed to be in several places at specific times for Code with Claude events in London and Tokyo, plus other stops. He gave Cowork an approximate schedule for five stops and asked it to verify the plan. It found two stops he had omitted and several dates he had given incorrectly, by checking his email after he authorized it to do so.

He then asked it to book the travel and returned to other work. About an hour later, it had booked eight flights and five hotels. One hotel was in the wrong area; he asked Cowork to rebook it, and it did.

8 flights, 5 hotels

Cherny said Claude Cowork booked after checking his email and calendar

For Cherny, the significance was not that the trip happened to be complicated. It was that the system handled a task he has repeatedly used as a test case. He tries the same kinds of real workflows with Cowork and Claude Code as models improve, and this was “the best result” he had received.

The lesson he drew was about expectation lag. Engineers who tried a model a year earlier may still assume it cannot be trusted beyond a few lines of code, because that was their prior experience. Cherny argued that the current experience is materially different, and that AI is unusual because users have to keep revisiting tasks that previously failed.

This is the first technology I've used like this where every month there's a step change in what it can do.

Boris Cherny · Source

The practical habit he recommended was a “beginner mindset”: retry the technology on tasks where it was not good before, because the next model may do it well.

The demand question is whether agents are doing work or merely burning tokens

Agentic software shifts the user from navigating interfaces to delegating outcomes. Alex Kantrowitz framed Cherny’s travel example as a move away from software built around scale: booking flows, menus, preference settings, and features that may or may not fit the user’s immediate goal. In the agentic model, the user describes the result and lets the system act online according to the user’s preferences.

That shift helps explain the intense interest in products like Claude Code and Claude Cowork. But Kantrowitz pressed on the core measurement problem: whether high usage reflects real demand or demand distorted by incentives.

His concern centered on “tokenmaxxing”: companies encouraging or requiring employees to consume large volumes of AI tokens, sometimes through leaderboards, rewards, or adoption goals tied to AI actions rather than clear productivity gains.

Kantrowitz said he personally uses many tokens and finds Claude Code and Claude Cowork useful for his own business. But he questioned whether large corporate budgets are partly distorted by poor incentives. He cited a recent Financial Times report stating that some Amazon staff used AI tools for unnecessary tasks to inflate usage scores. The quoted report said employees were using software to automate additional unnecessary AI activity to increase token consumption, under pressure after Amazon introduced targets for more than 80 percent of developers to use AI weekly.

Kantrowitz added that he had checked the claim with an Amazon employee, who told him the behavior was real. According to that employee, “I triggered an automation that runs for hours and then gets deleted every day in order to meet these targets.”

The challenge to Cherny was not whether agents can be useful. It was whether token consumption is a reliable signal of that usefulness. Completed work — corrected itineraries, booked flights, usable code, changed workflows — is one kind of evidence. Token volume alone is a weaker proxy.

Cherny says the productivity gains are too large to reduce to token games

Boris Cherny said he does not think tokenmaxxing accounts for a large share of usage. To explain why, he contrasted current AI-driven productivity gains with the kind of productivity work he saw before joining Anthropic, when he worked at Facebook.

At Facebook, one of his responsibilities was the health of code across Meta’s apps, including Facebook, Instagram, and WhatsApp. Code quality mattered because better code made engineers more productive. Before models like Claude, a large productivity team might work for a long time and eventually produce a one, two, or three percent annual improvement in productivity per engineer. That kind of gain was considered meaningful and was “very hard won.”

Claude Code, in Cherny’s telling, changed the order of magnitude. He said Anthropic and its largest customers are reporting productivity gains “on the order of hundreds of percentage points.” The last number Anthropic reported, according to Cherny, was that the amount of code written per engineer at Anthropic had grown by about 250 percent since Claude Code was introduced, while code quality, reliability, and related measures remained stable.

~250%

Cherny’s reported increase in code written per engineer at Anthropic since Claude Code was introduced

Cherny did not recommend tokenmaxxing as a blanket practice. His advice to companies was more basic: give employees enough tokens to experiment without asking for approval every time, and create psychological safety around trying new workflows. Some experiments will fail; some will work. The organizational problem, in his view, is that companies cannot reliably predict where the useful innovations will come from.

The breakthroughs may not come from the people management already thinks of as top performers. They could come from an accountant automating accounting in a way no engineer would have imagined, a marketer automating marketing, or a new graduate engineer building something unexpected.

That unpredictability shaped his guidance: let people experiment first, then optimize around use cases once they scale. If a competitive token-usage culture works for a particular company, he said, that may be fine. If another company prefers Anthropic’s approach — creating space and safety for experimentation — that can be fine too. He treated the right method as company-dependent rather than universal.

AI adoption may require redesigning work, not just mandating usage

Boris Cherny did not claim to know how widespread tokenmaxxing is. He had heard of it “as a trend” but did not know how many companies were doing it. On Claude Code specifically, his counterpoint was that Anthropic has “many, many, many customers,” so usage is not being driven by a single company.

His broader answer was that mandates and targets may be attempts at organizational change, even if some implementations produce bad incentives. He avoided speaking for those companies and suggested asking them directly, but characterized the likely goal as business-process change: figuring out how a company can actually benefit from AI.

Cherny compared the current AI adoption problem to the adoption of personal computers. He recalled a Harvard Business Review article from the 1990s asking why computers were not yet producing obvious productivity gains. In retrospect, he said, it now seems obvious that computers make people more productive. At the time, it was not obvious.

His explanation was that companies could buy computers without redesigning work around them. If paper filing cabinets, drawers, and pen-and-paper processes remained central, and the computer sat on the periphery, the company would not get much benefit. The companies that gained were the ones that went through the painful process of putting computers at the center of business operations.

Cherny sees a similar split forming around AI. Companies are trying to capture productivity improvements, but the right operating model is unclear because every company has a different business, culture, organization, and way of working. Some are experimenting with mandates and targets. Others are creating room for bottom-up workflows. He did not present one model as correct.

The unresolved tension is how to tell the difference between adoption and performance. Kantrowitz’s Amazon example points to the risk of fake activity: systems running because the metric rewards usage. Cherny’s answer is that the underlying productivity gains are large enough that companies are rationally searching for ways to reorganize around AI. The risk, in that frame, is not only inflated demand; it is also failing to revisit what the technology can now do.

Agents and Autonomy AI Economics and Labor Coding Assistants Enterprise AI Adoption

Useful autonomy is completed work, not token consumption

The demand question is whether agents are doing work or merely burning tokens

Cherny says the productivity gains are too large to reduce to token games

AI adoption may require redesigning work, not just mandating usage

The frontier, in your inbox tomorrow at 08:00.