AI Voice Agents Are Beating the Average Customer-Service Rep

Tom ChenEye on AIThursday, June 4, 202617 min read

Tom Chen, chief product officer at Aircall, argues that AI voice agents should be judged against the average customer-service interaction, not the best human rep. In his account, the technology is already good enough for many routine calls, can handle far more concurrency at lower cost, and may improve satisfaction when customers are given a clear choice between faster AI service and a human agent. The main constraint, Chen says, is often not the model but the undocumented company knowledge the agent needs to resolve issues.

The benchmark is not the best human agent

Tom Chen’s case for AI voice agents rests on a comparison he thinks many companies get wrong. The question is not whether an AI voice agent can outperform the most experienced person in a contact center: the veteran who knows the undocumented exceptions, the product quirks, the customer histories, and the internal workarounds. Chen’s answer to that is still no, or at least “probably not yet.”

The more relevant comparison, he argues, is the median rep. Against that benchmark, he believes AI voice agents can already look better than the average customer-service experience.

Is an AI voice agent going to be as performant as your smartest agent who's been around, who kind of has all the knowledge of the things that are never documented? Probably not yet. But I do think that it is better than the average rep in a call center.

Tom Chen · Source

The reasons are partly operational and partly experiential. From the company’s side, newer AI models are more likely to adhere to the intended workflow. They do not go off script in the same way a human rep might, and they can be instructed to handle or escalate particular topics according to the company’s rules. Earlier models were worse at this, he said, but the newer ones are “much better.”

From the customer’s side, the advantage is patience. An AI voice agent does not become irritated, does not let tone drift after a difficult interaction, and does not tire. Chen described its patience level as effectively “infinite.” For customers with a simple issue, that can make the interaction cleaner and faster than a human call.

Craig Smith pressed the point from the customer side. He said he had encountered AI voice agents that were hard to distinguish from humans, and that he found the experience “kind of a relief” because the conversation tended to be simpler and more direct. Chen said many Aircall customers have moved toward that same view as they have used the technology more.

The surprising finding, in Chen’s telling, is that customers do not always resist AI when the tradeoff is presented plainly. He described an Aircall customer, which he believed was in Australia, and said he had heard similar examples from U.S. customers. Instead of sending every caller directly to an AI voice agent, the company asked whether the caller wanted to speak with a human or get faster service through an automated AI agent. Chen said the share choosing the AI option was “much, much higher” than the company expected.

Chen’s interpretation is that the choice allowed customers to classify their own problem. If the issue seemed simple and speed mattered, the AI option was attractive. If the issue was complex, emotionally sensitive, or required judgment, the human option remained available. That design improved operational efficiency, and Chen said CSAT scores in those operations were higher because customers had more control and optionality.

The design principle is not that every customer wants AI. Many customers want resolution, and will accept — or prefer — AI if the bargain is explicit: faster service from an AI agent, or a human if that is what the issue requires.

The hard part is often the missing company knowledge

Chen’s clearest limitation for AI voice agents is not the voice model itself. It is the company’s own memory.

Some customers, he said, are overly optimistic about how high their resolution rates will be with a voice agent. If the AI had perfect information, those expectations might be reasonable. The obstacle, especially in smaller companies, is often missing context.

He contrasted customer-service automation with code generation. In coding, much of the context a tool like Copilot needs is already documented in the codebase. In many small businesses, customer-service and sales knowledge is not documented in the same way. It lives in employees’ heads. The best rep performs better not merely because they are smarter, but because they know the exceptions, the undocumented processes, and the company-specific details others do not.

That tribal knowledge becomes the bottleneck. The question, Chen said, is not how well AI can perform with perfect information. It is how to extract and structure the missing information without forcing small businesses to sit down and document everything manually.

He pointed to context engineering, data services, and ease of use as central product challenges. Large enterprises may be able to dedicate resources to creating and maintaining AI agents. Smaller companies generally cannot. Chen said SMB-oriented AI agent providers are all trying to solve this problem because it is widespread across smaller businesses.

This is also why Chen resists treating the underlying AI technology as the whole differentiator. He described AI technology as “somewhat commoditized” in many ways. What is less commoditized, in his view, are the products and services that help discover the missing knowledge inside a company and make it usable by the AI.

The limitation also clarifies the earlier benchmark. An AI voice agent may outperform the average rep when the workflow is clear and the context is available. It is less likely to beat the veteran employee whose value comes from undocumented knowledge the system has not yet captured.

Adoption starts where voice used to be uneconomic

Chen described the market for AI customer communication as early but accelerating. Aircall, which he joined as chief product officer a little over two years before the interview, had been operating in what he called “a completely different world” soon after ChatGPT launched. Since then, he said, Aircall has shifted from being understood primarily as a SaaS phone and customer-communication business to being viewed more as an AI business.

The customer expectation changed with it. Companies no longer ask only for a phone system or communications software, in Chen’s account. They ask what the system can resolve: tickets, phone conversations, customer questions, and, increasingly, autonomous conversations. He said deployment is not universal and described the market as still in “single-digit early innings,” but said businesses are already looking for partners they can trust for the next phase.

That trust is not only about model quality. It includes telephony reliability, chat reliability, and the ability to navigate telecommunications regulations in different countries. For a customer-communication platform, the AI layer sits on top of infrastructure that still has to work.

The adoption pattern Chen sees is incremental. Large customer operations often have extensive staffing, fixed shifts, and carefully tuned workflows. He referred to contact-center operations with 10,000 or even 100,000 reps across shifts and a year of operations. Those companies may have little appetite for replacing a core daytime operation all at once, because the downside of disrupting a well-oiled system is too high.

The first step, therefore, is often not the main queue. It is after-hours coverage, overflow calls, or other cases where the company was not answering anyway. In those situations, even a partial answer, a structured intake, or a captured message can be “upside only” or close to it. Once customers see those use cases work, Chen said, they begin expanding coverage.

His broader claim about voice is that LLMs changed the economics of a channel many companies had tried to avoid. Before large language models, he said, the prevailing assumption was that messaging and asynchronous chat were the future. Voice was expensive because it required trained people, staffing plans, and real-time availability. If a company can offer voice without scaling human headcount at the same rate, voice becomes a competitive advantage rather than a cost center hidden behind forms, chatbots, and phone trees.

Chen described the past 25 years as a “deflection culture,” in which companies treated deflection as the key metric. In his view, that is at odds with what actually matters: customer satisfaction. Smith agreed from his own experience, noting how often he cannot find a company phone number at all and instead finds a website form with no confidence anyone is monitoring it.

Chen’s reply was that every communication opportunity is a business opportunity. A complaint can strengthen a customer relationship if handled well. A call can also be a qualified sales opportunity that the company misses if it is inaccessible. His view is that it makes business sense to have something capable of taking customer calls 24/7, and that the downside is limited in many cases.

Every communication opportunity is a business opportunity.

Tom Chen

Lower cost allows companies to reverse design choices that were made because voice was too expensive. A company that previously buried its number, hid behind a form, or used a rigid phone tree may be able to make live-feeling voice access available again.

Chen expects this advantage to be temporary. If AI voice becomes common, companies will have to find a new frontier of differentiation. For now, he said, many businesses are still trying to get the basics in place: overcome the fear of AI, create an initial deployment, and move incrementally toward more coverage. Small and midsize businesses may move more slowly at first, but Chen expects acceleration because the return on investment is “too high to ignore.”

Local context still matters when the software speaks 100 languages

Aircall’s international footprint came up as more than a coverage map. Modern AI providers can offer multilingual transcription or voice-agent support across more than 100 languages, Chen said. The more distinctive Aircall advantage, in his account, is not simply speaking a language but operating with local commercial context.

Aircall has teams on the ground in London, Paris, Madrid, Sydney, Berlin, Mexico City, and U.S. hubs including San Francisco, New York, and Seattle. Chen argued that the difference between language competence and cultural business fluency is large. Local sales, support, and customer-success teams know how businesses operate in their markets, not just how to translate words.

He attributed part of that to Aircall’s European heritage. A company starting in Europe is forced to think across markets early, whereas a U.S. company can initially operate inside one large domestic market. Aircall, he said, had to confront multiple markets from day one and built its products and go-to-market structure with that in mind.

In a market full of AI automation and, as Chen put it, “AI spam,” he thinks human help and local business understanding become scarcer and more valuable. For multinational smaller businesses, the ability to work with teams that understand local norms can matter as much as the AI itself.

The human handoff is a business choice, not a universal rule

Aircall’s customers, Chen said, are mostly not contact centers themselves. Some may use offshore or onshore contact centers that in turn procure Aircall, but Aircall’s typical relationship is direct with businesses that need customer communication infrastructure.

Tom Chen does not describe AI deployment as an all-or-nothing replacement for human agents. He separates Aircall’s AI work into two broad modes.

The first is assistance technology: AI running in the background while a human leads the conversation. In those cases, the system listens for tone, keywords, and configured topics; helps reps follow playbooks; and speeds training and onboarding. Chen said there are many businesses where, even if AI could perform at the same level as a human, the customer relationship still calls for a person. A company trying to establish a long-term relationship may not want the first interaction to feel like it is being handled by an AI agent. In that context, AI is better positioned as a coach or assistant than as the voice of the brand.

The second mode is autonomous agents with designed escalation points. For higher-volume, more transactional businesses, customers can configure when the AI should attempt resolution and when it should route the call to a human team. Refunds were Chen’s example of a workflow many smaller companies would not want to leave entirely to AI because of fraud and approval concerns.

The configuration can be specific. If the AI detects topic A, it can route to one team; if it detects topic B, it can route elsewhere. Companies can also adjust how hard the AI should try to answer before escalating.

Here Chen drew a distinction between companies that prioritize fast human intervention and those that retain what he called a deflection mindset. Some customers tell Aircall to escalate as soon as frustration appears. Others instruct the agent to keep trying to deflect the issue and only escalate after several signs of frustration. Chen said businesses have the right to choose how they weigh deflection against satisfaction, and Aircall provides configurability and best-practice guidance depending on the industry.

The tension is central to the category. AI can make customer communication more available and more efficient. It can also be configured to reproduce the same frustrating avoidance behavior customers already dislike. The technology does not, by itself, decide whether a company uses voice agents to serve customers or keep them away from humans.

Call data becomes useful only when the phone system is modern enough

Craig Smith asked whether AI voice agents create useful back-end data that can improve service, and whether the same is true for human agents. Chen said that, in a modern phone or contact-center system, the answer is generally yes: calls can be recorded, transcribed, analyzed, and routed into other systems, subject to country and state-level consent rules.

He emphasized that recording requirements vary. In the United States, different states have different rules. Other countries have their own requirements. That is why callers often hear a message that the call is being recorded for training or related purposes. If a call is being recorded, Chen said, that needs to be disclosed to the customer.

But he also said a large share of the world is still not on modern systems.

More than 40%

of the world is not on a modern contact center or phone system, according to Chen

Without a digital phone or contact-center system, the AI layer has little to work with. Calls cannot easily be transcribed, analyzed, or integrated into a CRM. Chen said the AI wave appears to be motivating some companies to modernize systems they had left alone for years, though he added that he did not have hard data to prove that market observation. Digitization by itself may not have been enough to trigger the change; the potential of AI may be making modernization more compelling.

Chen was careful about Aircall’s own use of customer data. He said Aircall does not train on customers’ data as a platform matter, and that if Aircall needs to do analysis on customer data for its own purposes, it gets consent from customers through an opt-in arrangement. When the system parses data to show it back to the customer as part of Aircall’s services, Chen described that as part of the AI functionality customers buy. He said customers should be wary of services that are not clear about how their data is used.

For Aircall customers, the value of the data is broader. Digitized and transcribed calls can feed into a CRM and be combined with more structured customer data. Live transcription can assist agents during calls. The system can detect sentiment and configured keywords, surface answers, remind agents what to say, and then handle post-call work such as parsing transcripts and sending structured information to CRM systems.

Chen likened this assistance layer to a sales coach, but one that can be present across calls rather than sampling a small subset after the fact. A human sales coach might cost “upwards of” $150,000 to $200,000, he said, and still cannot listen live to every call. Aircall’s assistance technology, in his description, improves onboarding speed, call quality, and the consistency of human-agent work.

In Chen’s account, the value comes from turning spoken interactions into structured and usable operational data, then applying that data during and after the call.

The economics depend on concurrency as much as labor cost

Aircall prices its AI voice-agent offering on pay-per-use rather than outcome-based pricing. Some AI companies, especially in narrower use cases where the desired result is clear, can price by resolution or outcome. Chen said Aircall has avoided that because it is a horizontal platform with highly variable customer goals.

The problem, as he sees it, is attribution. If Aircall charged for outcomes, it would first need to define resolution with each customer. That could create disputes about whether the AI caused the result, whether the result counted, and how value should be assigned. Aircall instead charges for usage, while actions after the call are free. If a customer configures the system so one call triggers 10 automations, produces leads, or saves human follow-up work, Aircall is not trying to capture that entire upside.

Chen framed this as a practical choice in a competitive market. He said he can imagine arguments against his own view, but his guess is that, outside extreme enterprise contexts with high compliance requirements and few viable vendors, the long-term market may move closer to cost-plus pricing than pure value-based pricing. In Aircall’s customer base, he said, the debate over defining outcomes may not be worth the theoretical incentive alignment.

The economic case also depends on scalability. A human agent can take one call at a time. Chen said an AI voice agent on Aircall can take 100 concurrent calls, does not sleep, and does not take lunch breaks. That gives the system “infinite flexibility and scalability” relative to human staffing patterns.

100

concurrent calls one Aircall AI voice agent can handle, according to Chen

When customers compare an AI agent’s work to a fully loaded human operation, Chen said the value may be one-third, one-fourth, or even one-tenth the cost. He framed the comparison as an ROI estimate that includes flexibility, instant training, and actual cost.

Chen rejected the idea that Aircall tells customers to replace their human agents. He said he does not personally believe that is how businesses will operate. In Aircall’s customer base, he said, the workload taken by AI agents often allows existing human agents to move into higher-level conversations rather than disappear from the operation. Companies are recategorizing work, in his view, away from repetitive handling and toward more demanding forms of human work.

Implementation begins with a decision about access

For the end customer, Aircall is mostly invisible. A caller does not know what phone system the business uses, except that they may encounter an AI agent powered by Aircall. For the business, implementation begins with deciding that voice — and sometimes WhatsApp messaging — should be a channel for customer communication.

Tom Chen said almost all companies need some version of voice at some point, though the importance varies by business. A small e-commerce shop may choose to do most things online, but many businesses, especially local businesses, need phone access.

Aircall helps companies procure phone numbers, which Chen said is less trivial than it sounds because phone numbers are regulated across countries. Businesses may need to register their identity, particularly if they want to use SMS in the United States. The regulatory burden exists because governments and carriers are trying to manage spam and fraud.

Once phone lines are established, companies decide how visible and available they want to be. A business might give every sales rep an individual number, use one shared line, publish the number publicly on a website, reveal it only after purchase, or include it in sales-rep email footers. Those are demand-shaping decisions: the more visible the number, the more calls the company should expect.

AI agents can be attached to phone lines at different points in the routing logic. They can sit at the top of an IVR tree, handle calls after hours, operate all day with escalation during business hours, or take messages when human escalation is unavailable. Human agents are attached to the same system through teams and routing structures.

Chen said Aircall tries to make this simple enough for small teams to set up quickly. Unlike enterprise products that may be difficult to launch the same day, he said Aircall’s goal is that customers can get operations live “within the hour.” That simplicity matters more as AI products become more complex: companies may need advanced control eventually, but the starting point should not require unnecessary complexity.

The voice-agent stack is simple to describe and difficult to control

The technical architecture has two major layers. The first is the voice stack: telephony infrastructure, VoIP systems, communications providers such as Twilio, carriers around the world, phone-number acquisition, and cloud compute. This stack handles the movement from traditional telecommunications into digital communication.

The second is the generative AI stack. In the traditional pipeline, human speech is converted to text through speech-to-text; the text is sent to an LLM such as OpenAI, Gemini, or Claude; the LLM produces a text response; and text-to-speech turns it back into spoken audio. That pipeline is then placed inside the VoIP stack so the interaction can happen live.

Chen also noted the rise of voice-to-voice models, where the system takes voice in and produces voice out without the company controlling each separate leg of the pipeline in the same way. OpenAI, Groq, and other companies have voice-to-voice models, he said. But there are tradeoffs. The more a company operates at scale and needs control, the more likely it may be to stitch the components together itself.

The reasons are edge cases and latency. Different customers and industries produce different needs, and without control over the speech-to-text, model, and text-to-speech layers, debugging and optimization become harder. Voice is also highly latency-sensitive; callers do not want to wait 1.5 seconds for each response. Latency can come not only from the model but also from the telephony stack, including routing packets across regions such as the U.S., London, or Paris.

So while the concept of a voice agent can be described simply, Chen said the complexity is greater than it appears. The “devil,” as he put it, is “100% in the details.”

AI Application Architecture RAG and Knowledge Systems AI in Customer Support Voice and Audio AI Agents and Autonomy AI Business Models Enterprise AI Adoption