AI Makes Customer Understanding the Scarce Input in Product Development

Patrick Chase Constantin BenschSequoia CapitalTuesday, June 2, 202614 min read

Listen Labs co-founder and CEO Alfred Wahlforss argues that as AI makes software and marketing execution cheaper, the scarce input for companies becomes knowing what customers actually want. He describes Listen as an AI research platform that runs large-scale voice interviews, builds carefully targeted audiences, and uses interview data to simulate how specific customer groups may respond to future questions. Wahlforss’s central claim is that interviews, when designed and tested properly, can provide a richer and more predictive signal than surveys, behavioral logs, or generic personas.

The scarce input is no longer execution, but knowing what to build

Alfred Wahlforss describes Listen Labs as an attempt to make customer understanding continuous, scalable, and eventually usable by other software. The premise is simple: if AI makes building cheaper and faster, the harder question becomes what to build, what to say, and which customer signals to trust.

Listen’s current product is an AI-first customer research platform. A company can ask a question such as how to improve Cursor’s onboarding; Listen generates an interview guide, finds relevant participants from an audience Wahlforss says now numbers 30 million, runs hundreds or thousands of voice interviews, analyzes the results, and returns recommendations. The interviews are not form-fill surveys. Wahlforss compares the experience to a Zoom call with an AI agent, including voice, video, transcription, and what he describes as emotion detection from a participant’s face, eyes, and tone.

The larger ambition is to turn those interviews into a simulation layer. After a company has run enough interviews, Listen wants to predict how the relevant customer population would answer future questions. Wahlforss frames this as the beginning of a loop between customer understanding and execution: a churn interview identifies a bug; a coding agent receives the finding; the bug is fixed; the company learns again from users. In his formulation, the future company runs on “write code and talk to users,” with Listen occupying the latter half of the loop.

As we get closer to AGI, it will be easier to build things, but the hard part will know what to build.

Alfred Wahlforss

The examples Wahlforss gives are deliberately concrete. Chubbies, one of Listen’s early customers, uses the platform for marketing tests and product feedback. In one case, he says, the company discovered that chest hair interacted badly with a shirt material, making the shirt uncomfortable; after changing the shirt, it became “radically more comfortable.” Manscaped changed a Super Bowl ad using Listen insights. Skims is also named as a customer. These examples are not presented as grand strategy cases; they are presented as the kind of small, specific feedback that becomes available when the cost of asking customers falls.

Interviews can outperform surveys when they force reasoning, not clicking

Sonya Huang raises the central objection to the category: surveys are often unreliable. People are paid to take them, creating selection bias. They say they will behave one way and then behave another. In that view, behavioral telemetry and real-world data should matter more than stated preference.

Wahlforss accepts part of the critique. He says Listen has tested repeat survey answers by returning to the same person and asking the same multiple-choice question again; the answers were “radically inconsistent.” His claim is not that asking people is automatically reliable, but that the modality changes the quality of the signal. In Listen’s own tests, he says, when people have to think through an answer in an interview rather than select an option, they are much more consistent in how they answer the same question.

He also treats interview data as something to be checked against later behavior. With Chubbies, for example, Listen tests different shirts and then looks back months later at actual sales performance. Wahlforss still calls A/B testing “the holy grail,” but argues that in practice it is often difficult: a company needs enough users, enough volume, and enough time. For many decisions, his view is that some structured customer input is better than none.

The platform’s design is meant to address another skepticism about AI summaries: traceability. For every data point, Wahlforss says a user can click through to the underlying quote or video, so the user can see where the AI’s conclusion came from rather than taking a summary on faith.

He also argues that the video and voice format carries signal that a Likert scale cannot. In advertising tests, a participant may report high purchase intent on a five-point scale, but the participant’s visible enthusiasm can be a stronger indicator. Wahlforss says Listen has seen ads with this kind of enthusiastic response perform better later in performance marketing on channels such as Meta and LinkedIn.

The audience is the hard part, not the interview bot

The quality of customer research, in Wahlforss’s account, depends less on asking a generic population and more on finding the right people. He says Listen spends 80% of its engineering resources on audience. That is because, in his view, every company is driven by a power law in customer segmentation: a small, highly specific population often accounts for a large share of revenue or insight.

Sweetgreen is his example. The product may appear broad, but Wahlforss says the right research audience is typically urban, high household income, mostly female, and aware of seed oils — a trait he says only about 1% of the population has. He says some people go to Sweetgreen every day, and presents that kind of highly engaged segment as the group whose feedback can make research more actionable than a generic sample.

80%

of Listen’s engineering resources Wahlforss says go into audience

The long-term goal, Wahlforss says, is an audience of one billion people, stratified by what each person is actually knowledgeable about. A person might reveal in one unrelated interview that they are a sneaker obsessive or early adopter. Listen can preserve that profile and later find that person when a sneaker brand needs to interview the right audience. Wahlforss says this cross-interview learning was not possible in the old model, where panels were separate entities, outreach was manual, and companies relied on email lists and screening.

He points to “incidence rate” as one of the frictions in the traditional system. If only 10% of invited participants qualify for a study, nine out of ten people may be screened out before they can earn anything. That creates churn in respondent databases because the experience is annoying: people repeatedly answer screening questions and receive no compensation.

Listen can also connect to a customer’s CRM, but Wahlforss says the more interesting problem is reaching prospective customers and comparing them with current customers. A brand like Sweetgreen may know its best existing customers, but it may not know adjacent groups that could become customers. He also notes that CRM data is often disorganized, and large companies may face regulatory or operational constraints in contacting their own users directly. If a company such as Google wants customer feedback, he says, it cannot simply email everyone who uses Gmail.

AI interviewers may get more honesty because they lower social pressure

Wahlforss makes a counterintuitive claim: people often prefer being interviewed by AI, and they may be more honest with it. Asked whether people like being interviewed by an AI, he says the “objective answer is yes” because Listen can pay people less to talk to an AI than to a human interviewer. His explanation is that the format is asynchronous, lower pressure, and easier to fit into a participant’s life.

He also argues that AI can reduce social pressure. Participants, he says, sometimes open up because the agent is a nonjudgmental entity that is interested in them. That matters for sensitive categories, including interviews with kids about products. The children’s market, he says, is hard to research because it requires parental consent, scheduling around school and extracurricular activities, and finding the right child participants.

The broader point is that lowering the friction of customer contact changes which decisions receive input. Wahlforss argues that most business decisions are not made with customer input because the cost and effort of getting that input are too high. If a company can get real feedback within five minutes, and if hundreds of people can populate an interview study asynchronously, more decisions can include customer signal.

Consulting is compressed at the task layer, not necessarily at the outcome layer

The implications for services firms are mixed in Wahlforss’s account. He says Listen works with Bain, which uses the platform to accelerate traditional research processes. He does not argue that consultants vanish. Implementation, synthesis, and the broader ability to drive organizational change remain valuable.

But he does expect pressure on margins and service packaging. Some work that previously required consulting teams — designing research, finding participants, conducting calls, analyzing results — can be accelerated or partially handled by AI agents. That forces services firms to unbundle parts of their offering and decide where human judgment and implementation still command value.

Huang presses on who receives the economic surplus when AI makes research cheaper. Wahlforss’s answer is that speed can justify higher prices, not only lower ones. If research can be done much faster and in harder-to-reach populations, it may be worth more. He gives the example of studies where Listen charged hundreds of thousands of dollars to speak with 20 doctors across eight countries.

Over time, Wahlforss expects individual interviews may become cheaper, but the amount of research conducted could grow by orders of magnitude. Simulation is the extension of that logic: a way to handle “the 99% of use cases” where no one would otherwise have time to talk to real people. Big decisions, such as a Super Bowl ad, may still call for fresh interviews. Smaller decisions — a conference title, a billboard tagline, a product message — may be routed through simulations built from interview data.

Simulation is useful only if the system knows what it cannot know

Constantin Bensch asks Wahlforss to define generative agent simulation. Wahlforss’s description starts at the individual level: if he interviews someone for an hour, he can begin to predict that person’s preferences. He says large language models can do some of this too, especially when given rich information about a person. In some cases, Listen is able to reach 95% accuracy in predicting how someone will answer certain questions.

But the prediction problem has hard boundaries. People change, markets change, and, in Wahlforss’s phrasing, “chaos theory” makes the future hard to predict. For Listen, the important question is not only whether the simulation can answer, but whether it can identify the domain in which it has enough knowledge to answer.

The company tests this through back-testing. It removes a question from the data, asks the simulation to predict the answer, and checks the result against the held-out response. It also introduces questions that should not be predictable — for example, the name of someone’s dog — and tests whether the model recognizes that it cannot know.

Question type	Wahlforss’s view of predictability	Example from the discussion
Message testing	Useful and comparatively strong	Choosing a talk title, billboard tagline, or advertising message
Large, high-stakes decisions	Still likely to require real interviews	A Super Bowl ad or major launch
Unknown private facts	Should be rejected as unknowable	The name of a participant’s dog
Future behavior in changing markets	Limited by changing conditions and chaotic dynamics	New trends that suddenly reshape marketing strategy

Wahlforss’s view of where simulation can help, where it should be qualified, and where it should refuse to guess.

Wahlforss says message testing is one of the most useful applications. He describes using Listen’s simulation to choose a title for a conference talk. He generated 100 possible titles and ran them through a simulated panel of Listen’s customer base. The top title scored about twice as well as the next one. He does not claim certainty — “I don’t know if it’s correct,” he says — but he found the guidance useful for making a small decision.

He also compared Listen’s simulation with ChatGPT on a retrospective test. He fed both systems examples from a less successful talk and a more successful talk. ChatGPT picked the wrong one, while Listen’s simulation picked the right one. Wahlforss presents this as early evidence, not a finished proof. His explanation is that general models are trained toward an average person, while companies need to understand specific niches.

Sonya Huang pushes on that explanation. Why not simply prompt ChatGPT to behave like a narrow persona — for example, a grumpy 35-year-old software engineer who likes the terminal? Wahlforss says such prompting performs somewhat better than vanilla ChatGPT, but that interview data performs much better. Listen has tried inputs such as credit card spend, behavioral data, and purchase behavior; his claim is that interviews are the strongest dataset because they expose reasoning, tangents, and behavioral context. But not any interview will do: the questions must be designed well.

Bensch finds the intuition plausible: if the goal is to understand someone, asking them many questions is a direct way to do it. If Listen has enough interviews from a specific group, it can model that group better than a system aimed at the average respondent.

The product becomes a human API for other agents

Wahlforss describes Listen’s “augmented responses” as a way for a company to use accumulated interviews to answer ad hoc questions quickly. A company could run thousands of interviews with founders, doctors, engineers, or consumers, and then ask the resulting system for guidance. On implementation, he says Listen uses post-training, retrieval-augmented generation, and other proprietary techniques.

The architectural idea is that this customer-response layer can live inside other agents. Wahlforss imagines a “human API” that coding agents, strategy agents, and other systems can call to understand user preferences before deciding what to build or how to communicate. If execution agents can write code or generate marketing concepts, Listen aims to give them access to customer preferences inferred from interviews and simulations.

Huang asks whether multi-agent debate is part of the simulation approach — for example, having modeled customers debate one another before producing an answer. Wahlforss says Listen does not do that yet. He is open to exploring it but skeptical, again because interactions can compound unpredictably. Huang suggests an analogy to an AI council, where multiple LLM outputs are judged and synthesized by another model; she says this can improve answers on average. Wahlforss does not reject the possibility, but his emphasis remains on modeling representative individuals well and aggregating them, rather than simulating social interaction among them.

The simulation layer also changes ideation. Huang asks whether market research could move beyond testing ideas that companies already have and toward live product ideation — for example, an AI brainstorming new solutions during an interview as a customer complains about a product problem. Wahlforss says customers already use AI to generate images of concepts and feed them into interviews manually. Listen also has an MCP that lets Claude run Listen in a loop to generate and test ideas for marketing or concepts. Live brainstorming inside the interview is presented as an appealing next step, not as a finished product.

Vertical AI advantage comes from evals, workflow discipline, and accumulated data

Wahlforss’s account of Listen’s moat has several layers. The first is the panel: supply and demand network effects around access to participants. The second is data: more interviews should improve simulation. The third is stickiness: companies accumulate interview histories in the platform, want to track changes over time, and have less incentive to leave.

But he also emphasizes a less abstract product advantage: making a complicated research process “stupid simple.” He credits Sequoia partner Bryan Schreier with the observation that founders often want to build complex products, while customers want something simple that works. In market research, that simplicity is not trivial. Creating a good interview guide is hard enough to be an academic subject. Pricing research, brand perception, and product feedback require different methodologies. Bad questions can lead the witness and produce unusable data.

Wahlforss says Listen initially used vanilla large language models to generate interviews. Customers would run studies, receive the data, and come back frustrated because they could not use it. Listen treated that as its responsibility and trained the system around market-research best practices so that interview guides would produce useful data.

He gives a similar account of internal evaluation. Early on, Listen had an eval for whether the AI asked repetitive questions or followed instructions. With GPT-4, he says, the system sometimes asked the same question a hundred times. That eval started around 20% and improved to 85%. Then Listen created a harder eval, including whether the AI can understand what a user is doing on screen during a screen recording or skip questions that no longer apply. On the new eval, the system is back around 20%. For Wahlforss, this is part of the vertical AI company’s advantage: define proprietary evals that represent the actual job, then climb them.

And I think that’s the advantage you have as a vertical AI company, that you can essentially train this agent to follow the best practices in the work that you do.

Alfred Wahlforss · Source

The model is old: find the niche, then build around it

When asked who has historically listened to customers well, Wahlforss points to Procter & Gamble as the archetype. He describes P&G as a market research organization that identifies niches people care about and then builds brands to solve them. His example is Tide Pods: Wahlforss says P&G used customer interviews to identify that washing liquid was inconvenient and that people wanted something easier to use. In his account, the product became a successful response to that insight.

He also cites Mars and M&Ms, referencing a story from Acquired. In Wahlforss’s telling, M&Ms were originally designed for the army as a sweet treat that would not melt in a pocket. He says Mars used early market research in the 1950s to identify young kids as another strong segment and shifted advertising toward the fact that the candy would not melt and ruin furniture.

The historical point is not nostalgia for focus groups. Wahlforss uses these examples to argue that customer understanding has always created large businesses when applied well. What changes with AI is the frequency, speed, and granularity with which that understanding can be gathered. For a consumer packaged goods company, a launch in a new market can involve tens of millions of dollars or more. Wahlforss says that is one reason Procter & Gamble is a big Listen customer: if the cost of being wrong is high, front-loaded customer understanding matters.

That does not mean simulation replaces human input. Wahlforss is confident that companies will still need to ask people because humans remain irrational: even if AGI were perfectly rational, customers could still become suddenly obsessed with a new product or TikTok trend and force a marketing strategy to change. He is less certain about how far simulation can go. It should extend customer input into low-friction decisions, not pretend to replace reality where reality is needed.

Constantin Bensch frames the same point from the other side. If companies exist to serve people, and AI intelligence keeps improving, then the remaining delta — what is in a human’s mind that is not in the AI’s mind — becomes more important, not less. Wahlforss’s answer is effectively that Listen wants to capture that delta continuously and make it available to the agents that build, market, and iterate products.

AI Application Architecture AI in Sales and Marketing Evals and Benchmarks Agents and Autonomy Human-AI Interaction AI Product Management