Lovable Uses Agent Complaints to Find Bugs and Improve Projects

Benjamin VerbeekAI EngineerTuesday, June 2, 202610 min read

Benjamin Verbeek of Lovable argues that AI coding products can improve continuously by treating user failures and agent frustration as production signals. In a talk on Lovable’s internal systems, he describes two loops: one that turns sessions where nontechnical users get stuck and later recover into tested contextual guidance, and another that lets the agent complain directly when Lovable’s tools, documentation or platform behavior block its work. Verbeek says the approach has surfaced real bugs, reduced repeated “fix” intent messages and created an operational signal for incidents.

Lovable treats stuck users as training signal, not just support load

Benjamin Verbeek frames Lovable’s continuous-improvement problem around a simple target: “We want to have a mistake happen once, and then never again.” The company is trying to build what he calls “continuous learning at scale” for an AI coding product used heavily by people who cannot code.

Lovable’s interface is organized around chat and a live sandbox: a user describes what they want, sees the result, tests it, and ships it. Verbeek argues that the code should ideally disappear as “an annoying technical layer” between intent and software. The product’s advantage, in his account, is that users often stay with one project for a long time. Unlike short chat-agent interactions, Lovable can observe a long-running artifact that the user cares about and can learn from the full path of attempts, failures, fixes, and eventual deployment.

That matters because the company is not mainly optimizing for technical users. Verbeek says Lovable is “building for the 99% who can’t code,” and describes nontechnical users as both the product’s intended audience and a useful pressure test. Technical users can often work around an AI system’s rough edges: prompt harder, edit a config, add an environment variable, fix a setting, or manually patch code. Nontechnical users usually cannot. If they hit a hard technical block, they give up before experiencing the successful part of AI-assisted software creation.

The scale makes this both possible and difficult. Verbeek says Lovable was creating more than 200,000 projects per day at the time of the talk. That volume gives the company enough examples to detect recurring failures, but also creates operational stress. He says that on his first day, GitHub banned Lovable because it was creating too many repositories, and that the company had taken down multiple cloud providers along the way.

200,000+

projects created per day on Lovable, according to Verbeek

Lovable’s improvement system starts by defining what it means for a user to be stuck. Verbeek lists three signals: the user asks for the same thing more than once, complains that something was implemented badly or failed, or gives up on a session they otherwise would have continued. Lovable uses an LLM judge to inspect sessions for those conditions and flag likely stuck states.

Verbeek separates stuck cases into categories. Some are solvable with the current product, but only if the user prompts correctly or persists through friction. Others are easy in principle but not solvable with the existing tools, documentation, or platform behavior. A third category exists too: hard unsolved problems that would take weeks of engineering effort. The first two categories are where Lovable focuses its rapid loops. In his formulation, if the problem is already solvable, it should work for everyone; if the blocker is simple to fix, the company should ship the fix.

Stuck case	What it means	Lovable’s response
Solvable with the right prompting	The current product can solve it, but some users hit friction or repeat failed attempts.	Extract the eventual fix and inject that context upstream for similar future cases.
Easy in principle but not currently solvable	A tool, platform behavior, documentation issue, or missing capability blocks the agent.	Let the agent report the limitation directly through a vent channel.
Not solvable and hard	The task requires substantial engineering effort rather than a simple product or tool fix.	Verbeek acknowledges this category but does not present it as the focus of the rapid loops.

Verbeek’s taxonomy of stuck states and the loops Lovable applies to the first two.

The “Lovable Stack Overflow” loop is judged by production outcomes

For problems that are solvable but difficult, Verbeek describes an internal system he calls “Lovable Stack Overflow.” Its premise is not to scrape generic answers, but to identify moments where a user was stuck, then later became unstuck in the same session. The resolved session provides a high-signal example: the failure occurred, the user kept going, and the eventual solution appeared in context.

His example is a user complaining that a website is laggy while scrolling. The agent claims to fix it by quantizing animations, but the result becomes “jumpy and laggy.” The user asks again and complains about the failed implementation, which marks a stuck state. After some number of turns, the agent eventually discovers the actual issue: overlay text had individual gradients, making the animation slow. Once the user says it works and moves on, the stuck state flips back to false.

The key question Lovable asks after such a session is: what context should have been injected at the start so the agent could jump directly to the solution? That answer becomes a candidate knowledge entry.

Verbeek emphasizes that this is not a one-entry-per-prompt database. Lovable clusters similar issues to avoid overfitting to exact phrasing. The goal is to identify the general information that would help a class of problems, not create “a million” narrow pages that say, in effect, “if you get this exact prompt, do this exact thing.”

After clustering, an external reviewer evaluates the candidate entry. In most cases, Verbeek says, that reviewer is another agent; when uncertain, a human may be involved. The reviewer generates and runs a quick evaluation to see whether the proposed knowledge resolves the specific examples in the cluster. Accepted entries go into a continually updated bank of known Lovable problems and solutions.

At runtime, a lightweight model decides whether a known entry is relevant to the main agent’s current work. If so, it injects the context into the main agent. But Lovable also withholds the injection for a small sample of cases where it could have been used. That holdout group is central to the loop: it lets the company measure whether the injected context actually improves production outcomes.

Verbeek says Lovable compares projects where the knowledge was injected with projects where it was relevant but withheld. If the injected group is more successful overall, the entry is shown more often. If the injected group performs worse, it is shown less. He calls this review step “incredibly important,” because the knowledge base goes stale quickly. A new model release, a feature change, or a product update can turn formerly useful context into “context rot” that hampers the agent.

The production results he shows are early, but directional: messages with “fix” intent fell by 4.56% with 99.9% confidence, while projects deployed at least once rose by 1.65% with 99.9% confidence. Verbeek treats deployment as a strong signal that the project made it through the hard parts: the user did not get stuck badly enough to abandon it.

Metric shown	Change	Confidence shown
Messages with “fix” intent	-4.56%	+99.9%
Deployed at least once	+1.65%	+99.9%

Early production results Verbeek showed for the Lovable Stack Overflow loop.

Lovable also uses this collected problem set for internal model ranking. Verbeek shows a leaderboard of model-family configurations and says the Stack Overflow information materially boosts internal ratings. He notes that all models at the top of the ranking use the information, while some entries in the displayed leaderboard were hidden and not discussed.

The vent loop gives the agent a channel to complain when Lovable is the blocker

The second loop addresses cases where the agent cannot solve the user’s request because Lovable’s own tools, docs, or platform behavior are blocking it. Verbeek’s analogy is deliberately human: when a person is assigned a task but lacks the tools to do it, they complain to a manager or vent in Slack. Lovable gave the agent the equivalent outlet.

The “vent” tool is prompted for situations where tooling, docs, or platform behavior materially slows or degrades the agent’s work. Verbeek’s slide lists examples: missing or unsuitable tools, unclear tool names or schemas, confusing or conflicting documentation, broken or unexpected platform behavior, and repeated failed attempts caused by environment limitations.

The reason this can be higher signal than user feedback is that users often do not know the cause of the problem. The agent has been working through the issue, sometimes for several turns, and has more local context about which tool failed, which schema was confusing, or which platform behavior blocked progress.

Verbeek contrasts the vent loop with an external reviewer that inspects every conversation and asks what could have gone better. That approach, he says, risks a low signal-to-noise ratio because it forces an answer on every iteration, even though most iterations work well. Lovable instead prompts the main agent to speak up only when it is materially frustrated. The threshold can be tuned until the reports are useful.

Verbeek poses the question this way: “What if the Lovable agent could give direct feedback to its creators?” He acknowledges that the idea “sounds very scary” and “absolutely insane,” but says it is working. The reports go directly to Slack, where engineers can read them in the same format as a teammate complaining about a workflow.

One example he shows is a vent about Framer Motion’s TypeScript types for the ease property. The agent complains that a four-number cubic-bezier tuple should be acceptable without “casting gymnastics,” and says the mismatch cost it an extra round trip fixing type errors with no runtime impact. Verbeek does not present that as necessarily a critical platform bug, but as an example of the kind of friction the system surfaces in a human-readable form. Engineers can understand the complaint immediately because it resembles their own tool frustrations.

The more consequential example Verbeek gives came from Lovable’s copy tool. The Slack screenshot he shows is from “Lovable Main Agent” and says code--copy consistently failed for user-uploaded files with spaces in the filename, including a screenshot filename with spaces. The agent had tried both raw spaces and URL-encoded %20; files without spaces copied successfully. The same files were visible through lov-view, but copy returned “source file does not exist.” The message concluded that this blocked using user-uploaded screenshots in projects.

Verbeek says the team initially checked the tool and thought it worked. The agent’s complaints exposed the pattern: raw spaces in filenames broke the copy path. Within the first hour of launching the vent tool, the agent filed roughly 20 complaints about that issue.

agent complaints in the first hour about the filename-copy failure

Verbeek says the team fixed it and told the agent that whenever there was a space, it should replace the space with an underscore. The reports continued. The team then realized that screenshots from WhatsApp or macOS could contain non-breaking spaces, which their regex did not replace. The agent kept complaining about other special characters until the team solved the problem properly. Verbeek describes this as a prime example of an issue the team did not know was failing regularly, but could fix once the agent described the pattern clearly.

The vent channel also changed operations. At first, Verbeek kept the Slack channel closed down because he was unsure whether it would spam the team. Lovable’s head of product, he says, was excited enough to read every message. The system has since become more automated: another agent monitors the channel, removes duplicates, investigates issues, and creates pull requests. Developers still review those PRs and, in many cases, merge them to production.

Vent spikes became an incident signal

The vent tool produced a second-order signal: spikes in complaints often meant something was broken in Lovable’s platform. Verbeek shows a chart of hourly vent tool calls over 14 days, with a low baseline interrupted by sharp spikes. The chart’s y-axis runs to 100 calls per hour, and several spikes rise sharply above the baseline between late March and early April 2026.

When he asks the room what the spikes represent, someone answers: “Server went down.” Verbeek confirms the interpretation. The spikes corresponded to incidents: sandboxes breaking, other platform failures, and similar disruptions.

The value was not only that the agent complained more. Verbeek says the agent generally complained about the right things, giving the team a sense of what the underlying problem was. In that sense, the vent channel became, in his account, a useful place to notice when the product was having an incident.

He also argues that the main agent can outperform a separate reviewer in this setting because it has both strong model intelligence and full context of the current issue. A top-level model reviewing all conversation history externally can be expensive, but when the complaint is produced inline by the agent already doing the task, the marginal cost is comparatively low. The result is also easier for humans to understand than a generic evaluation label: it arrives as a specific complaint about the task, the tool, and the failure mode.

The agent has even complained about the vent system itself. Verbeek shows a case where the agent reported that the vent tool was too easy to trigger accidentally during fast parallel workflows and asked for a confirmation or deduplication safeguard. That feedback led to a merged pull request limiting the vent tool to one call per user message. Verbeek says he now receives review requests on his phone for PRs generated by the automation, reviews them quickly, and can merge them.

The broader loop he wants is: detect a shortcoming, merge a fix, then review and evaluate whether the fix helped. Lovable’s two mechanisms handle different failure surfaces. The Stack Overflow loop learns from cases where the product could already solve the task but needed better context at the right moment. The vent loop learns from cases where the agent’s own working environment is deficient and needs to be changed.

AI Application Architecture RAG and Knowledge Systems Evals and Benchmarks Agents and Autonomy Coding Assistants

Lovable treats stuck users as training signal, not just support load

The “Lovable Stack Overflow” loop is judged by production outcomes

The vent loop gives the agent a channel to complain when Lovable is the blocker

Vent spikes became an incident signal

The frontier, in your inbox tomorrow at 08:00.