Orply.

A Two-Hour AI Prototype Let Museum Visitors Talk to Statues

Joe ReeveAI EngineerMonday, June 1, 202614 min read

Joe Reeve of ElevenLabs argues that his “talk to a statue” prototype mattered less as a museum product than as evidence of what can now be assembled quickly from existing AI APIs. Built in Cursor in about two hours, the app identifies a photographed statue, generates historical context and a plausible voice, spins up an ElevenLabs agent, and starts a conversation in roughly 30 seconds. Reeve says the harder remaining questions are institutional rather than purely technical: who authors the object’s story, what voice it should have, and how multimodal voice interfaces should work.

The prototype worked because the hard parts were already APIs

Joe Reeve built the statue app as a demonstration of how far a small amount of “glue” can now go when it is stitched across capable AI APIs. The premise was simple: point a phone at a statue, identify it, generate a historically plausible voice and persona, and start talking to it.

The app’s pipeline had five steps. A user took a photo. A multimodal OpenAI model identified the object. OpenAI deep research generated historical context about the statue and produced prompts for what the statue’s voice might sound like if it were alive. ElevenLabs’ Voice Design API then generated a matching synthetic voice from that description. Finally, the system created an ElevenLabs agent and started a call.

Reeve said the full loop ran in about 30 seconds: photo, research, voice generation, agent creation, and conversation. The result, shown in a British Museum demo video, let users “speak” with objects including Pharaoh Amenhotep, Demeter, Hoa Hakananai‘a, Hans Sloane, a guardian lion, a young rider, and a war horse.

For Reeve, the important technical point was not that he had solved a deep new infrastructure problem. It was that he had combined existing systems that were already built to scale. ElevenLabs, he explained, is an audio AI foundation-model company spanning text-to-speech, speech-to-text, music, sound effects, voice creation, and managed agents. The statue app leaned especially on a part of the platform he thinks developers underuse: creating and editing voices.

He described the agent platform, half-jokingly, as something that “sounds like a mouthful of SaaS,” but the point was practical. The app did not need to own the hardest runtime components. Voice generation, agent hosting, and conversational audio infrastructure could be delegated to APIs.

That framing shaped his answer to a production question from the audience. A prototype is easy, an attendee said; scaling to millions of users is the hard part. Reeve agreed there would be real production work, especially around user management, but argued that the core load-bearing parts of this particular app were not where an individual builder would have to invent scale from scratch.

If the app “goes absolutely gangbusters,” he said, it still would not make a dent in the volume that ElevenLabs’ APIs are designed to handle. Authentication, user accounts, logins, magic links, and similar production concerns could be built or bought with standard tools such as Supabase. The part he saw as newly valuable was the “glue piece”: arranging existing systems into an interaction that feels coherent, and telling a story about why that interaction matters.

A two-hour build became a business-development signal

Reeve said he built the app in Cursor in two hours on a Sunday because he was “tired and bored.” He then published the one-shot prompt to the ElevenLabs blog, made a short video, and posted it. The first post reached about 50,000 impressions, which he considered a good result.

The more consequential response came from institutions and companies. According to Reeve, three museums or museum-representing groups reached out, along with other businesses, including companies he described as TripAdvisor competitors. One museum CEO found his WhatsApp number and called him. The CEO told him he had a team of 10 people working on the same idea for a year and asked how Reeve had built it.

Reeve reposted the next day, emphasizing that he had vibe-coded the app in roughly two hours. He said he did not present that as a brag, but as evidence that the interaction pattern itself had become surprisingly accessible. That second post went viral, moving from 50,000 impressions on the first day to about 1.5 million on the second.

2 hours
time Reeve said it took to build the first version in Cursor

The viral spread was not only developer interest. Reeve said artists, creatives, portrait museums, Bonhams, Christie’s, and others contacted him because they wanted people to talk to objects they displayed, preserved, or sold. The inbound interest also helped lead to ElevenHacks, an ElevenLabs hackathon program shown in the source as a series of weekly challenges with free credits, prizes, and a displayed total of more than $240,000 in cash prizes awarded so far.

The app therefore functioned as more than a demo. It surfaced demand from institutions that had collections, audiences, and content-management problems. It also revealed a gap between what a motivated individual could now prototype and what organizations had previously assumed required long, staffed efforts.

Museums do not just need identification; they need curatorial authorship

When asked about evals, Reeve shifted from model performance to institutional responsibility. The first prototype used a photograph and internet-backed research to infer a statue’s identity and assemble context. He did not present that as the long-term museum-grade solution.

For museums, he said, the “important bit” would be working with curators to decide the actual narrative. Rather than pulling “random things you found from Google,” institutions would need to design the content deliberately. That work would likely be the long tail: not whether the model can identify an object, but who decides what the object says, how it says it, and what interpretive frame it carries.

Reeve said many museums already view their databases as core intellectual property. That can make integration more plausible, because institutions may already have structured APIs or internal collections data. He specifically mentioned the V&A as having a public API for its materials.

An audience member then asked what interface would let curators design the experience. Reeve said he had not designed that interface yet. In the current version, the closest option would be for a curator to log into the ElevenLabs dashboard and edit the agent’s system prompt or knowledge-base files. He suspected that for information management, the best interaction pattern would probably remain text editing rather than voice, even though voice selection itself requires listening and judgment.

The deeper design problem is not just factual content. It is what an object should sound like. Reeve described conversations with Jago, formerly head of the Americas at the British Museum and now running the Sainsbury Centre, who is working through the question academically. If an object’s material came from a mountain in China, was shipped to Vietnam, carved there, and then spent two centuries in a British museum, what accent or vocal identity should it have? Should it carry Chinese origins, a Vietnamese inflection, a British influence, or something shaped by tourists and institutional life?

Reeve did not claim to have an answer. He treated the question as evidence that voice design for inanimate things is becoming a real cultural and product problem. ElevenLabs can give voices to more than people. That means builders have to ask what elevators, museum objects, and other everyday systems ought to sound like. Lifts already have voices, he noted, but he often finds them “discordant” with what a lift is.

Voice is not enough; the interface has to become multimodal

Asked about voice as an interface to applications, Reeve described current voice interaction as powerful but still awkward. One problem is that it is often binary: either the user is interacting by voice or by some other interface. He thinks the stronger pattern will be voice plus interactive or generative UI, but said the industry has not fully seen that yet.

His example was a coding tool such as Lovable. He does not necessarily want to talk directly to the coding agent that changes the application. He wants to talk to a product-manager-like agent that understands intent, negotiates requirements, and then triggers the coding agent to do the work. In that pattern, the thing the user talks to is not the thing doing the underlying work.

That distinction matters because voice has low visible structure. A voice agent may be thinking, extracting entities, creating plans, or calling tools, but the user often cannot see the intermediate state. Reeve pointed to an audience member’s app that showed what a voice agent was thinking and extracting from the conversation while allowing the user to interact with that material at the same time. He expects more “multi-modal conversations” where voice and visual interfaces operate together.

A second problem is social: people are too polite to interrupt agents. Reeve said he has learned to interrupt agents more aggressively, and that doing so makes the experience better, but he does not know how to give users permission to interrupt. Human conversation has visual cues, timing, body language, and micro-signals that make interruption and turn-taking manageable. Voice agents strip much of that away.

The audience pushed on a related frustration: voice can be a fast input channel, but it is often a poor output channel for dense information. One attendee said they like using voice to get thoughts out quickly, but want diagrams, text, or some higher-density representation back from the system. Reeve agreed. He wants to speak easily and quickly, receive perhaps a small amount of voice, but mostly get generated UI, diagrams, or context-specific visual output.

Another attendee added that voice supplies something other than information density: companionship. They may learn more from diagrams or text, but a talking system can reduce loneliness and increase motivation to continue tinkering. A different attendee argued that visual processing is easier because the visual cortex is older than voice and text. The room’s view was not that voice should replace visual interfaces, but that voice changes the emotional and motivational texture of interaction.

The missing primitive is skimmable listening

The discussion sharpened around a specific gap: there is no good equivalent of skim reading for voice. One attendee noted that concise text is welcome when requested, but concise speech can sound rude. Reeve asked what “skim listening” would look like.

He imagined audio interfaces with forward and backward controls more granular than podcast scrubbing: perhaps buttons that move by half-sentence, or a gesture like an old iPod wheel. The right unit might not be sentences at all. In skim reading, the eye does not necessarily move according to grammar; it hunts for concepts and salient words. A listening interface might need to scroll through concepts rather than time.

The problem is especially obvious with agents. If an agent begins giving a three-paragraph answer and the user wants “the next one,” a normal interruption becomes a new prompt. The agent may then elaborate on the next paragraph rather than simply skipping ahead. If the user asks it to be concise, the speech may become socially awkward or brusque.

Audience members suggested summaries for each paragraph, expandable sections, and interaction patterns similar to the Claude app, which one attendee said shows a higher-level structure while the user hears something else, allowing the user to tap into sections. But that also changes the interaction away from ordinary human conversation.

Reeve noted that humans handle these situations through cues agents usually lack. A speaker can tell when someone is about to interrupt and may speed up or move on. Listeners can signal “yeah, yeah” without taking the floor. With most agents, backchannels such as “uh-huh” or “go back” are not cleanly represented.

The group explored lightweight visual cues that could sit on top of audio. One builder, working on voice interaction for a PS5-like game world, said pure voice interruptibility had been unreliable, so he used push-to-talk: hold to speak, release to finish. He suggested an agent might show a small circle or other indicator when it wants to ask a question, letting the user decide whether to yield the floor. Reeve extended that idea: the agent could signal not only that it wants to interrupt, but what topic it wants to interrupt about. That might give more context than a human conversational cue.

Agent transcripts are more malleable than most products expose

The technical discussion turned to how such steering could work. One audience member asked whether the system would stream audio through WebRTC, use a back channel, update the system prompt, or handle tool calls in the background to manage the state of the interaction.

Reeve said that, as far as he knew, the proposed interaction had not yet been built. His first approach would be to analyze the transcript asynchronously: repeatedly inspect what has been said and ask whether there is anything useful to add. Rather than making the behavior a tool call inside the main agent loop, he would have another process observe the transcript.

When the questioner pressed on steering the original agent after the plan changes, Reeve described agents as less opaque than they often appear. An agent is partly the logic that decides the next message, but it is also a transcript. That transcript is “completely malleable.”

Most agent platforms, he said, generate a full text response and then begin generating audio. If the user speaks while the audio is still being read, the platform may append the user’s message after the full generated text, even though the user interrupted partway through hearing it. But if the platform has timestamps and knows how far the audio got before interruption, it could edit the transcript: treat the model’s unheard tail as if it had never been generated.

That would allow interruptions to become structurally meaningful rather than just another user message appended after a completed assistant turn. The audience member described wanting a second type of user message: something like steering rather than a standard conversational turn. Reeve’s response was practical rather than definitive: go to the ElevenLabs booth and vibe-code something to see.

Consumer vibe coding has not had its Instagram-filter moment

Reeve sees vibe coding as culturally important, but not yet mainstream for consumers. Tools such as Lovable may feel accessible, but he said they still appear oriented toward building B2B SaaS-style apps with common components, databases, and Supabase-like back ends.

The more interesting energy, in his view, shows up at evening vibe-coding events. They feel to him like early hackathons because people arrive who have never thought seriously about writing code. Some do not even have a mental model that apps are made by people. They do not know terms such as hamburger menu or accordion; they simply describe what they want. The LLM then attempts to produce it, often yielding something “completely wacky.” A software engineer might have translated the request into familiar primitives, but the novice’s language can lead the system somewhere stranger.

Reeve expects some equivalent of the Instagram filters moment or TikTok moment for vibe coding: a pattern that makes social creation mainstream. He does not know what it will look like, but he thinks it is worth experimenting.

He mentioned a few current attempts, including Spielwork, a mobile app for vibe-coding games with TikTok-style swiping, and another London-based game-oriented vibe-coding tool whose name he could not recall. But he was skeptical that games are necessarily the right path, because they are complex. In his view, many efforts are still basically “Lovable but on your mobile phone.”

The strongest analogy he offered came from a previous social gaming platform: Facebook Instant Games. He described buying a Fruit Ninja clone for £15, instrumenting it with the Facebook Instant Games API, and waking up the next day to 15 million users, mostly in Vietnam. The API provided simple primitives: user information, friends, leaderboards, basic key-value storage, rewarded ads, and interstitial ads. Those primitives made it easy to create a social-graph-enabled, ad-enabled consumer experience.

Reeve said he did not make much money from it, but the distribution lesson stuck. The social elements let people instantly share the experience. Something like that, he suggested, may become the template for social vibe coding: not merely making creation easier, but embedding the primitives for distribution, social context, and feedback.

The viral video was made with phone editing, captions, music, and a fast hook

Asked how he approached the statue video, Reeve said video performance is unpredictable. Things he expects to go viral often do not; things that do go viral can surprise him. His practical advice was to practice, iterate, and avoid assuming that high production value is required at the start.

He edited the statue video on his phone in CapCut. The edit took about 20 to 25 minutes. He described the quality as “relatively janky,” but good enough. Getting beyond that level, he said, quickly becomes harder: desktop editing tools add complexity, and even CapCut on desktop feels to him significantly harder and more expensive than the phone version.

Several tactics mattered. Captions help. The hook has to come early. Based on his platform analytics, Reeve said his videos tend to have a median view time of roughly 6 to 12 seconds before people drop off, so the interesting part has to be frontloaded.

Music also makes a large difference. Reeve said people underrate how much music changes the feeling of a video, even when it is relatively quiet. With ElevenLabs music generation, he can now make a video with narration and then test different genres until something fits, or generate music first to establish a mood and then shape the speech around it.

For the statue video, he chose the music before making the final video. He wanted something that fit the British Museum setting and the playful grandeur of talking statues — “a bit of imperial outside the British Museum,” as he put it. The visual material also improved through iteration. He first went to the museum and recorded footage that felt boring, then returned and took “a bunch of silly photos.” Those became the final video.

The frontier, in your inbox tomorrow at 08:00.

Sign up free. Pick the industry Briefs you want. Tomorrow morning, they land. No credit card.

Sign up free