Agents Can Build and Repair Scrapers Instead of Parsing Every Page

Rafael LeviAI EngineerSunday, June 7, 202613 min read

Rafael Levi of Bright Data argues that the hard part of web data collection has moved from scraping a page to maintaining the pipeline after sites change. In his session, he presents Bright Data’s MCP, APIs and browser infrastructure as a way for agents to inspect public websites, generate reusable scrapers, run them at scale and repair them when selectors, pagination or access conditions break. The economic case is that LLMs should spend tokens learning site structure and writing code, not repeatedly parsing every page.

The expensive part is not scraping once; it is keeping the scraper alive

Rafael Levi framed the problem as a shift from one-off extraction to production data collection. The common request he sees is not technically hard in isolation: scan 10,000 products, parse pages with an LLM, return structured data. The problem is that doing this page by page through a model burns tokens, and writing traditional scrapers creates a maintenance burden that often outlasts the initial build.

His answer was not to ask the model to parse every page. It was to ask the model to build a scraper that can parse the pages instead.

The point of Bright Data’s MCP setup, as Levi presented it, is to give an agent enough access and context to inspect a site, understand the page structure, generate a reusable scraper, and run that scraper through Bright Data infrastructure. The model uses MCP for exploration and intelligence; the API is used for execution. In his framing, the LLM should spend tokens learning the structure and producing code, not repeatedly consuming raw HTML or markdown for every product page.

The “scraper tax,” as shown on Bright Data’s slide, is the familiar sequence: inspect elements, find selectors, handle pagination, deal with JavaScript rendering, test, debug, and repeat. Then the site changes. Selectors move, anti-bot behavior changes, a React frontend changes the DOM, or a redesign invalidates the assumptions. The slide claimed that “80% of engineering time goes to upkeep, not building,” and Levi’s own description matched that emphasis: writing the scraper may take less time than maintaining it, especially on sites that change constantly.

Scraper maintenance is a full-time job nobody wants.

The pipeline he described has four stages. First, the agent explores the site through MCP, including pages that would otherwise be blocked by CAPTCHA, JavaScript rendering issues, or bot-detection systems. Second, it identifies the data structure: product names, prices, images, selectors, pagination patterns, and the shape of the site. Third, it writes a production Python script targeting Bright Data’s API. Fourth, it executes that script against many URLs through batch processing rather than sending every page through the model one by one.

Levi also described a maintenance loop. In his own collections, he said an LLM process runs every 30 minutes, checks whether the collected data looks valid, and shuts down if everything is fine. If validation fails — for example, a required data point is missing — the agent can repair the pipeline rather than waking a human. His operational claim was simple: if an agent can access the site, inspect it, and generate the scraper, then it can also diagnose and update the scraper when the site changes.

MCP supplies the web access the agent needs to inspect protected sites

The Bright Data MCP, as Levi described it, gives an AI agent a set of tools for web access. In response to a question about what is inside the MCP, he said it provides 66 tools. Some allow the agent to send a cURL-style request to a URL and receive the HTML back through Bright Data’s system. That system, in his description, handles CAPTCHA where needed, sends the relevant token, and uses the headers and cookies required for the site to treat the request as browser-like.

The agent can request full HTML when it needs selectors and page structure. It can also request a cleaner markdown representation when it only needs the page text. Levi repeatedly emphasized markdown as a token-saving mechanism: if the model does not need tags, attributes, and DOM structure, it should not consume them.

Bright Data also provides prebuilt APIs for roughly 500 domains, according to Levi. In those cases, the agent may not need to construct a scraper at all. For a site such as Amazon, he said, an agent can use a prebuilt API and receive product data as JSON rather than inspect the page and generate code. Bright Data also exposes remote browser infrastructure for cases where a site requires browser interaction rather than simple URL fetching.

The live demonstration used Claude Code. Levi showed Bright Data’s GitHub repository for a Claude Code plugin, described in the visible page as bringing Bright Data’s web infrastructure directly into Claude Code. The repository page listed capabilities including scraping webpages as clean markdown while bypassing bot detection, CAPTCHA, and JavaScript rendering; searching through a Bright Data search engine; extracting structured data into JSON or CSV; orchestrating more than 80 MCP tools; and building scrapers and scripts from the terminal with a single prompt. The page also claimed token savings of “up to 60%” for complex web pipelines by giving the LLM only the exact data it needs.

Levi first tried to target Walmart, which he described as having aggressive anti-bot systems. The cleaner demonstration moved to Very.co.uk, a site suggested by the audience. Claude Code inspected the Bright Data skills, examined the scraper structure, identified the search URL pattern for Very, and began generating the script. The visible terminal showed it finding a search URL pattern like https://www.very.co.uk/e/q/headphones.end?pageNumber=1, then fixing a price-extraction issue caused by the pound symbol being encoded or parsed oddly.

That small parser bug was the practical point. Levi was not only showing a tool that can fetch a page. He was showing an agent doing the work that normally turns into scraper maintenance: discovering URL patterns, understanding pagination, generating a parser, and fixing field-extraction logic when the page’s formatting does not match the first attempt.

The token-saving argument depends on moving extraction out of the model

Levi’s strongest economic claim was that LLMs should not parse every page directly when a reusable script can do the work. The model’s role is to inspect the site and write the parser. After that, collection happens in Python, with ordinary HTTP request costs rather than model-token costs for every page.

In the demo, he asked Claude Code to estimate the token savings of using the generated scraper rather than manually requesting and parsing every page through Claude. The figures were visible estimates produced inside the demo, not audited benchmarks. One comparison said that scraping 90 products in Python happened at “zero token cost” for the actual data collection, apart from HTTP request costs. It estimated each page would consume roughly 12,000 to 15,000 tokens of markdown if handled manually. For three pages, the table estimated a manual approach at roughly 41,000 to 53,000 tokens, compared with about 17,000 tokens for the scripted path, for savings of roughly 24,000 to 36,000 tokens, or about 60%.

~60%

estimated token savings in the visible three-page product-search comparison

Another visible comparison for building a code scraper for Very.co.uk showed larger savings: script total around 5,500 tokens versus manual total around 30,500 tokens, a savings of roughly 25,000 tokens, or about 81%. Levi cautioned that the earlier 60%-plus result was “not a high number” and speculated that the site may have had relatively structured HTML.

Approach	Input tokens	Output tokens	Total
Script	~4.5K	~1K	~5.5K
Manual	~25.5K	~5K	~30.5K
Saved	~21K (~82%)	~4K (~80%)	~25K (~81%)

Visible Claude Code estimate comparing a scripted Very.co.uk product-search scraper with manual LLM parsing

The mechanism matters more than the exact number. Levi described three levels of waste reduction. First, use markdown instead of full HTML when only text is required. Second, use JSON rather than markdown when a scraper has already extracted the fields. Third, once the script exists, let the model invoke the script instead of re-reading the site. In one example, he said executing the script might cost only dozens or around 100 tokens, after which the agent can operate on a compact JSON result.

He also argued that this pattern is useful even outside enterprise-scale scraping. If a user wants to find the best-reviewed headphones across several marketplace pages, the agent can build a scraper, collect the product and review data, and then analyze the resulting JSON. The benefit is not only that the request can avoid blocks; it is that the same scraper can be reused later without repeatedly paying the model to interpret the site.

For Levi, this changes even personal automation. He said he keeps the MCP connected and generally asks the LLM to build scripts it can reuse, rather than asking it to browse and extract ad hoc each time.

The demonstration contrasted blocked fetching with MCP-assisted access

Levi showed a direct contrast between ordinary fetching and Bright Data MCP-assisted access on Walmart. In Claude Code, he asked the agent to go to Walmart, search for headphones, and report the first result without using Bright Data MCP. The visible output said the fetch to https://www.walmart.com/search?q=headphones was blocked and returned a Cloudflare “Robot or human” CAPTCHA page.

He then asked it to do the same with Bright Data MCP. The visible output returned results for “headphones,” including “Beats Solo 4 Wireless On-Ear Headphones” and “JBL Tune 510BT Wireless On-Ear Headphones.” Levi explained that the agent was using “Scrape as a Markdown,” pulling text rather than full HTML. When an audience member asked whether a person was clicking a box somewhere, Levi said it could be opening a browser and holding Walmart’s click-and-hold challenge. Bright Data, he said, has in-house CAPTCHA-solving systems, including AI that moves and clicks.

The claim was not just that MCP can fetch pages. Levi described the infrastructure as including browser access, CAPTCHA handling, headers, cookies, IP selection, and domain-specific APIs. In his formulation, this matters most on a minority of sites: he said MCP is “mostly useful” on about 20% of domains, specifically those protected by systems such as Akamai, DataDome, and Cloudflare. Those are also often the most valuable domains for data work, in his examples: real estate and major e-commerce sites.

The generated code shown in VS Code used Bright Data’s Web Unlocker API. The visible Python function sent a requests.post call to https://api.brightdata.com/request, included an authorization bearer token, specified a Bright Data zone, passed the target URL, requested raw format, and returned response.text. Another function, parse_products, used BeautifulSoup-style parsing to extract product data. Levi pointed out that the generated script had exactly the two inputs he asked for: a keyword search and a maximum page count.

The live run slowed, as live demos often do, but the workflow artifact on screen was the relevant result. The agent had inspected the site, found the URL pattern, created a parser, handled a price parsing issue, and produced code that could be invoked with a command such as python very_scraper.py laptops 2.

Self-healing means validation, reinspection, and redeployment

The self-healing pipeline Levi described starts with validation. A pipeline returning empty or wrong data is not just a failed job; it is a signal for the agent. Bright Data’s slide summarized the loop as detect, diagnose, fix and redeploy. Detect means noticing that an extraction count dropped or expected fields disappeared. Diagnose means re-exploring the site through MCP, comparing the current DOM with the expected structure, and identifying what changed. Fix and redeploy means updating selectors, adapting to the new structure, testing output, and resuming the pipeline.

The site changes. The agent repairs. No tickets. No on-call.

Levi’s claim rests on a chain: if the agent can access the site, it can inspect the site; if it can inspect the site, it can build a scraper; if it can build a scraper, it can maintain one. The maintenance use case is where MCP becomes more than a browsing convenience. A model without access to the current page cannot diagnose a changed selector. A model that can re-open the target site, retrieve the current structure, and compare it with the previous parser can propose or apply repairs.

He described this as a way to avoid human on-call work for scraper breakage. If a validation rule catches a missing field or a sudden drop in record count, the agent can run the diagnosis and patch cycle. In his own phrasing, the agent fixes it in minutes and “you don’t have to wake up in the middle of a night.”

That framing also explains why Levi kept returning to pipelines rather than isolated prompts. A single prompt can answer a one-time question. A pipeline can be scheduled, validated, repaired, and reused. The LLM is not the scraper; it is the builder and maintainer of the scraper.

Public data is the boundary Levi drew around the system

An audience member asked whether the same approach works with authorization inside a site. Levi answered no: Bright Data deals only with public data. Behind-login data is private, and creating an account means accepting a website’s terms and conditions. He warned that if terms say not to scrape or not to use robots, ignoring those terms can lead to litigation.

He tied the point to the broader fight over data access. In Levi’s account, data has become “the new gold” or “the oil,” and platforms are increasingly locking it down. He cited Twitter under Elon Musk as an example of a service that had been more open and became locked down. He also said Bright Data had been sued by Meta and by Elon Musk shortly after the Twitter acquisition. Levi characterized the legal outcome in broad terms: in his telling, judges treated public data as public data, available in the same sense as prices visible on a counter in a shop. The source did not provide case records or legal citations, so the article should read that as Levi’s characterization rather than an independently established legal rule.

His practical guidance was narrower than the broad rhetoric. Use public data, do not go behind logins, do not accept terms and conditions in order to access data through the system, and be careful when a site requires an explicit checkpoint or agreement.

That boundary also shaped the later discussion of actions. When asked whether the system can do more than scrape — for example, fill out a form — Levi said yes. It can fill and submit forms. What it cannot do, in his stated boundary, is log in. For public workflows where a site requires interaction rather than a predictable URL, he said the agent can use a remote browser: click buttons, enter search parameters, and navigate workflows.

His example was flight search, where the URL may be a hash or otherwise not useful for direct parameter construction. In such cases, a browsing agent can open a remote browser, choose a geolocation if needed, and interact with the site through the UI.

Browser automation is presented as human-like interaction, not just headless control

Levi described Bright Data’s browser infrastructure as designed to mimic real user behavior. When an agent clicks, he said, it is not an instantaneous teleporting cursor; it uses pre-recorded mouse movement. When it types, it can type at variable speed and even make mistakes. The reason is that some websites continuously send behavioral signals back to the server, and automation that looks mechanically perfect may be detected.

In this account, the browsing model itself does not need to be the most capable frontier model. Levi said he does not use top models for browsing agents; for browsing, a model such as Haiku is often enough if the browser behavior is masked as real human interaction. The hard part is not always reasoning. Sometimes it is getting the site to serve the page and accept the interaction.

Levi’s examples deliberately spanned enterprise and personal use. At enterprise scale, the pipeline might collect millions of records. At personal scale, he described using Claude Code and Bright Data to watch for a house in a target area under a given price, notify him when one appeared, and ultimately help him find the place where he now lives. He also mentioned a public-site action-automation example involving a hard-to-book restaurant table: a listener waiting for a spot to open and acting when it appears. He did not describe that example as a login-based workflow, and it sits under the same boundary he stated elsewhere: public data and no credentialed access.

Those examples show the same pattern at different scales: a listener, a target condition, scheduled checks, public web access, and action or notification when the condition is met. The technical claim is that once agents can reliably access and operate websites, the difference between “scrape a catalog” and “watch for a specific event” becomes mostly a matter of prompt, validation, and script.

AI Application Architecture AI Search and Browsing Agents and Autonomy Coding Assistants