FineWeb Shows LLM Dataset Quality Depends on Measured Web Filtering

Alejandro AOHugging FaceTuesday, June 2, 202611 min read

Alejandro Ao’s overview of Hugging Face’s FineWeb argues that building a competitive LLM pretraining dataset from Common Crawl is a measurement-driven engineering process, not a matter of collecting more web text. He presents FineWeb as an open recipe in which Hugging Face chose raw HTML extraction over Common Crawl’s text extracts, found that global deduplication removed valuable data, and selected filters by training and evaluating small models. The same logic underpins FineWeb-Edu, where Llama-3-70B labels were distilled into a smaller classifier to filter the corpus for educational value at scale.

FineWeb treats web data as an engineering problem, not a raw ingredient

Alejandro AO frames FineWeb as an open attempt to show, step by step, how a large language model pretraining dataset can be built from the public web. The motivation is not just that pretraining data matters, but that the best-known LLM datasets are often closed or undocumented even when the models trained on them are released. FineWeb is Hugging Face’s effort to make the recipe inspectable, reproducible, and competitive.

The headline dataset is English-only, derived from Common Crawl, and contains 15 trillion GPT-2 tokens. Its source material spans 96 Common Crawl snapshots released since 2013. The accompanying FineWeb-Edu dataset is a filtered educational subset produced from FineWeb with model-assisted labeling; at the stricter threshold discussed by Alejandro, it contains 1.3 trillion tokens.

15T

GPT-2 tokens in FineWeb after extraction, filtering, deduplication, and final cleanup

The working mental model is “decanting” the web. FineWeb starts from roughly 200 trillion tokens of raw web data, separates useful text from boilerplate, duplicates, spam, non-English pages, navigation, and other low-quality material, and iterates until the remaining corpus trains better models. The process is not a single filter that discovers quality. It is a sequence of extraction choices, baseline cleaning, deduplication decisions, quality-filter ablations, and model-based evaluation.

Stage	Output described
Raw web data	Roughly 200T tokens
Deduplication and filtration	Roughly 20T tokens
FineWeb	15T tokens
FineWeb-Edu	1.3T tokens at the stricter educational threshold

Hugging Face’s decanting model moves from raw Common Crawl data to progressively filtered web-text corpora.

The important constraint is that “quality” is measured by downstream training. Alejandro repeatedly emphasizes that Hugging Face trained small models on candidate datasets using the same architecture and design, evaluated them on benchmarks such as MMLU, aggregated those results, and used the score to decide whether a dataset change helped. He presents this as the best practical way to evaluate a pretraining dataset because the intended use of the data is to train models.

The first costly choice was to avoid Common Crawl’s pre-extracted text

Common Crawl gave Hugging Face scale without crawling the web from scratch. It publishes web snapshots roughly every one to two months, and each crawl can contain hundreds of tebibytes of raw HTML. But that scale arrives in an extremely messy form: boilerplate, menus, spam, duplicated pages, non-English text, code-like fragments, templates, and page chrome.

The first major decision was whether to use Common Crawl’s WET files or its WARC files. WET files are text-only extracts already produced by Common Crawl. They are cheaper and simpler to process. WARC files contain raw HTML and metadata, which makes them more expensive because the dataset builder must do the extraction work.

FineWeb chose WARC plus trafilatura, an HTML-to-text extraction library. Alejandro says this was “probably the most expensive part of the entire process,” and notes that smaller teams attempting a similar pipeline might choose to skip it. But Hugging Face’s ablations found that the easier WET path produced worse model performance than extracting from WARC with their own cleaning method. The resulting dataset was smaller, but the models trained on it performed better.

This choice establishes a recurring pattern in the pipeline: cheaper processing and larger retained volume are not automatically better. Extraction quality changed the downstream model score enough to justify a more expensive first stage.

Base filtering reduced the corpus, but did not solve quality

After extraction, Hugging Face applied a first pass of standard filters inspired by RefinedWeb and MassiveText. URL filtering removed adult content using a URL blocklist. Language identification kept English documents using fastText with a score of at least 0.65. Quality and repetition filters removed some obviously problematic documents.

That first pass left roughly 36 trillion tokens. Alejandro presents it explicitly as an initial cleanup, not the core of FineWeb’s quality gains. The web-scale corpus still needed deduplication and more targeted filtering.

~36T

tokens remaining after text extraction and base filtering

The base filtering step also clarifies the scope of this version of FineWeb. The dataset discussed here is English-only, although Alejandro notes that a multilingual version exists separately. The English filter is part of the recipe being explained, not a claim about how all web-scale datasets must be constructed.

Global deduplication looked principled and removed the wrong data

Deduplication was the most counterintuitive part of the pipeline. Alejandro describes the usual reasons to deduplicate web text: improve generalization, reduce memorization, avoid spending training steps on repeated text, and increase effective data diversity. FineWeb used MinHash fuzzy deduplication over word 5-grams to detect near-duplicate documents, not merely exact copies.

The first attempt treated the 96 Common Crawl snapshots as one global deduplication space. Hugging Face deduplicated the newest crawl against itself, then deduplicated the next crawl against the already-deduplicated newer crawl and itself, then continued backward through older crawls. By the oldest snapshots, documents were being compared against more than 90 newer snapshots. More than 90% of the data from some old crawls disappeared.

The resulting global MinHash dataset fell from the post-base-filtering corpus of about 36 trillion tokens to roughly 4 trillion tokens. That would have been acceptable if model quality improved. It did not. Small models trained on the aggressively deduplicated dataset performed about the same as models trained on the non-deduplicated baseline, and did not close the gap to RefinedWeb, the open dataset Alejandro identifies as a strong reference point because it was used to train Falcon.

~4T

tokens kept after the failed global MinHash deduplication attempt

The autopsy was more damaging to the intuition that “more deduplication is better.” Hugging Face took an older crawl and separately trained small models on data that global deduplication had removed and on data it had kept. The model trained on removed data performed better. The data left behind after aggressive cross-snapshot deduplication was, on inspection, often template material, navigation, ads, and similar low-value text.

The failure mode was that “unique across all crawls” was not equivalent to “high quality.” Valuable older content could reappear across time and be discarded as duplicate, while low-value page fragments survived because they were less duplicated across the global corpus. Alejandro’s summary is blunt: deduplication is not monotonic, and more aggressive deduplication can hurt the dataset.

The fix was to deduplicate within each Common Crawl snapshot independently. Each dump was MinHash-deduplicated against itself, not against every newer crawl. The kept text from each deduplicated dump was then combined. This produced the 15 trillion-token FineWeb dataset and, in Hugging Face’s ablations, matched RefinedWeb while clearly beating the global deduplication attempt.

The final filters were chosen by ablation, not by copying C4 wholesale

After per-crawl deduplication, Hugging Face compared FineWeb against another strong baseline: C4. Alejandro describes C4 as impressive because it achieves strong results with a small number of strict rules. The C4-style filters discussed include removing documents with bad length profiles, documents containing lorem ipsum, JavaScript or cookie-notice artifacts, and documents containing curly braces.

But FineWeb did not simply adopt the full C4 filtering approach. One C4 rule, the terminal-punctuation filter, was too destructive at FineWeb scale. It rejected documents whose lines or sentences did not end with punctuation, and in FineWeb’s case removed about 30% of tokens by itself. Hugging Face kept the other C4-style filters and replaced that strict punctuation rule with softer, data-driven thresholds.

The custom filters came from comparing document-level metric distributions between better and worse datasets. Three rules were selected:

remove documents where the share of lines ending with punctuation was too low, using a threshold of less than or equal to 0.12;
remove documents where duplicated lines accounted for too many characters, using a threshold of greater than or equal to 0.1;
remove documents dominated by very short lines, using a threshold of greater than or equal to 0.67 for lines under 30 characters.

Together these custom filters removed about 22% of tokens, less than the 30% removed by the single strict C4 terminal-punctuation filter, and improved benchmark performance. Alejandro describes the gains as small but consistent. The custom filters were not a replacement for C4-style filtering; they were a final cleanup pass selected through ablation.

The final FineWeb recipe, as presented by Hugging Face, is: start with 96 Common Crawl WARC snapshots; extract text with trafilatura; apply URL, language, and quality filters; perform MinHash deduplication within each crawl; add C4-style and custom filters; release a 15 trillion-token dataset. In Hugging Face’s ablations, the final FineWeb dataset outperformed the open baselines shown, including RefinedWeb, C4, Dolma, SlimPajama, RedPajama2, and The Pile.

Step	FineWeb recipe
01 — Raw	96 Common Crawl WARC snapshots
02 — Extract	trafilatura HTML-to-text extraction
03 — Base clean	URL, language, and quality filters
04 — Dedup	MinHash within each crawl
05 — Filter	C4-style and custom filters
06 — Ship	FineWeb, 15T GPT-2 tokens

The final FineWeb pipeline combined extraction, base cleaning, per-crawl deduplication, and ablation-selected filters.

FineWeb-Edu used a large model once, then a smaller classifier at scale

FineWeb-Edu starts from a different question: whether broad web data can be filtered for educational value at trillion-token scale. Alejandro links the motivation to reports that Llama 3 and Phi-3 used educational-quality filtering, while their classifiers and filtered datasets were not public. Hugging Face’s open version used a strong LLM to label examples, then trained a smaller classifier to score the full FineWeb corpus.

The first step used Llama-3-70B-Instruct as an annotator. It scored 500,000 FineWeb samples on a 0-to-5 educational-value scale. Alejandro describes 0 as no educational value and 5 as very high, “something like PhD level” educational value. The prompt did not merely ask for a bare rating. It used a detailed additive rubric in which each additional point required a concrete reason. According to Hugging Face’s materials, this worked better than a simple Likert-style rating and avoided over-favoring very technical pages by focusing on grade-school and middle-school educational value.

The second step converted those expensive labels into a scalable filter. Hugging Face trained a smaller classifier on the 500,000 Llama-3-70B-scored examples. The model was Snowflake Arctic Embed, trained for 20 epochs with a frozen encoder, reaching about 82% F1 at threshold 3. Alejandro’s point is practical: the 70-billion-parameter model could label a sample, but it was not the tool to run over all 15 trillion tokens. The smaller classifier made full-corpus scoring feasible.

500k

FineWeb samples scored by Llama-3-70B-Instruct for educational value

Scoring the whole corpus still required significant compute. Full-corpus scoring took roughly 6,000 H100 GPU hours. Hugging Face then selected score thresholds. A threshold of at least 3 produced 1.3 trillion tokens. A more inclusive threshold of at least 2 produced 5.4 trillion tokens. Alejandro focuses on the threshold-3 version, which kept documents judged educational but not only at university level; it included secondary and high-school-level material.

In the benchmarks Hugging Face showed, the 1.3 trillion-token FineWeb-Edu dataset outperformed broader web datasets on knowledge-heavy evaluations such as MMLU. At 350 billion training tokens, FineWeb-Edu produced the strongest MMLU result among the compared open web datasets. Across the training curve shown, FineWeb-Edu stayed ahead of FineWeb, C4, Dolma, SlimPajama, RedPajama2, and The Pile in aggregate score.

That does not make educational filtering a general substitute for all pretraining data decisions. The claim in the source is narrower and more useful: for the evaluated knowledge-heavy benchmarks and training runs, filtering the same open FineWeb base for educational value changed the data mix enough to improve performance.

The lesson Alejandro draws is that a large model can bootstrap a scalable data-quality signal. A strong LLM labels a manageable sample; a smaller model learns the signal; the smaller model scores the full dataset.

Common Crawl snapshots are not interchangeable

Alejandro AO adds a finding that complicates the idea of Common Crawl as a uniform source. Hugging Face evaluated FineWeb-style data one dump at a time by training small models on individual snapshots. The result shown was a long-term trend: newer crawls generally trained better models.

Hugging Face’s interpretation is that recent dumps are not just more data, but often better data. Alejandro connects this back to the deduplication lesson. If snapshots vary in quality over time, treating them as interchangeable during analysis or deduplication can hide useful differences. Keeping snapshots separate helped Hugging Face detect both the failed global deduplication effect and the time trend in crawl quality.

The source then raises a possible related signal: synthetic text. Around the 2022–2024 period, Hugging Face examined whether the growing presence of LLM-generated web content might be visible in newer crawls. Alejandro is careful about uncertainty. He says it is difficult to classify a text reliably as LLM-generated, so Hugging Face used a proxy: counting words and phrases unusually associated with synthetic text, such as “delve into” and “certainly.”

The proxy stayed relatively constant before the ChatGPT period and rose sharply in late 2023 and 2024. Alejandro says this is not proof that the text is LLM-generated, but a warning signal and “a pretty good hint” that some of the newer material may have been generated by ChatGPT or similar systems.

The striking empirical observation is that snapshots with more of this synthetic-text proxy currently produce better models, at least while synthetic material remains a small fraction of the crawl. The source does not establish that synthetic text caused the newer dumps’ stronger performance. It leaves the important question open: what happens when a much larger share of future web snapshots is generated by large language models?

The open questions are about measurement, not only scale

The final implications are practical. FineWeb’s gains came from a series of measurement-driven choices: WARC plus trafilatura beat WET extraction; per-crawl deduplication beat global deduplication; model ablations selected filters better than intuition alone; educational labels from a large LLM could be distilled into a smaller scalable classifier.

Alejandro’s forward-looking questions follow from that pattern. First, synthetic text is now part of the web. Future crawls will contain more model-written pages, paraphrases, and SEO spam. The unresolved problem is how to filter harmful synthetic material without discarding useful human text or useful synthetic text.

Second, smaller models may change the economics of data pipelines. FineWeb-Edu used Llama-3-70B for 500,000 labels and then trained a classifier to scale the signal. Alejandro notes that newer small models can be much stronger and sometimes below 20 billion parameters. That raises the possibility of richer and cheaper classification pipelines, and perhaps more sophisticated synthetic-data generation or filtering workflows.

The closing claim is not that bigger crawls alone will improve LLM datasets. FineWeb’s recipe suggests that the next gains may come from better extraction, better ablation design, better quality measurement, and better use of models inside the data pipeline itself.

Data and Training Evals and Benchmarks