Models Are Trained on Curated Corpora, Not the Internet

Tatsunori HashimotoStanford OnlineWednesday, May 20, 202622 min read

Stanford CS336’s data lecture, taught by Tatsunori Hashimoto, argues that training data is both the most consequential and least transparent part of modern language models. Hashimoto says models are not trained on “the internet” in any simple sense, but on static corpora shaped by crawlers, access limits, licensing, copyright risk, filtering, deduplication and conversion choices. The lecture’s central claim is that data construction is a legal and operational pipeline, not a passive input, and that those choices materially distinguish otherwise similar models.

Data is the least transparent part of the model

Tatsunori Hashimoto frames data as the part of language-model training that matters most and is disclosed least. Open-weight models can reveal architectures and even training procedures, while saying almost nothing concrete about what they were trained on. The Llama 3 paper, in his example, gives full architectural transparency and training details, but its pre-training data section says only that the dataset comes from “a variety of data sources,” with de-duplication, cleaning, and removal of domains containing large amounts of personally identifiable information or known adult content.

Data is the most important thing to get right in training language models.

Tatsunori Hashimoto · Source

There are two reasons for this opacity. One is competitive: the composition and processing of the data is “competitive secret sauce.” The other is legal: naming the data can create copyright liability, especially if the source includes copyrighted material whose use is contested.

Before foundation models, data work often meant manually annotating labeled examples for supervised learning. In foundation-model pre-training, there is less annotation, but still extensive curation, cleaning, filtering, deduplication, and transformation. Data remains a “long-tail problem” that scales with human effort in a way architecture and systems work do not. A large model developer can parallelize data work across many people because the model is supposed to cover an enormous range of domains, tasks, formats, and failure modes.

The training pipeline is easiest to understand as a movement from more data of lower quality to less data of higher quality. Hashimoto uses three broad stages. Pre-training uses raw text, especially documents from the web, to build broad language and world knowledge from very large corpora. Mid-training uses higher-quality web data, instruction data, synthetic data, math data, and long-context data to enhance specific capabilities. Post-training uses chat transcripts, preference data, reinforcement-learning environments, and safety data to shape the model for assistant behavior and task performance.

In this terminology, a “base model” usually means a model after pre-training and mid-training. An instruct or chat model is after post-training. But the terms are becoming less reliable. Some large models are released only as instruct models, without visible intermediate checkpoints. Qwen 3.5 397B-A17B is cited as an example where the base-model checkpoint is not released.

Open projects such as OLMo make the pipeline more visible. The OLMo 2 1124 pre-training mix includes 3.90 trillion tokens from DCLM-Baseline web pages, StarCoder code, academic papers, arXiv STEM papers, OpenWebMath, Algebraic Stack, and Wikipedia/Wikibooks. Its mid-training includes high-quality web subsets, FLAN instruction data, Stack Exchange, and synthetic math mixes. Its post-training mix includes prompt datasets for general assistance, knowledge, math, reasoning, coding, safety, multilingual behavior, and precise instruction following.

Source	Type	Tokens
DCLM-Baseline	Web pages	3.71T
StarCoder	Code	83.0B
peS2o	Academic papers	58.6B
arXiv	STEM papers	20.8B
OpenWebMath	Math web pages	12.2B
Algebraic Stack	Math proofs code	11.8B
Wikipedia & Wikibooks	Encyclopedic	3.7B
Total	—	3.90T

Hashimoto’s OLMo 2 1124 pre-training mix example

The point is not simply that models use many datasets. It is that each dataset has a provenance, a legal status, a conversion process, a filtering logic, and a set of artifacts that affect the resulting model.

Models are not trained on “the internet”

The claim that language models are trained on “the entire internet” is not even the right type of claim. The web is made of live servers. A pre-training corpus is not a model wandering around the internet interacting with live applications. It is a static collection of downloaded content, usually produced by a crawler or by bulk dumps provided by particular services.

A crawler begins with a seed set of URLs, downloads pages, extracts hyperlinks, adds them to a queue, and repeats this graph traversal at scale. But that process cannot access all web content. Much of the modern web is dynamic: the URL is not a full specification of the content, and the useful material may require clicks, forms, session state, or application interaction. Discord and Weights & Biases are examples of sites where content is not simply recovered by curling a URL.

Authentication is another barrier. Facebook, X, LinkedIn, and The New York Times contain large volumes of content behind accounts, paywalls, or other walled gardens. A company that owns one of those platforms can train on its own data without crawling it; everyone else cannot simply access it through a generic crawler.

Even for accessible pages, crawlers face technical and normative restrictions. Websites can publish robots.txt files that specify what different user agents are allowed to crawl. Hashimoto stresses that robots.txt is not itself a legal restriction; it is a “good citizen” convention. But sites can also use bot detection, CAPTCHAs, IP blocks, country blocks, and rate limits. There are also terms of service, which may prohibit bots, scraping, AI training, or commercial use, and separate copyright questions even when technical access is possible.

The New York Times robots.txt example explicitly says New York Times content is available for personal, non-commercial use subject to its terms of service and prohibits automated data mining or scraping without prior written permission. Its visible text names prohibited uses including text and data mining under the EU copyright directive, the development of software, machine learning, artificial intelligence, and large language models, creation of archived or cached datasets, and commercial purposes. It also disallows user agents including ChatGPT-User and ClaudeBot.

Restrictions have intensified. Hashimoto cites Shayne Longpre and coauthors’ “Consent in Crisis,” which examined robots.txt and terms-of-service restrictions for URLs in common datasets including C4, RefinedWeb, and Dolma. In the chart Hashimoto uses, full robots.txt restrictions were fairly stable until around mid-2023, when they rose sharply, approaching about half of the measured sites. Terms-of-service restrictions also increased: whereas many pages had no terms in 2016, by the time of the cited analysis many had terms, often terms forbidding AI use.

The practical consequence is that the accessible, legally lower-risk web is smaller than it was when many early web-scale datasets were created. Even if it was possible to crawl much of the internet around 2020, what can legally be crawled now is, in Hashimoto’s phrase, “actually much smaller.”

Crawling also imposes costs on the sites being crawled. Complaints from Kyle Wiens and Eric Holscher about aggressive AI crawlers separate crawling ethics from copyright. Wiens said Anthropic hit his servers a million times in 24 hours; Holscher said Read the Docs was also being “hammered.” Even before asking whether training on the content is lawful, a crawler can violate terms, ignore robots.txt, degrade service, and impose hosting costs.

Shadow libraries add a more explicit legal boundary problem. Sites such as Library Genesis, Z-Library, Anna’s Archive, and Sci-Hub are technically reachable on the web, but they disregard copyright and bypass paywalls. Their defenders may argue that they make freely available what should be free, but from a legal perspective Hashimoto characterizes them as piracy and copyright infringement. They have received takedown orders, lawsuits, and blocks in various countries, often circumvented by moving servers.

The summary is deliberately deflationary: the internet is huge, but raw access to it is constrained by application design, authentication, anti-bot systems, rate limits, robots.txt, terms of service, licensing, and copyright.

Copyright is not reducible to memorization

Tatsunori Hashimoto treats copyright as one of the main reasons data discussions cannot be reduced to engineering. The legal conclusions are narrow at the outset: whether training on copyrighted data is lawful remains unsettled and evolving, and recent rulings are specific to particular cases rather than a blanket rule for all model training.

The relevant legal frame is intellectual property law, whose goal is to incentivize creation of intellectual goods. Copyright is the most relevant branch for language-model data. In the U.S. context, the Copyright Act of 1976 protects “original works of authorship fixed in any tangible medium of expression.” The threshold is low. Registration is not required to receive copyright protection, unlike patents. A website is copyrighted when it is fixed; registration is required before suing for infringement, and the stated registration cost is $65.

This leads to a blunt premise: basically everything on the internet is copyrighted. That does not mean everything is unusable. There are two routes: get a license, or appeal to fair use.

Licenses are contracts in which a licensor grants a licensee permission to use a work in specified ways. Hashimoto’s shorthand is that a license is “a promise not to sue.” Creative Commons licenses are an important bridge between public domain and conventional copyright: a creator can say, in legal form, that others may freely distribute or use the work rather than waiting for copyright to expire. The lecture lists Wikipedia, OpenCourseWare, Khan Academy, Free Music Archive, Flickr images, MusicBrainz images, and YouTube videos as examples of Creative Commons-licensed material. Model developers also license data commercially; examples include Google and Reddit, OpenAI and Shutterstock, and OpenAI and Stack Exchange.

Fair use is less mechanical. The four Section 107 factors are the purpose and character of use, the nature of the copyrighted work, the amount and substantiality used, and the effect on the market for the original. Educational use is favored over commercial use; transformative use is favored over reproduction; factual and non-creative work is favored over fictional and creative work; snippets are favored over whole works; and uses that damage the market for the original are disfavored.

Examples of fair use include watching a movie and writing a summary, reimplementing an algorithm idea rather than copying its code, and Google Books indexing books and showing snippets after the Authors Guild v. Google litigation, which Hashimoto says ended in Google’s favor after 11 years and set some precedent for thinking about language-model training.

Copyright is not only about verbatim memorization. Verbatim reproduction is one way to infringe, but plots and characters can be copyrightable as well. Harry Potter, as a character or expressive universe, can be protected even apart from exact text. Parody may be fair use because it imitates in order to make fun of the original. The concise lesson is that copyright is about semantics and economics, not n-gram overlap.

For language models, that creates a mixed picture. Copying data as the first step of training may itself be an issue, even before training begins. Training has a “transformative flavor,” because it is not the same as rehosting a work and is aimed at learning general ideas, world knowledge, and patterns rather than preserving a concrete expression. But language models can also affect markets for writers, artists, and other creators, which matters under the fourth fair-use factor.

Terms of service are a separate layer. Even where a work is licensed or a fair-use argument may apply, the platform’s terms may restrict automated downloading. YouTube is the example: some videos may be Creative Commons licensed, but YouTube’s terms of service prohibit downloading videos through scraping or bots.

The litigation landscape remains active. The New York Times sued OpenAI in 2023, alleging training on and reproduction of Times articles, with examples of ChatGPT producing near-verbatim news articles. Hashimoto says that case was still pending. In Authors v. Anthropic, the allegation was that Anthropic pirated millions of books and trained on plaintiff books. He describes a 2025 summary judgment as holding that training on books in that instance was fair use, while pirating copies was not. He also says Anthropic bought and scanned books, which the court treated as fair use, but that did not undo the earlier alleged piracy. The outcome presented was a $1.5 billion settlement, about $3,000 per book.

A similar suit against Meta involved allegations of training on plaintiff books, which Hashimoto says was revealed in the Llama paper, and a 2025 summary judgment that training on books in that instance was fair use. An allegation about torrenting books remained pending. The summary stays narrow: so far, training has been deemed fair use in specific instances, but not in a way that settles all training on all copyrighted content; pirating books is clearly illegal; and the field is still evolving.

Raw sources become datasets only after lossy choices

Common Crawl is the default public substrate for many web datasets, but it is raw material rather than a dataset ready for modeling. Common Crawl is a nonprofit founded in 2007 that runs web crawls roughly monthly, adding 3 to 5 billion pages per crawl, with overlap but some attempt to diversify. The April 2026 crawl in Hashimoto’s notes has 2.19 billion pages and 372.2 terabytes, mostly text rather than images.

2.19B

pages in the April 2026 Common Crawl dump cited by Hashimoto

Crawling itself involves policy choices: which pages to select, how to respect robots.txt and avoid overloading servers, when to revisit pages that may change, and how to handle dynamic URLs or many URLs leading to the same content. Common Crawl releases data in WARC format, the raw HTTP response such as HTML, and WET format, a lossy conversion to text. Conversion from HTML to text materially affects model quality. In a DataComp-LM ablation, resiliparse and trafilatura outperform Common Crawl’s WET files on downstream task accuracy.

Specialized sources have different access patterns and different artifacts. Wikipedia is a free online encyclopedia founded in 2001, with 67 million articles across 361 language editions as of May 2026. It excludes original thought, opinions, promotions, and personal pages, and includes articles based on notability and significant coverage from reliable sources. Its content is produced by internet volunteers, with vandalism reverted by administrators and bots. A small number of Wikipedians account for much of the work; Steven Pruitt is cited as an example with 5 million edits. Crucially, Wikipedia produces periodic dumps, so model builders do not need to crawl it.

Those dumps create their own vulnerability. Hashimoto cites Carlini and coauthors’ work on poisoning web-scale training datasets. If an attacker knows when a Wikipedia dump will happen, they can inject malicious edits just before the dump and have them captured in the dataset before rollback. Prior work can then show that injected examples cause models to associate negative sentiment with trigger phrases such as “iPhone.” Hashimoto says he thinks this has since been fixed, but the takeaway is broader: even high-quality sources can contain bad content if adversaries are considered.

GitHub is useful not only for programming tasks but, in Hashimoto’s words, for reasoning as a matter of “folklore.” As of May 2026, GitHub has more than 420 million repositories, 28 million public. A repository is not just files; it includes directory structure, commit history, issues, pull requests, comments, and other metadata. Code has heavy duplication from copying and forks. Hashimoto says GitHub has deemed training on public repositories with permissive licenses such as MIT or Apache allowable. Repository data should be downloaded through the git protocol rather than by scraping the website; metadata is available through the GitHub API and GitHub Archive hourly event snapshots. Software Heritage, founded in 2016, preserves software repositories from GitHub, GitLab, Bitbucket, PyPI, and other sources, though it focuses on repositories rather than issue and comment metadata.

arXiv is another structured source. Founded in 1991 for free sharing of research papers, it now covers physics, math, computer science, statistics, and other areas. It has roughly 3 million submissions. Each submission includes metadata, a PDF, and optionally LaTeX source. The distinction matters because “training on arXiv” could mean converting PDFs to text or using LaTeX source files. arXiv papers are lightly approved, not peer reviewed. Authors choose whether to reserve rights or use Creative Commons licenses, while metadata such as title and abstract is under a permissive CC0 license. Like Wikipedia, arXiv can be bulk downloaded rather than crawled.

Pre-training datasets are built around proxies for quality

The dataset history is less a procession of names than a sequence of answers to one question: what can stand in for “quality” at web scale? The recurring design choices are visible across otherwise different corpora. Some builders use social signals, such as Reddit links. Some use reference corpora, such as Wikipedia or books, and select pages that resemble them. Some rely on hand-written rules for language, punctuation, repetition, toxicity, and boilerplate. Some train classifiers. More recent systems add licensing as a first-order constraint.

BERT’s training data consisted of Wikipedia and books. The books came from BooksCorpus, which was built from self-published books priced at $0 on Smashwords. It contained 7,000 books and 985 million words. BooksCorpus was taken down because it violated Smashwords’ terms of service, illustrating that “free” access and lawful dataset construction are not the same thing. BERT’s sequences were documents rather than isolated sentences, contrasting with earlier language-modeling benchmarks built around sentence data.

GPT-2’s WebText used a social proxy for quality: pages linked from Reddit submissions with at least three karma. The idea was that good Reddit posts link to good pages. WebText contained 8 million pages and 40GB of text, but was not released. OpenWebTextCorpus attempted to replicate it by extracting URLs from Reddit submissions, filtering non-English pages with Facebook’s FastText classifier, and removing near duplicates.

CCNet used a reference-corpus proxy. It aimed to automatically construct large high-quality pre-training datasets, including for low-resource languages such as Urdu. It deduplicated paragraphs, used FastText language identification to keep a target language, and applied quality filtering by keeping documents that looked like Wikipedia under a KenLM 5-gram model. It showed that BERT models trained on CCNet-filtered Common Crawl could outperform models trained only on Wikipedia.

C4, the Colossal Clean Crawled Corpus used for T5, relied on manual heuristics rather than a classifier. Starting from an April 2019 Common Crawl snapshot with 1.4 trillion tokens, it kept lines ending in punctuation with at least five words; removed pages with fewer than three sentences; removed pages containing “bad words,” code-like curly braces, “lorem ipsum,” “terms of use,” and similar boilerplate; and used langdetect to keep English with probability 0.99. The resulting dataset was 806GB of text, or 156 billion tokens. The curly-brace rule filtered out much code, indicating that the builders were not optimizing for code models at the time.

GPT-3 combined processed Common Crawl, WebText2, Books1 and Books2, and Wikipedia, producing 570GB or 400 billion tokens. The books corpora are described in the paper as internet-based but remain mysterious in detail. For Common Crawl processing, GPT-3 used a quality classifier trained to distinguish WebText, Wikipedia, Books1, and Books2 from the rest, plus fuzzy deduplication that included benchmarks.

The Pile, created by EleutherAI after GPT-3, pushed in a different direction: open curation across many domains. It included Common Crawl, PubMed Central, Books3, OpenWebText2, arXiv, GitHub, FreeLaw, Stack Exchange, patents, PubMed abstracts, Project Gutenberg, OpenSubtitles, Wikipedia, mathematics, Ubuntu IRC, BookCorpus2, EuroParl, Hacker News, YouTube subtitles, philosophy papers, NIH ExPorter, and Enron emails. Its inclusion of Enron emails is a reminder that available datasets can be strange distributions: Enron’s 500,000 emails from 150 senior-management users became public through the company’s investigation and collapse.

Books datasets illustrate the legal gradient. Project Gutenberg, started in 1971, includes books that received copyright clearance, mostly in the public domain; PG-19 packages Gutenberg books before 2019. Books3, by contrast, consisted of 196,000 books from the shadow library Bibliotik and included authors such as Stephen King, Min Jin Lee, and Zadie Smith. It was used in The Pile and later taken down due to copyright infringement and lawsuits. Hashimoto says Books3 “got them in a lot of trouble” when later model papers disclosed training on it, and points to that as one reason model developers no longer talk about data in detail.

Stack Exchange is valuable for another reason: some pre-training data already resembles instruction data. It consists of user-contributed questions and answers, beginning with Stack Overflow in 2008 and expanding to many topics. It uses votes, reputation, badges, tags, and comments, all useful for filtering. Hashimoto notes that Q&A format is close to real user interaction, so some apparent question-answering behavior in models may not be “super magically emergent”; similar structures exist in the web data.

Filtering approach	Examples discussed	What counts as “quality”
Social proxy	GPT-2 WebText, OpenWebText	Pages linked by Reddit posts with enough karma
Reference-corpus similarity	CCNet, GPT-3 quality classifier	Documents that resemble Wikipedia, WebText, books, or other selected high-quality sets
Manual rules	C4, MassiveWeb, RefinedWeb, FineWeb, Dolma	Documents passing hand-written language, structure, toxicity, boilerplate, and repetition rules
Model-based filtering	DCLM, Nemotron-CC	Documents scored by learned classifiers or language-model-generated labels
Licensing filter	The Stack, CommonPile	Documents with permissive licenses or public-domain status

The main filtering logics recur across otherwise different pre-training datasets

The modern pipeline is a quality–quantity trade-off

The central post-2021 design tension is whether to preserve broad coverage with interpretable but crude rules, or to use model-based filters that identify higher-quality documents more aggressively. Neither choice is clean. Rules can be controlled and inspected, but miss many cases. Classifiers can work surprisingly well, but encode the dataset builder’s chosen definition of quality.

DeepMind’s MassiveText, used for Gopher, combined MassiveWeb, C4, books, news, GitHub, and Wikipedia, though the paper gives little detail on some of those sources. MassiveWeb kept English, deduplicated, handled train-test overlap, and used manual quality rules such as requiring that 80% of words contain at least one alphabetic character. It used Google SafeSearch for toxicity rather than word lists. The result was 10.5TB of text, though Gopher trained on only 300 billion tokens, about 12% of the dataset.

LLaMA 1 is presented as one of the last non-fully-open models to describe data processing in detail. Its dataset included Common Crawl processed with CCNet, C4, GitHub with permissive licenses and manual filtering, Wikipedia in 20 languages, Project Gutenberg and Books3 from The Pile, arXiv with LaTeX processing, and Stack Exchange with answers sorted by score. The result was 1.2 trillion tokens. Together’s RedPajama V1 reproduced it from the description; Cerebras’s SlimPajama produced a 627 billion-token deduplicated subset. Early copyright decisions propagate downstream: RedPajama initially included Books3 because it reproduced LLaMA’s mixture, then had to strip it out.

RefinedWeb made the argument that web data alone could be sufficient. It used WARC rather than WET files, trafilatura for HTML-to-text extraction, Gopher-style rules, and avoided ML-based filtering to reduce the risk of narrowing or biasing the web. It used MinHash over 5-grams for fuzzy deduplication, had 5 trillion tokens, and released 600 billion. FineWeb began as a replication of RefinedWeb and improved it with 95 Common Crawl dumps, URL filtering, language ID, Gopher and C4 rules, fuzzy deduplication, email and public IP anonymization, and PII handling, resulting in 15 trillion tokens.

Dolma, from AI2, mixed Common Crawl, The Stack, C4, Reddit from Pushshift, PeS2o academic papers from Semantic Scholar, Project Gutenberg, and Wikipedia/Wikibooks. Its Common Crawl processing used FastText language identification to keep English, Gopher and C4 rules for quality filtering, rules and a Jigsaw classifier for toxicity filtering, and Bloom filters for deduplication. It produced about 3 trillion tokens.

DCLM, DataComp-LM, is where model-based quality filtering began to become more normal in the open community. DCLM’s original motivation was to define a standard dataset and pipeline for evaluating data processing algorithms, but the released dataset itself became widely used. It processed Common Crawl into DCLM-pool, an enormous 240 trillion-token pool. After English filtering, URL filtering, heuristic cleaning, repetition and length filters, deduplication, and model-based filtering, DCLM-Baseline kept only about 1.4% of the original documents.

DCLM filtering stage	Share of original documents
DCLM-Pool	100%
English filter	50.8%
Heuristic cleaning / RefinedWeb reproduction	19.9%
Deduplication	13.7%
Bloom filter deduplication	12.3%
Model-based filtering / DCLM-Baseline	1.4%

The DCLM-Baseline funnel in Hashimoto’s DataComp-LM example

The model-based filter was surprisingly effective. Positive examples came from OpenHermes-2.5, mostly GPT-4-generated instruction data, and ELI5, a subreddit of curiosity questions and answers. Negative examples came from RefinedWeb. A FastText classifier trained on these examples produced a 3.8 trillion-token result. In the comparison Hashimoto presents, the FastText OpenHermes-plus-ELI5 filter outperformed RefinedWeb reproduction, PageRank, SemDedup, BGE features, AskLLM, perplexity filtering, and top-k average logits on CORE and EXTENDED evaluations.

Nemotron-CC, from Nvidia, reacted to the aggressiveness of DCLM and FineWebEdu. If high-quality filters remove roughly 90% of data, the problem becomes how to get more tokens without giving up too much quality. Nemotron used jusText for HTML-to-text because it returned more tokens, and used classifier ensembling: prompt Nemotron-340B-Instruct to score FineWeb documents for educational value, distill those labels into a faster model, and combine with the DCLM classifier. It also leaned into synthetic transformations. Low-quality data was rephrased by a language model to make it more useful; high-quality data was used to generate tasks such as question-answer pairs, summaries, or key-information extraction. The result was 6.3 trillion tokens, with a 1.1 trillion-token high-quality subset. Llama 3 trained on 15 trillion tokens and Qwen 3 on 36 trillion, though Hashimoto warns that token counts in papers can include repeated epochs and are not necessarily unique tokens.

The pattern across these systems is a trade-off. Common Crawl can yield hundreds of trillions of raw tokens, but much is low quality. Aggressive filters can produce a few trillion tokens that work well, but may be too small or too narrow. The choices are mostly heuristic: rules, classifiers, thresholds, deduplication settings, and definitions of “quality.”

Data does not fall from the sky. You have to work to get it.

Tatsunori Hashimoto

Code data is not just files; it is the software-development process

The Stack is Hashimoto’s main example of a purpose-built code dataset. Its first version took repository names from GitHub Archive from 2015 to 2022, git-cloned 137 million repositories containing 51 billion files, found 5 billion unique files, kept permissively licensed repositories using go-license-detector, removed near duplicates with MinHash and Jaccard similarity, and produced 3.1TB of code.

Stack v2 broadened the object of training from code files to software work. It added issues, comments, and pull requests from GitHub Archive; repositories from Software Heritage; and documentation from crawling sites such as PyPI, npm, and devdocs.io. The processing removed binary files, malware, bot activity, duplicates, and PII, and subsampled pull requests to keep the dataset manageable.

One design Hashimoto highlights is pairing low-resource programming languages with LLVM, a shared low-level intermediate representation. A language such as Nim may have little public code. If examples can be compiled into LLVM, the model can learn a relationship between the low-resource language and an intermediate representation that has much more data across other languages.

Pull requests demonstrate why “code data” often must be linearized. A pull request is a structured object containing title, repository, base files, diffs, comments, review states, reply relationships, event IDs, file paths, and surrounding context. To train an autoregressive language model, this structure must be turned into a token sequence. The example uses XML-like tags such as <pr_title>, <pr_base>, <pr_diff>, <pr_comment>, <pr_review_state>, and <pr_diff_hunk>. The dataset builder must decide how much context to include around a diff: a changed line alone, a few neighboring lines, or the whole file. These decisions determine whether the model learns only code syntax or also the collaborative workflow around code.

Permissively licensed data is possible, but it changes the frontier

CommonPile asks what happens if a model builder refuses to rely on unsettled fair-use arguments. The premise is: almost everything on the internet is copyrighted; some of it is permissively licensed or public domain; fair use remains unsettled. If uncertainty is treated as a “no,” the question becomes whether a good model can be trained only on permissively licensed data.

CommonPile collected 8TB of such data. The largest category is code, at 4,775GB, followed by government and legal data at 1,172GB, wikis at 528GB, academic papers at 379GB, public-domain books at 244GB, online forums at 165GB, and smaller categories of other and educational resources. The web category appears with two different figures in the source visuals — 288GB in one view and 398GB in another — so the source supports describing it as a few hundred gigabytes rather than a single resolved figure. Sources include Stack v2, PEPs, USPTO, legal corpora, Regulations.gov, Wikimedia, C4-derived material, PubMed, arXiv papers, OpenAlex, Stack Exchange, GitHub Archive, Ubuntu IRC, pre-1929 books, Library of Congress, Project Gutenberg, Creative Commons YouTube, DOAB, Pressbooks, and OER Commons.

CommonPile category	Size
Code	4,775GB
Government & Legal	1,172GB
Wikis	528GB
Academic Papers	379GB
Web	A few hundred GB; the source visuals show 288GB and 398GB
Public Domain Books	244GB
Online Forums	165GB
Other	29GB
Educational Resources	15GB
Total	8TB

CommonPile’s permissively licensed data mix, preserving the web-category discrepancy in the source visuals

Collecting permissively licensed data is harder than reading a license field. License laundering is one problem: someone can redistribute copyrighted work and attach a Creative Commons license, and it can be hard to determine whether the license is legitimate. Collection licenses are another problem. A dataset such as Dolma can have a permissive collection license, but that does not necessarily mean every individual document inside it is permissively licensed. Hugging Face datasets may show permissive licenses at the dataset level while containing individual works with different rights.

CommonPile also avoided synthetic data. Synthetic data from models trained on unlicensed data is unclear. Open-weight models may have permissive licenses that allow use of their outputs, but if those models were trained on unlicensed data, treating their outputs as clean can look like a form of data laundering if one is being strict.

The performance result is mixed. CommonPile’s Comma v0.1-1T model is not as good as Qwen models on the displayed knowledge, reasoning, and coding benchmarks, but it performs decently and can outperform older models such as early 2023-era baselines in some comparisons. Hashimoto’s conclusion is cautious: permissively licensed data can get “reasonably” far, but it is tough to compete without more tokens. He does not treat CommonPile as the final word; he thinks more performance may be possible from public, permissively licensed sources with further effort.

Data and Training AI Governance and Regulation Open Models