AI Math Progress Is Jagged, Not a Clean AGI Benchmark

Grant SandersonDwarkesh PatelTuesday, June 30, 202625 min read

Grant Sanderson argues that AI’s rapid gains in mathematics are less a clean proxy for AGI than a map of uneven capabilities: solving contest problems, finding cross-field connections, inventing definitions, verifying proofs, and explaining ideas are different tasks with different signals. In conversation with Dwarkesh Patel, Sanderson says the most consequential mathematical breakthroughs may be the hardest for current systems to learn, because their value often depends on delayed judgment, human taste, and concepts that compress a field rather than merely prove a theorem.

Math progress is a testbed, not a clean AGI proxy

AI progress in mathematics is useful evidence precisely because it is uneven. In Grant Sanderson’s account, “AI can do math” is not one capability advancing along one line. It can mean solving posed problems, finding bridges between fields, inventing useful definitions, checking long arguments, compressing proofs into explanations, or curating what is worth understanding. Those capabilities have different reward signals and different implications for the rest of the economy.

Dwarkesh Patel framed mathematics as the field where AI appears to be moving fastest, and therefore as a concrete preview of how progress may or may not generalize elsewhere. He began with a retrospective. Three years earlier, he had asked Sanderson whether an AI that could win gold at the International Math Olympiad would effectively be AGI. Sanderson had answered that it would become another benchmark rather than a decisive threshold. Patel said that now looked right, and asked why: if Olympiad success did not produce a general “aha moment,” what should one infer from future mathematical breakthroughs?

Sanderson’s answer was that even “math ability” has a jagged internal frontier. Mathematics is one of the spikes in AI progress, but within mathematics the spikiness has what he called a fractal character: when one zooms in, some subdomains are much easier than others.

Math is just right there in one of the spikes. But there's kind of a fractal nature to that spikiness because when you zoom into the specific progress within math, you have some things are a lot easier than others.

Grant Sanderson · Source

The International Math Olympiad is the clean example. Its problems are traditionally grouped into geometry, number theory, algebra, and combinatorics. Sanderson said systems had become especially strong at geometry. In 2024, they could “cold solve geometry basically,” sometimes very quickly, because geometry admits brute-force approaches. Students have versions of brute-force geometry methods too, he said, which is part of the “dirty secret” of the IMO: many problems can be trained for.

Combinatorics remains different in his telling. It is more “playful” and “puzzly,” and the 2024 test happened to include two combinatorics problems. Sanderson’s counterfactual was conditional: had the distribution included more geometry, he suggested, an AI system might have received gold that year.

That example matters because it keeps the benchmark in proportion. IMO success is impressive, but a hard benchmark can still fail to imply broad automation. Patel pressed the stronger version: if an AI solves a Millennium Prize problem, could there still be many economically important tasks humans can do that AI cannot?

Sanderson said the answer depends on what the solution looks like. A solution to the Riemann hypothesis, for instance, could arise in several different ways.

One possibility is a “lightning bolt” between already mature fields. Sanderson pointed to the story of Hugh Montgomery and Freeman Dyson at the Institute for Advanced Study. Montgomery, a number theorist, was studying statistical correlations among zeros of the Riemann zeta function. Dyson, a physicist, recognized an expression from random Hermitian matrices, which arise in the study of nuclear energy levels. That kind of bridge between analytic number theory and physics is the sort of thing LLMs might be especially suited for: superhuman breadth across domains, plus the ability to notice similarity without needing the right two people to happen to talk over lunch. But Sanderson said that ability is “totally different” from what makes an AI a useful editor or general white-collar worker.

A second possibility is “mountain building”: the creation of a new theory or conceptual apparatus. Sanderson compared this to Fermat’s Last Theorem, whose elementary statement eventually required machinery from elliptic curves and modular forms. If the Riemann hypothesis required building a new mathematical mountain, then solving it would imply a deeper capability: not just finding a connection across known terrain, but inventing the terrain from which the problem becomes tractable. Sanderson said that if a system could build the “correct new theory” that crystallizes a subject, it would be surprising if that did not spill into other parts of the economy.

A third possibility, developed later in the exchange, is raw exertion: a proof that is extremely long, perhaps thousands of pages, with no new conceptual compression. That would demonstrate correctness and persistence across long chains of reasoning, but it might not produce much human understanding.

Patel acknowledged that he was “moving the goalposts.” He had previously wondered why systems with broad knowledge were not making discoveries by connecting facts across domains. Now, after examples he took to include the AI-assisted counterexample to the unit distance conjecture, he asked what the next benchmark should be. His candidates were not theorem proving alone, but coming up with interesting problems and inventing new objects or conceptualizations that unify fields.

Sanderson agreed that this moves closer to the center of mathematics. In discussions around AI progress, he said, one mathematician had invoked a hierarchy: good mathematicians prove theorems, great mathematicians come up with conjectures, and the greatest mathematicians come up with definitions. A “conjecture generator” would be impressive; a “definition generator” would be the premium tier.

Good mathematicians prove theorems, great mathematicians come up with conjectures, and the greatest mathematicians come up with definitions.

Grant Sanderson · Source

But that hierarchy exposes a measurement problem. A proof has a scoreboard: the theorem is proved or it is not. A conjecture or definition usually does not. Sanderson said a headline claiming that a model “came up with a really good conjecture” would not land like a theorem-proof announcement, because the value of the conjecture may be subjective or delayed.

He expects progress here to show up less as a benchmark and more as a tone shift among mathematicians. The signal would be mathematicians saying that models are not only solving their problems, but helping them decide what research direction is worth pursuing.

Patel connected this directly to training. The kinds of work that are hardest to benchmark are also, in the current paradigm, hardest to train for. There may be no deep metaphysical reason AI cannot do them; the practical issue is that current reinforcement learning from verifiable rewards works best where the reward can be checked.

The most valuable concepts may have century-long reward loops

The deepest tension is not whether AI can prove hard theorems. It is whether AI can generate the sort of concepts whose value only becomes visible after a long historical arc.

Dwarkesh Patel used Galois theory as the test case. Abel had proved the general quintic unsolvable by radicals, but Galois’s work eventually led to group theory, a far more generative conceptual framework. If the question is whether group theory was useful, Patel suggested, the verification loop may be a century long: its value becomes visible through later applications to symmetry, physics, cryptography, and other areas. That sort of breakthrough is hard to reward in a training loop because the reward arrives long after the move.

Grant Sanderson, who said he had spent roughly a year thinking about Galois for a shelved project, treated the example as almost ideal. The value of Galois’s insight did not come from immediate utility, and even human verifiers at the time did not recognize it as valuable.

Sanderson reconstructed the historical path. The quadratic formula was ancient in spirit, though algebraic notation came later. Italian mathematicians found formulas for cubics and quartics, making the quintic a natural next challenge. The quartic formula was already so complicated that it was usually not written out in full, so one might have expected the quintic formula, if it existed, to be even more monstrous.

Lagrange made the key preliminary move. He saw that solving polynomials was related to symmetries of algebraic expressions under permutations of their roots. Sanderson illustrated this with simple expressions: a sum such as $a + b + c + d$ remains unchanged under any permutation, while an expression such as $(a + b) (c + d)$ is invariant under some permutations but not others. Lagrange recognized that expressions with certain symmetry properties could reduce a quartic problem to a cubic one. Extending the method to quintics would require an expression in five variables whose $5!$ permutations collapse to four or fewer values — a puzzle-like condition that one might begin to suspect is impossible.

Lagrange did not solve the quintic. But he planted the instinct that symmetry of roots was the right way to think about polynomial solvability. Abel and Galois later pushed that instinct further. Abel proved impossibility. Galois moved toward abstraction: not the formulas themselves, but the symmetries underlying them.

The striking part, for Sanderson, is that Galois’s work did not pass the verifiers of his own time. He submitted papers; they were rejected. Sanderson did not present this simply as institutional blindness. The work was not fully coherent, not a complete proof, and not yet a clear theory. Galois was a young mathematician “getting his bearings.” Even stating precisely what problem Galois solved is tricky. Modern Galois theory lets one analyze whether a specific polynomial has roots expressible by radicals, but Galois himself did not cleanly present that in the modern form.

After Galois died, the ideas still took decades to take hold. His brother and a friend tried to circulate the notes. Around 20 years later, Liouville saw that there might be something there and tried to clean them up. Another roughly 20 years passed before Jordan produced something closer to a modern treatment of group theory and attributed the ideas to Galois. Only much later did group theory become central in physics; Sanderson pointed to Murray Gell-Mann anticipating quarks through group-theoretic reasoning as one striking application.

The lesson is uncomfortable for any reward-driven account of discovery. Lagrange had an instinct. Galois had an instinct. Liouville had an instinct that a dead young mathematician’s scattered notes mattered. None of those instincts had an immediate clean reward signal.

Sanderson suggested that if there is any route to formalizing this, it may involve compression. A concept that yields a smaller, more predictive representation feels closer to intelligence. He floated Kolmogorov complexity as a possible analogy for elegance: perhaps one could reward not only whether a proof works, but whether it reduces the conceptual machinery needed to see why it works.

But he did not claim this is easy. The Galois instinct is precisely the kind of thing current verifiable-reward methods do not naturally capture.

Proof and explanation can come apart

A common worry about AI in mathematics is that a system may prove the Riemann hypothesis while leaving human understanding no better off. Dwarkesh Patel questioned whether that should be the default expectation. Humans often invent abstractions and subgoals because they are useful for solving complicated problems; perhaps the simplest route to a proof is also the route through natural concepts. He also pointed to recent AI-assisted mathematical progress where, as he understood it, the reasoning was understandable to mathematicians and used known concepts in natural language.

Grant Sanderson again separated cases. If the proof is a lightning bolt between existing fields, it may be quite human-parseable. The contribution is small but potent: here is a concept in one field, here is a concept in another, and here is the bridge. Mathematicians can often run with that once shown the start and end points.

If the proof requires mountain building, understanding takes longer. Humans must climb the new theory. Sanderson mentioned the attempted proof of the ABC conjecture, associated with inter-universal geometry, as a nearby example of alien-feeling theory building. He did not want to dwell on the example, and he described it as probably not a correct solution, but the relevant feature was that an otherwise reputable mathematician proposed a whole new framework and other mathematicians spent years trying to parse it. The fear would be an AI producing something like that — especially if, after years of effort, people conclude it is wrong. Even if it were right, hiking the mountain would take work.

If the proof is raw hustle — a very long chain of reasoning without conceptual compression — then the worry is strongest. One may have correctness without insight. Patel invoked David Bessis’s essay “The Fall of the Theorem Economy” as a reference point for a world where the theorem-proving part is automated and the valuable work shifts toward definitions, insights, and consolidation. If AI produces direct arguments for many important conjectures, humans or future AIs may then need to extract the higher-level concepts behind them.

Sanderson thought even a proof that outstrips current understanding would be valuable. Much of mathematical research is being wrong: wandering, trying things, and discovering that they fail. If one knows that a body of reasoning leads to a correct solution, that alone changes the search. It gives direction for digestion.

He drew a distinction between proof and explanation through Timothy Chow’s framing of forcing, as Sanderson described it. The continuum hypothesis asks, roughly, whether there is a size of infinity between the natural numbers and the real numbers. The answer depends on one’s axioms; it is independent of the usual axiom system. The method used to show this involves forcing, which is famously hard to understand. Sanderson cited Chow’s proposal of an “unsolved expository problem”: not an unsolved theorem, but a theorem whose proof exists before the field has a satisfying explanation of why it is true.

Everyone knows the idea of an unsolved research problem. Like I wanna propose the idea of an unsolved expository problem. Where like, sure we've proven it, but we don't really know why it's true.

Grant Sanderson · Source

That distinction matters beyond mathematics. Patel suggested that incentives in science may shift from proving things about the world toward consolidating proofs into higher-level insight. But he also noted that sometimes the representation is not a mere aid to understanding; it is the idea. Minkowski spacetime diagrams are not just illustrations of special relativity, he suggested, but part of the conceptualization of the reality itself.

Sanderson agreed that there is a strong correlation between people who produce genuinely novel insights and people who explain them clearly. Einstein, Claude Shannon, and Richard Feynman were his examples: their papers or explanations are lucid, not merely expert documents requiring a machete. He said his own view had shifted. He once imagined AIs becoming automated theorem provers while humans retained the role of explaining. Now he suspects the same capability that finds the right new idea may also be good at explaining it.

That leaves a narrower human role. Sanderson does not expect to stop doing his kind of work, but he thinks part of the explainer’s role may become more relational and curatorial. If AIs can generate and explain enormous amounts of mathematics, people will still want guidance through the space of ideas. He compared future mathematicians or explainers to art museum curators: the art exists, and perhaps an AI can describe it well, but a trusted human helps decide what is worth attending to. Motivation is social. Audiences trust a curator’s taste, not merely their ability to summarize.

The next wave may be engineered serendipity

A few examples of AI-enabled connection-making do not exhaust the category. Grant Sanderson said that even if models have begun striking “lightning bolts,” he still expects a “flourishing future” over the next couple of years in connecting ideas.

Some major breakthroughs may themselves look like connections when viewed at the right scale. Dwarkesh Patel floated general relativity as a case: Riemannian geometry plus special relativity. Sanderson responded by pointing to the Langlands program, not as a single field but as a research ethos organized around deep correspondences between distant areas. Fermat’s Last Theorem is one instance of this broader style: a bridge between seemingly disparate structures leads to the solution of a problem. Much mathematical work, Sanderson said, is not aimed at knocking down a particular problem but at filling in a large map of connections.

That is where AI breadth may matter most. Sanderson’s guess is that much of the useful progress from models over the next five years will look like filling in that landscape of possible bridges an expert in multiple fields might draw. It will be hard to score. If a problem falls, the headline is obvious. If a connection feels right and generative, humans may need to remain in the loop to judge whether it is the kind of connection mathematicians are seeking.

Sanderson also wondered why broad language models have not already produced more of these bridges. He speculated about autoregression itself. A language model produces output by repeatedly predicting the next token from context. He compared this to locking an intelligent person in a box, giving them slips of paper, asking them to predict what comes next, then wiping their memory each time. The resulting essay might be much worse than what the person would have written if allowed to compose, plan, and revise. In particular, the model may become “a slave to your context.” A connection to another field is, by nature, unlikely relative to the immediate local continuation.

Patel thought data and environments matter more than architecture or loss function. He pointed to autonomous coding agents learning to step back, search a codebase, assess mistakes, and continue. For math or science, he imagines frontier-style problems designed to require cross-field connections, perhaps with synthetic ways to make them harder by removing assumptions. The essential requirement is an environment that incentivizes the ability.

Sanderson found it plausible that such environments will produce many more lightning bolts. Patel then emphasized a different AI advantage: parallelism. AI systems need not be one idiosyncratic genius who makes a few connections and dies young. They can be scaled across problems and copied with similar knowledge. “Quantity has a quality all of its own,” he said.

Sanderson extended the thought by returning to the Montgomery-Dyson lunch. An institute is smarter than an individual partly because it creates serendipitous conversations. One could imagine engineering such conversations among agents with different expertises. But Sanderson also suggested that the advantage may not only be knowledge pooling; it may be context manipulation.

Humans often get trapped in a way of thinking. Some problems require backing up, trying the opposite, or escaping the assumptions produced by training. Sanderson described an IMO problem that many strong students and even Terence Tao failed on, which he plans to discuss in his AI-and-math series. People were angry at it, calling it a “troll problem,” because the contest context suggested a sophisticated approach while the best solution was almost “brain-dead.” The way to solve it was to escape the IMO frame and approach it like an ordinary brain teaser.

Digital minds may be good at this if prompted or organized correctly. One could spin up agents with different contexts: one trying to prove a statement, one trying to disprove it, one using one family of tactics, another using a different one. Patel noted that this inverts a common worry about AI: that models collapse toward the same patterns because they are trained similarly. Perhaps systems can instead be used to increase entropy by systematically trying negations and biases that human communities neglect.

Sanderson connected this to scientific heuristics. Einstein’s bias that physics should look the same in different reference frames was productive. His resistance to quantum randomness was not. There is no single correct heuristic for science; one needs multiple independent research programs with their own biases. Patel imagined “old school software” wrapped around models to explore an ontology of approaches: prove, disprove, try different tactics, enforce breadth.

Verifiability is not enough; progress also needs grindability

Dwarkesh Patel accepted verifiability as one reason AI has advanced quickly in math, but argued that the usual explanation is incomplete. A domain also has to be grindable: it must allow many parallel attempts, clean resets, and usable credit assignment.

What computer use lacks is grindability.

Dwarkesh Patel

Computer use is verifiable in many cases. Did the package get ordered? Was the event booked? Did the task complete? But it is hard to grind. Websites have bot detectors. Running thousands of parallel rollouts through the same checkout flow is expensive and likely to trigger defenses. Building clones of every website is labor-intensive. The current deep learning regime still needs many rollouts, Patel said, because sample efficiency remains limited.

Coding and math are unusual because they can be replayed, containerized, and farmed. In coding, one can snapshot a repository, spin up many containers, ask agents to implement a feature, and compare outcomes. Because the starting point is deterministic, credit assignment is easier: if one rollout succeeds and another fails, the difference between them carries information. In many real-world domains — building a business, trading markets, interacting with people — the world changes, cannot be reset cleanly, and cannot be rolled out in identical parallel branches.

Patel therefore downplayed Lean as the central driver of recent AI math progress. In the unit distance conjecture counterexample, the released chain of thought, or at least the rewritten version, was in natural language rather than Lean. He argued that process supervision from formal proof assistants may be less important than grindable, verifiable outcomes.

Grant Sanderson partly agreed, but defended Lean’s importance in two other roles. The first is open-ended autonomous exploration. AlphaGo and AlphaZero could play enormous numbers of games in their own universe because the reward was automated. In natural-language mathematics, a human still ultimately reviews a proof. That bounds how far one can let systems run. In Lean, by contrast, one could imagine an AI endlessly trying to extend Mathlib — the Lean repository meant to contain formalized mathematics. It might be a fork, not the canonical human-curated version, but it could grow without constant human checking.

Sanderson presented this as a possibility, not as a settled prediction. A system could “press go,” pour compute into formal mathematical exploration, look away for years, and return to a vast tree of formally correct results. The open question is usefulness. The system might generate many definitions, theorems, and branches that are true but uninteresting. Still, he said it would be surprising if no mathematical insight came from such a process.

The second role for Lean is trust. If AI mathematicians produce ten natural-language papers per day, even a small error rate becomes intolerable. Sanderson cited Alex Kontorovich’s concern: if 99 out of 100 papers are right, a mathematician may still not know whether it is worth spending days checking one, because finding the error is labor-intensive and frustrating. A formal proof gives a “green check mark.” It does not make the proof understandable or interesting, but it establishes correctness. Sanderson said every other field would “kill” for that.

Patel introduced examples of natural-language verification as a possible alternative. He described DeepSeekMath as using a verifier and a meta-verifier for natural-language proofs in an Art of Problem Solving style, and said the approach appeared to work in the published literature. He also said coding agents seem to be improving at not merely writing functional code but writing clean, refactored code, plausibly using model-as-judge processes for taste and structure. Sanderson thought math is more plausible than writing for natural-language verification because correctness is more objective.

But he still saw formal exploration as complementary. A Lean-based tree of logic could, in his speculation, explore paths disconnected from human phrasing and prior heuristics, somewhat like AlphaGo’s Move 37. Natural-language progress may remain closer to existing mathematical culture; formal autonomous exploration might open a different search process.

Writing is the counterexample to the verifier story

Writing exposes what current models still do poorly. Dwarkesh Patel’s first explanation was reward hacking. Models may discriminate between a B essay and an A essay, but then prefer what he called a “B star” essay: something bad that hits the superficial features of quality. His second explanation was modularity. Code and math can tolerate local messiness if the final artifact works. A function can be written several ways and still produce the correct result. A lemma can be ugly but usable. In writing, the output is the artifact itself. Every sentence and word is part of the substance. There is no separate functional product that the prose merely produces.

Grant Sanderson pushed back: if agents are improving from merely functional code to clean, mergeable pull requests, why would the same progress not yield clearer writing?

Patel conceded that models can be excellent at some forms of explanation. He often prefers pasting material into an LLM and asking for an explanation rather than reading the original. Even when speaking with a human expert, he sometimes wishes it were socially acceptable to pause, ask an LLM for a basic concept, and then return to the expert’s unique knowledge.

Sanderson drew a line between distillation and writing. If the task is a book report, an LLM may perform well. But when people say models are bad at writing, they usually mean something else: generating the insight, deciding what is worth saying, and making the unpredictable move at the right moment. Writing is not just clear explanation of pre-existing ideas. A good author explores the world, selects what matters, and builds a motivated narrative. Autoregression is awkward for this because good writing requires deliberate novelty, not merely a higher-temperature next-token guess.

Patel added theory of mind. Good writing requires modeling the reader’s mind sentence by sentence: what image appears first, what confusion arises, what association will persist months later. He referred to Andy Matuschak and a collaborator’s work trying to get LLMs to write good spaced-repetition prompts. The difficulty, as Patel understood it, was that a good flashcard requires projecting a learner’s mind months into the future and crafting a prompt that elicits the intended memory. He saw writing as similar. One must constantly ask what is happening in the reader’s mind now.

Sanderson did not find it surprising that models struggle here. He described a study he remembered — with caveats that he might be misremembering — in which people who had just received Botox became worse at reading facial expressions. The proposed explanation was that recognizing an emotion partly involves subtly mimicking it with one’s own facial muscles. If one cannot move the muscles, one loses some ability to infer the feeling.

The analogy was that humans understand other humans partly by running them on shared hardware. Models have read what people wrote, but they do not have faces, bodies, or human minds. Their theory of mind, if it emerges, is alien and indirect rather than grounded in the same machinery.

Learning still depends on human sequence and taste

LLMs can be useful tutors, but Grant Sanderson argued that their best current use is often not to replace human-created explanations. It is to route learners toward them.

His advice began from a pre-LLM principle: “who matters more than what.” Students choosing courses should care less about their current topic preferences and more about whether the teacher is good and resonates with them. Readers should often follow authors they trust, not just subjects they already care about.

LLM explanations, to Sanderson, currently feel like Wikipedia: astonishingly useful compared with the pre-Wikipedia world, but often lacking the crafted motivation of a single author. Wikipedia’s best use is often the references at the bottom. Similarly, he often uses LLMs to find the right human-created resource. He described asking Claude for a visual explanation of semiconductors; Claude falsely attributed a video to 3Blue1Brown, but the linked video itself was useful. The best use was not the model’s own explanation, but its ability to locate a human explanation.

Dwarkesh Patel agreed. His best learning sessions involve a human artifact — article, book, lecture, or video — that orders concepts correctly and builds motivation step by step, with the LLM used to prune side branches. He described studying Steven Strogatz’s textbook on chaos and nonlinear dynamics with a lecture on one part of the screen, the textbook on another, and an LLM on the third. The human author supplied the sequence and motivation; the model answered local questions.

Both saw a remaining weakness in LLM tutoring: a good human can reject the premise of a student’s question and say they are thinking about the topic the wrong way. Models are too placating. Sanderson connected this again to theory of mind. A strong teacher recognizes that a question reveals a mental structure and may need to reframe it. An exceptional teacher can go further: take the student’s unusual framing seriously and “jujitsu” it into the right path rather than simply replacing it.

The human role shifts toward judgment, curation, and teaching

Students who love math but worry about AI should begin with a question that mattered before AI: where does the money come from, and what value are you adding? Grant Sanderson heavily caveated his advice. He is a YouTuber outside the formal mathematical institution, not someone whose career guidance should be treated as authoritative. But he thought this principle applies before and after AI.

Many students, he said, want to pursue math because they have always been rewarded for being good at it. They imagine becoming mathematicians as a continuation of the same sequence of hoops. But a job is embedded in a value chain. For a prestigious mathematician, a university may be paying partly for brand value. For grant-funded work, the money comes through public-good beliefs about basic science, filtered through institutions and bureaucracy that try to predict valuable progress. For many academic jobs, teaching is central: parents and students pay for access to institutions with experts who can educate them.

AI changes some tasks, but not all of these social functions in the same way. If theorem proving becomes mostly automated and explanation also improves, Sanderson still expects an internal mathematical culture to decide what counts as valuable contribution. The prestige signals may shift from theorem proving toward definitions, curation, or identifying meaningful directions. The institution may still exist because society values basic science and trusts mathematicians to judge where effort should go.

Teaching, in his view, is especially stable. Even if LLMs become excellent explainers, teaching is relational: coaching, mentoring, motivating, and forming a human relationship. Sanderson called it one of the more stable careers over the next 50 years. Students interested in mathematics should take seriously the value of becoming math educators, not only theorem provers.

I actually think teaching is one of the most stable post-AGI jobs that there is, because it's so relational.

Grant Sanderson · Source

Dwarkesh Patel argued that this becomes even more true in an extreme scenario. If, over the next five to ten years, AIs were not only solving Millennium Prize problems but inventing new problems, fields, and mathematical objects, then the area where AI minds would have seen furthest beyond human horizons might be mathematics. In that world, Patel suggested, there would be demand for humans who can explain what the AIs have seen. If any jobs exist in such a world, distilling AI-discovered mathematics would plausibly be one of them.

Sanderson added that this framing almost presumes math is useless. If new mathematics has practical applications, then people who understand it and can point the “behemoth” of AI-generated mathematics toward useful ends become more economically leveraged, not less.

Patel asked whether accelerated mathematics would actually matter economically. Group theory began in questions about polynomial roots and later found applications across physics and other domains. If math accelerates 10x or 100x, should one expect “crazy” downstream effects, or will other fields bottleneck progress?

Sanderson’s answer was again spiky. Progress in algebraic number theory may not obviously unlock immediate applications. But progress in areas such as dynamics, PDEs, and simulation can be application-adjacent. He described, with explicit uncertainty about whether he was summarizing the example correctly, a mathematician whose group had ideas that helped Boeing do more aircraft testing in simulation rather than through costly physical disassembly and rebuilding; Sanderson said the work saved Boeing “billions of dollars or something” and led the company to fund the group. In such domains, better mathematics might improve engine design, wing shapes, computational fluid dynamics, or simulation efficiency. The effects may be meaningful but incremental rather than a single dramatic breakthrough.

He also allowed a more awkward possibility. AI-accelerated mathematics may force the field to confront how much recent work is disconnected from physical application. If mathematical progress increases dramatically and the rest of the world does not see corresponding benefits, people may ask what the work is for. Mathematicians might then have to reckon with the possibility that some grant-proposal stories about eventual utility were more tenuous than advertised.

Data and Training AI in Education and Learning Evals and Benchmarks AI Research Methods Agents and Autonomy AI Economics and Labor

Math progress is a testbed, not a clean AGI proxy

The most valuable concepts may have century-long reward loops

Proof and explanation can come apart

The next wave may be engineered serendipity

Verifiability is not enough; progress also needs grindability

Writing is the counterexample to the verifier story

Learning still depends on human sequence and taste

The human role shifts toward judgment, curation, and teaching

The frontier, in your inbox tomorrow at 08:00.