AlphaProof Nexus Solved Nine Erdős Problems With Formal Verification

Károly Zsolnai-FehérTwo Minute PapersFriday, June 5, 20266 min read

Károly Zsolnai-Fehér argues that DeepMind’s AlphaProof Nexus should not be judged mainly by its 9-for-353 success rate on Erdős problems, but by the kind of system it represents. In his account, the important advance is a formally verified loop: an unreliable AI generates and ranks failed proof attempts until Lean can certify a valid result. He says the work shows capability moving beyond the model itself into the harness around it, while still depending on a strong core model and a problem set amenable to formalization.

Nine solved problems is the wrong number to dismiss

DeepMind’s AlphaProof Nexus attempted 353 Erdős problems and solved nine. That leaves a 97.5% failure rate, at roughly $200 per problem. Károly Zsolnai-Fehér treats that not as a reason to dismiss the result, but as the wrong denominator for understanding it.

Erdős problems solved by AlphaProof Nexus out of 353 attempted

The relevant context, as Zsolnai-Fehér presents it, is that Paul Erdős left more than a thousand open problems to the world, and the system was working on decades-old mathematical problems that had not been solved by people. The visuals highlight a 56-year-unsolved case, but the source does not establish that every one of the nine solved problems had been open for exactly that long. The claim is narrower and still material: AlphaProof Nexus reached into a class of long-standing open problems and produced formally checkable solutions to nine of the 353 attempted.

He also rejects the criticism that the work is not “fundamentally new” by putting it on a short capability timeline. Four years ago, he says, people objected that GPT-3 could not reliably add numbers. Two years ago, the complaint was that AI could not reliably solve high-school competition problems. One year ago, it was that AI could not reliably win a Mathematical Olympiad gold medal. Today, the complaint has shifted to whether it can reliably solve 50-year-old unsolved problems.

The point is not that every earlier criticism was false when made. It is that the bar keeps moving. Zsolnai-Fehér invokes what he calls the “first law of papers”: do not look only at where the work is now; look at where it may be “two more papers down the line.” On that basis, he calls the AlphaProof Nexus result “absolutely amazing” and “stunning.”

The system does not ask the model to be trustworthy

The core technique is not simply “ask an AI to prove a theorem.” Zsolnai-Fehér emphasizes that ordinary AI assistants hallucinate and make things up, which makes them unreliable proof generators. AlphaProof Nexus instead routes the work through Lean, a formal mathematical language in which a proposed proof can be checked mechanically.

The setup begins with a human mathematician formalizing the problem in Lean while leaving the proof blank. In the example shown, the theorem statement ends with a sorry placeholder: the mathematical claim is written down, but the proof still has to be found. A prover subagent then tries to solve it. Zsolnai-Fehér’s expectation is blunt: it fails, because the problems are too hard.

The architecture shown for AlphaProof Nexus includes a mathematician, Lean problem formalization, domain knowledge and proof libraries, a prover subagent using an LLM plus AlphaProof, a proof validator, a rater subagent using an LLM critic and tournaments, and a population database containing sketches, goal proofs, and Elo scores.

Component	Role in the system
Mathematician	Writes the problem in Lean and leaves the proof blank
Prover subagent	Uses an LLM plus AlphaProof to attempt a proof
Proof validator	Checks whether the proposed proof is formally valid
Rater subagent	Uses an LLM critic and tournaments to compare candidate solutions
Population database	Stores sketches, goal proofs, and Elo scores for reuse

AlphaProof Nexus as shown in the architecture diagram

That first failure is not the end of the loop. A proof validator checks whether the proof is correct. Another AI component critiques bad attempts and explains why they are bad. The distinctive mechanism, as Zsolnai-Fehér presents it, is a cheaper judging model that compares two previous solutions and picks a winner. Both can be wrong; the judge only needs to pick the one that is “a bit better.”

The system assigns scores to candidate proof attempts using a tournament structure analogous to chess Elo ratings. In this analogy, proofs are the players. A bad proof can still earn a relatively high score if it beats other bad proofs. The next iteration does not restart from nothing; it starts from the highest-scoring failed solution and tries to improve it.

This is incredible, because it takes an unreliable AI, runs it over and over again, and it can lie its rear end off as much as it wants, and we still get a reliable system out of this.

Károly Zsolnai-Fehér

The reliability comes from the validator and the tournament loop, not from trusting the language model’s statements. The model can generate false or incomplete proof attempts. The rater compares imperfect candidates. The validator is the non-negotiable endpoint: the loop keeps running until the formal checker says the proof works.

The intelligence is partly in the harness

For Károly Zsolnai-Fehér, AlphaProof Nexus points to a broader shift in how AI systems improve. Previously, he says, the story was mainly “make the model smarter.” Here, the story becomes “make the harness around it tighter.”

His claim is not that the underlying model no longer matters. It is that the model is no longer the only place where capability lives. A system can become more powerful by changing the loop around the model: how attempts are generated, judged, ranked, stored, reused, and formally checked.

The phrase he returns to is that the system is “built from unreliable parts.” The LLM can fail repeatedly. The judge need not produce a proof. The tournament does not need every comparison to be perfect. But if the loop can preserve incremental progress and the validator can reject invalid final answers, repeated failure becomes a search procedure.

This is why the result matters beyond the nine theorems. The method points toward a design pattern: use a fallible generator, a comparative judge, memory of prior attempts, and a hard external verifier. Zsolnai-Fehér says everyone is experimenting with different kinds of loops, and he describes the current change as one where “the intelligence is not just in the model, but it is in the loop around it.”

The limitations are real, but not disqualifying

Károly Zsolnai-Fehér explicitly separates the result from the version likely to travel in simplified coverage. The first limitation is selection bias. Erdős left more than a thousand problems, while AlphaProof Nexus was tested on roughly 350. Zsolnai-Fehér says he thinks DeepMind selected a subset that was easier to formalize.

He does not treat that as a fatal weakness. His position is that the work had to start somewhere, and solving any decades-old open problems under formal verification remains significant. The caveat matters because the denominator matters: this was not a full sweep of all Erdős problems, and the attempted set may have been filtered by formalization feasibility.

Limitation	What Zsolnai-Fehér says it means
Selection bias	The system was tested on about 350 of more than 1,000 Erdős problems, likely a subset easier to formalize
Smaller models	Smaller models solved zero problems, so a strong core model is still required
Cost tradeoff	It remains open whether equal budgets are better spent on a larger model with fewer rounds or a smaller model with more rounds

The main caveats Zsolnai-Fehér attaches to the AlphaProof Nexus result

The second limitation is sharper: smaller models solved zero problems. Zsolnai-Fehér stresses the number repeatedly — zero, nothing — and concludes that a “beefy AI system” is still needed at the core. That qualifies the harness argument. Better loops may matter, but they did not make weak models sufficient for this task.

The result also raises an engineering tradeoff: if costs are equal, should one use a larger model with fewer tournament rounds, or a smaller model with more rounds? Zsolnai-Fehér does not answer it, but presents it as the kind of question this result makes practical. He also says the finding matches his experience that fast, cheap models can look close to frontier systems on benchmarks while feeling much weaker in practice.

The strongest claim is about formal verification, not human-like genius

The strongest version of Károly Zsolnai-Fehér’s argument is precise: AlphaProof Nexus solved nine long-standing open Erdős problems at a cost of a couple hundred dollars each by letting an unreliable AI fail thousands of times inside a judge-and-validator loop that could eventually certify a proof.

That should not be read as a simple claim that a model has become a human-like mathematician. The mechanism matters. The important output is a proof accepted by the Lean validator, and the path to it can be messy, iterative, and full of failed attempts. The system is not trusted because it sounds plausible. It is trusted, in this account, when the formal checker accepts the proof.

Zsolnai-Fehér connects this to the four-year progression from “can’t add numbers” to “solves decades-old open problems.” His conclusion is not that limitations have disappeared. He repeats that limitations apply. But he argues that the balance of importance has shifted: raw model size still matters, while algorithmic harnesses and multi-agent loops now matter too.

AI Application Architecture Evals and Benchmarks AI Research Methods Agents and Autonomy