OpenAI Model Disproves Erdős’s 80-Year-Old Unit Distance Conjecture

Hongxun WuOpenAIThursday, June 4, 202612 min read

OpenAI reasoning researchers Alexander Wei, Hongxun Wu and Lijie Chen say a general-purpose model disproved Paul Erdős’s 80-year-old unit distance conjecture, a central problem in discrete geometry, by finding a construction that beat the square-grid arrangement Erdős had proposed as essentially optimal. In the podcast, they argue the result is significant not just because of the problem’s status, but because the model was not a bespoke math system: given enough inference-time compute, it produced a proof idea that internal reviewers initially doubted and that other mathematicians quickly began using. Their broader claim is that AI is moving beyond contest math toward a collaborative role in research, where models solve hard problems and humans verify, interpret and extend the ideas.

The model did not just solve an olympiad problem

The result that animated Alexander Wei, Hongxun Wu, and Lijie Chen was not another benchmark score. Wei said an OpenAI model produced a disproof of Paul Erdős’s unit distance conjecture, an 80-year-old open problem in combinatorial geometry.

Wei described the problem as asking, for $n$ points on a plane, how many pairs can be exactly one unit apart, and how that number grows as the number of points increases. Erdős’s original conjecture, as Wei framed it, was that the best way to maximize these unit-distance pairs was essentially to arrange the points in a square grid. The model’s proof showed, in Wei’s account, that the square grid is not close to optimal, and that a better construction exists using “a lot of high-powered number theory.” In other words, the claim was not simply that a model found a clever configuration; it was that the model overturned Erdős’s proposed square-grid optimality for the unit-distance problem.

Chen emphasized that this was not a minor curiosity among Erdős’s many questions. Erdős asked more than a thousand problems, some with monetary prizes attached and some simply noted. This one carried a $500 prize, but Chen’s point was not the money. He described it as “one of the central questions” in discrete geometry, heavily discussed in papers and regarded as a concrete major open problem in its field.

$500

Erdős prize Chen said was attached to the unit distance problem

The distinction mattered because the researchers had already seen models reach strong performance on competitive math benchmarks. The International Math Olympiad and International Olympiad of Informatics had long functioned as implicit grand challenges for AI, Wei said: high school competitions with brutally difficult problems, two sessions of roughly four and a half to five hours, and only three problems per session. Wei recalled that when he joined OpenAI, Noam Brown asked him when models would get IMO gold. Many researchers, he said, thought it might be out of reach that year and perhaps possible in 2026. Wei thought it might be possible by April; he said it took until June to get a strong model, and when IMO rolled around, “we were able to get gold.”

But Chen drew a clear line between that and the Erdős result. IMO-level problem solving, however hard, was now “far in the rearview mirror” of frontier AI, in Wei’s phrase. The Erdős disproof, Chen said, was “way beyond” IMO level and “something that can be published in like the best journal of math.”

The proof first looked implausible, then harder to dismiss

The researchers were not looking for this particular breakthrough. Wu said they wanted to test the upper bound of a model’s capability, so they used a selected subset of Erdős problems as probes. Wei and Wu were testing two slightly different internal models and both saw what appeared to be correct solutions.

The first verification step was to ask the model to check its own work. Chen was explicit that this was not enough: models are sometimes unreliable. The team then asked mathematically trained colleagues inside the company, including Matev and Maciej, to review the proof. According to Chen, their initial reaction was disbelief: “There’s no way this can be true. It’s like a major open problem.” After a day of failing to find a mistake, they became more convinced.

Wu described the shift in credibility as gradual. Matev initially said the proof was definitely wrong, but Wu discounted that because he believed the review had only taken five or ten minutes. When Matev later moved to “50%,” Wu half-joked that if the trend continued, the next night it would be 100%. The experience felt “dreamlike,” he said, but also increasingly natural: they knew a model would eventually do something of this scale, just not that quickly.

Wei’s own instinct was skepticism. As a researcher, he said, one learns that results that look too good to be true usually contain a bug. His prior was that Lijie and Hongxun would eventually find the mistake. Instead, as the days passed, his view shifted toward the possibility that this was “the one in a hundred times where it’s too good to be true, but it’s actually true.”

Chen later described the model as having produced a 125-page reasoning trace. He treated parts of that trace as creatively interesting, even where those paths did not work. Wu, however, cautioned that reading such a long trace was probably not very helpful for a mathematician. The answer itself mattered more: it conveyed an idea that mathematicians could learn from and reuse.

Wu said that reuse happened quickly. In his account, some of the mathematicians who reviewed the proof, together with collaborators, used the idea to disprove a sum-product conjecture for real numbers. Wei described that as remarkable: according to him, within about a week, a group of mathematicians had used the model’s result as inspiration to attack another problem he characterized as of similar importance to the unit distance conjecture.

The important variable was time spent thinking

Wei’s path into reasoning began with the idea of spending more compute at inference time to solve harder problems. Before “test-time compute,” he said, models answered immediately, “right off the cuff.” Giving a model inference-time compute lets it think, revise, try different paths, and improve its answer before producing a final response.

Andrew Mayne summarized this as letting the model think longer. Wei agreed: that extra time makes models smarter in the operational sense that they can do things they could not do instantly.

Chen pointed to the Erdős experiment as evidence that reasoning works. In the official blog’s plot, he said, giving the model more time to think made its accuracy on the problem grow faster; with a lot of time, it could get almost 50% correct. Wei later described the same result as a test-time compute finding: with enough test-time compute budget, the model could solve the problem around 50% of the time. Chen’s shorthand was: “More thinking, more correctness.”

~50%

success rate Chen and Wei said the model reached on the problem with enough test-time compute

Wei also stressed that the model was not a bespoke mathematics system. It was a general-purpose model they took “on a test drive” against very challenging math problems. Chen said the same model worked well in a general ChatGPT or Codex-like setup: it could code, browse websites, find information, write and execute Python. It did not, in this case, write Lean.

That generality produced one comic but revealing detail. Chen said the model’s first action when it accessed the web was to check the meaning of “unit” in the Cambridge Dictionary. It wanted to make sure it had the exact correct understanding of the term. Wu added that models often restate definitions in their answers to show they have grounded the problem.

The model did not need elaborate scaffolding for the original result, according to Wei. It was essentially asked to solve the problem, and it produced the answer; the original prompt and response, he said, were uploaded in a note on OpenAI’s blog. Later reproductions by other models or labs appeared to involve more structure or steering. But he interpreted the central lesson as test-time compute scaling: pour in more compute at inference time and results improve.

The creative act was a bridge between fields

Wei was careful about the limits of his own mathematical judgment. The proof was, he said, “well above my own mathematical pay grade.” But at a high level, he understood the core move as taking class field theory and applying it to combinatorial geometry. Some mathematicians may have known that a bridge between those areas could exist, he said, but making the connection required insight and creativity, and executing the proof required delicate, careful work.

Wu’s most vivid description was operational rather than technical: tell the model to do something, go to lunch, and come back to find that it did far better than expected. That was the moment when the model felt “really amazing.”

Chen was interested in the distinction between combining distant ideas and generating wholly new ideas. In the Erdős proof, he characterized the final idea as largely a sophisticated combination of existing material. But he said that within the long reasoning trace there were creative thoughts that did not work out. He wants to see whether AI can eventually generate “completely new ideas from scratch,” something he said has not yet been concretely seen.

Wu was more cautious about new forms of mathematics. Models are now very good at coming up with ideas to solve problems, he said, but not yet good at proposing a completely new kind of math or a new theory. How to make models do that remains open.

Wei framed the issue in terms of time horizon. He sees something like Moore’s Law in the duration over which models can work effectively and independently: every few months, the amount of human-equivalent time they can operate for seems to double. Some hard problems may have short solution paths if one is very good at finding them. But inventing new ways of doing mathematics is a years- or decades-long process, so Wei expects it will take time for that exponential improvement to reach theory-building.

The near-term pattern is collaboration, not replacement

Mayne pressed the question of whether this should intimidate mathematicians. Hongxun Wu said it should be empowering. After the unit distance proof, he said, mathematicians improved the bound and used the construction’s intuition to knock down other open problems. He expects that pattern to continue: models may make breakthroughs on hard questions, while humans digest the ideas and apply the methods elsewhere.

Lijie Chen made a related argument about comparative strengths. AI knows a lot and can connect distant ideas, he said, while humans can still think longer and build broader theories. “Currently it seems AI cannot build a new theory for math,” he said. But with AI help, humans can pull ideas from distinct fields and use them to make new discoveries.

The optimism was not presented as denial of professional anxiety. Wu said concern is legitimate, especially in fields where much of the work is problem-solving, because models are going to become very good at problem-solving. But mathematics, in his view, is more than that. It is about understanding structure and building new theories. The optimistic path is to use models to solve encountered problems faster and thereby accelerate theory-building and understanding.

Chen compared the likely future of mathematics to what has already happened in coding. As Codex becomes more capable, one might expect to work less. In practice, he said, capable tools make him work more because there is more he can do. He imagines mathematicians with ten ideas assigning them to ten models, seeing which one succeeds, and avoiding tedious calculations themselves.

That shift is already visible in the researchers’ own workflows. Alexander Wei said much of his work is now done by coding agents, allowing him to do far more. Wu described an even sharper change: half a year earlier, he was hand-coding and searching Slack for directions; now his default is to ask Codex, let it work, go to lunch, or talk to people. “The work completely changes,” he said.

Wei also warned against treating this as a race by OpenAI to solve as many Erdős problems as possible. He said the goal is to empower academic communities, not enter from the outside, solve a bunch of problems, and hand communities “AI slop.” The better approach is to make researchers aware of the capability and give them access to direct test-time compute toward problems they consider important.

Using the models well means recalibrating trust

Hongxun Wu gave two practical recommendations for researchers: use a model that can think longer, and ask the boldest question possible. He has sometimes decomposed a problem into smaller pieces, asked the model those subproblems, and found that it performed worse than when asked the original question directly. His explanation was that human priors about how a problem should be solved are useful for reducing thinking time, but they can also be wrong and create blind spots. Models can surprise users by finding hidden paths.

Alexander Wei said researchers need to learn how far to trust the model. Under-trusting it prevents a researcher from using its full capability; over-trusting it causes failures. He said Chen had taught him a lot about using these tools, and described himself as a “dinosaur” in adoption because he began working at OpenAI before these tools existed and still carried old habits from models of six months prior.

Lijie Chen offered a simple heuristic: double your trust in the model, see where it fails, and if it fails, back up. Repeat this every month. That way, a researcher can quickly approach the point of maximum useful trust without breaking their work. Over the previous five months, he said, that trust boundary had been moving exponentially.

Wu added that a model’s usefulness does not end at the answer. After asking it to solve a question, a researcher can ask how it solved it, request explanations of specific proof steps, and have the model walk through the proof line by line. He liked that the system becomes not just a one-shot solver, but a patient tutor for the solution it produced.

Some famous problems remain out of reach

The researchers did not suggest that all open mathematics is about to fall. Lijie Chen said some Erdős problems are “very, very hard.” Alexander Wei noted that the Erdős list includes problems such as the Collatz conjecture, which feel far beyond the mathematical technology of today even though they are simple to state.

Chen’s aspirational target was P versus NP, but he had already explained why it looks different from the unit distance result. Solving P versus NP may require building an entirely new theory, perhaps “many books of new ideas.” At present, he said, models still seem far from that.

Wei’s next milestone was AI that can do AI research. He sees many unsolved problems in AI as limited by human intelligence, and he is broadly optimistic about making AI available because demand for intelligence in the world exceeds what humans can supply.

The researchers were more speculative when asked about fields beyond math. Chen wants people to use models to discover new things not only in mathematics but across science. He imagines a “dream world” in which everyone has access to top-level reasoning ability and researchers can use it to discover whatever they want to discover. In his framing, OpenAI’s mission is to accelerate science by empowering every scientist.

Cryptography may be stress-tested; quantum computing is a different paradigm

On cryptography, Lijie Chen focused on the conjectural foundations. Much of cryptography rests on assumptions that certain problems, such as factoring, are hard for classical computers. Those are not generally backed by mathematical proof. If models become strong at algorithms, he said, they might prove some cryptographic conjectures and show that protocols are secure — or they might find loopholes. Either outcome would matter because models could stress-test the foundations of security.

The question of quantum computing was treated differently. Chen, who said his first paper was on quantum advantage, described quantum computing as a different territory. Models are classical computers; they do what humans can do, perhaps better. Quantum computers can in some cases do different “fancy stuff,” such as simulating quantum effects in chemistry, though he cautioned he was not an expert on that comparison.

He did expect AI to accelerate the development of quantum computers. In particular, he pointed to recent improvements in quantum error correction and imagined AI proposing new quantum error-correction algorithms, speeding physical implementation and development.

Evals and Benchmarks AI Research Methods Human-AI Interaction