Mathematicians have developed a very precise common language for saying whatever they want to say. This precision is embodied first of all in the definitions of the objects they work with, stated usually in the framework of a more or less axiomatic set (or category) theory, and in the skillful use of metalanguage (which our natural languages provide) to qualify the statements. All the other vehicles of mathematical rigor are secondary, even that of rigorous proof.

– Yuri Manin, *The interrelation between mathematics and physics*^{[1]}

Google DeepMind’s AlphaProof system has achieved a silver medal performance on this year’s International Mathematical Olympiad. (GDM blog post, Tim Gowers’ twitter thread talking about it).

Currently, the fastest way to find a proof of a theorem is to have a mathematician search for it manually. The domains where efficient proof searches exist^{[2]} are narrow, and their use usually can’t justify the cost of formalizing mathematics in proof assistant. Two things are changing here: General RL-based proof search methods are becoming extremely powerful, and LLM-based *autoformalization* is reducing the cost of using proof assistants.

Therefore, I think there is a good chance automatic proof search will soon render manual proof search obsolete (contrary to the title of this post, proofs will still exist, as trivially-verifiable certificates of truth - but they will not primarily be produced by humans).

The primary importance of mathematicians will be the production of definitions. As proof search makes the *inspection* of the formal properties of formal objects extremely cheap, it becomes more valuable to produce formal objects whose formal properties are of interest - and which can be *understood as such* by users.

For example, the safeguarded AI programme involves verifying that a system is safe to operate by producing a model of the system and its environment so that verifying some formal safety property in the model implies, with a high degree of confidence, that the system will be safe to operate in reality. Automatic proof search can find a *proof* of such a formal safety property. But model has to be produced in a way that leaves us confident in its accuracy, which implies a high degree of human input.

## What AlphaProof actually did

AlphaProof is an automatic proof searcher for Lean. It’s a combination of a language model with the AlphaZero reinforcement learning algorithm. The training problems was produced by using the Gemini LLM to autoformalize a large corpus of mathematical statements.

This was combined with a model called AlphaGeometry2, which is somewhat similar but specialized to geometric problems.

For the IMO benchmark, the problems were *manually* formalized - apparently the autoformalization technology is not currently 100% reliable (note that, for training, it doesn’t have to be 100% reliable, as learning to solve a slightly different problem from the one in the dataset is just fine - the purpose of the dataset is just to give example problems).

The output of the system is simply Lean code defining a proof - there is no automatic *deformalization*/interpretation step.

The system apparently took up to three days to solve some problems, in contrast to the two times 4.5 hours given to IMO contestants, so perhaps it is a bit premature to give it the silver medal.

The IMO problems can be downloaded here. Of course it is worth noting that these problems are designed to have fairly short solutions (writeable in a few pages of handwriting), and for example do not require any complicated auxiliary definitions. How effectively this approach scales is a key question.

## Outlook

Betting on the effective scaling of big ML systems to fail has not been a winning strategy recently - still, it is not clear how much one should infer from the increasing performance of LLMs about a fairly different domain like this.

On the other hand, finding long proofs is also very difficult for humans - the system doesn’t have to become so good that it can solve hard open problems *very quickly and cheaply* to be superhuman.

IMO problems are also different from research mathematics in that they do not usually draw upon a lot of existing theory (they should at least be understandable by high school students without a great deal of extra teaching), and of course accessing a large knowledge base is something large neural networks are very good at. This is a reason to expect these systems to become even better (relative to humans) when applied to research mathematics.

So overall I think these systems will surpass us at proof-search “soon”. How this will affect academic mathematics is very hard to forecast. There may be a period where computer-aided mathematicians cash in on the fame of cracking a bunch of open problems, but eventually the mere act of typing some interesting-looking term into a proof search algorithm will be recognized as not very novel. I think it will increase the *economic* value of trained mathematicians, since part of their expertise is designing formal objects (definitions), which will become much more useful.

What does this look like? A partial answer is that you will write things in natural language, then have them turned into Lean by a language model (you may hand-write some Lean too, of course). Then you’ll put the cursor in a proposition and hit CTRL-ALT-P to have the system search for a proof or counterexample. There will also be a language model which can explain the formal proof back to you in natural language.

But partially, this will probably lead to a greater change in the practice of mathematics. There is probably a much better possible interface than the existing ones. One area that seems ripe for potential innovation is the connection between the systems-being-verified and the mathematical models as it exists in Lean (this is connected to the model=code idea which has been floating around the ACT space since forever, maybe due to Jules Hedges?). I think this sort of thing brings together a bunch of ideas that people around here have already been thinking about, and it’s gonna be a very exciting field going forward.

I think this might be the source of the oft-repeated line “Proofs are more important than theorems, definitions are more important than proofs,” which is often attributed to Manin but which I could not identify a source for. ↩︎

Any time you use a SAT solver, you are of course running a proof search in some sense, but there is also stuff like Sledgehammer ↩︎