A Summary of PUTNAMBENCH

While looking at computer science research areas that I could find interesting, I stumbled upon formal methods, and more specifically, automated symbolic reasoning, theorem proving, and the integration of modern machine learning with formal reasoning. I decided to read some research papers to get a feel for this area, since it looked quite interesting to me. In this article, I am going to review and outline one interesting paper I read in this area. I will continue writing further articles about other papers I read.

PUTNAMBENCH

The first interesting paper I stumbled upon was the PUTNAMBENCH paper by Tsoukalas et al., where the capabilities of modern neural models in proving theorems in the framework of theorem provers such as Lean 4, Isabelle, and Coq are tested. These frameworks can automatically and rigorously verify the correctness of the proofs provided by the neural models. In this paper, the authors formalized hundreds of problems from the William Lowell Putnam Mathematical Competition.

The improvement that PUTNAMBENCH makes on existing benchmarks is that it introduces college level problems into the mix, with some problems even requiring ideas from research level mathematics, according to the paper. A few additional reasons cited for the creation of this benchmark were:

The limited scope of existing benchmarks
Existing benchmarks being designed for older frameworks
Preventing the leakage of benchmark data into the training data for LLMs (in general, the paper claims that this necessitates periodically creating new benchmarks)

One issue that PUTNAMBENCH had to address was that Putnam problems often aren't stated as logical propositions. In fact, more often than not, they require the student to both come up with a closed form solution and then prove that the solution is indeed correct. PUTNAMBENCH addresses this issue by splitting up generation of closed form solutions from the proofs of correctness into two tasks of different difficulty levels, where success in one task likely has high correlation with success in the other. The second task only asks for a proof of correctness of a pre-provided closed form solution. The first task is a strict superset of the second task, since it requires not only the generation of a closed form solution, but also a proof of correctness.

PUTNAMBENCH is claimed in the paper to be the first formalization of a large number of Putnam problems in Lean, Isabelle, or Coq, which is what is used to justify the idea that there isn't much cross-contamination between the dataset produced by the paper and the data used by large language models for training. I found this to be an interesting claim. Large language models probably have seen Putnam problems and their solutions in their natural language forms, but the claim that they haven't been exposed to formalizations of these problems and their proofs does seem plausible. It is then an interesting question whether or not seeing the natural language variants would give a language model an unfair advantage in the solving of the formalizations. The paper does acknowledge the possibility of such an indirect form of contamination.

The results of running various theorem proving models on these formalizations was quite astonishing to me, as a newcomer to this field. None of provers were able to get more than even a handful of the problems. I don't know whether this is typical of benchmarks for formal theorem proving, but it was a surprise to me. It also indicated to me that there is much progress left to be made in this area.

After reading this paper, I was curious to learn about the current state of the art in neurosymbolic reasoning. I wanted to learn how some of the models used (though unsuccesfully) in the PUTNAMBENCH paper worked. I therefore started reading some papers on this area. I also wanted to a bit about theorem proving frameworks, so I also began reading about those.