Sólrún Halla Einarsdóttir (Chalmers University of Technology)
Emily First (University of California, San Diego)
George Granberry (Chalmers University of Technology)
Moa Johansson (Chalmers University of Technology)
Nicholas Smallbone (Chalmers University of Technology)
ABSTRACT. Generating useful and novel lemmas is a challenging task for LLMs, as well as for previous symbolic approaches. We present ongoing work on neuro-symbolic lemma conjecturing, combining Large Language Models (LLMs) and symbolic methods. We train an LLM to generate lemma templates that describe the shape of a lemma, and symbolic methods are used to fill in the details. The aim of our neuro-symbolic tool is to avoid generating repetitive and incorrect lemmas compared to purely LLM-based generation, and generate more varied output lemmas than our purely symbolic methods. By leveraging the best of both symbolic and neural methods we can generate useful lemmas for a wide range of input domains, facilitating computer-assisted theory development and formalization.
Uzma Shafiq (Coventry University)
Matthew England (Coventry University)
Jürgen Gerhard (Maplesoft)
ABSTRACT. See attached paper.
ABSTRACT. The contrast between the remarkable fluency of large language models and their flawed reasoning is linked in a recent assessment of LLMs to the distinction between pre-training powered by next-word prediction and fine-tuning associated with alignment such as reinforcement learning [Mahowald et al, 2024]. The present work attempts to understand this distinction by exploring the view that next-word prediction presupposes a discrete notion of state which alignment may refine in any number of ways. Open-ended as that process may be, some structure is imposed on it by developing Kleene’s representation of nerve nets into a logical system [Goguen and Burstall 1992] on which information-theoretic costs are introduced around a discretized least action principle [Marsden and West 2001], generalized to a finite-state setting, supported by tools such as Mona.
Asterios Gkantzounis (national technical university of athens)
Foivos Skarpelos (national technical university of athens)
Petros Stefaneas (NTUA)
ABSTRACT. Useful interactions between Automated theorem proving (ATP) and Machine learning (ML), more specifically Large Language Models (LLMs), would benefit from an abstract logical framework for proofs and validity. We propose the use of the theory of Institutions, an abstract model-theoretic tool for specification and programming.
Lean and Metamath are proof assistants that rely on distinct mathematical-logical foundations, i.e., specific Institutions. Lean is built on dependent type theory, which enables the construction of highly structured mathematical objects and proofs. On the other hand, Metamath centers around its main database, the Metamath Proof Explorer, which is based on first-order logic. Additionally, other Metamath databases explore alternative systems, including intuitionistic logic, Quine’s New Foundations, and higher-order logic. Metamath verifies proofs using a simple substitution rule by focusing on the correctness of substitutions, without assuming a specific logic system. Hence, we need a logic-independent approach to enhance the development of semantics for such tools. While Metamath and Lean are powerful tools, they have certain limitations that highlight the need for a more advanced logic-independent framework, such as Institutions.
Institutions form the core of categorical abstract model theory, which formalizes the notion of a logical system by defining its syntax, semantics, and the satisfaction relation between them. They provide a common framework that moves beyond the details of specific logics, which helps in studying their connections. An Institution consists of four main components: (a) a collection of signatures (serving as vocabularies for forming sentences in a logical system) and signature morphisms, (b) a set of sentences for each signature, which represents the syntax of the logic, (c) a set of models for each signature, which provides the meaning or semantics of the sentences, and (d) a satisfaction relation that connects the models with the sentences, indicating which sentences are true in which models. Institutions extend Tarski's semantic definition of truth and generalize Barwise's Translation Axiom. Institutions highlight the fact that truth is invariant under change of notation. The key aspect of any logical system is the satisfaction relationship between its syntax (i.e., its sentences) and its semantics (i.e., its models). While this relationship assumes a fixed vocabulary, Institutions provide the flexibility to work with multiple vocabularies simultaneously. This ability allows for translations between different logical systems, enabling one to be interpreted within another. A main result from Institution Theory gives the conditions under which you can translate proofs from one theorem prover to another.
Systems like Lean and Metamath that demonstrate how LLMs can be utilized to assist in constructing formal proofs can clearly benefit from the Institutional approach. The main methodological implications for providing a general framework for ML and LLMs are:
The relativistic view The translation of problems and solutions The multi-language aspect
In our presentation, we will demonstrate how Institutions can be used to approach the semantics of systems like Lean and Metamath and the benefits of such an approach.
Alexander Lew (Massachusetts Institute of Technology)
Timothy O'Donnell (McGill University, Mila – Québec AI Institute,)
Vikash Mansinghka (Massachusetts Institute of Technology)
ABSTRACT. Recent advances in the control of large language models via probabilistic programming allow us to imagine new workflows in theorem proving and formal mathematics. In order to spark a broader conversation about the role of probabilistic reasoning, we propose several speculative approaches: a reweighted wake-sleep algorithm for improving model quality, an involutive approach to autoformalization and autoinformalization, and a coarse-to-fine approach to proof synthesis.
ABSTRACT. The integration of large language models (LLMs) with proof assistants promises to enhance mathematical reasoning, blending intuitive human interaction with rigorous verification. However, this coupling might introduce subtle risks. In this talk, we investigate how LLMs, proof checkers, and proof assistants handle inconsistent assumptions: sets of assumptions that can lead to seemingly valid but ultimately flawed proofs. Our focus is on GPT-o1 as an LLM and Isabelle and Coq as proof assistants.
We use the case study of Minimal Probability, defined as a theory of statements about expectations of a finite number of random variables. This theory is rich enough to cover a variety of simple applications and exercises, including known elementary paradoxes. Probability theory offers advantages for this study due to its status as a formal theory connected to simple, concrete situations and its susceptibility to subtle inconsistencies involving probabilities, expectations, conditional probabilities, and independence.
Key findings from our investigation into the behavior of LLMs and proof assistants when faced with inconsistent assumptions include: - When the possibility of inconsistency is not mentioned, GPT-o1 often detects simple inconsistencies but fails in more complex cases. - When the possibility of inconsistency is raised in the prompt, GPT-o1 decides about inconsistencies at a level comparable to human experts. - Proof checkers (e.g., Isabelle and Coq used solely to verify a submitted proof) do not detect contradictions, even when apparent. - Proof assistants (e.g., Isabelle and Coq with active intervention on justifying proof steps) might flag inconsistencies, but only occasionally. - GPT-o1 can be instructed to deceive the proof checker: it can create an inconsistent set of assumptions, opposite conclusions, and the proof checker code in such a way that both Isabelle and Coq would certify either conclusion as correct.
These findings highlight how potential inconsistencies could infiltrate probabilistic modeling of seemingly simple situations, leading LLMs and proof assistants to produce seemingly valid but ultimately meaningless proofs. Historically, when experienced humans designed setups, such risks were remote. However, these risks could grow as machines take on more control and assumptions become increasingly complex. Although possibly remote, the mere possibility that an LLM could deliberately craft assumptions to gain certification of unwarranted conclusions is particularly concerning.
This suggests the need for potential remedies or at least mitigating strategies. Our proposed actions include: (0) Raising awareness of the issue to direct efforts toward mitigating the potential dangers. (1) Maintaining human supervision of assumptions before integrating them into the proof environment for as long as possible. (2) Using multiple proof assistants—built on diverse logical foundations—to increase the likelihood of detecting contradictions. (3) Integrating automated consistency checks or specialized consistency-verification tools to promptly flag contradictions.
We explore (2) in some examples with some moderate success.
As for (3), we consider Minimal Probability, and we show that its axiomatization in a semantically complete first-order theory allows for the algebraization of assumptions, transforming probabilistic problems into polynomial relations. Consistency is then equivalent to the nonemptiness of a semi-algebraic set, a decidable problem analyzable using tools like cylindrical decomposition, which fully classifies consistent and inconsistent assumptions. This procedure has the complexity of the existential theory in a real closed field, and we implemented it in Mathematica for the considered examples.
The talk will present examples, tests, and mitigating actions.
ABSTRACT. I report on the progress of the DeepIsaHOL project for doing modern machine learning (ML) on the Isabelle interactive theorem prover (ITP) data. The project is ongoing and focused on creating infrastructure for doing ML experiments with Isabelle data such as fine-tuning large language models (LLMs) or converting Isabelle into a reinforcement learning (RL) environment. The project’s ultimate objective is to provide tools based on these experiments to aid Isabelle users in the proving process. This report describes the project’s current state and the tools it has generated, including a generic data extraction algorithm from Isabelle’s theory files, its Scala and Python interfaces, simple (Python and Scala) read, eval, print, loops (REPLs) for interacting with Isabelle programmatically, and initial use of the framework’s data for training a Google’s FLAN-T5 small LLM. The project remains open and welcomes ITP and ML enthusiasts to collaborate on it.
Orestis Melkonian (IOG)
Jean-Philippe Bernardy (Gothenburg University, Department of Computer Science and Engineering)
ABSTRACT. We introduce an ML-ready dataset for dependently-typed program-proofs in the Agda proof assistant, and present the underlying extraction tool. We then demonstrate how a bidirectional Transformer encoder retrofitted with structure-specific adjustments can achieve high performance in a realistic premise selection setup. Unlike most go-to LLM architectures, both the dataset and the neural model work in tandem to maximally preserve the typing information offered by Agda. The resulting pipeline is agnostic and impervious to most syntactic confounds, producing its inferences on the basis of type-structure alone. We provide empirical evidence that this makes for a general, robust, and efficient modeling approach.
ABSTRACT. We present discussion of opportunities of integration of ATP, ML and LLM in the context of exploration of well-know Andrews-Curtis conjecture in combinatorial group theory and report on initial experiments.
ABSTRACT. When proving an inductive problem, we often prove auxiliary lemmas that are useful for proving the original problem. If these auxiliary lemmas themselves are challenging, we must introduce more lemmas to prove these lemmas. To automate such multi-step conjecturing, we developed Abduction Prover. Given a proof goal, Abduction Prover conjectures a series of lemmas and attempts to prove the original goal using these lemmas. Our working prototype of Abduction Prover for Isabelle/HOL is publicly available on GitHub.
ABSTRACT. This project explores the intersection of mathematics, crystallography, and machine learning by developing a method to detect symmetry groups of tilings (crystals). The approach originates from a mathematical technique I introduced in [1], which can be effectively modeled using classical machine learning methods. The work aligns with the principles of explainable AI, as preliminary results suggest the existence of a precise conjecture that should be provable based on data-driven modeling. This poster presents the methodology, key findings, and potential directions for collaboration in bridging theoretical mathematics with computational techniques in crystallography.
[1] Growth functions of periodic space tessellations, with Jakub Malinowski, Zbigniew Dauter and Mariusz Jaskolski, Acta Crystallogr., Sect. A: Found. Adv. (2025), Vol. 81, No. 1, 64-81.
ABSTRACT. Abstraction logic is a minimal yet maximally expressive logical system that serves both as a logic and a logical framework. While existing logics often trade simplicity for expressiveness, abstraction logic achieves both, making it an ideal foundation for machine-learning approaches to theorem proving. Its simplicity enables ML systems to manipulate and adapt it easily, while its expressiveness ensures no practical limitations. We present the core principles of abstraction logic and discuss its unique characteristics that make it well suited for AITP (artificial intelligence theorem proving).
ABSTRACT. Generative artificial intelligence (AI) applications based on large language models have not enjoyed much success in symbolic processing and reasoning tasks, thus making them of little use in mathematical research. However, re-cently DeepMind’s AlphaProof and AlphaGeometry 2 applications have re-cently been reported to perform well in mathematical problem solving. These applications are hybrid systems combining large language models with rule-based systems, an approach sometimes called neuro-symbolic AI. In this pa-per, I present a scenario in which such systems are used in research mathe-matics, more precisely in theorem proving. In the most extreme case, such a system could be an autonomous automated theorem prover (AATP), with the potential of proving new humanly interesting theorems and even pre-senting team in research papers. The use of such AI applications would be transformative to mathematical practice and demand clear ethical guidelines. In addition to that scenario, I identify other, less radical, uses of generative AI in mathematical research. I analyse how guidelines set for ethical AI use in scientific research can be applied in the case of mathematics, arguing that while there are many similarities, there is also a need for mathematics-specific guidelines.
Sebastian Mežnar (Jozef Stefan Institute, Faculty of Mathematics and Physics, University of Ljubljana)
Ljupčo Todorovski (Faculty of Mathematics and Physics, University of Ljubljana)
ABSTRACT. We introduce a neural generator that produces valid Lean~4 expressions directly at the kernel level, bypassing Lean's elaboration process. To train the generator, we implement Lean's core type checking in Python and integrate it into a reinforcement learning environment that assigns rewards for well-typed subexpressions. We evaluate the model's ability to learn and adapt within this environment.
ABSTRACT. In recent years, LLMs have been increasingly applied to mathematical and reasoning tasks.
These models take significant amounts of training and yet are still capable of dreaming up incorrect solutions or struggle with simple algebraic tasks. For this reason, we investigate methods to improve the performance of models on symbolic reasoning tasks. TRS offer an intuitive, reliable, verifiable and Turing-complete reasoning framework on formal languages.
With this work, we want to explore the potential of combining TRS with RL. We introduce treewrite, a Python library designed to facilitate the integration of TRS with neural networks, enabling efficient problem-solving through RL.
ABSTRACT. I describe the strengths and weaknesses of various AI techniques with respect to mathematical reasoning and propose a hybrid architecture for future mathematical AI systems to combine the best features of the different approaches.
Igor Engel (JetBrains Research, Constructor University Bremen)
Stanislav Alekseev (JetBrains Research, Neapolis University Pafos)
Daniil Berezun (JetBrains Research)
Ekaterina Verbitskaia (JetBrains Research, Constructor University Bremen)
Anton Podkopaev (JetBrains Research, Constructor University Bremen)
ABSTRACT. Despite being able to produce reliable software, formal methods have hardly been adopted in mainstream programming. With the advent of large language models, it becomes more feasible to automatically generate code along with verification guarantees. This research explores whether LLMs can produce verified code from a textual description and a partial specification. We were able to achieve 63% success rate in Nagini and 40% in Verus on the HumanEval benchmark.
ABSTRACT. In this talk I present how theorem provers can be used to define machine learning applications together with correctness invariants of interest. This approach can be seen as a pathway towards encoding provably safe AI.
Nikita Khramov (JetBrains Research, Constructor University Bremen)
Andrei Kozyrev (JetBrains Research, Constructor University Bremen)
Anton Podkopaev (JetBrains Research, Constructor University Bremen)
ABSTRACT. We present CoqPilot benchmarking framework tailored for conducting the experiments involving LLMs and other techniques in Coq proof-generation domain. This framework allows a seamless setup of the experiment pipelines while ensuring the reliability and efficiency of their execution, making it suitable both for small preliminary studies and large-scale experiments. In addition to describing the benchmarking framework and its features, we present the experiment results obtained with it.
ABSTRACT. Artificial Intelligence (AI) is increasingly contributing to mathematics, from assisting in proof verification to uncovering patterns in complex structures. However, AI's role in creative problem-solving remains an open question. This paper examines the strengths and limitations of AI in mathematics, focusing on whether AI can engage in genuinely creative reasoning. We explore computational creativity through the lens of John McCarthy’s definition—where creativity involves introducing new concepts not present in the problem statement—and the functional model of creative generation and evaluation. We illustrate this with the Mutilated Chessboard Problem, a classic combinatorial challenge that requires conceptual insight. AI-generated responses to this problem, while well-formed, fail to demonstrate the underlying creative reasoning, highlighting AI’s current limitations. The discussion extends to conceptual blending, which models human creativity as the integration of different conceptual spaces. While AI systems can merge data-driven insights, they still struggle with flexible abstraction and deep conceptual innovation. Philosophically, AI’s role in mathematical discovery raises critical questions about the nature of creativity and co-creativity between humans and machines. While AI excels at computation and pattern recognition, human mathematicians provide intuition, abstraction, and the ability to work with ambiguity. Rather than replacing human insight, AI offers a powerful collaborative tool for expanding mathematical knowledge. This paper provides suggestions for a balanced perspective where AI enhances rather than diminishes human creativity, emphasizing the complementary strengths of both. References [1] Blanchette, J. C., Kaliszyk, C., Paulson, L. C., & Urban, J. (2016). Hammering towards QED. Journal of Formalized Reasoning, 9(1), 101–148. https://doi.org/10.6092/issn.1972-5787/4593 [2] Bloom, T. F., & Mehta, B. (2023). On a density conjecture about unit fractions. arXiv preprint arXiv:2112.03726. Retrieved from https://arxiv.org/abs/2112.03726 [3] Gizzi, E., Nair, L., Chernova, S., & Sinapov, J. (2022). Creative problem solving in artificially intelligent agents: A survey and framework. Journal of Artificial Intelligence Research, 75. https://doi.org/10.48550/arXiv.2204.103581 [4] Gizzie, E., et al. (2023). Conceptual spaces and AI: Creative problem-solving in intelligent agents. CEUR Workshop Proceedings, 3432, 43-50. Retrieved from https://ceur-ws.org/Vol-3432/paper5.pdf [5] Gowers, W. T., Green, B., Manners, F., & Tao, T. (2023). On a conjecture of Marton. arXiv preprint arXiv:2311.05762. Retrieved from https://arxiv.org/abs/2311.05762 [6] Halina, M. (2021). Insightful artificial intelligence. Mind and Language, 36(2), 315-329. [7] Johansson, M., & Smallbone, N. (2023, January). Exploring Mathematical Conjecturing with Large Language Models. In NeSy (pp. 62-77). Retrieved from https://ceur-ws.org/Vol-3432/paper5.pdf [8] Lieto, A., Perrone, F., Pozzato, G. L., & Chiodino, E. (2019). Beyond subgoaling: A dynamic knowledge generation framework for creative problem solving in cognitive architectures. Cognitive Systems Research, 58, 305-316. [9] McCarthy, J. (1964). A tough nut for proof procedures. [10] McCarthy, J. (1999). Creative solutions to problems. Paper presented at the AISB Workshop on Artificial Intelligence and Creativity. [11] Olšák, M., Kaliszyk, C., & Urban, J. (2020). Property invariant embedding for automated reasoning. In ECAI 2020 (pp. 1395-1402). IOS Press. https://doi.org/10.3233/FAIA200244 [12] Schorlemmer, M., & Plaza, E. (2021). A uniform model of computational conceptual blending. Cognitive Systems Research, 65, 118-137. [13] Starikova, I. (2024). Thought experiments in mathematics: From fiction to facts. In Handbook of the History and Philosophy of Mathematical Practice. [14] Starikova, I., & Van Bendegem, J. P. (2021). Revisiting the mutilated chessboard or the many roles of a picture. Logique et Analyse, 255, 289-312. [15] Urban, J., & Jakubův, J. (2020). First neural conjecturing datasets and experiments. In C. Benzmüller & B. Miller (Eds.), Intelligent Computer Mathematics. CICM 2020. Lecture Notes in Computer Science (vol 12236). Springer, Cham. https://doi.org/10.1007/978-3-030-53518-6_24 [16] Yu, & Song. (2019). Creative problem solving in intelligent agents: A framework and survey. arXiv preprint arXiv:2204.10358. Retrieved from https://arxiv.org/abs/2204.10358
Guillaume Baudart (IRIF, Université Paris Cité, Inria, CNRS)
Emilio Jesús Gallego Arias (IRIF, Université Paris Cité, Inria, CNRS)
Marc Lelarge (DI ENS, PSL University, Inria)
ABSTRACT. Thanks to recent advances in LLM capabilities, we believe natural language can serve as a universal interface for reasoning about formal proofs. In this work, we develop an LLM-based agent that can interact with the Rocq proof assistant. We present the following contributions: 1) Pétanque: A new fast and lightweight environment to interact with the Rocq theorem prover. 2) An interactive proof protocol leveraging natural language reasoning: hierarchical proof templating. 3) A search algorithms leveraging feedback from the ITP and natural language to rerank proof candidates. Using our method with state-of-the-art LLMs, we can successfully synthesize proofs for 46-58% of the first 100/260 lemmas from the newly published Busy Beaver proofs.
Bernhard Beckert (Karlsruhe Institute of Technology)
ABSTRACT. Along with the success of Large Language Models (LLMs) in other domains, a quickly developing body of research is investigating the usage of Large Language Models to process program code. While much of this research focuses on the usage of LLMs for code generation, our work explores the usage of LLMs for generating (auxiliary) specifications. In contrast to pure code generation, where LLMs could produce buggy code, specification generation allows us to rigorously check consistency between pre-existing source code and generated annotations through the usage of theorem provers. To this end, we present our efforts to couple the deductive verification tool KeY with GPT to automate the process of annotating Java code with JML specifications.
Adam Pease (Naval Postgraduate School)
Angelos Toutsios (Naval Postgraduate School)
Roberto Milanese Jr. (Naval Postgraduate School)
Jarrad Singley (Naval Postgraduate School)
ABSTRACT. 1 Introduction In recent years Large Language Models (LLMs) have improved significantly in the ability to ingest massive amounts of information and produce statistically relevant answers to queries. Unfortunately there is a lack of formalism with this approach, and real reasoning is absent in LLM responses. Symbolic AI on the other hand is able to logically reason through a problem with concrete steps, but draw from a relatively small quantity of formal axioms, and are thus limited to specific domains. The neuro-symbolic approach draws from the benefits of both, and limits the disadvantages inherent in each. We present a system that can take natural language and translate sentences into higher order formal axioms using a LLM.1 Automated theorem proving (ATP) is then used to ensure our growing knowledge base is consistent, and to answer questions from our knowledge base in a provable way. Human language is expressive and diverse, more so than can be expressed in First Order Logic [4]. Thus to create our system, we use a large ontology called Suggested Upper Merged Ontology (SUMO) [2], which contains over 230,000 axioms formalized with expressive higher order logics. Axioms in SUMO are continuously being checked for consistency using the Vampire Eprover [1]. Our system is constantly improving, being able to successfully translate growing categories of sentence structures. 2 System Overview The dataflow of our system is shown in Figure1. The target user of our system is a domain expert composing policy. The user will input a policy document, written in natural English, to a computer. The text is first sent to a Metaphor Handler, which uses an LLM to detect and replace metaphorical words and phrases. The result is sent to a Sentence Simplification module, a combination of heuristics and an LLM which rephrases the sentences into simpler grammatical structures, 1 Code repository for our system is found on Github at: https://github.com/ontologyportal/sumonlp 2 R. Thompson et al. similar to the sentences on which our Language to Logic (L2L) translator was trained. Every word in the WordNet lexicon [3] is mapped to SUMO, however even with such an expansive dataset, there are words and organizations that will be encountered that are unknown. Rather than fail, the system will attempt to produce logic using Stanford’s Stanza Name Entity Recognizer (NER) to detect the word type. The word is replaced with an appropriate label in the sentence, which our L2L translator has been trained to handle. After translation to logic is complete, the label is replaced with the original word. Thus, successful translation is still possible even when an unknown word is encountered. Fig. 1. Language to logic architecture 2.1 Language to Logic Translation and Theorem Proving The L2L translator is trained from a synthetic corpus of around 12 million sentences with their corresponding logic translations. Language logic pairs are built by creating a frame structure for each sentence and filling in the slots of the frame from the contents of SUMO; the objects, processes and relationships. Because we understand the intended semantics of the frame, we are able to generate equivalent english and logic expression of the frame. This is in contrast to methods that otherwise rely on untrained, non-learning methods, or human manual generation of a training set. After translation to logic, the new axioms are added to the SUMO ontology. The Vampire ATP is then tasked with finding and contradictions that might have been introduced. The user can then use SUMO to ask questions of the knowledge base. These questions are similarly translated to a formal conjecture, which Vampire is used to respond to.
Guillaume Baudart (IRIF, Université Paris Cité, Inria, CNRS)
Emilio Jesús Gallego Arias (IRIF, Université Paris Cité, Inria, CNRS)
Marc Lelarge (DI ENS, PSL University, Inria)
ABSTRACT. In this work, we conduct an experiment using state-of-the-art LLMs to translate MiniF2F into Rocq. The translation task focuses on generating a Rocq theorem based on three sources: a natural language description, the Lean formalization, and the Isabelle formalization. We conducted our experiment in 3 stages of increasing complexity, from basic one-shot prompting to multi-turn conversations that incorporate feedback from unsuccessful attempts. At each stage, we perform multiple rounds of translation using increasingly advanced models: GPT-4o mini, Claude 3.5 Sonnet, o1 mini, and o1. We successfully translated 478 out of 488 theorems.
ABSTRACT. Recent advances in large language models have markedly influenced mathematical reasoning and automated theorem proving within artificial intelligence. Yet, despite their success in natural language tasks, these models face notable obstacles in formal theorem proving environments such as Lean and Isabelle, where exacting derivations must adhere to strict formal specifications. Even state-of-the-art models encounter difficulty generating accurate and complex formal proofs, revealing the unique blend of mathematical rigor required in this domain. In the DeepSeek-Prover series (V1 and V1.5), we have explored specialized methodologies aimed at addressing these challenges. This talk will delve into three foundational areas: the synthesis of training data through autoformalization, reinforcement learning that utilizes feedback from proof assistants, and test-time optimization using Monte Carlo tree search. I will also provide insights into current model capabilities, persistent challenges, and the future potential of large language models in automated theorem proving.
ABSTRACT. Possibility of covering an incomplete chessboard with dominos, without gaps or intersections, is (in the case when 2 diagonally opposite corners on a standard board are removed) a classical example where impossibility of a cover can be formally proven directly, but such a direct proof would be in practise too tedious for a human, while introducing a new idea, natural for a human but not immediate in a direct formalisation, of colouring chessboard squares in 2 colours, black and white, and noticing that a domino always cover equal number, 1, of each of the colours, while unequal number of those are present on the truncated board, establishes impossibility of such a cover immediately.
Meanwhile, counting the _number_ of such coverings (when a cover is possible), is among classical problems in combinatorics, or in statistical physics, and for example in the case of full board has been solved by methods discovered by Kasteleyn; Temperley Fisher; and Lieb in the 60's. While for small enough and finite boards counting by an exhaustive search can also be done by a computer, reproducing in those cases more creative results achieved by the humans (and answering by brute force questions of possibility or impossibility of covers, enjoyed by philosophers and logicians, as a corollary).
With rapid advances of machine learning and AI, we were curious how much of that would be accessible to present-day AI, and how creative it would be in tackling such problems. In my talk, and a follow-up paper, I will discuss what we have found. The results were essentially disappointing, as they exhibited little mathematically-valid creativity, and rather displayed known problems of foundational models such as hallucination and false confidence (and, under the hood, apparently very large formal VC dimensions of the transformers models, compared to the training set sizes).
Some of the highlights of what we found. A LLM model, of chatGPT kind, was able to correctly inform us on impossibility of covering with dominos a standard chessboard with diagonally opposite corners removed, and to write a Lean program making this explanation formal, and to provide us with the correct count of the number of coverings of the full board. But those results were essentially in the training set, and did not require much creativity.
With a small modification, corners now removed on the same side, we found that what was so far learned by AI was quickly abandoned. We were confidently told that covering is still impossible, for the same reason of parity mismatch, although in this case, unlike the previous one, the black and white counts are equal; and actually a direct search would easily (for a computer) find a large (for a human) number, 2436304, of possible inequivalent covers. Moreover, we nevertheless were offered a Lean code "proving" that such cover was _impossible_, with a caveat that the theorem coded in Lean used a miracle procedure "sorry", as a placeholder for "the proof is not yet complete."
We proceeded to interrogate AI on coverings of full or truncated boards by longer 1 x n dominos. In some sufficiently simple cases, for example when number of remaining on the board squares was not divisible by n, or when n was larger than any of the board measurements, impossibility of covers was correctly established by AI. Yet in cases where there was no immediately obvious reason, either way, the feedback was typically not mathematically sound (though plausible linguistically).
Further, we explored whether in the case of 1 x 3 domino covers the AI will try a new creative element. It seems natural that a human will try colouring the board in 3 colours in that case, trying to generalise success of black and white colourings for 1 x 2 dominos problems. In fact, 3 colours do help, in some of 1 x 3 colouring problems. However, we did not find signs that AI would explore this (at least if not directly prompted).
We hope that our investigation can contribute to the discussion of the AI meaning and its role in mathematical proofs.