Abstract: In this talk, we explore the transformative potential of custom reinforcement learning (RL) algorithms in accelerating solutions to complex, research-level mathematical challenges. We begin by illustrating how these algorithms have achieved a 10X improvement in areas where previous advances of the same magnitude required many decades. A comparative analysis of different network architectures is presented to highlight their performance in this context. We then delve into the application of RL algorithms to exceptionally demanding tasks, such as those posed by the Millennium Prize problems and the smooth Poincaré conjecture in four dimensions. Drawing on our experiences, we discuss the prerequisites for developing new RL algorithms and architectures that are tailored to these high-level challenges. This talk is largely based on recent work with A.Shehper, A.Medina-Mardones, L.Fagan, B.Lewandowski, A.Gruen, Y.Qiu, P.Kucharski, and Z.Wang.
Bio: After receiving his PhD from Princeton University, Sergei Gukov spent five years at Harvard University as a research fellow of the Clay Mathematics Institute and two years at the school of mathematics at the Institute for Advanced Studies. His passion is building new bridges between different areas of mathematical physics and pure mathematics, such as quantum topology, mirror symmetry, and gauge theory. His more recent interests involve new connections between mathematics and machine learning.
Cameron Freer (Massachusetts Institute of Technology, United States) Alexander Lew (Massachusetts Institute of Technology, United States) Timothy O'Donnell (McGill University, Mila – Québec AI Institute,, Canada) Vikash Mansinghka (Massachusetts Institute of Technology, United States)
Autoformalization and autoinformalization via probabilistic reasoning
ABSTRACT. Recent advances in the control of large language models via probabilistic programming allow us to imagine new workflows in theorem proving and formal mathematics. In order to spark a broader conversation about the role of probabilistic reasoning, we propose several speculative approaches: a reweighted wake-sleep algorithm for improving model quality, an involutive approach to autoformalization and autoinformalization, and a coarse-to-fine approach to proof synthesis.
Frontier of Formal Theorem Proving with Large Language Models: Insights from the DeepSeek-Prover
ABSTRACT. Recent advances in large language models have markedly influenced mathematical reasoning and automated theorem proving within artificial intelligence. Yet, despite their success in natural language tasks, these models face notable obstacles in formal theorem proving environments such as Lean and Isabelle, where exacting derivations must adhere to strict formal specifications. Even state-of-the-art models encounter difficulty generating accurate and complex formal proofs, revealing the unique blend of mathematical rigor required in this domain. In the DeepSeek-Prover series (V1 and V1.5), we have explored specialized methodologies aimed at addressing these challenges. This talk will delve into three foundational areas: the synthesis of training data through autoformalization, reinforcement learning that utilizes feedback from proof assistants, and test-time optimization using Monte Carlo tree search. I will also provide insights into current model capabilities, persistent challenges, and the future potential of large language models in automated theorem proving.
ABSTRACT. We introduce an ML-ready dataset for dependently-typed program-proofs in the Agda proof assistant, and present the underlying extraction tool. We then demonstrate how a bidirectional Transformer encoder retrofitted with structure-specific adjustments can achieve high performance in a realistic premise selection setup. Unlike most go-to LLM architectures, both the dataset and the neural model work in tandem to maximally preserve the typing information offered by Agda. The resulting pipeline is agnostic and impervious to most syntactic confounds, producing its inferences on the basis of type-structure alone. We provide empirical evidence that this makes for a general, robust, and efficient modeling approach.
ABSTRACT. Abstraction logic is a minimal yet maximally expressive logical system that serves both as a logic and a logical framework. While existing logics often trade simplicity for expressiveness, abstraction logic achieves both, making it an ideal foundation for machine-learning approaches to theorem proving. Its simplicity enables ML systems to manipulate and adapt it easily, while its expressiveness ensures no practical limitations. We present the core principles of abstraction logic and discuss its unique characteristics that make it well suited for AITP (artificial intelligence theorem proving).
The Generative AI Laboratory (GAIL) at the University of Edinburgh is a centre for excellence dedicated to researching all aspects of generative artificial intelligence (AI) in society. Uniting the diverse research expertise across the University with generative AI at its core, GAIL taps into a thriving AI landscape with recognised strengths in natural language processing, machine learning, and data-driven innovation.
This interdisciplinary initiative focuses on advancing generative AI techniques in the key areas of future health and care, climate and sustainability, and economic growth, and is committed to innovating in generative AI through partnerships with global stakeholders, driving new collaborations and dynamic projects across government, industry, and the public sector.
To learn more about GAIL, see here: https://gail.ed.ac.uk/
Contributed Talks devoted to Large Language Models for translating betwen different provers, proof formats and mathematical proofs written in natural languages
MiniF2F in Rocq: Automatic Translation Between Proof Assistants — A Case Study
ABSTRACT. In this work, we conduct an experiment using state-of-the-art LLMs to translate MiniF2F into Rocq. The translation task focuses on generating a Rocq theorem based
on three sources: a natural language description, the Lean formalization, and the
Isabelle formalization. We conducted our experiment in 3 stages of increasing complexity, from basic one-shot prompting to multi-turn conversations that incorporate feedback from unsuccessful attempts. At each stage, we perform multiple rounds of translation using increasingly advanced models: GPT-4o mini, Claude 3.5 Sonnet, o1
mini, and o1. We successfully translated 478 out of 488 theorems.
NLIR: Natural Language Intermediate Representation for Mechanized Theorem Proving
ABSTRACT. Thanks to recent advances in LLM capabilities, we believe natural language can serve as a universal interface for reasoning about formal proofs. In this work, we develop an LLM-based agent that can interact with the Rocq proof assistant. We present the following contributions: 1) Pétanque: A new fast and lightweight environment to interact with the Rocq theorem prover. 2) An interactive proof protocol leveraging natural language reasoning: hierarchical proof templating. 3) A search algorithms leveraging feedback from the ITP and natural language to rerank proof candidates. Using our method with state-of-the-art LLMs, we can successfully synthesize proofs for 46-58% of the first 100/260 lemmas from the newly published Busy Beaver proofs.
Abstract: AI for Mathematics (AI4Math) is not only intriguing intellectually but also crucial for AI-driven discovery in science, engineering, and beyond. In the recent past, there have been many efforts on AI4Math that mirror techniques in NLP, in particular, by training large language models on textual math datasets. As a complementary yet less explored avenue, formal mathematical reasoning is grounded in formal systems such as proof assistants, which can verify the correctness of reasoning and provide automatic feedback. In this talk, we advocate for formal mathematical reasoning as a task for LLMs and argue that it is indispensable for advancing AI4Math to the next level. In recent years, we have seen steady progress in using AI to perform formal reasoning, including core tasks such as theorem proving and autoformalization, as well as emerging applications such as verifiable generation of code and hardware designs. However, significant challenges remain to be solved for AI to truly master mathematics and achieve broader impact. We summarize existing progress, discuss open challenges, and envision critical milestones to measure future success. We end by calling on the research community to come together to drive transformative advancements in this field.
Bio: Swarat Chaudhuri is a Professor of Computer Science and the director of the Trishul laboratory at UT Austin, as well as a Visiting Researcher at Google Deepmind, London. His research lies at the interface of machine learning, programming languages, and formal methods. Through a synthesis of ideas from these areas, he seeks to develop a new generation of intelligent systems that are reliable, transparent, and secure by construction and can solve mathematical reasoning, scientific discovery, and system engineering tasks beyond the scope of contemporary AI. Dr. Chaudhuri has received the NSF CAREER award, the ACM SIGPLAN John Reynolds Dissertation award, Meta and Google Research awards, and several ACM SIGPLAN and SIGSOFT distinguished paper awards. He also served as a Program Chair for CAV 2016 and ICLR 2024.
Contributed Talks devoted to applications of Large Language Models in formal verification for software, hardware, and systems engineering, including industrial applications
Aleksandr Shefer (Jetbrains Research, Constructor University Bremen, Germany) Igor Engel (JetBrains Research, Constructor University Bremen, Germany) Stanislav Alekseev (JetBrains Research, Neapolis University Pafos, Cyprus) Daniil Berezun (JetBrains Research, Netherlands) Ekaterina Verbitskaia (JetBrains Research, Constructor University Bremen, Netherlands) Anton Podkopaev (JetBrains Research, Constructor University Bremen, Netherlands)
Are LLMs Ready for Software Verification?
ABSTRACT. Despite being able to produce reliable software, formal methods have hardly been adopted in mainstream programming. With the advent of large language models, it becomes more feasible to automatically generate code along with verification guarantees. This research explores whether LLMs can produce verified code from a textual description and a partial specification. We were able to achieve 63% success rate in Nagini and 40% in Verus on the HumanEval benchmark.
16:30
Samuel Teuber (Karlsruhe Institute of Technology, Germany) Bernhard Beckert (Karlsruhe Institute of Technology, Germany)
Towards LLM-support for Deductive Verification of Java Programs
ABSTRACT. Along with the success of Large Language Models (LLMs) in other domains, a quickly developing body of research is investigating the usage of Large Language Models to process program code. While much of this research focuses on the usage of LLMs for code generation, our work explores the usage of LLMs for generating (auxiliary) specifications. In contrast to pure code generation, where LLMs could produce buggy code, specification generation allows us to rigorously check consistency between pre-existing source code and generated annotations through the usage of theorem provers.
To this end, we present our efforts to couple the deductive verification tool KeY with GPT to automate the process of annotating Java code with JML specifications.
The need for ethical guidelines in mathematical research in the time of generative AI
ABSTRACT. Generative artificial intelligence (AI) applications based on large language models have not enjoyed much success in symbolic processing and reasoning tasks, thus making them of little use in mathematical research. However, re-cently DeepMind’s AlphaProof and AlphaGeometry 2 applications have re-cently been reported to perform well in mathematical problem solving. These applications are hybrid systems combining large language models with rule-based systems, an approach sometimes called neuro-symbolic AI. In this pa-per, I present a scenario in which such systems are used in research mathe-matics, more precisely in theorem proving. In the most extreme case, such a system could be an autonomous automated theorem prover (AATP), with the potential of proving new humanly interesting theorems and even pre-senting team in research papers. The use of such AI applications would be transformative to mathematical practice and demand clear ethical guidelines. In addition to that scenario, I identify other, less radical, uses of generative AI in mathematical research. I analyse how guidelines set for ethical AI use in scientific research can be applied in the case of mathematics, arguing that while there are many similarities, there is also a need for mathematics-specific guidelines.
17:20
Irina Starikova (Institute of philosophy, Slovak Academy of Science, Slovakia)
Mathematics and Creative AI
ABSTRACT. Artificial Intelligence (AI) is increasingly contributing to mathematics, from assisting in proof verification to uncovering patterns in complex structures. However, AI's role in creative problem-solving remains an open question. This paper examines the strengths and limitations of AI in mathematics, focusing on whether AI can engage in genuinely creative reasoning. We explore computational creativity through the lens of John McCarthy’s definition—where creativity involves introducing new concepts not present in the problem statement—and the functional model of creative generation and evaluation.
We illustrate this with the Mutilated Chessboard Problem, a classic combinatorial challenge that requires conceptual insight. AI-generated responses to this problem, while well-formed, fail to demonstrate the underlying creative reasoning, highlighting AI’s current limitations. The discussion extends to conceptual blending, which models human creativity as the integration of different conceptual spaces. While AI systems can merge data-driven insights, they still struggle with flexible abstraction and deep conceptual innovation.
Philosophically, AI’s role in mathematical discovery raises critical questions about the nature of creativity and co-creativity between humans and machines. While AI excels at computation and pattern recognition, human mathematicians provide intuition, abstraction, and the ability to work with ambiguity. Rather than replacing human insight, AI offers a powerful collaborative tool for expanding mathematical knowledge. This paper provides suggestions for a balanced perspective where AI enhances rather than diminishes human creativity, emphasizing the complementary strengths of both.
References
[1] Blanchette, J. C., Kaliszyk, C., Paulson, L. C., & Urban, J. (2016). Hammering towards QED. Journal of Formalized Reasoning, 9(1), 101–148. https://doi.org/10.6092/issn.1972-5787/4593
[2] Bloom, T. F., & Mehta, B. (2023). On a density conjecture about unit fractions. arXiv preprint arXiv:2112.03726. Retrieved from https://arxiv.org/abs/2112.03726
[3] Gizzi, E., Nair, L., Chernova, S., & Sinapov, J. (2022). Creative problem solving in artificially intelligent agents: A survey and framework. Journal of Artificial Intelligence Research, 75. https://doi.org/10.48550/arXiv.2204.103581
[4] Gizzie, E., et al. (2023). Conceptual spaces and AI: Creative problem-solving in intelligent agents. CEUR Workshop Proceedings, 3432, 43-50. Retrieved from https://ceur-ws.org/Vol-3432/paper5.pdf
[5] Gowers, W. T., Green, B., Manners, F., & Tao, T. (2023). On a conjecture of Marton. arXiv preprint arXiv:2311.05762. Retrieved from https://arxiv.org/abs/2311.05762
[6] Halina, M. (2021). Insightful artificial intelligence. Mind and Language, 36(2), 315-329.
[7] Johansson, M., & Smallbone, N. (2023, January). Exploring Mathematical Conjecturing with Large Language Models. In NeSy (pp. 62-77). Retrieved from https://ceur-ws.org/Vol-3432/paper5.pdf
[8] Lieto, A., Perrone, F., Pozzato, G. L., & Chiodino, E. (2019). Beyond subgoaling: A dynamic knowledge generation framework for creative problem solving in cognitive architectures. Cognitive Systems Research, 58, 305-316.
[9] McCarthy, J. (1964). A tough nut for proof procedures.
[10] McCarthy, J. (1999). Creative solutions to problems. Paper presented at the AISB Workshop on Artificial Intelligence and Creativity.
[11] Olšák, M., Kaliszyk, C., & Urban, J. (2020). Property invariant embedding for automated reasoning. In ECAI 2020 (pp. 1395-1402). IOS Press. https://doi.org/10.3233/FAIA200244
[12] Schorlemmer, M., & Plaza, E. (2021). A uniform model of computational conceptual blending. Cognitive Systems Research, 65, 118-137.
[13] Starikova, I. (2024). Thought experiments in mathematics: From fiction to facts. In Handbook of the History and Philosophy of Mathematical Practice.
[14] Starikova, I., & Van Bendegem, J. P. (2021). Revisiting the mutilated chessboard or the many roles of a picture. Logique et Analyse, 255, 289-312.
[15] Urban, J., & Jakubův, J. (2020). First neural conjecturing datasets and experiments. In C. Benzmüller & B. Miller (Eds.), Intelligent Computer Mathematics. CICM 2020. Lecture Notes in Computer Science (vol 12236). Springer, Cham. https://doi.org/10.1007/978-3-030-53518-6_24
[16] Yu, & Song. (2019). Creative problem solving in intelligent agents: A framework and survey. arXiv preprint arXiv:2204.10358. Retrieved from https://arxiv.org/abs/2204.10358