Contributed Talks devoted to Machine Learning and Generative AI in automating proof search and proof construction, in the context of the Automated Theorem Proving
Automated Theorem Provers as the Hub of the AI Math Ecosystem
ABSTRACT. I describe the strengths and weaknesses of various AI techniques with respect to mathematical reasoning and propose a hybrid architecture for future mathematical AI systems to combine the best features of the different approaches.
In search of simplicity: a case study in ML and LLM assisted applications of ATP in mathematics
ABSTRACT. We present discussion of opportunities of integration of ATP, ML and LLM in the context of exploration of well-know Andrews-Curtis conjecture in combinatorial group theory and report on initial experiments.
ABSTRACT. In recent years, LLMs have been increasingly applied to mathematical and reasoning tasks.
These models take significant amounts of training and yet are still capable of dreaming up incorrect solutions or struggle with simple algebraic tasks.
For this reason, we investigate methods to improve the performance of models on symbolic reasoning tasks.
TRS offer an intuitive, reliable, verifiable and Turing-complete reasoning framework on formal languages.
With this work, we want to explore the potential of combining TRS with RL.
We introduce treewrite, a Python library designed to facilitate the integration of TRS with neural networks, enabling efficient problem-solving through RL.
Formalizing Natural Language: Cultivating LLM Translations Using Automated Theorem Proving
ABSTRACT. 1 Introduction
In recent years Large Language Models (LLMs) have improved significantly in
the ability to ingest massive amounts of information and produce statistically
relevant answers to queries. Unfortunately there is a lack of formalism with this
approach, and real reasoning is absent in LLM responses. Symbolic AI on the
other hand is able to logically reason through a problem with concrete steps, but
draw from a relatively small quantity of formal axioms, and are thus limited to
specific domains. The neuro-symbolic approach draws from the benefits of both,
and limits the disadvantages inherent in each. We present a system that can take
natural language and translate sentences into higher order formal axioms using
a LLM.1 Automated theorem proving (ATP) is then used to ensure our growing
knowledge base is consistent, and to answer questions from our knowledge base
in a provable way.
Human language is expressive and diverse, more so than can be expressed in
First Order Logic [4]. Thus to create our system, we use a large ontology called
Suggested Upper Merged Ontology (SUMO) [2], which contains over 230,000
axioms formalized with expressive higher order logics. Axioms in SUMO are
continuously being checked for consistency using the Vampire Eprover [1]. Our
system is constantly improving, being able to successfully translate growing categories
of sentence structures.
2 System Overview
The dataflow of our system is shown in Figure1. The target user of our system is a
domain expert composing policy. The user will input a policy document, written
in natural English, to a computer. The text is first sent to a Metaphor Handler,
which uses an LLM to detect and replace metaphorical words and phrases. The
result is sent to a Sentence Simplification module, a combination of heuristics
and an LLM which rephrases the sentences into simpler grammatical structures,
1 Code repository for our system is found on Github at:
https://github.com/ontologyportal/sumonlp
2 R. Thompson et al.
similar to the sentences on which our Language to Logic (L2L) translator was
trained.
Every word in the WordNet lexicon [3] is mapped to SUMO, however even
with such an expansive dataset, there are words and organizations that will
be encountered that are unknown. Rather than fail, the system will attempt
to produce logic using Stanford’s Stanza Name Entity Recognizer (NER) to
detect the word type. The word is replaced with an appropriate label in the
sentence, which our L2L translator has been trained to handle. After translation
to logic is complete, the label is replaced with the original word. Thus, successful
translation is still possible even when an unknown word is encountered.
Fig. 1. Language to logic architecture
2.1 Language to Logic Translation and Theorem Proving
The L2L translator is trained from a synthetic corpus of around 12 million
sentences with their corresponding logic translations. Language logic pairs are
built by creating a frame structure for each sentence and filling in the slots of
the frame from the contents of SUMO; the objects, processes and relationships.
Because we understand the intended semantics of the frame, we are able to
generate equivalent english and logic expression of the frame. This is in contrast
to methods that otherwise rely on untrained, non-learning methods, or human
manual generation of a training set.
After translation to logic, the new axioms are added to the SUMO ontology.
The Vampire ATP is then tasked with finding and contradictions that might
have been introduced. The user can then use SUMO to ask questions of the
knowledge base. These questions are similarly translated to a formal conjecture,
which Vampire is used to respond to.
ABSTRACT. When proving an inductive problem, we often prove auxiliary lemmas
that are useful for proving the original problem.
If these auxiliary lemmas themselves are challenging, we must introduce more lemmas to prove these lemmas.
To automate such multi-step conjecturing, we developed Abduction Prover.
Given a proof goal, Abduction Prover conjectures a series of lemmas and attempts to prove the original goal using these lemmas.
Our working prototype of Abduction Prover for Isabelle/HOL is publicly available on GitHub.
ABSTRACT. Generating useful and novel lemmas is a challenging task for LLMs, as well as for previous symbolic approaches.
We present ongoing work on neuro-symbolic lemma conjecturing, combining Large Language Models (LLMs) and symbolic methods.
We train an LLM to generate lemma templates that describe the shape of a lemma, and symbolic methods are used to fill in the details.
The aim of our neuro-symbolic tool is to avoid generating repetitive and incorrect lemmas compared to purely LLM-based generation, and generate more varied output lemmas than our purely symbolic methods.
By leveraging the best of both symbolic and neural methods we can generate useful lemmas for a wide range of input domains, facilitating computer-assisted theory development and formalization.
Contributed Talks. Can Generative AI improve user experience? This session will discusses extensions to theorem proving user interfaces that include Large Language Models and other Machine Learning assistance
DeepIsaHOL progress report: current machine learning for the Isabelle proof assistant
ABSTRACT. I report on the progress of the DeepIsaHOL project for doing modern machine learning (ML) on the Isabelle interactive theorem prover (ITP) data. The project is ongoing and focused on creating infrastructure for doing ML experiments with Isabelle data such as fine-tuning large language models (LLMs) or converting Isabelle into a reinforcement learning (RL) environment. The project’s ultimate objective is to provide tools based on these experiments to aid Isabelle users in the proving process. This report describes the project’s current state and the tools it has generated, including a generic data extraction algorithm from Isabelle’s theory files, its Scala and Python interfaces, simple (Python and Scala) read, eval, print, loops (REPLs) for interacting with Isabelle programmatically, and initial use of the framework’s data for training a Google’s FLAN-T5 small LLM. The project remains open and welcomes ITP and ML enthusiasts to collaborate on it.
14:30
Gleb Solovev (JetBrains Research, Constructor University Bremen, Germany) Nikita Khramov (JetBrains Research, Constructor University Bremen, Germany) Andrei Kozyrev (JetBrains Research, Constructor University Bremen, Germany) Anton Podkopaev (JetBrains Research, Constructor University Bremen, Netherlands)
CoqPilot benchmarking framework
ABSTRACT. We present CoqPilot benchmarking framework tailored for conducting the experiments involving LLMs and other techniques in Coq proof-generation domain. This framework allows a seamless setup of the experiment pipelines while ensuring the reliability and efficiency of their execution, making it suitable both for small preliminary studies and large-scale experiments. In addition to describing the benchmarking framework and its features, we present the experiment results obtained with it.
15:00
Job Petrovčič (Faculty of Mathematics and Physics, University of Ljubljana, Slovenia) Sebastian Mežnar (Jozef Stefan Institute, Faculty of Mathematics and Physics, University of Ljubljana, Slovenia) Ljupčo Todorovski (Faculty of Mathematics and Physics, University of Ljubljana, Slovenia)
Kernel-level expression generator
ABSTRACT. We introduce a neural generator that produces valid Lean~4 expressions directly at the kernel level, bypassing Lean's elaboration process. To train the generator, we implement Lean's core type checking in Python and integrate it into a reinforcement learning environment that assigns rewards for well-typed subexpressions. We evaluate the model's ability to learn and adapt within this environment.
How LLMs could Fool a Proof Checker: The Risks of Inconsistent Assumptions in Automated Proof Systems
ABSTRACT. The integration of large language models (LLMs) with proof assistants promises to enhance mathematical reasoning, blending intuitive human interaction with rigorous verification. However, this coupling might introduce subtle risks. In this talk, we investigate how LLMs, proof checkers, and proof assistants handle inconsistent assumptions: sets of assumptions that can lead to seemingly valid but ultimately flawed proofs. Our focus is on GPT-o1 as an LLM and Isabelle and Coq as proof assistants.
We use the case study of Minimal Probability, defined as a theory of statements about expectations of a finite number of random variables. This theory is rich enough to cover a variety of simple applications and exercises, including known elementary paradoxes. Probability theory offers advantages for this study due to its status as a formal theory connected to simple, concrete situations and its susceptibility to subtle inconsistencies involving probabilities, expectations, conditional probabilities, and independence.
Key findings from our investigation into the behavior of LLMs and proof assistants when faced with inconsistent assumptions include:
- When the possibility of inconsistency is not mentioned, GPT-o1 often detects simple inconsistencies but fails in more complex cases.
- When the possibility of inconsistency is raised in the prompt, GPT-o1 decides about inconsistencies at a level comparable to human experts.
- Proof checkers (e.g., Isabelle and Coq used solely to verify a submitted proof) do not detect contradictions, even when apparent.
- Proof assistants (e.g., Isabelle and Coq with active intervention on justifying proof steps) might flag inconsistencies, but only occasionally.
- GPT-o1 can be instructed to deceive the proof checker: it can create an inconsistent set of assumptions, opposite conclusions, and the proof checker code in such a way that both Isabelle and Coq would certify either conclusion as correct.
These findings highlight how potential inconsistencies could infiltrate probabilistic modeling of seemingly simple situations, leading LLMs and proof assistants to produce seemingly valid but ultimately meaningless proofs. Historically, when experienced humans designed setups, such risks were remote. However, these risks could grow as machines take on more control and assumptions become increasingly complex. Although possibly remote, the mere possibility that an LLM could deliberately craft assumptions to gain certification of unwarranted conclusions is particularly concerning.
This suggests the need for potential remedies or at least mitigating strategies. Our proposed actions include:
(0) Raising awareness of the issue to direct efforts toward mitigating the potential dangers.
(1) Maintaining human supervision of assumptions before integrating them into the proof environment for as long as possible.
(2) Using multiple proof assistants—built on diverse logical foundations—to increase the likelihood of detecting contradictions.
(3) Integrating automated consistency checks or specialized consistency-verification tools to promptly flag contradictions.
We explore (2) in some examples with some moderate success.
As for (3), we consider Minimal Probability, and we show that its axiomatization in a semantically complete first-order theory allows for the algebraization of assumptions, transforming probabilistic problems into polynomial relations. Consistency is then equivalent to the nonemptiness of a semi-algebraic set, a decidable problem analyzable using tools like cylindrical decomposition, which fully classifies consistent and inconsistent assumptions. This procedure has the complexity of the existential theory in a real closed field, and we implemented it in Mathematica for the considered examples.
The talk will present examples, tests, and mitigating actions.
15:30
Tim Fernando (Computer Science, Trinity College Dublin, Ireland)
Between min cost search and least action
ABSTRACT. The contrast between the remarkable fluency of large language models and their
flawed reasoning is linked in a recent assessment of LLMs to the distinction
between pre-training powered by next-word prediction and fine-tuning associated
with alignment such as reinforcement learning [Mahowald et al, 2024].
The present work attempts to understand this distinction by exploring the view that next-word prediction presupposes a discrete notion of state which alignment may refine in any number of ways. Open-ended as that process may be, some structure is imposed on it by developing Kleene’s representation of nerve nets into a logical system [Goguen and Burstall 1992] on which information-theoretic costs are introduced around a discretized least action principle [Marsden and West 2001], generalized to a finite-state setting, supported by tools such as Mona.
ABSTRACT. Useful interactions between Automated theorem proving (ATP) and Machine learning (ML), more specifically Large Language Models (LLMs), would benefit from an abstract logical framework for proofs and validity. We propose the use of the theory of Institutions, an abstract model-theoretic tool for specification and programming.
Lean and Metamath are proof assistants that rely on distinct mathematical-logical foundations, i.e., specific Institutions. Lean is built on dependent type theory, which enables the construction of highly structured mathematical objects and proofs. On the other hand, Metamath centers around its main database, the Metamath Proof Explorer, which is based on first-order logic. Additionally, other Metamath databases explore alternative systems, including intuitionistic logic, Quine’s New Foundations, and higher-order logic. Metamath verifies proofs using a simple substitution rule by focusing on the correctness of substitutions, without assuming a specific logic system. Hence, we need a logic-independent approach to enhance the development of semantics for such tools. While Metamath and Lean are powerful tools, they have certain limitations that highlight the need for a more advanced logic-independent framework, such as Institutions.
Institutions form the core of categorical abstract model theory, which formalizes the notion of a logical system by defining its syntax, semantics, and the satisfaction relation between them. They provide a common framework that moves beyond the details of specific logics, which helps in studying their connections. An Institution consists of four main components: (a) a collection of signatures (serving as vocabularies for forming sentences in a logical system) and signature morphisms, (b) a set of sentences for each signature, which represents the syntax of the logic, (c) a set of models for each signature, which provides the meaning or semantics of the sentences, and (d) a satisfaction relation that connects the models with the sentences, indicating which sentences are true in which models. Institutions extend Tarski's semantic definition of truth and generalize Barwise's Translation Axiom. Institutions highlight the fact that truth is invariant under change of notation. The key aspect of any logical system is the satisfaction relationship between its syntax (i.e., its sentences) and its semantics (i.e., its models). While this relationship assumes a fixed vocabulary, Institutions provide the flexibility to work with multiple vocabularies simultaneously. This ability allows for translations between different logical systems, enabling one to be interpreted within another. A main result from Institution Theory gives the conditions under which you can translate proofs from one theorem prover to another.
Systems like Lean and Metamath that demonstrate how LLMs can be utilized to assist in constructing formal proofs can clearly benefit from the Institutional approach. The main methodological implications for providing a general framework for ML and LLMs are:
The relativistic view
The translation of problems and solutions
The multi-language aspect
In our presentation, we will demonstrate how Institutions can be used to approach the semantics of systems like Lean and Metamath and the benefits of such an approach.
ABSTRACT. In this talk I present how theorem provers can be used to define machine learning applications together with correctness invariants of interest. This approach can be seen as a pathway towards encoding provably safe AI.
ABSTRACT. Possibility of covering an incomplete chessboard with dominos, without gaps or intersections, is (in the case when 2 diagonally opposite corners on a standard board are removed) a classical example where impossibility of a cover can be formally proven directly, but such a direct proof would be in practise too tedious for a human, while introducing a new idea, natural for a human but not immediate in a direct formalisation, of colouring chessboard squares in 2 colours, black and white, and noticing that a domino always cover equal number, 1, of each of the colours, while unequal number of those are present on the truncated board, establishes impossibility of such a cover immediately.
Meanwhile, counting the _number_ of such coverings (when a cover is possible), is among classical problems in combinatorics, or in statistical physics, and for example in the case of full board has been solved by methods discovered by Kasteleyn; Temperley Fisher; and Lieb in the 60's. While for small enough and finite boards counting by an exhaustive search can also be done by a computer, reproducing in those cases more creative results achieved by the humans (and answering by brute force questions of possibility or impossibility of covers, enjoyed by philosophers and logicians, as a corollary).
With rapid advances of machine learning and AI, we were curious how much of that would be accessible to present-day AI, and how creative it would be in tackling such problems. In my talk, and a follow-up paper, I will discuss what we have found. The results were essentially disappointing, as they exhibited little mathematically-valid creativity, and rather displayed known problems of foundational models such as hallucination and false confidence (and, under the hood, apparently very large formal VC dimensions of the transformers models, compared to the training set sizes).
Some of the highlights of what we found. A LLM model, of chatGPT kind, was able to correctly inform us on impossibility of covering with dominos a standard chessboard with diagonally opposite corners removed, and to write a Lean program making this explanation formal, and to provide us with the correct count of the number of coverings of the full board. But those results were essentially in the training set, and did not require much creativity.
With a small modification, corners now removed on the same side, we found that what was so far learned by AI was quickly abandoned. We were confidently told that covering is still impossible, for the same reason of parity mismatch, although in this case, unlike the previous one, the black and white counts are equal; and actually a direct search would easily (for a computer) find a large (for a human) number, 2436304, of possible inequivalent covers. Moreover, we nevertheless were offered a Lean code "proving" that such cover was _impossible_, with a caveat that the theorem coded in Lean used a miracle procedure "sorry", as a placeholder for "the proof is not yet complete."
We proceeded to interrogate AI on coverings of full or truncated boards by longer 1 x n dominos. In some sufficiently simple cases, for example when number of remaining on the board squares was not divisible by n, or when n was larger than any of the board measurements, impossibility of covers was correctly established by AI. Yet in cases where there was no immediately obvious reason, either way, the feedback was typically not mathematically sound (though plausible linguistically).
Further, we explored whether in the case of 1 x 3 domino covers the AI will try a new creative element. It seems natural that a human will try colouring the board in 3 colours in that case, trying to generalise success of black and white colourings for 1 x 2 dominos problems. In fact, 3 colours do help, in some of 1 x 3 colouring problems. However, we did not find signs that AI would explore this (at least if not directly prompted).
We hope that our investigation can contribute to the discussion of the AI meaning and its role in mathematical proofs.
From Mathematical Theory to Machine Learning: Detecting Symmetry Groups in Crystallographic Tilings
ABSTRACT. This project explores the intersection of mathematics, crystallography, and machine learning by developing a method to detect symmetry groups of tilings (crystals). The approach originates from a mathematical technique I introduced in [1], which can be effectively modeled using classical machine learning methods. The work aligns with the principles of explainable AI, as preliminary results suggest the existence of a precise conjecture that should be provable based on data-driven modeling. This poster presents the methodology, key findings, and potential directions for collaboration in bridging theoretical mathematics with computational techniques in crystallography.
[1] Growth functions of periodic space tessellations, with Jakub Malinowski, Zbigniew Dauter and Mariusz Jaskolski, Acta Crystallogr., Sect. A: Found. Adv. (2025), Vol. 81, No. 1, 64-81.
This panel session will provide an opportunity for community building, and free-format discussion of the current trends in the area, as well as its future directions.
Possible drinks and meal at Mariott Hotel: https://www.marriott.com/en-us/hotels/edihw-courtyard-edinburgh-west/overview/?scid=f2ae0541-1279-4f24-b197-a979c79310b0