CICM 2023: 16TH CONFERENCE ON INTELLIGENT COMPUTER MATHEMATICS
PROGRAM FOR FRIDAY, SEPTEMBER 8TH
Days:
previous day
all days

View: session overviewtalk overview

09:00-11:00 Session 18: CICM 2023 - Invited keynote & paper presentation
Location: Upper Hall
09:00
Progress on proof systems interoperability

ABSTRACT. Proof system interoperability is a research topic which started about 30 years ago, and received some attention with the formalization of Hales' proof of Kepler Conjecture in the 2000s as parts of this proof were initially formalized in two different provers. Then, proof system interoperability received new interest in the 2010s with the increasing use of automated theorem provers in proof assistants. At about the same time appeared a new logical framework, called Dedukti, which extends LF by allowing the identification of types modulo some equational theory. It has been shown that various logical systems can be nicely encoded in Dedukti, and various tools have been developed to actually represent proofs from various provers in Dedukti, and to translate them to other provers. In this talk, I will review some of these works and tools, and present recent efforts to handle entire libraries of proofs.

10:00
Morphism Equality in Theory Graphs
PRESENTER: Franziska Weber

ABSTRACT. Theory graphs have theories as nodes and theory morphisms as edges. They can be seen as generators of categories with the nodes as the objects and the paths as the morphisms. But in contrast to generated categories, theory graphs do not allow for an equational theory on the morphisms. That blocks formalizing important aspects of theory graphs such as isomorphisms between theories.

MMT is essentially a logic-independent language for theory graphs. It previously supported theories and morphisms, and we extend it with morphism equality as a third primitive. We show the importance of this feature in several elementary formalizations that critically require stating and proving certain non-trivial morphism equalities. The key difficulty of this approach is that important properties of theory graphs now become undecidable and require heuristic methods.

10:30
ProofLang: the Language of arXiv Proofs

ABSTRACT. The ProofLang Corpus includes over three million English-language proofs—558 million words—mechanically extracted from the papers (Math, CS, Physics, etc.) posted on arXiv.org between 1992 and 2020. The focus of this corpus is written proofs, not the explanatory text that surrounds them, and more specifically on the language used in such proofs; mathematical content is filtered out, resulting in sentences such as "Let MATH be the restriction of MATH to MATH." This dataset reflects how people prefer to write informal proofs. It is also amenable to statistical analyses and to experiments with Natural Language Processing (NLP) techniques.

11:30-13:00 Session 19: CICM 2023 - Open session

The open session  is an opportunity for people to show their system  developments on their laptops. In this session, people could also present other things, such as preliminary work, research proposals, offers of  collaboration ...  

Location: Upper Hall
11:30
A Catalogue of Mathematical Objects — micropublishing knowledge of examples

ABSTRACT. The right examples can be crucial for making progress in mathematical research, but they are often hard to find in the literature. An online catalogue where researchers can publish examples they come to think of would therefore be a valuable resource. Such a database may also be an arena for exposing mathematicians in general to mathematical knowledge management, as well as an place where formal verification may exist side-by-side with traditional informal presentations.

This work-in-progress aims to develop the cultural and software infrastructure for such a service. Collaborators are welcome.

12:00
Naproche demonstration

ABSTRACT. Naproche demonstration

14:00-16:00 Session 20: Workshop on Libraries of Formal Proofs and Natural Mathematical Language - Invited talk & Paper presentation
14:00
Informalizing formalized mathematics using the Lean theorem prover (Invited Talk)

ABSTRACT. One of the applications of interactive theorem provers in pure mathematics is being able to produce machine-verified formal proofs. I will talk about a less-obvious application, which is using formalized mathematics to author interactive informal expositions. I will demonstrate a prototype of an "auto-informalization" system written in Lean that presents the reader with an interface to view proofs at a desired level of detail. I will also discuss thoughts on the impact of such tools in mathematics. This is joint work with Patrick Massot.

15:00
CheckMate: an adaptable prototype platform for humans to interact with and evaluate LLMs

ABSTRACT. The standard methodology of evaluating large language models (LLMs) based on static pairs of inputs and outputs is insufficient for developing assistants: this kind of assessments fails to take into account the essential interactive element in their deployment, and therefore limits how we understand language model capabilities. We introduce CheckMate, an adaptable prototype platform for humans to interact with and evaluate LLMs. We conduct a study with CheckMate to evaluate three language models~(InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics, with a mixed cohort of participants from undergraduate students to professors of mathematics. We release the resulting interaction and rating dataset, MathConverse. By analysing MathConverse, we derive a preliminary taxonomy of human behaviours and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness in LLM generations, amongst other findings. Further, we identify useful scenarios and existing issues of GPT-4 in mathematical reasoning through a series of case studies contributed by expert mathematicians. We conclude with actionable takeaways for ML practitioners and mathematicians: models which communicate uncertainty, respond well to user corrections, are more interpretable and concise may constitute better assistants; interactive evaluation is a promising way to continually navigate the capability of these models; humans should be aware of language models’ algebraic fallibility, and for that reason discern where they should be used.

15:30
Formal Reasoning using Distributed Assertions

ABSTRACT. When a proof system checks a formal proof, we can say that its kernel asserts that the formula is a theorem in a particular logic. We describe a general framework in which such assertions can be made globally available so that any other proof assistant willing to trust the assertion’s creator can use that assertion without rechecking any associated formal proof. This framework, called DAMF (Distributed Assertion Management Framework), is heterogeneous and allows each participant to decide which tools and operators they are willing to trust in order to accept external assertions. This framework can also be integrated into existing proof systems by making minor changes to the input and output subsystems of the prover. DAMF achieves a high level of distributivity using such off-the-shelf technologies as IPFS (InterPlanetary File System) [1], IPLD (InterPlanetary Linked Data) [2], and public key cryptography. We illustrate the framework by describing an implemented tool for validating and publishing assertion objects and a modified version of the Abella theorem prover [3] that can use and publish such assertions.

On a higher level, we explore and develop a new dimension in the general quest for interoperability between proof systems and communities. Motivations for achieving interoperability include efficiency through reusing developments, avoiding unnecessary duplications, enhancing collaboration, cross-validation of theorems, and avoiding extreme centralization. Another reason is the appeal of modularity and composability in theories, systems, and logics. Some approaches attempt a solution between two specific systems [4] [5], which is not always feasible mainly due to incompatibilities regarding proof languages and implementations. Others aim for universal interoperability, which is more challenging. The evidential tool bus [6] was proposed as a means to integrate specialized inference systems into a unified formal system. Approaches like Dedukti [7] and MMT [8] aim to create a framework in which other systems and logics would be expressed, combining in this way theorems and proofs produced by different systems in one trusted system. Program verification tools such as TLAPS [9] approach the problem in an opposite direction: an attempted proof is composed of different parts, and each part is delegated upfront to a separate proof system trusted by the main system. Standard formats and languages such as TPTP [10] have been developed as bridges between systems, and universal libraries have been proposed [11] [12].

In DAMF, we consider the natural existence of different communities, different approaches, different languages, etc. Hence, the starting criterion is to consider the diversity of logics, languages, and systems that are not necessarily completely compatible, mapped to each other, nor to a central system. We want to enable heterogeneous producers and consumers of formal proofs to communicate coherently in the light of differences, and thus we start by setting the base ground for such communication.

Publications, documentation, and source code related to DAMF can be found on the distributed-assertions website [13].

References:

[1] What is IPFS? https://docs.ipfs.tech/concepts/what-is-ipfs/

[2] IPLD documentation. https://ipld.io/docs/

[3] The Abella theorem prover. https://abella-prover.org/

[4] Armand, M., Faure, G., Grégoire, B., Keller, C., Théry, L., and Werner, B. A modular integration of SAT/SMT solvers to Coq through proof witnesses. In Certified Programs and Proofs (CPP 2011) (2011), J.-P. Jouannaud and Z. Shao, Eds., vol. 7086 of LNCS, pp. 135–150.

[5] Fontaine, P., Marion, J.-Y., Merz, S., Nieto, L. P., and Tiu, A. F. Expressiveness + automation + soundness: Towards combining SMT solvers and interactive proof assistants. In TACAS: Tools and Algorithms for the Construction and Analysis of Systems (2006), H. Hermanns and J. Palsberg, Eds., vol. 3920 of LNCS, Springer, pp. 167–181.

[6] Rushby, J. M. An evidential tool bus. In Formal Methods and Software Engineering, 7th International Conference on Formal Engineering Methods, ICFEM 2005, Manchester, UK, November 1-4, 2005, Proceedings (2005), K. Lau and R. Banach, Eds., vol. 3785 of LNCS, Springer, pp. 36–36.

[7] Assaf, A., Burel, G., Cauderlier, R., Delahaye, D., Dowek, G., Dubois, C., Gilbert, F., Halmagrand, P., Hermant, O., and Saillard, R. Dedukti: a logical framework based on the λΠ-calculus modulo theory, 2016. [

8] Rabe, F. The MMT Language and System. https://uniformal.github.io/, 2022.

[9] Chaudhuri, K., Doligez, D., Merz, S., and Lamport, L. The TLA+ proof system: Building a heterogeneous verification platform. In Proceedings of the 7th International Colloquium on Theoretical Aspects of Computing (ICTAC) (Natal, Rio Grande do Norte, Brazil, Sept. 2010), A. Cavalcanti, D. Déharbe, M.-C. Gaudel, and J. Woodcock, Eds., vol. 6256 of LNCS, Springer, p. 44.

[10] Sutcliffe, G. The TPTP Problem Library and Associated Infrastructure: The FOF and CNF Parts, v3.5.0. Journal of Automated Reasoning 43, 4 (2009), 337–362.

[11] Wiedijk, F. The QED manifesto revisited. Studies in Logic, Grammar and Rhetoric 10, 23 (2007), 121–133.

[12] Hales, T. Formal abstracts. https://formalabstracts.github.io/.

[13] Distributed Assertions website. https://distributed-assertions.github.io/

16:30-18:00 Session 21: Workshop on Libraries of Formal Proofs and Natural Mathematical Language - Paper presentation
16:30
LeanAIde - A statement autoformalisation tool for the Lean theorem prover

ABSTRACT. We present LeanAIde, a tool to translate mathematical statements within Lean doc-strings into formal Lean code. The tool is integrated with the Lean 4 editor via code actions for ergonomic usage.

At its core, LeanAIde relies on the abilities of a large language model to translate between natural language and code. Further optimisations to the input and output of the language model are used to improve the success rate of translation. The prompt to the language model is customised to the statement at hand by gathering similar doc-strings from the Lean mathematical library (mathlib) using sentence embeddings. Through post-processing and filtering, which is done efficiently using Lean's meta-programming capabilities, only the outputs corresponding to well-formed Lean expressions are retained. The tool is implemented entirely in Lean 4.

When used with OpenAI language models, the tool is able to formalise short mathematical statements at the undergraduate level with about 65% accuracy when tested on 120 theorem statements. The test dataset is split into normal, silly and false categories, where statements from the last two categories are highly unlikely to be a part of the language model's training data and hence guard against contamination. The tool has also been benchmarked against the new ProofNet dataset for auto-formalisation.

The source code for LeanAIde is available online at https://github.com/siddhartha-gadgil/LeanAide.

References:

[1] Zhangir Azerbayev and Edward W. Ayers. lean-chat: user guide. Lean. 2023. url: https://github.com/zhangir-azerbayev/lean-chat.

[2] Zhangir Azerbayev et al. ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics. 2023. arXiv: 2302.12433 [cs.CL].

[3] The mathlib Community. “The Lean Mathematical Library”. In: Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs. CPP 2020. New Orleans, LA, USA: Association for Computing Machinery, 2020, 367–381. isbn: 9781450370974. doi:10.1145/3372885.3373824. url: https://doi.org/10.1145/3372885.3373824.

[4] Naman Jain et al. “Jigsaw: Large language models meet program synthesis”. In: Proceedings of the 44th International Conference on Software Engineering. 2022, pp. 1219–1231.

[5] Albert Q Jiang et al. “Draft, sketch, and prove: Guiding formal theorem provers with informal proofs”. In: arXiv preprint arXiv:2210.12283 (2022).

[6] Leonardo de Moura and Sebastian Ullrich. “The lean 4 theorem prover and programming language”. In: Automated Deduction–CADE 28: 28th International Conference on Automated Deduction, Virtual Event, July 12–15, 2021, Proceedings 28. Springer. 2021, pp. 625–635.

[7] Arvind Neelakantan et al. “Text and code embeddings by contrastive pre-training”. In: arXiv preprint arXiv:2201.10005 (2022).

[8] OpenAI. GPT-4 Technical Report. 2023. arXiv: 2303.08774 [cs.CL].

[9] Qingxiang Wang et al. “Exploration of neural machine translation in autoformalization of mathematics in Mizar”. In: Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs. 2020, pp. 85–98.

[10] Yuhuai Wu et al. “Autoformalization with large language models”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 32353–32368.

17:00
Solving logical puzzles with ChatGPT

ABSTRACT. I will show the limitations and merits of Large Language Models (LLMs) to solve logic puzzles. For experiments I use the 144 puzzles from the library https://users.utcluj.ro/~agroza/puzzles/maloga [1]. The library contains puzzles of various types, including arithmetic puzzles, logical equations, Sudoku-like puzzles, zebra-like puzzles, truth telling puzzles, grid puzzles, strange numbers, or self-reference puzzles. The correct solutions were obtained using Prover9 and Mace4 based on human-modelling in Equational First Order Logic.

The aim is to investigate how many of these 144 puzzles are actually solved by LLMs. Part of these puzzles have graphical input that makes them difficult to be textualised for ChatGPT, e.g. Futoshiki, Kakurasu, Takuzo, Kakuro. After eliminating these graphical puzzles, a set of 100 logical puzzles remains. Each puzzle was prompted to ChatGPT with ``Solve this puzzle''. The results have been classified as follows. Only 7 puzzles were correctly solved and backed by correct reasoning. 1 puzzle was correct, but only incomplete justification was provided by the language model. For 9 puzzles the correct solution was provided, but backed by faulty justifications or reasoning chain – the correctness was actually a guess since some of the puzzles ask to check if a given sentence is true or false. Wrong solutions backed by faulty reasoning were provided for 70 puzzles. Additionally to these wrongly solved puzzles, there were the following answers from ChatGPT: claiming no solution when one exists, claiming not enough information to solve the task, or providing wrong solutions without no justification at all. There was one case in which, instead of the exact solution, an inefficient but valid algorithm was provided. Of the most interest might be one puzzle for which ChatGPT admitted its inability to solve it.

In most of the cases, the answers had a 3-parts structure: (i) task understanding, (ii) solving strategy, and (iii) the chain of reasoning. First, ChatGPT “seems” to master task understanding since it just reformulates the input in a more structured way. Second, we identify and annotate various solving strategies, e.g. “let’s do it step by step”, “backward reasoning”, “systematic search”, “trial and error”, “backtracking”, “euclidian algorithm”, “principle of inclusion-exclusion”, “assumption based”, “analysing possibilities”. Third, we classify the reasoning faults conveyed by ChatGPT, such as: wrong justification, inconsistency, wrong assumption, implication does not hold, wrong evaluation, bad arithmetic.

Related work focused on mathematical problems. Simon et al. [4] analysed the mathematical capabilities of ChatGPT and showed that ChatGPT provides less that 50% correct answers. PaLM enhanced with chain-of-thought prompting and an external calculator was able to solve 58% of basic mathematical reasoning tasks (i.e. from the datasets MathQA or GSM8K) [5]. Patterns of ChatGPT failures have been classified by Borji [3], including reasoning, factual errors, math, coding or bias. I am focusing here on identifying and refining the reasoning flaws of ChatGPT.

The discussion during the workshop targets: (1) which are the patterns of logical faults performed by LLMs?; (2) which types of puzzles are easy/difficult to solve by LLMs?; (3) what are the reasoning abilities of various free LLMs (e. g. ChatGPT, Bart, Claude)?; (4) which prompts are better when solving logical puzzles?.

Alternatively to engineering prompts to solve logic tasks, one could use LLMs to translate the text of the puzzles from natural language to some logic. For instance, Ishay et al. have used LLMs to generate answer set programs [2]. This approach is justified by the fact that the highest skill of LLMs is translation. Part of our 100 puzzles and their corresponding formalisation could be used to fine-tune the GPT-Curie model, while the remaining part to assess the model capabilities.

One dimension is how LLMs can support students to learn first order logic. Logical puzzles are an efficient pedagogical instrument to engage students in learning knowledge representation and reasoning. In this line, pedagogical tasks include LLMs to: (i) signaling logical faults given a formalisation provided by the student (ii) explaining in natural language a formalisation, proof or interpretation model; (iii) completing a partial formalisation provided by the student.

One output of this study is the benchmark of 100 logical puzzles, for which ChatGPT provided both correct answer and justification for 7% only. A second output is the classification of reasoning faults conveyed by ChatGPT. Third, each wrong statement within the ChatGPT answer was manually annotated, aiming to quantify the amount of faulty text generated by the language model.

  • [1] Groza, A. (2021). Modelling Puzzles in First Order Logic (pp. 1-338). Springer.
  • [2] Ishay, A., Zhun Y., and Joohyung L. "Leveraging Large Language Models to Generate Answer Set Programs." arXiv:2307.07699 (2023).
  • [3] Borji, Ali. "A categorical archive of chatgpt failures." arXiv preprint arXiv:2302.03494 (2023).
  • [4] Frieder, Simon, et al. "Mathematical capabilities of ChatGPT." arXiv:2301.13867 (2023).
  • [5] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, and Gaurav Mishra et al. Palm: Scaling language modeling with pathways.arXiv preprint arXiv:2204.02311, 2022.
17:30
An indexer and query language for libraries written in LambdaPi/Dedukti

ABSTRACT. I will present recent work on integrating in LambdaPi an indexer and query language for libraries written in LambdaPi/Dedutki syntax. Terms are indexed using discrimination trees to allow queries up to generalization and instantiation. Moreover I provide a query language to combine basic queries using logical operators (conjunction/disjuntion), set operators (restrictions to certain domains) and localization operators (e.g. where a pattern is supposed to match a statement, e.g. in the conclusion or the hypothesis). I use LambdaPi rewriting rules to implement alignments or to remove the clutter from encodings before indexing the terms.