View: session overviewtalk overview
09:00 | Progress on proof systems interoperability ABSTRACT. Proof system interoperability is a research topic which started about 30 years ago, and received some attention with the formalization of Hales' proof of Kepler Conjecture in the 2000s as parts of this proof were initially formalized in two different provers. Then, proof system interoperability received new interest in the 2010s with the increasing use of automated theorem provers in proof assistants. At about the same time appeared a new logical framework, called Dedukti, which extends LF by allowing the identification of types modulo some equational theory. It has been shown that various logical systems can be nicely encoded in Dedukti, and various tools have been developed to actually represent proofs from various provers in Dedukti, and to translate them to other provers. In this talk, I will review some of these works and tools, and present recent efforts to handle entire libraries of proofs. |
10:00 | Morphism Equality in Theory Graphs PRESENTER: Franziska Weber ABSTRACT. Theory graphs have theories as nodes and theory morphisms as edges. They can be seen as generators of categories with the nodes as the objects and the paths as the morphisms. But in contrast to generated categories, theory graphs do not allow for an equational theory on the morphisms. That blocks formalizing important aspects of theory graphs such as isomorphisms between theories. MMT is essentially a logic-independent language for theory graphs. It previously supported theories and morphisms, and we extend it with morphism equality as a third primitive. We show the importance of this feature in several elementary formalizations that critically require stating and proving certain non-trivial morphism equalities. The key difficulty of this approach is that important properties of theory graphs now become undecidable and require heuristic methods. |
10:30 | ProofLang: the Language of arXiv Proofs PRESENTER: Christopher Stone ABSTRACT. The ProofLang Corpus includes over three million English-language proofs—558 million words—mechanically extracted from the papers (Math, CS, Physics, etc.) posted on arXiv.org between 1992 and 2020. The focus of this corpus is written proofs, not the explanatory text that surrounds them, and more specifically on the language used in such proofs; mathematical content is filtered out, resulting in sentences such as "Let MATH be the restriction of MATH to MATH." This dataset reflects how people prefer to write informal proofs. It is also amenable to statistical analyses and to experiments with Natural Language Processing (NLP) techniques. |
The open session is an opportunity for people to show their system developments on their laptops. In this session, people could also present other things, such as preliminary work, research proposals, offers of collaboration ...
16:30 | LeanAIde - A statement autoformalisation tool for the Lean theorem prover PRESENTER: Anand Rao Tadipatri ABSTRACT. We present LeanAIde, a tool to translate mathematical statements within Lean doc-strings into formal Lean code. The tool is integrated with the Lean 4 editor via code actions for ergonomic usage. At its core, LeanAIde relies on the abilities of a large language model to translate between natural language and code. Further optimisations to the input and output of the language model are used to improve the success rate of translation. The prompt to the language model is customised to the statement at hand by gathering similar doc-strings from the Lean mathematical library (mathlib) using sentence embeddings. Through post-processing and filtering, which is done efficiently using Lean's meta-programming capabilities, only the outputs corresponding to well-formed Lean expressions are retained. The tool is implemented entirely in Lean 4. When used with OpenAI language models, the tool is able to formalise short mathematical statements at the undergraduate level with about 65% accuracy when tested on 120 theorem statements. The test dataset is split into normal, silly and false categories, where statements from the last two categories are highly unlikely to be a part of the language model's training data and hence guard against contamination. The tool has also been benchmarked against the new ProofNet dataset for auto-formalisation. The source code for LeanAIde is available online at https://github.com/siddhartha-gadgil/LeanAide. References: [1] Zhangir Azerbayev and Edward W. Ayers. lean-chat: user guide. Lean. 2023. url: https://github.com/zhangir-azerbayev/lean-chat. [2] Zhangir Azerbayev et al. ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics. 2023. arXiv: 2302.12433 [cs.CL]. [3] The mathlib Community. “The Lean Mathematical Library”. In: Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs. CPP 2020. New Orleans, LA, USA: Association for Computing Machinery, 2020, 367–381. isbn: 9781450370974. doi:10.1145/3372885.3373824. url: https://doi.org/10.1145/3372885.3373824. [4] Naman Jain et al. “Jigsaw: Large language models meet program synthesis”. In: Proceedings of the 44th International Conference on Software Engineering. 2022, pp. 1219–1231. [5] Albert Q Jiang et al. “Draft, sketch, and prove: Guiding formal theorem provers with informal proofs”. In: arXiv preprint arXiv:2210.12283 (2022). [6] Leonardo de Moura and Sebastian Ullrich. “The lean 4 theorem prover and programming language”. In: Automated Deduction–CADE 28: 28th International Conference on Automated Deduction, Virtual Event, July 12–15, 2021, Proceedings 28. Springer. 2021, pp. 625–635. [7] Arvind Neelakantan et al. “Text and code embeddings by contrastive pre-training”. In: arXiv preprint arXiv:2201.10005 (2022). [8] OpenAI. GPT-4 Technical Report. 2023. arXiv: 2303.08774 [cs.CL]. [9] Qingxiang Wang et al. “Exploration of neural machine translation in autoformalization of mathematics in Mizar”. In: Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs. 2020, pp. 85–98. [10] Yuhuai Wu et al. “Autoformalization with large language models”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 32353–32368. |
17:00 | Solving logical puzzles with ChatGPT ABSTRACT. I will show the limitations and merits of Large Language Models (LLMs) to solve logic puzzles. For experiments I use the 144 puzzles from the library https://users.utcluj.ro/~agroza/puzzles/maloga [1]. The library contains puzzles of various types, including arithmetic puzzles, logical equations, Sudoku-like puzzles, zebra-like puzzles, truth telling puzzles, grid puzzles, strange numbers, or self-reference puzzles. The correct solutions were obtained using Prover9 and Mace4 based on human-modelling in Equational First Order Logic. The aim is to investigate how many of these 144 puzzles are actually solved by LLMs. Part of these puzzles have graphical input that makes them difficult to be textualised for ChatGPT, e.g. Futoshiki, Kakurasu, Takuzo, Kakuro. After eliminating these graphical puzzles, a set of 100 logical puzzles remains. Each puzzle was prompted to ChatGPT with ``Solve this puzzle''. The results have been classified as follows. Only 7 puzzles were correctly solved and backed by correct reasoning. 1 puzzle was correct, but only incomplete justification was provided by the language model. For 9 puzzles the correct solution was provided, but backed by faulty justifications or reasoning chain – the correctness was actually a guess since some of the puzzles ask to check if a given sentence is true or false. Wrong solutions backed by faulty reasoning were provided for 70 puzzles. Additionally to these wrongly solved puzzles, there were the following answers from ChatGPT: claiming no solution when one exists, claiming not enough information to solve the task, or providing wrong solutions without no justification at all. There was one case in which, instead of the exact solution, an inefficient but valid algorithm was provided. Of the most interest might be one puzzle for which ChatGPT admitted its inability to solve it. In most of the cases, the answers had a 3-parts structure: (i) task understanding, (ii) solving strategy, and (iii) the chain of reasoning. First, ChatGPT “seems” to master task understanding since it just reformulates the input in a more structured way. Second, we identify and annotate various solving strategies, e.g. “let’s do it step by step”, “backward reasoning”, “systematic search”, “trial and error”, “backtracking”, “euclidian algorithm”, “principle of inclusion-exclusion”, “assumption based”, “analysing possibilities”. Third, we classify the reasoning faults conveyed by ChatGPT, such as: wrong justification, inconsistency, wrong assumption, implication does not hold, wrong evaluation, bad arithmetic. Related work focused on mathematical problems. Simon et al. [4] analysed the mathematical capabilities of ChatGPT and showed that ChatGPT provides less that 50% correct answers. PaLM enhanced with chain-of-thought prompting and an external calculator was able to solve 58% of basic mathematical reasoning tasks (i.e. from the datasets MathQA or GSM8K) [5]. Patterns of ChatGPT failures have been classified by Borji [3], including reasoning, factual errors, math, coding or bias. I am focusing here on identifying and refining the reasoning flaws of ChatGPT. The discussion during the workshop targets: (1) which are the patterns of logical faults performed by LLMs?; (2) which types of puzzles are easy/difficult to solve by LLMs?; (3) what are the reasoning abilities of various free LLMs (e. g. ChatGPT, Bart, Claude)?; (4) which prompts are better when solving logical puzzles?. Alternatively to engineering prompts to solve logic tasks, one could use LLMs to translate the text of the puzzles from natural language to some logic. For instance, Ishay et al. have used LLMs to generate answer set programs [2]. This approach is justified by the fact that the highest skill of LLMs is translation. Part of our 100 puzzles and their corresponding formalisation could be used to fine-tune the GPT-Curie model, while the remaining part to assess the model capabilities. One dimension is how LLMs can support students to learn first order logic. Logical puzzles are an efficient pedagogical instrument to engage students in learning knowledge representation and reasoning. In this line, pedagogical tasks include LLMs to: (i) signaling logical faults given a formalisation provided by the student (ii) explaining in natural language a formalisation, proof or interpretation model; (iii) completing a partial formalisation provided by the student. One output of this study is the benchmark of 100 logical puzzles, for which ChatGPT provided both correct answer and justification for 7% only. A second output is the classification of reasoning faults conveyed by ChatGPT. Third, each wrong statement within the ChatGPT answer was manually annotated, aiming to quantify the amount of faulty text generated by the language model.
|
17:30 | An indexer and query language for libraries written in LambdaPi/Dedukti ABSTRACT. I will present recent work on integrating in LambdaPi an indexer and query language for libraries written in LambdaPi/Dedutki syntax. Terms are indexed using discrimination trees to allow queries up to generalization and instantiation. Moreover I provide a query language to combine basic queries using logical operators (conjunction/disjuntion), set operators (restrictions to certain domains) and localization operators (e.g. where a pattern is supposed to match a statement, e.g. in the conclusion or the hypothesis). I use LambdaPi rewriting rules to implement alignments or to remove the clutter from encodings before indexing the terms. |