Enhancing Tree-Based Keyphrase Extraction Technique Using Embedding Technique

Title:Enhancing Tree-Based Keyphrase Extraction Technique Using Embedding Technique

Authors:M. Washim Akram, Saiful Azad, Tanbin Ahmed and Md. Solaiman Mia

Conference:STI 2024

Tags:Automatic Keyphrase Extraction Technique, Binary Tree, Candidate Keyphrase, Document Processing and Embedding System

Abstract:

The automatic keyphrase extraction techniques aim to extract high quality keyphrases for document summarization and many other purposes, including indexing and search optimization, content classification and categorization, topic modeling and analysis, text mining, and data analysis, and so on. However, most of the existing techniques are often domain-specific or require domain knowledge. Again, some of these methods employ complex statistical approaches that require significant computational resources. Others depend on large training data sets that may not always be available. To address these challenges, this paper proposes a new keyphrase extraction technique, named ETeKET or Enhanced Tree-Based Keyphrase Extraction Technique, which is domain independent, requires minimal statistical knowledge, and does not rely on training data. It employs a special variant of binary tree, called Keyphrase Extraction (KePhEx) tree, to extract final keyphrases from candidate keyphrases. Moreover, to assess the degree of cohesiveness of various nodes concerning the root, it also employs a measure, called the Cohesiveness Index (CI), which offers flexibility to the process. Since the generation of final keyphrases depends on the creation of candidate keyphrases, the proposed technique integrates the KeyBERT model to leverage contextual embeddings for effective and efficient candidate keyphrase extraction. The effectiveness of ETeKET is evaluated using three benchmark datasets, namely SemEval-10 (for long texts such as scientific articles), SemEval-2017 (for short texts) and Thesis100 datasets. Experiment results show significant improvement over its ancestor for various performance metrics.