Tags:Cosine similarity, Deep learning, Ensemble, Keyword analysis, Large language models (LLM) and Title generation
Abstract:
Automatic title generation has emerged as a significant area of inter- est in natural language generation as titles attract people's attention easily. It optimizes the tasks of web searching, academia, and news headline generation. Title generation in the Malayalam language is still in its early stages, and limited significant work has been done in this area. In this work, we have addressed the title generation task in Malayalam with the social science textbook content from the Social-sum-Mal dataset as training data. Title creation of Malayalam school textbook content will improve the document's readability and contribute to academics. An ensemble based method using the outputs of three fine-tuned large language models - mBART-50, IndicBART, and mT5 is employed in this study. The outputs generated by the fine-tuned LLMs are passed to a scoring mechanism where they are evaluated and scored based on three criteria. This includes keyword analysis, length scoring, and co- sine similarity. Based on the overall score, the best output from the candidate titles is selected as the title for the input document. The system has been rigorously evaluated using testing metrics such as ROUGE, BLEU, and BERTScore. The results demonstrate that the ensemble approach effectively generates meaningful titles.
Title-Sum: an LLM-Based Title Generation System for Domain-Specific Documents in Malayalam Language