Download PDFOpen PDF in browser

Lemmatisation for under-resourced languages with sequence-to-sequence learning: A case of Early Irish

12 pagesPublished: March 18, 2019


Lemmatisation, which is one of the most important stages of text preprocessing, consists in grouping the inflected forms of a word together so they can be analysed as a single item, identified by the word’s lemma, or dictionary form. It is not a very complicated task for languages such as English, where a paradigm consists of a few forms close in spelling; but when it comes to morphologically rich languages, such as Russian, Hungarian or Irish, lemmatisation becomes more challenging. However, this task is often considered solved for most resource-rich modern languages irregardless of their morphological type. The situation is dramatically different for ancient languages characterised not only by a rich inflectional system, but also by a high level of orthographic variation, and, what is more important, a very little amount of available data. These factors make automatic morphological analysis of historical language data an underrepresented field in comparison to other NLP tasks. This work describes a case of creating an Early Irish lemmatiser with a character-level sequence-to-sequence learning method that proves efficient to overcome data scarcity. A simple character-level sequence-to-sequence model trained during 34,000 iterations reached the accuracy score of 99.2 % for known words and 64.9 % for unknown words on a rather small corpus of 83,155 samples. It outperforms both the baseline and the rule-based model described in [21] and [76] and meets the results of other systems working with historical data.

Keyphrases: Early Irish, lemmatisation, Natural Language Processing, neural networks, sequence-to-sequence learning, under-resourced languages

In: Gerhard Wohlgenannt, Ruprecht von Waldenfels, Svetlana Toldova, Ekaterina Rakhilina, Denis Paperno, Olga Lyashevskaya, Natalia Loukachevitch, Sergei O. Kuznetsov, Olga Kultepina, Dmitry Ilvovsky, Boris Galitsky, Ekaterina Artemova and Elena Bolshakova (editors). Proceedings of Third Workshop "Computational linguistics and language science", vol 4, pages 113--124

BibTeX entry
  author    = {Oksana Dereza},
  title     = {Lemmatisation for under-resourced languages with sequence-to-sequence learning: A case of Early Irish},
  booktitle = {Proceedings of Third Workshop "Computational linguistics and language science"},
  editor    = {Gerhard Wohlgenannt and Ruprecht von Waldenfels and Svetlana Toldova and Ekaterina Rakhilina and Denis Paperno and Olga Lyashevskaya and Natalia Loukachevitch and Sergei O. Kuznetsov and Olga Kultepina and Dmitry Ilvovsky and Boris Galitsky and Ekaterina Artemova and Elena Bolshakova},
  series    = {EPiC Series in Language and Linguistics},
  volume    = {4},
  pages     = {113--124},
  year      = {2019},
  publisher = {EasyChair},
  bibsource = {EasyChair,},
  issn      = {2398-5283},
  url       = {},
  doi       = {10.29007/cxtl}}
Download PDFOpen PDF in browser