On the Way To the Relevant Grammatical Tagset for Tatar National Corpus

9 pagesPublished: November 28, 2016


The development of the metalanguage for annotation is one of the topical issues in modern corpus linguistics. One of the main problems in the development of a grammatical tagset for the Tatar National Corpus is to identify the inventory level of inflectional categories and to create an optimal metalanguage of description. We discuss the factors that complicate the process of grammatical annotation for Turkic corpora in general, including the need to overcome the influence of the Indo-European grammatical tradition in the description of the phenomena of Turkic languages, the lack of generally accepted standards for corpus annotation, the lack of a common metalanguage used to describe grammatical categories of Turkic languages, poor differentiation of word-building and form-building in Turkic languages, etc. In the course of work on the system of grammatical annotation of the Tatar Corpus, we made an inventory of grammatical categories of the Tatar language and developed a metalanguage for describing them.
Currently, the developed grammatical tagset contains 93 tags. Tags for parts of speech and grammatical categories were created to meet the worldwide standards, primarily the Leipzig glossing rules.

Keyphrases: annotation, grammatical category, metalanguage, Tagset, Turkic languages

In: Antonio Moreno Ortiz and Chantal Pérez-Hernández (editors). CILC2016. 8th International Conference on Corpus Linguistics, vol 1, pages 121--129

