Download PDFOpen PDF in browser

Dialectones: Finding statistically significant dialectal boundaries using Twitter data

EasyChair Preprint no. 12

14 pagesPublished: March 15, 2018

Abstract

MOTIVATION Most NLP applications assume that a particular language is homogeneous in the regions where it is spoken. However, each language varies considerably throughout its geographical distribution. When dialectal variation is significant, the effectiveness of oral and written communication can be significantly affected. To make NLP sensitive to dialects, a reliable, representative and up-to-date source of information that quantitatively represents such variation must be necessary. PROBLEM Some of the current approaches have disadvantages such as the subjectivity of the regions found, the need for parameters, ignoring the geographical coordinates in the analysis and the lack of a statistical test of the existence of the identified dialectal regions. METHOD Detection of ecotones is an analogous problem in the field of ecology that focuses on the detection of boundaries in ecosystems instead of region, facilitating the construction of statistical tests. We adapted a popular ecotone detection technique called “wombling” to the detection of dialectal boundaries by using as underlying non-parametric statistical test, the Hilbert-Schmidt independence criterion (HSIC). In addition to dealing with the aforementioned drawbacks, the use of HSIC provides robustness against to non-linearities present in the linguistic and geographical variables. The proposed method was applied to a large corpus of Spanish tweets produced in 250 locations in Colombia through the analysis of unigram features. RESULTS The resulting dialectal boundaries (i.e. dialectones) showed to be meaningful and spatially correlated with regions identified by other authors using classic dialectology. CONCLUSION We concluded that the automatic detection of dialectones is convenient alternative to classical methods in dialectology.

Keyphrases: Dialectal boundaries detection, dialectology, dialectometry, Dialectone, Ecotone, HSIC

BibTeX entry
BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:
@Booklet{EasyChair:12,
  author = {Carlos A. Rodriguez-Diaz and Sergio Jimenez and George Dueñas and Johnatan Bonilla and Alexander Gelbukh},
  title = {Dialectones: Finding statistically significant dialectal boundaries using Twitter data},
  howpublished = {EasyChair Preprint no. 12},
  doi = {10.29007/9wpx},
  year = {EasyChair, 2018}}
Download PDFOpen PDF in browser