CMC2023: CMC-CORPORA 2023
PROGRAM FOR FRIDAY, SEPTEMBER 15TH
Days:
previous day
all days

View: session overviewtalk overview

09:00-09:30 Session 8: Community Building

Speakers: Alexander König and Egon Stemle

Location: Aula O102
09:30-10:30 Session 9A: Corpora Construction 2
Chair:
Location: Aula O102
09:30
Anonymization of Persons in Videos of Authentic Social Interaction: Machine Learning Model Selection and Parameter Optimization.

ABSTRACT. Automatic anonymization of persons in video recordings requires robust detection of face and head areas. Machine learning-based face and posture detectors provide bounding boxes of face and head regions, but specific parameters need to be optimized to maximize the number of correctly anonymized persons and minimize manual annotation and verification efforts. Three different, state-of-the-art ML models (RetinaFace Detector (RFD), Dual-Shot Face Detector (DSFD) and Yolo7-Pose Detector (Y7PD)) were evaluated regarding their suitability for face- and head-region anonymization. Results on our specific anonymization test dataset show that RFD slightly outperforms DSFD if recall (maximizing anonymization) is favored over precision (minimizing false positive face detections). Y7PD yields an even better recall, but at the cost of comparatively low precision. Besides anonymization, collected detector outputs can provide useful data for multimodal interaction research, like body-posture trajectories and face locations.

10:00
A Pipeline for the Large-Scale Acoustic Analysis of Streamed Content

ABSTRACT. Vast quantities of audio and video data are available from video sharing sites, streaming services, and social media platforms, but relatively little of this content has been utilized for acoustic, phonetic, or multimodal analysis of linguistic variation. This article describes a Python-based scripting pipeline for the extraction and analysis of audio from websites that use the common DASH streaming protocol. The pipeline comprises elements from the open-source Python libraries yt-dlp and Parselmouth and uses the Montreal Forced Aligner for aligning audio with text. The scripts are customizable and suitable for the automatic extraction of video as well as audio data. An exploratory proof-of-concept analysis considers the nucleus of the /eɪ/ diphthong in American English: Starting from videos indexed in the Corpus of North American Spoken English, almost 9 million tokens of the segment were retrieved using the pipeline and their values in F1/F2 formant space mapped. As expected, the diphthong nucleus has a more back starting point for speakers in the American Southeast.

09:30-10:30 Session 9B: Linguistics of Inclusion and Discrimination
Location: SO 418
09:30
Digital Corpus Linguistic Analysis of the Language on disability and inclusion in social media – in a German corpus of 2,559 Tweets on #disability and #inclusion between 1st of December – 31st of December 2020

ABSTRACT. This presentation examines the digital language usage on disability and inclusion – edited by disabled and non-disabled people – in Social Media. For this examination, we use a small corpus of 2,559 Tweets with 61,249 tokens as a part of a big corpus of 214,926 total with 5,663,504 tokens. The whole corpus consists of tweets published in the time span of 2007-2023 under the hashtags 'inclusion' and 'disability', while the small corpus was published from the first day until the last day of December 2020 UTC. This linguistic study provides valuable insights into the lexicon on disability, inclusion, the co-occurrences of the lexical units by using AntConc. In addition, our research goal focuses on the classification of the Tweets via Sentiment Analysis (SentiStrength). The paper shows the potential and capacity of a quantitative Corpus Linguistic examination carried out on a German corpus from Social Media on disability, inclusion, discrimination and exclusion. The study provides not only a quantitative lexicon analysis with AntConc but also combines this with a Sentiment Analysis, which is a research desideratum in German Corpus Linguistics. Our paper describes also the potentials of a quantitative lexical and sentiment analysis with SentiStrength for language studies on the communication on disability and inclusion in Social Media accomplished with a critical reflection of methodological issues.

10:00
The representation of the ‘Jew’ as enemy in French public Telegram channels within the identitarian-conspiratorial milieu

ABSTRACT. The ‘Jew’ as enemy is not new, neither in France nor in Europe. However, according to the CNCDH Report, discourses reminiscent of conspiracy theories have resurfaced during the Covid-19 pandemic (CNCHD, 2022). The hatred is partly driven by a milieu that situates itself between conspiracism and identitarianism and that prefers to spread its ideas on the internet and social networks (Froio 2017). Super-conspiratorial narratives (Soteras 2019) circulate in these platforms and stigmatise the ‘Jew’ as enemy (Schwarz-Friesel 2013). What are the denominative patterns that the milieu uses in France to designate them? To answer this question, a corpus of 90,000 messages from ten Telegram messenger channels emitted between January 2018 and May 2022 was analysed. The given social media channels are particularly characterised by the homogeneity of its users. Approaches from DA and CxG were applied to the corpus in order to find recurring patterns.

09:30-10:30 Session 9C: Digital Identities 2
Location: O 048
09:30
IDA - Incel Data Archive: a multimodal comparable corpus for exploring extremist dynamics in online interaction.

ABSTRACT. Extremist online communities are growing fast, representing a potential threats for for many European and extra-European countries. To understand the dynamics of interaction inside web-based extremist groups, we introduce IDA, the Incel Data Archive: a multilingual and multimodal corpus created by gathering data from Incel forums in both Italian and English languages. With its cluster of forums, blogs and websites, the Incelosphere represents an ideal case study for delving into the dynamics of interaction within extremist online communities from a cross-cultural perspective. Thus, the contribution of this work is twofold: first, it offers a novel cross-cultural perspective on the Incel phenomenon. Second, our contribution discusses in detail the challenges and opportunities presented by constructing a multimodal and multilingual corpus from discussion forums. To do this, we employee a mixed-method approach to Computer Mediated Communication. To partially highlights some important differences in the two communities, we carried out an exploratory analysis using a novel topic modeling technique based on Transformer architectures, which enabled a thematic exploration of the two corpora. The results of our thematic exploration show how the two communities diverge not only in terms of their focus on different discussion topics, but also for the targets of their hateful contents.

10:00
“Don’t be afraid of Greeklish”: Adolescent students’ transliteration practices

ABSTRACT. Greeklish, the Latin-alphabet Greek used for the past 30 years in Computer-Mediated Communication (CMC), has sparked much debate in Greek society. However, previous research has mainly recorded adults’ transliteration practices. This study is concerned with the Greeklish transliteration practices used nowadays, mainly by adolescent users, and reports on any differences observed compared to those of adults. The analysis of the Greeklish corpus that was built for this purpose shows that adolescents transliterate some graphemes differently; however, consistency is observed in the transliteration of Greek graphemes that can be transliterated with more than one Latin grapheme, except for the grapheme <y>. Adolescents use mainly the mixed transliteration type, the combination of phonetic and orthographic transliteration, but prefer the orthographic transliteration of vowels in verbs and nouns, especially when those are positioned on the suffix of those word classes.

10:30-11:00Coffee Break
11:00-12:30 Session 10A: Multimodality in Digital Spaces
Location: Aula O102
11:00
Workflows and Methods for Creating Structured Corpora of Multimodal Interaction

ABSTRACT. Corpus analysis of computer mediated and/or multimodal interaction can draw on methods of written and spoken corpora, while also providing further information like gaze or walk annotations or sensor-based data like kinect or motion capture or robot log files. We propose a workflow leveraging the developments of both worlds while simultaneously focussing on standard formats and a sustainable way of research data management.

11:30
Multimodal Intertextual Practices in Video Film Reviews

ABSTRACT. A paradigmatic shift in the functions and formats of film reviews has altered the ways that intertextuality, or the relations of the text to other texts, is rendered in various film review formats. As one of the fastest growing multimodal social networking sites, YouTube offers a nuanced inventory for expressions of intertextuality, which extend beyond exclusively textual practices. This study is part of a larger project which investigates the semiotic resources of evaluation in online video film reviews. Based on a detailed case study of a sample from the Corpus of Online Video Film Reviews (CoVFR), this paper aims to explore the diversity and complexity of multimodal intertextual practices in video film reviews. 

12:00
Acquiring, Analyzing, and Understanding Multimodal TikTok Short Video Data: The Case of Online Sex Worker Visibility Management

ABSTRACT. This paper introduces some key questions that affect how multimodal short video data on TikTok can be accessed, acquired, and analysed. The accompanying research ethical questions will also be highlighted. The issue of data collection is approached in terms of TikTok’s platform features that most readily affect how videos are made visible and available to users: audio centricity of content and its delivery via the For You Page (FYP) recommender algorithm. The specific context of the presented approaches on data gathering and multimodal discourse analysis are connected to the visibility management of sex workers on TikTok as affected by the platform’s content and visibility moderation. The paper presents work-in-progress approaches to data gathering for building multimodal corpora, multimodal discourse analysis, and research ethics of TikTok videos.

11:00-12:30 Session 10B: Features of Digitally-Mediated Communication 2
Location: SO 418
11:00
Phonetic Metaphor of Chinese Emojis: An Approach of Neologism Formation

ABSTRACT. In Internet-mediated communication, emoji has gradually become a non-negligible element, and the visual writing system of language is also experiencing the impact of emoji. Neologisms have also emerged as a result, and one interesting way of creating new words is through phonetic metaphors. Chinese, with its unique character system and one-character-one-syllable feature, is more likely to produce emoji-related phonetic metaphors. For example, through phonetic metaphors, [chilli][chicken] acquires the phonetic sound of 垃圾 “trash” laji and is used to refer to trash-like useless people. This paper explains that the instantiation of this phonetic metaphor approach is a two-way result; on the one hand, the need for expression cannot be directly satisfied by the existing emojis, and on the other hand, it is possible to extract phonetic materials from emojis. Moreover, the paper also argues with semantic and pragmatic evidence that these emoji expressions are neologisms rather than new calligraphic forms.

11:30
"megageil", "mega geil", and "voll mega": Intensification in YouTube comments

ABSTRACT. This paper analyses intensification in German digitally-mediated communication (DMC) using a corpus of YouTube comments written by young people (the NottDeuYTSch corpus). Research on intensification in written language has traditionally focused on two grammatical aspects: syntactic intensification, i.e. the use of particles and other lexical items and morphological intensification, i.e. the use of compounding. Using a wide variety og examples from the corpus, the paper identifies novel ways that have been used for intensification in DMC, and suggests a new taxonomy of classification for future analysis of intensification.

11:00-12:30 Session 10C: Digitally-Mediated Interaction 2
Location: O 048
11:00
The Reply Function in WhatsApp Chat Communication

ABSTRACT. This paper is based on empirical observations and focuses on the reply function in WhatsApp-Chats communication. Although this function has been available for several years, it has received little attention in academic research. Through the corpus analysis of both the collected data in the Mobile Communication Database 2 (MoCoDa2) and self-collected chat data from WhatsApp, this study identified various functions of the reply function in different chat contexts. In one-on-one chats, the reply function can serve diverse functions, including thematic reference, forming pair sequences, and improving comprehensibility. In contrast, in group chats, it can be used to address one or more participants, continue a previous conversation, or clarify misunderstandings. Structurally, while the quotation’s placement can be influenced by various factors, messages that have been sent to the chat are quoted sequentially. This paper summarizes the functions and structures of using the reply function in individual and group chats, respectively, and contributes to research on internet-based and internet-supported chat communication.

11:30
Specific behaviours in Wikipedia talk pages: some insights from extreme cases

ABSTRACT. Based on a dataset of 3.4 million threads from English Wikipedia talk pages, we specifically focus on extreme cases. We propose a qualitative analysis of the most prolific message authors, the longest threads in terms of messages, contributors and durations, as well as the longest monologues (single-user threads). These case studies allow us to identify a number of behaviours that can significantly differ from the typical discussions between Wikipedians. If some threads do not have a real dialogic status (polls, monologues, logbooks and diaries), some of them push online communication to its limits across time. These sometimes unexpected behaviours can help us get a more precise understanding of this unique source of computer-mediated communication data.

12:00
A Corpus study on the negotiation of pronominal address on talk pages of the German, French, and Italian Wikipedia

ABSTRACT. The adequate use of social deixis is highly dependent on the situation and context and has therefore always been at the center of linguistic pragmatics. So far, principles of pronominal address have mainly been modelled with a focus on oral, co-present interaction. The use of pronominal address in computer-mediated communication with its translocal and partially anonymous contexts is still a research gap. In this context, it is particularly interesting that different digital platforms have developed specific customs or netiquettes regarding the appropriate use of address pronouns. This paper asks, from a contrastive perspective, how the appropriate use of address pronouns is negotiated on talk pages of the German, French, and Italian Wikipedia. The corpus study is based on the multilingual Wikipedia corpora of the Leibniz Institute for the German Language.

12:30-14:00Lunch
14:00-15:00 Session 11A: Corpora Construction 3
Location: Aula O102
14:00
MMWAH! Compiling a Corpus of Multilingual / Multimodal WhatsApp Discussions by Swedish-speaking Young Adults in Finland

ABSTRACT. This WiP paper will report the compilation of the corpus of Multilingual / Multimodal WhatsApp discussions at Hanken (MMWAH). The target data donors are Swedish-speaking young adults in Finland. This demographic group is of particular interest in the study of bi- and multilingualism and the coexistence of Swedish, English and Finnish in CMC. Furthermore, the corpus will lend itself to research on multimodal / polysemiotic practices in instant messaging and maintenance of social networks and identity-building through the linguistic and other semiotic resources at the speakers’ disposal.

14:30
MigrTwit Corpora. (Im)migration Tweets of French Politics.

ABSTRACT. Since the early 2010s, French politicians have steadily utilized Twitter as a communication tool. Political tweets about immigration are prolifically produced by Marine Le Pen (former leader of the French far-right populist Party Rassemblement National). Their biased social representation of migrants, immigrants, and asylum seekers seems to be instilled in manifold voters through online public debate and media. To characterize political immigration discourse on Twitter, we developed the diachronic bilingual corpus of political tweets posted throughout the last 12 years, from 2011 to 2022. The whole MigrTwit corpus consists of three subcorpora, for a total of 23869 tweets, with 703016 words. The constitution of the French MigrTwit corpora enabled us to study the evolution of immigration discourse in comparative and corpus-based approaches.

14:00-15:00 Session 11B: Digital Identities 3
Location: SO 418
14:00
“Hebrew level: Bibist.”: Online Hebrew language corrections as a tool for “civilized” bashing

ABSTRACT. Despite the potential for democratic engagement offered by online commenting, research suggests that political discourse within the Israeli online commenting sphere falls short of realizing its democratic potential. A notable example is the presence of language policing practices. Hebrew online comments often display non-standard language forms, occasionally prompting corrections from individuals who adhere to a standard language ideology. It is these instances of correction that serve as the focal point for the present study. Examining interactions containing language corrections drawn from four prominent Israeli Facebook news pages, we compare between interactions that follow posts related to political issues (the judicial reform/coup in Israel) and those that follow posts related to (mostly) non-political matters (celebrities in Israel and abroad). Findings indicate that language corrections are more prevalent in the political context compared to the non-political context. Qualitative analysis suggests that language corrections manifest as a form of supposedly “civilized” and sanctioned bashing. These language corrections are not driven by a genuine concern for the Hebrew language, but rather stem from the desire of (typically) left-wing correctors to establish their intellectual superiority over the (typically) right-wing individuals being corrected, thus contributing to the perpetuation of existing stereotypes prevalent in Israeli society.

14:30
Towards a more inclusive approach of digital literacy: social media writing at an older age

ABSTRACT. We present two complementary pilot studies on older adults’ social media literacy. The first pilot discusses a survey among two generations of older adults, the second is based on family WhatsApp conversations between young adults and their parents. While the survey results show a restricted command of abbreviation strategies and emoji pragmatics, in spite of a clear predilection for emoji, the WhatsApp conversations point to a more elaborate exploitation of emoji functions by the parent generation. Still, older adults’ practices clearly do not always align with those of the younger generations. Both lack of knowledge and dislike of specific online practices seem to be determining factors. The pilots constitute the starting point for a more extensive research project on seniors’ social media literacy which in the end should lead to a more inclusive approach of present-day digital literacy.

15:30-16:00Coffee Break
16:00-17:30 Session 13: Workshop: Working with audio data (Instructor: Thomas Schmidt)

The workshop gives a basic introduction to methodological and technological aspects of working with multimedia data as part of CMC corpora. We will give an overview of the most important tools for manual transcription of audio or video, discuss the role of automatic methods (ASR, Automatic Speech Recognition) and look at standardisation in the TEI framework, including integration with solutions for written CMC data. Researchers of all stages are welcome, no special equipment is required to follow the workshop.

Location: Aula O102