CMC2023: CMC-CORPORA 2023
PROGRAM FOR THURSDAY, SEPTEMBER 14TH
Days:
next day
all days

View: session overviewtalk overview

09:00-09:30 Session 2: Conference welcome

Opening addresses from:

Louis Cotgrove (Organising Committee)

Cornelia Ruhe (Dean of the School of Humanities, University of Mannheim)

Angelika Wöllstein (Deputy Director of the IDS)

Location: Aula O102
09:30-10:30 Session 3: Keynote
Location: Aula O102
09:30
Individual linguistic variability in social media

ABSTRACT. Computer-mediated language has become a popular source of data for analyses in linguistics and social science, aided by convenient access to large-scale ad-hoc corpora. In this talk I will present several case studies to support two main points. First, social media present a varied source of informal, spontaneous, situated discourse that can inform linguistic theory (amongst other lines of research). Second, it is important to study linguistic behavior by actors across media and domains, since both "computer-mediated communication" as a whole as well as specific social media show considerable within-corpus variation that is partially due to intra-speaker variability. Finally, I will discuss how to construct corpora that support this kind of research, from practical, ethical, and sustainability perspectives.

10:30-11:00Coffee Break
11:00-12:30 Session 4A: Features of Digitally-Mediated Communication 1
Location: Aula O102
11:00
Ellipsis Points in Messaging Interactions and on Wikipedia Talk Pages

ABSTRACT. In this paper, we examine the usage of ellipsis points (EP) in two genres of computer-mediated communication (CMC) using corpora. In two studies, we describe and compare the formal and functional characteristics of EP usage in WhatsApp chats and on Wikipedia talk pages. (1) We present a typology of pragmatic functions of EP in WhatsApp interactions that has been derived from the analysis of a corpus sample and discuss how the practices of EP usage in these data originate from traditions of writing. (2) We investigate typographic and allographic variation of EP on Wikipedia talk pages and examine whether the categories resulting from study 1 also fit for this type of CMC discourse. Our analyzes show that EP are frequently used for the sequential organization of written interactions and relationship management between interlocutors in different CMC genres.

11:30
A Multivariate Register Perspective on Reddit: Exploring Lexicogrammatical Variation in Online Communities

ABSTRACT. Even though social media have shaped day-to-day communication for years, the internal linguistic variation associated with these emergent register contexts still remains largely unknown. To fill this gap, the present study evaluates a geometric multivariate approach for this domain by investigating patterns in the visualisation of forty-two lexicogrammatical features derived from systemic functional theory for thirty-three communities on Reddit. The results successfully demonstrate that these subreddits can be interpreted as subregisters of a yet hypothetical macro-register that align with contextual and thus functional differences. Accordingly, this study argues that investigating individual texts rather than broad feature correlation patterns would improve multidimensional analyses of subregisters in general and hybrid web registers specifically. This perspective can not only improve our understanding of language variation at lower levels of instantiation but also hopes to incentivise further research on platform-internal register variation in light of practical implications for context-informed automatic classifications of web documents in functional terms.

12:00
“Also ehrlich” – From adjectival use to interactive discourse marker

ABSTRACT. In this paper we will take a closer look at the German word “ehrlich”. Traditionally, it is seen and described as an adjective. However, this word, as we will demonstrate with corpus data, has widened its domain of usage and is now used frequently, and in combination with other words, as an interactive unit (“interaktive Einheit”). As such, it is typically used in spoken discourse and in written dialogues, while having lost central aspects of its core meaning. Our findings are based on a German reference corpus and a corpus of Wikipedia discussion pages.

11:00-12:30 Session 4B: Corpora Construction 1
Location: SO 418
11:00
Collecting Health Memes for a Subcorpus of Peer Health Discourse

ABSTRACT. A discussion of a social media subcomponent of CADOH, the Corpus of American Discourses on Health, a corpus focusing on how health information in English is conveyed among non-specialists. Previous online data in the corpus is now supplemented by a collection of 200 food and health-focused internet memes collected between 2018 and 2022 using Google image searches. Query terms were chosen to align with the other corpus components: calories, cold, Covid, cholesterol, colonoscopy, diet, fat, flu, health, nutrition, shots, vaccination, germs, and sanitizer. Metadata was collected to track the URL, author, posting date, wording of the caption, keywords of the topic, topic of the image macro, and a jpeg of each meme. The paper discusses the benefits of thematic dataset collection. Applications are suggested for linguistic analyses such as contrasting the performance of online speech acts and for public health investigations into lay beliefs about causality and health outcomes.

11:30
Collecting and de-identifying half a million WhatsApp messages

ABSTRACT. Instant messaging (IM) applications, especially WhatsApp, have become ubiquitous in contemporary computer-mediated communication practices. IM data have the potential to constitute a rich source of research material for corpus linguistics and cultural analytics, owing to their similarities with face-to-face conversations as well as their private nature. In this work, we outline the creation process of a large curated dataset of WhatsApp messages in French. The paper covers the protocol for collecting these messages as well as the de-identification process for removing sensitive information liable to identify the users in these messages. The de-identified dataset will ultimately be made available to researchers on request.

12:00
Little Big Data: Karelian Twitter Corpus

ABSTRACT. This paper investigates Karelian language visibility on Twitter and describes the first corresponding data collection using language-related keywords and hashtags. In total, 2626 entries written fully or partially in Livvi, South and Viena Karelian were scraped with Postman API. The visibility of Karelian on Twitter has been considerably increasing in the past few years, Livvi-Karelian being the most prominent dialect. The data were analysed linguistically (manually and with language detection software) and thematically. Although language-related topics are the most popular, there is a substantial number of entries in eight further topics. Applicability of the collected data for linguistic and sociological research, and further data collection considerations are discussed.

12:30-14:00Lunch
14:00-15:30 Session 5A: Digital Identities 1
Location: Aula O102
14:00
Not an expert, but not a fan either. A corpus-based study of negative self-identification as epistemic index in web forum interaction.

ABSTRACT. This study examines the linguistic micro-management of identity in and across online contexts, drawing upon corpus-based pragmatic analysis of a structure with a meaning potential to examine wider questions about identity in digitally mediated social life. The structure analyzed are negative self-identifiers of the type “I + copula + not + indefinite NP” used in UK web discussion forums. It was chosen as it explicitly relates the speaker with the notion of interest, namely identity, and, by negating explicitly stated or presupposed claims, indexes how speakers perceive, and discursively create, the context they are writing into. Qualitatively and quantitatively analyzing the forms and functions of 936 instances of the structure in their co-texts, negative self-identifiers from the fields of expertise and preferences were found to be salient in the examined data, framing co-texts in which speakers linguistically enacted various forms of expertise, pointing to heightened reflexivity regarding the epistemic status and social impact of their utterances and a reconceptualization of expertise as a transient discourse phenomenon rather than a more permanent identity feature.

14:30
Balancing expert and peer-student identities in online discussion forums

ABSTRACT. This paper analyses how students collaborating to improve a translation in online discussion forums construct credibility by projecting an expert image. The analysis focuses on the writing style of three prestige-prominent students, and how they manage to balance the conflicting goals of demonstrating expertise to legitimize their status as advice-givers and asserting their student identities to mitigate imposition. They present themselves as 1) knowledgeable and trustworthy, by using academic and specialized language, adopting a professorial role, citing reliable sources or claiming personal experience; but also 2) as sensitive towards other participants through displays of honesty, humility and in-group solidarity. Their distinct ways of balancing expertise and peer-solidarity arguably explains their relative prominence in the forums rendering their contributions more reliable and acceptable, consequently more worth reading by their colleagues, while also probably securing them better grades. The findings have pedagogical interest for the teaching of academic online discussion skills.

15:00
Studying Socially Unacceptable Discourse Classification (SUD) through different eyes: "Are we on the same page ?"

ABSTRACT. We study Socially Unacceptable Discourse (SUD) characterization and detection in online text. We first build and present a novel corpus that contains a large variety of manually annotated texts from different online sources used so far in state-of-the-art Machine learning (ML) SUD detection solutions. This global context allows us to test the generalization ability of SUD classifiers that acquire knowledge around the same SUD categories but from different contexts. From this perspective, we can analyze how (possibly) different annotation modalities influence SUD learning by discussing open challenges and possible open research directions. We also provide several data insights which can support domain experts in the annotation task.

14:00-15:30 Session 5B: Digital Knowledge-Building
Location: SO 418
14:00
Scientific communication on social media: Analysing Twitter for knowledge recontextualisation

ABSTRACT. In the last decades and as a result of the growing concern with ensuring the democratisation of science, Twitter has gained importance within the scientific community as a means for the transmission of specialised knowledge both within experts and non-expert audiences. Considering this, the present paper studies the phenomenon of science dissemination and popularisation on Twitter, taking the official accounts of Greenpeace and WWF as its object of analysis. For this purpose, a total of 100 tweets from these accounts were gathered and analysed through manual reading. Based on this analysis, it was found that Twitter for science dissemination primarily responds to informative and engagement purposes, which are materialised through the convergence of the verbal and visual modes and the combination of diverse types of hyperlinks (e.g., hashtags, tags and outbound links). As a result, it was concluded that the use of Twitter features and affordances for science dissemination is determined by the prevalence of non-expert audiences over expert ones.

14:30
The recontextualization of expert knowledge: intertextual patterns in digital science dissemination

ABSTRACT. Technology is having an unprecedented impact on the communication of specialized knowledge, which takes advantage of the use of digital modes and media to reach multiple, diversified audiences. To serve this purpose, expert knowledge is subjected to various processes of recontextualization. I here explore the intertextual patterns used in digital scientific dissemination to recontextualize expert knowledge to reach less specialized audiences, as well as the role played by digital affordances to shape the intertextual patterns identified. For such purposes, I focus on a corpus of 30 digital scientific feature articles, where, among other features, the use of direct/indirect quotations and intertextual text-types and networks are explored. Results reveal the existence of patterns of intertextuality in the feature articles analysed, in which digital affordances (i.e. hyperlinks) combine with other “offline” intertextual resources for different recontextualization purposes and in various ways, depending on the level of expertise and specialization aimed at in the text.

15:00
Can I Publish my Social Media Corpus? Legal Considerations for Data Publication

ABSTRACT. This paper is intended as an aid for linguists and other researchers wishing to compile and publish collections of social media data. Based on the example of the Corpus of Political Tweets by Trump and US Senators (PoTTUS), it demonstrates potential technical steps for obtaining social media data and discusses the requirements social media data needs to fulfil to make publication possible: legal basis for data processing beyond the end of the research process, metadata standards for documentation, and the need of a technical format fit for long-term preservation. The discussion reveals that oftentimes a compromise between researchers’ wishes and the legal limitations imposed by social media companies is the only viable solution. In the case of Twitter data, this might mean publication of Tweet IDs instead of the content of the tweets. The downside of this is so called ‘data rot’, i.e. loss of parts of the data.

15:30-15:45Coffee Break
15:45-17:15 Session 6: Poster Session
Location: Aula O102
Structural linguistic characteristics of podcasts as an emerging register of computer-mediated communication

ABSTRACT. Podcasts, a relatively recent audio medium, have risen in popularity since their initial appearance in the mid-2000s. Yet, little is known about their structural linguistic characteristics and their relation to other registers. Addressing this gap in the literature, we apply Biber-style multidimensional analysis (MDA) to a representative sample of Spotify podcast transcripts and compare their structural linguistic characteristics to those of selected computer-mediated registers (e.g., informational blog, interview) as well as traditional spoken registers (e.g., broadcast, conversation). Our results reveal that, while podcasts share some linguistic characteristics with traditional spoken registers such as broadcast discussion and unscripted speech, they are unlike any of the analysed registers. In fact, they exhibit unique structural characteristics combining features of involved spoken language with some features typical of informational production and narration. In short, we show that podcasts are a newly emerging register of computer-mediated communication.

Ellipsis of the subject pronoun ich (‘I’) in German WhatsApp chats: A usage-based approach

ABSTRACT. Previous research into ellipsis in “keyboard-to-screen communication” (Jucker/Dürscheid 2012) shows that the first-person singular subject pronoun is frequently omitted in SMS text messages (cf. Androutsopoulos/Schmidt [2002: 69] for German; cf. also Frick [2017: 88-89] for Swiss German). In contrast, Stark and Meier (2017: 224) point out that subject omission is a “quite rare” phenomenon in their data (488 German WhatsApp messages) drawn from the corpus “What’s up, Switzerland?”. This, however, raises the question of whether this decline in occurrence of subject omissions is linked to the affordances of WhatsApp (cf. Androutsopoulos 2023) or whether it can be explained with reference to other factors, especially with reference to “strong interferences with local Swiss German dialects” (Stark/Meier 2017: 226) which Stark und Meier have observed in their WhatsApp data. Build on previous research into ellipsis in spoken and written interactions, this contribution presents the preliminary results of a study on the omission of the singular first-person pronoun ich (‘I’) based on 712 German WhatsApp chats with 31,693 messages (241,675 tokens) drawn from the “Mobile Communication Database 2”. Using the MAXQDA software, all occurrences of the omitted first-person singular subject pronoun – as well as approximately 8,700 occurrences of ich – identified in this corpus have been manually annotated with a range of morpho-syntactic, sequential and some other functional features. These annotations serve as a basis for the analysis with the statistical software R in order to examine whether omissions of subject pronouns are formal and/or functional motivated and whether some structures including omitted subject pronouns can be interpreted as constructions in the sense of Interactional Construction Grammar (cf. Deppermann 2006; Imo 2015).

References

Androutsopoulos, J. (2023): Kontextualisierung digital: Repertoires und Affordanzen in der schriftbasierten Interaktion. In: Meier-Vieracker, S./Bülow, L./Marx, K./Mroczynski, R. (Hg.): Digitale Pragmatik. Digitale Linguistik, vol 1. Berlin, Heidelberg: J.B. Metzler. https://doi.org/10.1007/978-3-662-65373-9_2.

Androutsopoulos, J./Schmidt, G. (2002): SMS-Kommunikation. Ethnografische Gattungsanalyse am Beispiel einer Kleingruppe. In: Zeitschrift für Angewandte Linguistik 36, 49–80.

Deppermann, A. (2006): Construction Grammar – Eine Grammatik für die Interaktion? In: Deppermann, A./Fiehler, R./Spranz-Fogasy, T. (Hg.): Grammatik und Interaktion. Radolfzell: Verlag für Gesprächsforschung, 43–65.

Frick, K. (2017): Elliptische Strukturen in SMS. Eine korpusbasierte Untersuchung des Schweizerdeutschen. Berlin: De Gruyter

Imo, W. (2015): Interactional Construction Grammar. In: Linguistics Vanguard, 1–9. Jucker, A. H./Dürscheid, Ch. (2012): The linguistics of keyboard-to-screen communication. A new terminological framework. In: Linguistik Online 56(6), 39–64. https://doi.org/10.13092/lo.56.255.

Stark, E./Meier, P. (2017): Argument Drop in Swiss WhatsApp Messages. A Pilot Study on French and (Swiss) German. In: Zeitschrift für französische Sprache und Literatur 127(3), 224–252

ChrisTof: A Novel Corpus of Christian Online Forums

ABSTRACT. We present a novel corpus online forum data of posts in a custom-designed TEI XML format, where posts are grouped together according to the discussion threads in which they were produced and which retains the original thread structure. The data includes complete archives from two religious German online forums (jesus.de and mykath.de) and two English subreddits on Christianity (r/TrueChristian and r/OpenChristian).

We stored a comprehensive amount of post metadata. This includes post IDs, the timestamp of the post and reactions to posts such as upvotes and downvotes. Additionally, we preserved as much of the forum markup as possible, which means that formatting (e.g. boldface), as well as structural information such as quotations and replies is retained for future analyses. The user names have been automatically pseudonymized.

Each post was automatically sentence segmented and tokenized using SoMaJo (Proisl and Uhrig, 2016) and all sentences and tokens were given unique IDs. The retained thread structure sets our corpus apart from previous corpora of forum-like CMC represented in TEI. We expect that this would additionally facilitate comprehensive analyses of discourse relations between posts and the discourse structure of discussion threads.

The main purpose of our data is the annotation of metaphorically used words according to the Metaphor Identification Procedure VU Amsterdam (MIPVU) (Steen et al., 2010) and quantitative as well as qualitative analyses of religious metaphors in the forums. We will present the results of an inital annotation round, where we applied MIPVU to threads from our corpus. Additionally, we use the corpus to train topic models for an analysis of the different Christian communities, of which we will also present early results.

We will make the data available to interested researchers after contacting [removed for reviewing].

Negotiating knowledge in cooperative learning scenarios: a multimodal approach to practices of computer-mediated and face-to-face communication in the university classroom

ABSTRACT. The contemporary ‘digital condition’ (Kultur der Digitalität, Stalder, 2016) has given rise to innovative concepts of learning and teaching. Learning activities and interactions between teachers and learners as well as among learners do not exclusively take place in the physical classroom anymore but also – partly or completely – in the digital sphere using the potentials of computer-mediated communication (CMC). On my poster I will give an outline of my dissertation project, which I have been working on since January 2022 at the University of Duisburg-Essen, and discuss issues related to the collection of a multimodal corpus of interactions including (i) audio and video recordings of Zoom and face-to-face meetings (in peer-to-peer and plenary discussions with teachers), (ii) collaborative text annotation of digitized papers and discussion threads, (iii) cooperative text production with Etherpads on the learning platform Moodle, (iv) interviews with selected students, and (v) logfiles of private text messaging among students. These data are used for the investigation and modelling of digital and face-to-face practices of negotiating knowledge in a hybrid learning scenario in higher education (university seminar in linguistics). The learning scenario is designed to foster students’ competencies in comprehending researchers’ perspectives and approaches in linguistics papers and to improve their skills in discussing theoretical concepts derived from their readings. The scenario is characterized by the following features (examples with rough English translations illustrate the different stages of the setting):

• Student teams cooperatively elaborate on the theoretical frameworks and findings reported in papers and book chapters and discuss them on the basis of key questions provided by the teacher. They annotate and discuss texts using the Moodle activity type ‘Textlabor’ where they can comment on the text and also verbalize when they have difficulties understanding certain text passages. Example 1 (comment in a Textlabor discussion thread on a figure in a paper by Auer (2000), November 23, 2022): Student 1: “Also ich kann das hier gar nicht gut nachvollziehen. Können wir da Freitag drüber reden?“ I really can‘t comprehend this [figure]. Can we talk about that on Friday?

• Based on their annotations and discussion threads, the teams talk about their findings and questions regarding text passages in Zoom meetings. They use Etherpads to take notes and/or edit texts containing their results. Example 2 (Zoom meeting two days after student 1 posted his comment represented in Ex. 1, November 25, 2022): Student 2 shares her screen that shows the Textlabor comment from Ex. 1: “so (---) das ist eh die frage (.) im zweiten text”. this is uh the question (.) in the second text. Student 1 expresses his displeasure without repeating or rephrasing his comment: “ACH ja (.) hm ja (.) also DIE grafik […] (das war) also WIRklich” [kollektives Lachen] oh right (.) um yeah (.) well THAT figure […] (that was) HONestly [students laugh collectively] […] Student 2: „ich kann das auch nicht verstehen (---) ehm (3.0) SCHWIErig (---) °h OH (.) d_DAS ist ein [beispiel] von kombination also man s_also der autor meinte in (dieser) äußerung […]“ i don’t understand it either (---) um (3.0) DIFFicult (---) °h OH (.) th_THAT is an [example] of combination so you s_so in this statement the author meant […]

• In class they discuss their results and open questions face-to-face with the other teams and the teacher.

• The students use private messaging apps and/or other communication channels of their own choice to organize their team work. Example 3 (WhatsApp messages on November 25, 2022, 90 minutes prior to the Zoom meeting): Student 3: „Meint ihr, wir sollten gemeinsam Absatz für Absatz im Meeting durchgehen […]“ Do you think we should discuss paragraph by paragraph in our meeting […] Student 1: […] „Vielleicht gucken wir uns erstmal an, was im Textlabor bearbeitet wurde und orientieren uns dann an den Aufgaben?“ Maybe we should take a look at the comments in the Textlabor first and then focus on the assignments [provided by the teacher]? […] Student 1: „[…] finde besonders den Auer-Text echt schwierig...“ […] especially the text by Auer is really difficult…

On my poster I will include sample data and give insights into my analyses. In my analyses I combine the perspectives and concepts of interactional linguistics (Imo/Lanwer, 2019), research on ‘digital practices’ (see Androutsopoulos, 2016; Beißwenger, 2016) and of negotiating knowledge in classroom discourse (see e.g, Morek/Heller/Quasthoff, 2017).

At the current state, the collected data set has the status of a “corpus in the wider sense” (sensu Beißwenger/Lüngen, 2022): a collection of audio and video files, stored logfiles and text documents as well as transcribed audio and video files for the purpose of linguistic and conversational analysis. On the poster I will present the procedure of data collection in three linguistics classes (April 2022 to July 2023) with a special focus on the handling of ethical and GDPR issues and the challenge to deal with the observer’s paradox, i.e. the challenge to design the observation process as unobtrusive as possible:

• Prior to the data collection, the students were informed about the project without expanding on the research questions in order to avoid priming effects. • It was pointed out that participation in the data collection is voluntary and non-participation does not have any negative implications. • The students gave informed consent by specifying which data types may or may not be collected (Gestufte Einverständniserklärung, Stukenbrock, 2022: 313). • Data were collected in a “natural”, i.e. non-experimental setting by using unobtrusive recording devices and, whenever possible, in my absence (see Stukenbrock, 2022: 312).

A main motivation for my presentation at the conference is to get in touch with other researchers with experience in using state-of-the-art corpus technology and to discuss issues of representing and analyzing (multimodal) corpora with heterogeneous data types.

References

Androutsopoulos, J. (2016). Mediatisierte Praktiken: Zur Rekontextualisierung von Anschlusskommunikation in den Sozialen Medien. In A. Deppermann, H. Feilke & A. Linke (Eds.), Sprachliche und kommunikative Praktiken (pp. 337--367). Berlin, Boston: de Gruyter. Beißwenger, M. (2016). Praktiken in der internetbasierten Kommunikation. In A. Deppermann, H. Feilke & A. Linke (Eds.), Sprachliche und kommunikative Praktiken (pp. 279--310). Berlin, New York: de Gruyter. Beißwenger, M. and Lüngen, H. (2022). Korpora internetbasierter Kommunikation. In M. Beißwenger, L. Lemnitzer & C. Müller-Spitzer (Eds.), Forschen in der Linguistik. Eine Methodeneinführung für das Germanistik-Studium (pp. 431--448). Paderborn: Brill|Fink (= UTB 5711). Imo, W. and Lanwer, J. P. (2019). Interaktionale Linguistik. Eine Einführung. Stuttgart: Metzler. Morek, M., Heller, V. and Quasthoff, U. (2017). Erklären und Argumentieren. Modellierungen und empirische Befunde zu Strukturen und Varianzen. In I. Meißner & E. L. Wyss (Eds.), Begründen – Erklären – Argumentieren. Konzepte und Modellierungen in der Angewandten Linguistik (pp. 11--45). Tübingen: Stauffenburg. Stalder, F. (2019), Kultur der Digitalität. Berlin: Suhrkamp. Stukenbrock, A. (2022). Audio- und Videographie. In M. Beißwenger, L. Lemnitzer & C. Müller-Spitzer (Eds.), Forschen in der Linguistik. Eine Methodeneinführung für das Germanistik-Studium (pp. 307--323). Paderborn: Brill|Fink (= UTB 5711).

Deontic Authority in Computer-mediated Communication Between University Teachers and Students: A Comparative Study of German and Chinese

ABSTRACT. Deontic Authority in Computer-mediated Communication Between University Teachers and Students: A Comparative Study of German and Chinese

My poster presentation presents work in progress. I will shed light on the central question of my doctoral project which I have been conducting since October 2022 at the University of Duisburg-Essen. It is focused on deontic authority between teachers and students in German as well as Chinese universities respectively, i.e., how it is constructed, demonstrated and negotiated through different practices. The data are drawn on the one hand from WeChat-interactions between Chinese university teachers and students, and on the other hand from email correspondences between German university teachers and students. The study takes a comparative perspective and tries to provide insights into the computer-mediated communication in institutional contexts for two different languages and cultures. Deontic authority which makes up the main theoretical framework of my study is about getting the world to match the words, i.e., determining what ought-to-be (Stevanovic 2013). It is no exaggeration to say that most of our actions have something to do with deontic authority. In institutional communication, for instance that between teachers and students, interlocutors have to deal with (i) the relation of deontic status which are determined, e.g., by their institutional roles, and (ii) deontic stance, i.e., locally and interactionally positioned expressions regarding deontic rights (Frick/Palola 2022). It is the dynamics of deontic authority in authentical interactions that makes this topic interesting. I examine my data within the methodological framework provided by (i) interactional linguistics (Couper-Kuhlen/Selting 2018) and conversation analysis and (ii) research on the affordances and practices of computer-mediated communication (e.g., Beißwenger 2016). I focus on types of interactions which are typical for digital, institutional communication between teachers and students – making appointments, discussing term papers or bachelor and master theses, answering questions related to lectures and seminars etc. – and try to explain how deontic authority comes into play. The following example from my WeChat-data shows some interesting practices of the teacher (Q) and the student (Nan). Q is the supervisor of Nan´s master thesis. Firstly, this student is observed to perform transformed turns and adjusts her actions (e.g., making extreme case formulations and promises from post 14 to 19) successively in front of the perceived deontic authority of the teacher. With an institutional task in her mind, she infers the intention of the teacher and gives him deontic authority. By performing resistance, i.e., refusing Q´s advice, she however doesn’t totally relinquish her deontic right. Secondly, the teacher responds to turns of the student only selectively (an eventually more obvious proof is abandoning the dialog abruptly), and carries the sequential development forward for his own intention, which, I argue, demonstrates more deontic stance by controlling the sequential organization. At the same time, the teacher refers to and relies on his epistemic primacy (post 4-8, 15) as a vehicle and veil in order to impose his deontic authority, i.e., he demonstrates his more know-that in terms of writing a thesis in order to persuade the student to accept his advice.

Example: Q 14:09 论文的整体框架出来了没有 Have you already finished the main structure of your paper? WeChat #1, (03.02.2022, 14:09 PM)

Q 14:09 什么时候给我交一个整体的论文 When will you hand over a complete paper WeChat #2, (03.02.2022, 14:09 PM)

Nan 14:16 老师,二月 10 号发给您整体的论文~ Teacher, I will send you the complete paper on 10th February~ WeChat #3, (03.02.2022, 14:16 PM)

Q 14:16 时间来得及吗 Do you have enough time WeChat #4, (03.02.2022, 14:16 PM)

Q 14:16 我需要时间看 I need time to read it WeChat #5, (03.02.2022, 14:16 PM)

Q 14:16 你还要修改 And you still have to revise it WeChat #6, (03.02.2022, 14:16 PM)

Q 14:16 我还要再看 And I have to read it again WeChat #7, (03.02.2022, 14:16 PM)

Q 14:16 够吗 Is it enough WeChat #8, (03.02.2022, 14:16 PM)

Q 14:16 建议你延期毕业 I advise you to postpone your graduation WeChat #9, (03.02.2022, 14:16 PM)

Nan 14:16 这周日给您 I will give you this weekend WeChat #10, (03.02.2022, 14:16 PM)

Q 14:16 我不是第一次提醒你 This is not the first time I remind you WeChat #11, (03.02.2022, 14:16 PM)

Q 14:16 延期到今年年底毕业 Postpone your graduation till the end of this year WeChat #12, (03.02.2022, 14:16 PM)

Q 14:16 这样你还有大半年的时间好好写论文 So you still have the most year to focus on your paper WeChat #13, (03.02.2022, 14:16 PM)

Nan 14:16 老师,我的论文大致已经写完了 目前在按您第一次提的意见修改 Teacher, I have already finished the most part of my paper Currently I am revising it according to your first advice WeChat #14, (03.02.2022, 14:16 PM)

Q 14:16 你时间不够 Your time is not enough WeChat #15, (03.02.2022, 14:16 PM)

Q 14:16 利用假期跟家人充分沟通 Make use your vacation to communicate with your family WeChat #16, (03.02.2022, 14:16 PM)

Q 14:16 准备延期毕业 Prepare to postpone your graduation WeChat #17, (03.02.2022, 14:16 PM)

Nan 14:16 老师,我这段时间一定尽全力修改 Teacher, I promise to try my best to revise it recently WeChat #18, (03.02.2022, 14:16 PM)

Nan 14:16 所有的论文真的已经写完了 I have really finished my paper WeChat #19, (03.02.2022, 14:16 PM)

This study is intended to be qualitative. I have already obtained approximately 1200 sequences of WeChat interactions between teachers and students based on voluntary donation and under consideration of privacy protection. As a next step I am planning to collect email correspondences between German teachers and their students. For this purpose, I have created a data-collection plan which also pays attention to privacy protection of the involved interlocutors. At the same time, I am working at analyzing the WeChat data and try to develop further my categories of analysis. At the moment, the WeChat sequences are represented as more or less ‘raw’ data (1: screenshots of the original sequences as they have been displayed on the students’ smart devices, 2: the written and graphic content of the sequences stored in text documents, 3: a prose representation of the metadata relevant for my analysis). It is thus not (yet) a “corpus in the narrower sense” (sensu Beißwenger/Lüngen, 2020). Through presenting my project at the CMCCORPORA conference, I am interested to discuss and learn how other researchers represent and handle their CMC data for the purposes of documentation and analysis, and learn about tools that may be useful for storing and annotating my data (two languages, two different types of CMC).

References Beißwenger, M. (2016): Praktiken in der internetbasierten Kommunikation. In In A. Deppermann, H. Feilke & A. Linke (eds.), Sprachliche und kommunikative Praktiken (pp. 279-310). Berlin, New York: de Gruyter. Beißwenger, M., & Lüngen, H. (2020). CMC-core: a schema for the representation of CMC corpora in TEILe CMC-core : un schéma de représentation des corpus de la CMR en TEI. Corpus 20. Couper-Kuhlen, E., & Selting, M. (2017). Interactional linguistics: Studying language in social interaction. Cambridge University Press. Frick, M., & Palola, E. (2022). Deontic Autonomy in Family Interaction: Directive Actions and the Multimodal Organization of Going to the Bathroom. Social Interaction. Video-Based Studies of Human Sociality, 5(1). Stevanovic, M. (2013). Deontic rights in interaction: A conversation analytic study on authority and cooperation.

CoDEC-M: the multi-lingual Manosphere subcorpus of the Corpus of Digital Extremism and Conspiracies

ABSTRACT. Accompanying the widespread social isolation and spike in right wing rhetoric driven by the COVID-19 pandemic has been relentless news and social media coverage of incels, the red pill, and pickup artists. Additionally, the language of these communities and their rhetoric has been given more mainstream attention (consider Andrew Tate). Academics have documented these primarily online communities (Ribeiro et al, 2021) and their speech (Pelzer et al, 2021), but previous corpus analysis of CMC data focusing on the manosphere has targeted only English data (e.g. Thomas, 2022; Bogetic, 2022). The Corpus of Digital Language and Extremism (CoDEC) is an open-source, open-access corpus made up of several subcorpora documenting different online spaces where extremists and conspiracy theorists gather. CoDEC-M is a subcorpus that addresses this growing interest in the manosphere. The pilot of this multi-lingual manosphere subcorpus was designed to address the gap in knowledge on language contact and transfer between the English-speaking manosphere and its non-English speaking equivalents. While we plan on expanding the subcorpus to include more languages, the pilot to be presented focuses on comparing the shared themes and rhetoric between these Russian- and English-speaking communities. CODEC-M consists of data in two languages taken from two sources: 1,000,000 words of Russian from the ongoing /incel/ thread on 2chan’s /sex/ board and 1,000,000 words of English from various threads on the forum incels.is. In addition to presenting the process behind scraping CoDEC-M and discussing the practical and ethical issues associated with big data corpora documenting these communities, we will also compare the rhetoric of the English and Russian components of CODEC-M to one another (in translation) and to relevant reference corpora hosted by SketchEngine in order to identify possible transfer of the misogynist themes and language currently seen in mainstream media.

References Bogetic, K. (2022). Race and the language of incels: Figurative neologisms in an emerging English cryptolect. English Today, pp. 1-11. DOI: 10.1017/S0266078422000153. Pelzer, B., Kaati, L., Cohen, K., & Fernquist, J. (2021). Toxic language in online incel communities. SN Social Sciences, 1:213. https://doi.org/10.1007/s43545-021-00220-8. Ribeiro, M.H., Blackburn, J., Bradlyn, B., de Cristofaro, E., Stringhini, G., Long, S., Greenberg, S., & Zannettou, S. (2021). The Evolution of the Manosphere Across the Web. Proceedings of the Fifteenth International AAAI Conference on Web and Social Media, pp. 196-207. Thomas, M. (2022). A Quantitative Analysis of the Language Used by Violent and Non-Violent Incels. [Master’s thesis, University of North Carolina at Chapel Hill]. Carolina Digital Repository. https://cdr.lib.unc.edu/downloads/vt150t849.

Zooming in on emerging norms: Preliminary findings from a cross-linguistic investigation of backchanneling

ABSTRACT. In this pilot study, we examine conversational norms in the multimodal register of Zoom across several typologically distinct languages and varieties. Specifically, we combine methods from variationist and interactional linguistics to investigate backchanneling, i.e., the use of minimal responses like uh-huh and mhm to signal interlocutor engagement (Oreström 1983, Eiswirth 2020).

Data collection is ongoing: so far, we have videoconferencing data from six language varieties (American English, Asante Twi, Finland Swedish, German German, Ghanaian English, and Gulf Arabic). Our goal for this poster is two-fold: first, to introduce our pilot videoconferencing corpus, consisting of 13 sociolinguistic interviews conducted via Zoom from five countries (USA, Ghana, Germany, Finland, and Kuwait); second, to present preliminary findings from a cross-linguistic analysis of backchanneling.

We follow Eiswirth (2020) in quantifying backchanneling as the normalized number of backchannels per number of words in a turn. Consistent with previous work, backchanneling increases with turn length; however, turns are shorter in videoconferencing (mean length=25.9 words) than previously reported for face-to-face interaction (mean length in Eiswirth 2020=100 words). With respect to the frequency of backchanneling, a conditional inference tree reveals that the interview group is significant, while variety is not. This suggests that individual communicative styles and rapport may be more relevant than interlocutors’ linguistic backgrounds and that norms are still emergent.

(214 words)

References

Archibald, M. M., Ambagtsheer, R. C., Casey, M. G. & Lawless, M. (2019). Using Zoom videoconferencing for qualitative data collection: Perceptions and experiences of researchers and participants. International Journal of Qualitative Methods, 18, 1–8.

Eiswirth, M. E. (2020). Increasing interactional accountability in the quantitative analysis of sociolinguistic variation. Journal of Pragmatics, 170, 172-188.

Oreström, B. (1983). Turn-taking in English conversation. Gleerup: Lund.

Semantic Prosody Evolution of the Word "女权[Feminist]" in Chinese Social Media

ABSTRACT. Semantic prosody refers to the phenomenon that a word attracts a certain kind of words with the same semantic characteristics, thus forming a certain collocation habit and semantic atmosphere. Semantic prosody is an extension of associative meaning beyond the boundaries of words. This kind of association can be said to be caused by the user's native language, but it has a certain breadth and stability. Through the study of the semantic prosody of a certain term, we can intuitively understand people's attitudes and emotions towards it, and then explore the cultural and social significance behind it. In China, the feminist movement was once a long-standing but niche voice.Since the #MeToo movement was introduced to China in 2018, and a wave of women's complaints against sexual harassment and sexual assault was set off, women's rights have received unprecedented attention and discussion. But at the same time, feminists are also facing more and more "counterattacks" on the Internet, and the word "女权[Feminist]" has gradually become a vocabulary with a certain offensive meaning. This study collected language data from mainstream social media in China in recent years, and constructed a small longitudinal corpus; then, using statistical software such as AntConc, and considering significant collocations with MI and T values, the semantic prosody of "女权[Feminist]" was analyzed. analyzed. Using a data-driven research approach, this study examines the evolution of the semantic prosody of the word "女权[Feminist]" to analyze the attitudes and emotions of the people behind it (the group represented by social media users) towards this word and the concept of "feminism".

What kind of socio-emotional support can families dealing with autism receive in social media? Sentiment and attitudinal analyses of comments on Bilibili videos from 2019-2022

ABSTRACT. Informal social support provided by individuals such as relatives and friends were reported as more effective than formal social support provided by autism community or organizations (Shepherd et al., 2020). This study focused on the informal SS provided by individuals online, specifically the emotional support provided by viewers commenting on autism-themed videos on social media. We constructed a small social media corpus composed of 192263 words, including 3766 comments extracted from 30 autism-themed videos on the well-known Chinese social platform Bilibili (https://www.bilibili.com/). This study adopted a mix-method approach combining sentiment analysis by applying natural language process (NLP) techniques and attitudinal analysis based on the appraisal system (Martin and White, 2005). The results of sentiment analysis (M = 0.77, SD = 0.29) indicated that families dealing with autism might receive a positive and supportive ambiance when they are engaging in social media. Three types of comments were clustered in attitudinal analysis: positive attitude, information about autism, and negative attitude. Based on the attitude subsystem of the appraisal system (Martin and White, 2005), we found four subtypes for positive attitudes (i.e., sympathy, defense, wishes, and suggestions), and three for negative attitudes (i.e., antipathy, slander, and violence). Four themes extracted in the category of information of autism were a) asking about symptoms of autism, b) sharing one’s experience facing autism or people with autism, c) popularizing what autism is, and d) promoting the treatments for autism. One thing that should be noticed is that misinformation was found both in comments of positive and negative attitudes, which indicated that to some degree public misconceptions of autism still existed. This study revealed the supportive social ambiance composed of public sympathy, defense against stigmatization, best wishes for autistic people and their families, and kind suggestions for autism groups on Chinese social media.

Tracing Perceptions of Black History by Comparison of Two Corpora

ABSTRACT. This paper contrasts modern perspectives on Black history with historical documents through close and distant readings. A Twitter corpus was created using the hashtag #BlackHistoryMonth. It was then examined through topic modeling using BERTopic. Based on the results, thematically matching historical sub-corpora were assembled using documents from the ''BWAT- Black Writing and Thought Collection'' at the University of Chicago. To ensure the linguistic comparability of the corpora, linguistic measures such as the type-token ratio are applied. The results show that the topics of #BlackHistoryMonth discourse span almost all areas of life and often involve historical figures, indicating that Black memory culture is associated with Black individuals rather than historical events. In contrast to the historical narratives, the tweets show that African Americans follow the white American doctrine of national heroism. The tweets also include criticism of recent critical race theory legislation and call for alternative methods of teaching black history, such as visiting memorials.

Building corpora of Russian fake and genuine news for linguistic analysis

ABSTRACT. The speed with which fake news spreads online and people’s resistance to change their minds continues to be a growing problem (Mosleh et al. 2021). Addressing stylistic and grammatical features of fake news is one of the promising lines of research (Grieve and Woodfield 2023). In Russia, intertwined with the limitations imposed on the freedom of speech, the issue has become particularly pressing during the Covid-19 pandemic and the invasion of Ukraine.

This work describes the challenges of building a corpus of Russian fake news and the matching reference corpus of genuine news for the purpose of linguistic analyses, including comparing patterns of grammatical variation based on the multidimensional register analysis (Biber 1988). The primary aim of creating the corpora is to investigate the language and style of fake news in Russian. A secondary goal is to use the datasets for the improvement of the fake news detection through the automation of the defining linguistic features.

The building of the Russian fake news corpora is a unique process with its challenges and implications. Unlike similar English corpora, the Russian fake news datasets consist mostly of social media texts, predominantly from Telegram and Facebook, and not of well-known news outlets. This is due to the tendency of the news outlets to reproduce false claims by citing or paraphrasing other sources to avoid legal responsibility. Investigations, performed by fact-checkers, usually lead to original messages in smaller and mostly anonymous SM-based outlets.

To date, the fake news corpus is a dataset of over 140 000 tokens, claims veracity confirmed by carefully chosen fact-checking agencies. The compilation of the reference (genuine news) dataset is in early stages. The need to control for source, register, size, authorship, and other variables makes it a demanding but rewarding task, with the unique well-balanced and ready for exploration datasets as a result.

Building a Parallel Discourse-annotated Multimedia Corpus

ABSTRACT. We present the building process of a novel parallel discourse-annotated multimedia corpus, including data collection, preprocessing, paragraph-level alignment between documents and annotation of discourse structure. Our goal in building such a corpus is to compare how the same idea is linguistically realized in different media. The presented corpus contains texts from two parallel media that present the same information in two communicative situations: podcasts and blog posts. After the podcasts have been automatically transcribed, each podcast episode and its corresponding blog post have been annotated manually for parallel segments. This paragraph-level alignment was carried out so that discourse structure between texts can be compared in terms of linguistic features. The resulting corpus comprises 73 episodes in each medium (14,598 tokens in the blog posts, 125,182 tokens transcribed podcasts). The blog posts and the corresponding parallel podcast segments have been discourse-annotated in two frameworks: Rhetorical Structure Theory (RST) and Questions Under Discussion (QUD). Both theories represent discourse structure as a tree. RST derives plausible global text structures by connecting discourse units using discourse relations (Mann & Thompson, 1988). It has been primarily designed for analyzing well-written text. On the other hand, the QUD model treats discourse as a series of implicit and explicit questions that are answered one by one (Roberts, 2012). It is mostly used to analyze dialogue. We compare the resulting annotations to find similarities and differences between the discourse models and between the two media. We plan to release the corpus under a Creative Commons license.

Linguistic features, device affordances, and contextual factors: A mixed-methods, two-corpora approach

ABSTRACT. In CMC research, the role of the particular technological device used to send messages is rarely taken into account; messages sent by computer and phone are implicitly treated as broadly similar (cf. Jucker & Dürscheid, 2013). In contrast, laypeople often believe their messages vary between device types, e.g., regarding message length, use of emoji, capitalisation, etc. The present study thus investigates the potential influence of device on such microlinguistic features. Rejecting any technological determinism, i.e., that computer-sent and phone-sent messages differ categorically, the study instead favours an affordance-based approach (cf. Hutchby, 2001). Device properties (e.g., keyboard type, autocorrect) afford the use of various linguistic features more easily or less, which may lead to linguistic variation in CMC messages. However, affordances are one influence among many, as contextual factors like synchronicity can also play a role in linguistic variation, as can individual user style.

Drawing inspiration from computational sociolinguistics (cf. Nguyen et al., 2015), the empirical study uses quantitative and qualitative methods to investigate both device affordances and their interactions with other factors. To explore both aspects, a two-strand approach was designed, each relying on its own type of corpus. Section (1) focuses solely on the influence of device affordances, and uses a large-scale corpus, so as to explore general trends found across device types. Section (2) focuses on interactions between affordances and contextual factors, and thus uses a smaller-scale corpus with richer information about both message context and the users. This two-strand, two-corpora approach allows for a richer understanding of both the possibilities and limits to the influence of device affordances.

Thus, for Section (1), a large-scale corpus of a million anonymous Twitter messages was collected, with the only metadata being device type. Five categories of microlinguistic features were examined: length, acronyms and abbreviations, emoji and emoticons, punctuation, and non-standard orthography. Quantitative analysis found weak but relatively consistent differences across them, for example, a higher frequency of emoji on the phone. For Section (2), a small-scale corpus of 50,000 messages from the platforms Twitter and Discord was collected from the same eleven participants. The two platforms differ in regard to contextual factors like synchronicity and audience size, and thus it is possible to compare how the influence of device affordances differs across them. Quantitative analysis found variation for both device type and platform, while a fine-grained qualitative analysis showed that users differed also in the extent to which they adhered to or circumvented device affordances. For example, mirroring the findings in the large-scale corpus, across both Twitter and Discord phone-based emoji frequency was overall found to be higher, while contrary results were also found for individual users due to their personal device and platform-related habits. The study thus illustrates both the interaction of device, contextual factors, and style, as well as the usefulness of complementary large-scale and smaller-scale corpus analysis.

Hutchby, I. (2001). Technologies, texts and affordances. Sociology, 35(2), pp. 441-456.

Jucker, A.H. & Dürscheid, C. (2013). The linguistics of keyboard-to-screen communication. A new terminological framework. Linguistik Online, 56(6), pp. 39-64.

Nguyen, D., Doğruöz, A.S., Rosé C.P. & de Jong, F. (2016). Computational sociolinguistics: A survey. Computational Linguistics, 42(3), pp. 537-593.