APCLC 2020: ASIA PACIFIC CORPUS LINGUISTICS CONFERENCE 2020
PROGRAM FOR TUESDAY, FEBRUARY 11TH
Days:
previous day
next day
all days

View: session overviewtalk overview

09:00-10:40 Session 4A
Location: Choi Young Hall
09:00
Discourse functions and semantic characteristics of high-frequency phrase frames: A learner corpus study

ABSTRACT. Much research to date on formulaic language has focused on n-grams: continuous multiword sequences of n length. However, more recently, discontinuous multiword sequences, or 'phrase frames', have come into focus (Renouf & Sinclair, 1991; Rӧmer, 2010; Gray & Biber, 2015). Phrase frames, or simply 'frames', are n-grams with an internal variable slot such as "the * of the" or "in the * of". The asterisk represents a variable slot occupied by a 'filler' such as end or beginning. The present study attempts to elucidate the discourse functions and semantic characteristics of highly frequent four or five-word frames (i.e., 4-frames or 5-frames) built around the 3-frame "the * of" with an optional preposition preceding and/or "the" following the frame: (preposition) the * of (the).

Target frames were extracted from the Japanese and Spanish sub-corpora of the International Corpus of Learner English (ICLE) and the LOCNESS corpus (Granger et al. 2009). The semantic relationship between fillers of frames was investigated using network analysis based off Wu-Palmer semantic similarity scores generated via the WordNet Lexical Database (Miller, 1995). Biber et al.'s (2004) functional taxonomy was used to code frames for discourse function: referential, stance, and text organizing. Ultimately, 3,602 frames were analyzed. The probability of a given discourse function was explored via a multinomial logistic regression with the following predictor variables: L1, proficiency, essay topic, and preposition (or lack thereof). The final logistic regression model suggests that, controlling for other predictors, topic and preposition better predict discourse function than L1 or proficiency.

09:25
Acquisition of the Chinese indefinite determiner “one + classifier” and English articles in two-way learner corpora

ABSTRACT. We present findings concerning L2 use of classifiers and articles in the TUFS Learners’ Corpus of Chinese and the TUFS Learners’ Corpus of English, an error-tagged two-way learner corpus containing intermediate and advanced learners’ written production. First, English L1 learners of Chinese overuse the “one +classifier” construction for indefinite reference, analogous to English indefinite articles, whereas Japanese L1 learners show underuse of this construction, despite Chinese and Japanese both being “classifier languages”. Analysis of the I-JAS corpus reveals that Chinese L1 learners of Japanese use the “one + classifier” construction more frequently than native speakers. This is argued to stem from overgeneralization of “one + classifier” in Chinese. The second study examines Chinese L1 and Japanese L1 learners’ use of English articles in a translation task. Chinese L1 learners show higher levels of Target-Like Use (TLU), particularly for the definite article. This may be related to more extensive use of demonstratives (zhè and nà) in Chinese to mark definiteness. The two studies provide partial evidence for the effect of L1 on acquisition of L2 Chinese and English, namely that the presence of salient functional equivalents to an L2 target form in L1 can lead to positive transfer and greater mastery of the target form. This can supersede ostensible typological similarities, for example the classification of both Chinese and Japanese as classifier languages.

09:50
L2 English Article Use by Chinese and Japanese Native Speakers in a Learner Corpus

ABSTRACT. This study provides quantitative and qualitative analysis of the L2 use of English articles by native speakers of Chinese and Japanese, two languages lacking an article system. Production in a translation task is compared in order to ascertain trends of use and misuse, and whether such trends can partially be explained by characteristics of learners’ L1. The following differences are observed. First, Japanese L1 learners use the indefinite article more frequently than the definite article, whereas the reverse is true for Chinese L1 learners. Second, Target-Like Use (TLU) of both articles is higher for Chinese L1 learners than for Japanese L1 learners, suggesting meaning transfer through partial functional similarities between the Chinese classifier and English article systems. Within each group, TLU is higher for the indefinite article than for the definite article. Japanese L1 learners appear to have particularly difficulty using the definite article appropriately, suggesting that lack of an equivalent L1 form inhibits acquisition. Third, error typology is broadly uniform; in both groups, errors of omission and errors of incorrect article choice account for over 60% and over 20% of errors respectively, a reflection of L1 typology. Nonetheless, qualitative analysis reveals a number of distinct error trends, notably Japanese L1 learners’ erroneous marking of definites with the indefinite article, and Chinese L1 learners’ erroneous marking of non-referentials with the definite article, another potential instance of meaning transfer. This study adds to growing evidence that finer-grained analyses are required when examining L2 English article use by speakers of “article-less languages”.

10:15
A corpus analysis on pre-service English teachers' morphology in use

ABSTRACT. The present study established pre-service Korean EFL teachers corpus (KTC) and investigated the teachers’ actual knowledge of using morphologically complex words. Two referenced corpora were native English speaking pre-service teachers corpus (ETC) and Corpus of Contemporary American English (COCA). Ant Conc word list analysis results showed that specific terms related to education were more utilized in teachers corpus (KTC and ETC) than general corpus (COCA). Filter analysis in Excel program provided that the most frequent suffix in KTC was -ly and the most suffixed word with the suffix was an adverb only. Contrastive cluster and concordance line analysis results revealed that Korean pre-service EFL teachers were less knowledgeable than native English speaking teachers and general speakers for utilizing the suffix -ly in various and accurate contexts. Implications for teacher training and future research ideas were discussed.

09:00-10:40 Session 4B
09:00
An NLP analysis of VP-ellipsis and Gapping in EFL input in Korea

ABSTRACT. While VP-ellipsis (e.g., Mom sat on the sofa and dad did [e] too.) and Gapping (e.g., Mom sat on the sofa and dad [e] on the chair.) have received some attention in L2 research<1–4>, their frequency in L2 input remains unknown. The present study addresses this gap by investigating the incidence of VP-ellipsis and Gapping in the EFL input to which elementary-to-high school students in Korea are exposed. Four types of such EFL input made up the corpus, resulting in 44,646 utterances (207,288 words) for analysis: L1-Korean EFL teacher speech; L1-English EFL teacher speech; written input from EFL textbooks<5–9>; spoken input from EFL textbooks<5–9>. First, the data were parsed with spaCy<10> and Benepar<11> in Python. Next, for VP-ellipsis, all sentences containing any modal/auxiliary verb--which potentially delimits this construction--were extracted; for Gapping, all sentences manifesting fewer thematic verbs than number of clauses were extracted. Lastly, each extracted sentence was manually checked for instances of VP-ellipsis and Gapping. Results reveal that Gapping occurred only twice (0.004%), both in a coordinate clause; by contrast, there were 601 instances of VP-ellipsis (1.346%), most (561 utterances) either questions or responses following the (immediately) preceding utterance. Notably, VP-ellipsis never appeared in a conjunct clause, and it appeared only once (0.002%) in an adjunct clause. We discuss whether such input alone could lead Korean EFL learners in Korea to learn subtle contrasts between the two constructions--e.g., VP-ellipsis is possible in both conjunct clauses and adjunct clauses, but Gapping is possible only in conjunct clauses.

09:25
Using word embeddings to explore a corpus of Australian Aboriginal life writing as a historical source and a literary genre

ABSTRACT. In the 1960s, almost 200 years after Australia was invaded by Europeans, Australian Aboriginal people turned to life writing to share an alternative history using the oppressors' language, English. As they inevitably see the world through the prism of their experiences of racism, oppression, exploitation and injustice, the genre of Aboriginal life writing is defined by its content and the language use challenging European perceptions, concepts, and values.

This submission aims at expanding research on the use of word embeddings in literary and historical studies and will present the preliminary results of the ongoing PhD project applying word embeddings to analyse discursive spaces of a corpus of digitised Australian Aboriginal autobiographies published in 1960s-2010s. In this context, we have selected a number of keywords related to the dominant themes in Australian Aboriginal literature and examined the word embeddings nearest to the these keywords to identify the vocabulary of corresponding discourses. We have compared word embeddings trained on the corpus in question with embeddings trained on various reference corpora as well as based on the decade of publication. Finally, we have explored embeddings of words from Aboriginal languages.

The preliminary results of the experiments suggest that word embeddings trained on the corpus of autobiographies reflect the historical, political, and cultural environment of the authors, their experiences and perspectives, and the change of the use of concepts over time.

We are currently extending the study to include the analysis of embeddings trained with different algorithms, and to add the gender dimension.

09:50
A Corpus Linguistics Approach with Artificial Intelligence on Large-Scale Data towards a Novel Theory of Support Constructs in Social Media Discourse

ABSTRACT. The study aims to critically inform the way people affected directly by various issues and conditions can support each other on social media in the domain of chronic illness. A significant challenge is to identify what online ‘support’ entails, and about how to analyse the large-scale online data in social media support discourse, for constructs of support. It uses one of the most influential accounts of a theory of language, i.e., Halliday’s (1985) systemic functional linguistics (SFL) and Martin and White’s (2005) related Attitude appraisal framework. The study makes a methodological offer with new ethical considerations in dealing with personal data. It seeks to combine corpus linguistics with artificial intelligence computational analyses (Blei et al., 2003) to some timely and online-generated large-scale data: of over 200,000, anonymised Facebook Diabetes UK posts and 16,137 diabetes-related users. AI is a sophisticated approach that includes latent Dirichlet allocation (LDA) topic modelling, automated content analysis and annotation for entity recognition. The implications of the new theory are aimed at healthcare communicators to work with organisations to help their social media users support each other by understanding a peer-focused view of chronic illness support. Corpus linguistics may benefit from the use of combined AI and DA approaches to anonymised large-scale online data. The study also offers preliminary work for AI support-bots to be programmed to utilise the language patterns to support people who need them automatically. The bots may be able to have conversations instantaneously with many people but to do so in natural ways.

10:15
NLP-assisted automatic extraction of English passive constructions from learner writing

ABSTRACT. This study investigates the validity of an automatic tool for extracting L2 production of constructions in learner corpus by way of Natural Language Processing (NLP). English passives, the focus of our study, have been identified as a particular construction type presenting learning difficulties for L2 learners (Ju, 2000; MacWhinney, 1987). Evidence shows that learners’ ability to produce passives is closely associated with L2 proficiency. Although researchers agree upon the need to measure L2 production of passives for a precise assessment of L2 proficiency, it requires an enormous amount of time and effort to manually track down learners’ production of passives. To address this issue, we conducted an NLP-assisted automatic extraction of English passives from learner writing. A Python programme was developed by using spaCy (Honnibal & Johnson, 2015; https://spacy.io/) as a platform for this task. We first identified the passives from the Yonsei English Learner Corpus through manual coding, and then submitted them to the automatic pattern-finding process to see how accurately the programme detected the intended passive pattern. This pattern-finding process, as an initial attempt, considered information about dependency relations crucial for the passives (nsubjpass; auxpass). Our results showed that the programme successfully detected 414 cases out of the 417 instances of passives, with an accuracy of 99.28%. Our tool is expected to support researchers in identifying L2 production of passives accurately with less effort. We plan to expand the application of this programme to other construction types by adding algorithms for target constructions and model training.

09:00-10:40 Session 4C
Chair:
Location: IBK Hall
09:00
Expressions of Necessity/Obligation in a Curriculum-Based EFL Textbook in Indonesia: A Corpus-Based Analysis

ABSTRACT. A well-designed textbook can equip the learners with sufficient and relevant materials correspond to actual language use. As non-native English users, however, Indonesia ELT practitioners can make use of existing English corpora, to inform them with the real language use for the design and development of EFL textbooks. In practice, this benefit is still ignored or neglected among EFL practitioners in Indonesia. This study then aims at investigating a curriculum-based EFL textbooks in Indonesia, focusing on the necessity/obligation modality expression in the conversation sections. The corpus consulted in this study is spoken section of Corpus of Contemporary American English (COCA) comprising 118 million words. The analysis focuses on the frequency of necessity/obligation modal verbs in the textbook and the corpus. The results of the analysis show that there is a significant difference in the use of necessity/obligation between the textbook and the corpus. In the textbook, the use of necessity/obligation expressions outnumbers other modality expressions (e.g, prediction, volition). On the contrary, necessity/obligation modal verbs are the least frequent modality in spoken section of COCA. Moreover, core modals expressing necessity/obligation (e.g., must, should) dominate the conversation sections of the textbook, while quasi-modals are more frequent in spoken section of COCA (e.g. have to). In terms of variants, the textbook mentions small number of modal verbs expressing necessity/obligation (e.g., must, should, have to, be supposed to). Pedagogically speaking, it is highly recommended that ELT practitioners in Indonesia design and developed EFL textbook based on corpus investigation to enhance the textbook quality.

09:25
The Development of Prepositional Phrases in the Writings of L2 Learners: An Annotated Corpus-based Study

ABSTRACT. Prepositional phrases (PPs) rank among the most frequently occurring grammatical structures in English (Biber et al., 1999). PPs serve important syntactic functions in that they contribute to the construction of “structurally elaborate and informatively dense clauses” (Brandes & Ravid, 2017:84). As such, they are considered to be an essential aspect of mature writing in English, especially of academic writing (Biber et al., 2011). Despite their significance, few studies have attempted to examine the development of English PPs in L2 writing. To that end, this study examines (1) the use of prepositions heading PPs; and (2) the development of different syntactic types of PPs in 410 narrative written texts produced by young high school EFL learners across four writing proficiency levels. A dependency-annotated corpus is used as an instrument to extract PPs and identify their grammatical functions. Analyses of the annotated corpus yielded the following two main findings. As for the types of prepositions heading PPs, regardless of writing proficiency, young high school students mainly use a restricted range of prepositions, including “at”, “on”, “with”, “to” and “from”, possibly due to the high input frequency of these prepositions and their multiple syntactic and semantic functions. Furthermore, significant differences are found in the distribution of different syntactic types of PPs among the four levels. Students of higher writing proficiency are more apt to use PPs as verbal complements and noun modifiers and less likely to employ them as adverbials. These findings have important implications for language teaching, testing and materials design.

09:50
Comparing L1-L2 differences in lexical bundles in student and expert writing

ABSTRACT. Numerous studies have explored the use of lexical bundles between L1 and L2 academic writing or between students and expert writing. However, the results of these studies are mixed due to differences in the control of potentially confounding variables (e.g., discipline, the level of expertise). It is still unclear whether the L1 background, or the level of expertise (i.e., student vs. expert) accounts for the differences in the use of lexical bundles. To clarify this issue, the present study compared L1-L2 differences in the use of lexical bundles in master theses and research articles by controlling discipline (i.e., applied linguistics), the level of expertise and research paradigm (i.e., quantitative texts). The study shows that L2-English academic writers employ more bundle types and tokens than L1-English academic writers regardless of levels of expertise. Structurally, both L1 and L2 academic writers use proportionally more phrasal bundles as their levels of expertise increase. Functionally, L1 academic writers use proportionally more participant-oriented bundles than L2 academic writers regardless of levels of expertise. Our findings also indicate that both L1 background and the level of expertise affect the structural differences in lexical bundles. In addition, L1 background matters to the functional differences in lexical bundles. The potential pedagogical implications for teaching English for academic writing will also be discussed.

10:15
Dynamism of collocation in L2 English writing: A bigram-based study

ABSTRACT. Here we present a longitudinal corpus-based study of collocation, highlighting bigram development in L2 writing over a five-month span. Our analysis is primarily based on the learner corpus, a collection of 570 untimed L2 writing which is benchmarked against a reference corpus, the BNC. We applied two measures, the mean MI and the average t, computed from the learner corpus, to quantify both overall performance and individual variation. Interviews with three learners were conducted to reveal factors contributing to the dynamism of bigram development. Our study found that, across the five times of writing, (1) learners’ command of bigrams, those of high quality in particular, is fairly unstable; (2) inter- and intra- individual variations account for the individual differences in managing bigrams over time; and (3) learners’ perception of collocation and their allocation of resources add to the dynamism of bigram development. Our study implies that teaching collocation from a dynamic perspective deserves increased attention; it is advisable to consider individual differences when allocating resources in teaching collocation.

09:00-10:40 Session 4D
09:00
Some Features Of Clickbait Discourse

ABSTRACT. Clickbait – this word appeared quite recently, but it can be heard more and more often these days. It seems like everyone at least ones saw, for example, a clickbait headline on the Internet and perhaps even clicked. Looking on them I thought if there is something that makes clickbaits so specific from the discourse point of view. Previous researches showed that “forward-reference technique” is one of the strategies for clickbait headlines (Blom & Hansen, 2015) and that clickbait headlines have some “headline construction formulas” (Agrici, Alves, Antunes, Sousa, & Ramos, 2016). However, in my research, I decided to use corpora-based approach to find out if the language of clickbait headlines has other features. For the purposes of this study, I made a corpus of clickbait headlines. As a source, I chose two websites: BuzzFeed and Clickhole. BuzzFeed was one of the first websites on the Internet that started to use this unusual style for the headlines. Clickhole was made later as a parody to BuzzFeed and other clickbaits websites. The content from these websites provided very diverse and up-to-date data for the analysis and showed a different approach to the clickbait headlines writing. As a result, some other interesting features of clickbaits were identified, e.g. high frequency of 2nd person pronouns and imperative forms, use of evaluative adjectives at the beginning of a sentence and other.

09:25
“We” in Hard Science Articles: Disciplinary Distributions, Co-Selection Patterns and Discourse Functions

ABSTRACT. Traditional theory holds that academic writing in hard sciences should adopt an impersonal writing style and avoid the use of first person pronouns (Lester 1993; Smith 1996; Spencer & Arbon 1996). However, recent corpus-based studies have also shown that hard science articles are not any more written in completely impersonal style, but in preference to the use of personal involvement (Hyland 2012; Hyland & Jiang 2017; Gillaerts & Velde 2010). This poses a challenge to the traditional writing style in hard disciplines and makes it problematic for academic writers, especially novice and non-English native writers, in regard to whether novice writers should use first person pronouns, how frequently they can use first person pronouns, and how to use them appropriately. Among all the expressions of personal involvement, the first person “we” is used far more frequently than any other types and has the most complicated semantic references. Based on corpus data, this paper describes the characteristics of “we” in hard science articles with regards to its semantic references, co-selection patterns and discourse functions. Findings suggest that “we” has been extensively used in hard science articles either to construct salient authorial identity or to interact with readers and peer researchers; in particular, it has four kinds of semantic references: i.e. author, author-reader, whole discipline, and general people. This paper also identifies the primary academic discourse function that each “we” performs by collocating with different types of verbs and regroups “we”-sentences into different functional categories under each semantic category of “we”.

09:50
POPULARIZATION OF SCIENCE THROUGH ‘VISIBLE SCIENTISTS’: A CORPUS-INFORMED STUDY OF DISCURSIVE CONSTRUCTION AND REALIZATION OF NOBEL LAUREATES’ SCIENTIFIC ACHIEVEMENTS

ABSTRACT. Objectives The research investigates how a writer’s control of themes and linguistic features in popular science discourse may facilitate promotion of science as a commodity, using Nobel Prize popular science discourse as a case study.

Methods Corpus-driven analysis using WMatrix3 was conducted to cover keyness comparison at lexical, grammatical and semantic levels between the two specialized popular science corpora of ‘visible scientists, namely Nobel Laureate of Chemistry’s biographical and ‘popular information – information for the public’ from 2000 to 2017 (2000 marked the first year of publication of ‘popular information’) and a reference corpus of general writings, the Corpus of American English 2006.

Results Nobel Prize discourse highlighted visible scientists through action (‘change’), ability (‘success’) and newness. The use of hard facts to justify the writer’s claims follows the rational appeal of Aristotle’s theory of persuasive communication. Matching linguistic features like affirmative tones were deployed to create an authoritative positioning of their discoveries. General writings, on the contrary, placed more emphasis on sensory details and feelings to appeal to lay readers. Adding personal touch could be a strategy to engage lay readers to popular science articles.

Conclusion The comparison facilitated an investigation of how linguistic features may contribute to the branding of ‘visible scientists’ to increase the impact of natural science to the society. Besides, an investigation of linguistic features deployed relatively less frequently in Nobel Prize popular science articles, as compared to general writings, might also inspire writers on ways to further enhance writer-reader communication.

10:15
The Importance of Multimodal Corpora in the Study of Pragmatics: A Case Study of Fictive Apologies

ABSTRACT. Traditionally, the study of pragmatics implied using data collected through controlled means, such as discourse completion tests, role-plays, interviews, and written questionnaires (Cohen and Olshtain 1994). The use of corpora brought real-life examples, which provided a much better understanding of speech acts (Tomasello, 2000; Félix-Brasdefer & Bardovi-Harlig, 2010; Author, 2011). Nonetheless, even these studies relied only on the linguistic manifestation of the speech acts as they were using corpora of written discourse or at most syntactically tagged transcripts of spoken discourse. Recent developments have shown, however, that non-linguistic features of communication such as gestures, gaze, intonation, and visual clues contribute to the meaning of speech acts (Alibali, Kita, & Young, 2000; Pascual, 2014; Author, 2016). The aim of this presentation is to show that multimodal corpora containing such non-linguistic information are crucial to the analysis of speech acts. The presentation will use the case study of fictive apologies to show how the availability of information about gestures, changes in pitch and intonation, as well as other multimodal features of communication can completely change the meaning of the same linguistic expression analyzed in a monomodal corpus. The presenter will discuss examples of fictive apologies occurring in the UCLA NewsScape Archive of television and video news programs facilitated by the Distributed Little Red Hen Lab, co-directed by Francis Steen and Mark Turner. In conclusion, the presentation suggests that multimodal corpora add an important dimension to the analysis of fictive apologies in particular as well as the study of pragmatics in general.

09:00-10:40 Session 4E
Location: Helinox Hall
09:00
Trends of corpus-based research on English linguistics in Korea: A review of ‘Korean Journal of English Language and Linguistics’

ABSTRACT. This study surveys corpus linguistic research trends in Korea, in particular in the field of English linguistics. The aims of the study are to examine the research themes and methodologies of corpus-related studies published in ‘Korean Journal of English Language and Linguistics’ and to propose directions and prospects for future corpus-based research in English linguistics. For the purposes of the study, corpus-related research articles published in ‘Korean Journal of English Language and Linguistics’ will be analyzed in terms of research topics, methodology, and characteristics of the corpora used in the studies. ‘Korean Journal of English Language and Linguistics’ was selected to be analyzed as it is a representative Korean journal with a specialized scope in the research domain of English linguistics. The journal has been issued four times a year since 2001, and the scope of the journal includes English history, grammar, English education, phonetics, phonology, syntax, semantics, pragmatics, and discourse analysis. The publisher of ‘Korean Journal of English Language and Linguistics’ is the Korean Association for the Study of English Language and Linguistics (KASELL) established in 2000 at the beginning of the 21st century. Corpus studies from ‘Korean Journal of English Language and Linguistics’ can show distinctive characteristic research trends of English linguistics in Korea and reflect research progress in corpus linguistics during the recent two decades. The result of the analysis will reveal a tendency of research topics and methods adopted in the corpus studies published in the journal and provide implications for corpus-based research in English linguistics.

09:20
Analysis of Corpus Linguistics Studies in English Teaching

ABSTRACT. This presentation will analyze the trends of corpus linguistics studies published in English Teaching, one of the most eminent applied linguistics journals in South Korea. The first journal was issued in 1965, and has been published four times a year. In order to investigate the trends of corpus linguistics research in this journal, a corpus was built based on all research papers in English Teaching from 1965 to 2019. Among the papers, studies relevant to corpus linguistics were selected to examine their features of research topics and methods. The papers were sub-categorized based on timeline. The presentation will introduce the characteristics of corpus linguistics studies in every 5 years, and will show the similarities and differences of corpus linguistics research trends for 55 years.

09:40
An analysis of research trends of the corpus-based studies in ‘Secondary English Education’

ABSTRACT. Many researchers in Korea have conducted studies on corpus linguistics and corpus-based approaches. These individual studies are required, but it is also necessary to systematically categorize these studies and analyze research trends to assess the overall corpus-based research, get educational insight and implications, and open a new door to future corpus-based approaches. Therefore, some researchers have tried to examine the trends of corpus-based research. However, there are not many studies analyzing the trends of corpus-based research related to English education. In particular, there is little research on the subject published in more specific areas of secondary English education. To identify the trends of corpus-based research related to secondary English education, this study is mainly intended to review and investigate the research themes and foci of corpus-based studies published in ‘Secondary English Education’. For this purpose, corpus-based research published in ‘Secondary English Education’ will be reviewed and explored from various perspectives, including research topics and methodologies. This study will shed meaningful light on the future directions of corpus-based research.

10:00
An analytical study on corpus-related research projects published in ‘Primary English Education’

ABSTRACT. The present study investigates and analyses the corpus-related research projects published in Primary English Education in an effort to find out the time-periodic characteristics and recent trends of the studies. The main purpose of this study is to draw educational implications and future directions of corpus-related research particularly on primary English education in Korea. For this purpose, a total of fourteen research articles published in Primary English Education, containing ‘corpus’ among the keywords were selected as subjects of analysis. The analysis in this study were conducted mainly focused on the quantity of studies, diversity of research topics and writers, types of corpora used in the studies, methodology and aims to apply of corpora in the studies. The analysis results of the corpus-related studies published in Primary English Education firstly showed that there were relatively not many studies using corpus data or corpus-based analysis in this domain. In addition, the diversity of writers and research topics were lower than the number of the studies. Moreover, the types of corpora used in the studies can be classified into two large types by the source of corpora, teaching and learning materials for English education and learner corpora. Lastly, the methodological analysis of the studies revealed that the major purpose of utilizing corpora were developing English learning materials and improving Korean elementary school students’ English vocabulary or grammar learning. Educational implications and future research recommendations for corpus-related research projects especially in the field of Korean primary English education were discussed.

10:20
Trends of corpus related studies in 'Korean Journal of Applied Linguistics'

ABSTRACT. A corpus is a collection of naturally occurring spoken and/or written texts which is electronically searchable. There has been a growing body of literature that recognises the importance of corpus related studies in Korean academic societies. This study attempts to analyse corpus related studies that have been published in the Korean Journal of Applied Linguistics (KJAL), which is one of the most prestigious academic journals in South Korea, totalling 175 research articles from 1983 to 2019. In this talk, I will discuss characteristics of corpus related studies—specifically in the KJAL, focusing on topics, methodologies, and results. I will also propose future directions of corpus related studies based on the findings from the KJAL. This talk will be of interest, not only to corpus researchers who study English language corpora, but also to corpus researchers who study languages other than English.

10:50-12:30 Session 5A
Location: Choi Young Hall
10:50
Metaphorical usage with the expression ‘love’ in Learner Corpus

ABSTRACT. Lakoff and Johnson (1980) propose the conceptual system that indicates what and how we think and express metaphorically in our daily speech. The conceptual metaphors refer to understanding an abstract domain which is generally hard to describe by means of another concrete and clear domain (Kövecses, 2010). In addition to the theoretic approach to metaphorical expressions in human language, the recent skills of natural language processing has paid attention to identifying metaphorical usage of language expressions has been issued in natural language processing. Since a number of metaphorical expressions are interrelated with cultural differences (Deignan, 2003), different cultures differ in the attitudes towards metaphors. If this hypothesis is valid, such a difference can be detected when exploring the learner corpora. Thus, the present study set out to examine whether Korean L2 learners know about these conceptual meanings, and if they know, which one is the most preferred one, and whether there is any cultural differences to use the metaphors. To compare the usage between native English speakers and Korean L2 learners, the metaphorical expression on ‘love’ is collected within two different corpora: COCA as a L1 corpus and Gachon Learners Corpus as a L2 corpus. Then, this research adopts a corpus-based method called Metaphorical Pattern Analysis suggested by Stefanowitsch (2006). In this method, there are two domains, one is the target domain (TD) and another is the source domain (SD). As mentioned above, target domain is normally an abstract concept, and the source domain is more concrete one.

11:15
The Influence of the Learners’ Environment on the Characteristics of Learner Corpora: A Comparison of Corpus from Cadets and University Students

ABSTRACT. The purpose of this study was to examine whether the characteristics of learner corpora are affected by the learner’s environment. For this purpose, corpora constructed through English diary entries were collected from 73 cadets and 75 university students for 3.5 months and analyzed using WordSmith 6.0. The study revealed that: (1) The environmental differences of learners had a significant effect on the highest frequency vocabulary used by cadets and college students; (2) The environmental differences of learners did not seem to greatly affect the types of errors committed by students and college students. That is, cadets and college students showed overall similar error patterns; and (3) Cadets and college students, however, showed different patterns of error in terms of using Korean words in their English writings. Lastly, the results of the study suggest the following educational implications: (1) The EFL learners’ error patterns found in their corpus should be carefully incorporated in language teaching and learning; and (2) More ESP textbooks and curriculum, based on the ESP learners’ corpus, should be developed and applied with caution. Further implications and issues will be discussed.

11:40
The Use of Native and Learner Corpora in Enhancing EFL Textbook Treatment: The case of modal verbs

ABSTRACT. This study aims to investigate how native and learner corpora can be combined to enhance modal verb treatment in EFL textbooks in mainland China. The linguistic focus is will, would, can, could, may, might, shall, should and must. The native corpus is the spoken component of BNC2014 (BNCS2014) (Love, et al., 2017). The standard query option of CQPweb was used to sample 5% of each of the nine modals from BNCS2014. The learner corpus is “secondary school” component of the Ten-thousand English Compositions of Chinese Learners ((Xue, 2015). The textbook corpus contains a series of five secondary coursebooks. All data in both the learner and textbook corpora were retrieved through the concordance functions of WordSmith Tools (Scott, 2008). The textbook corpus was compared with BNCS2014 regarding distributional features, semantic functions, and co-occurring constructions to assess the authenticity of textbook language. The learner corpus was analyzed in terms of the use (distributional features, semantic functions, and co-occurring constructions) and the misuse (syntactic errors) to uncover potential difficulties. The results indicate discrepancies between textbook presentation of modal verbs and authentic modal use in natural discourse in terms of distributions of frequencies, semantic functions and co-occurring structures. In addition, error analysis revealed that must, may, could, should and would are the most difficult for Chinese learners and both inter-linguistic and intra-linguistic interference is behind the difficulties. The findings point to the need to adjust the textbooks based on the authentic modal patterns found in BNCS2014 and learner difficulties identified in the learner corpus.

12:05
A study on the usage and error patterns of ‘-ge doeda’ expression through learner corpus analysis.

ABSTRACT. This study aims to present implications for foreign Korean learners in order to effectively teach ‘-ge doeda’ through the analysis of error in usage. In order to collect the error data and study the error patterns of ‘-ge doeda’, the Korean learners’ corpus of the National Institute of the Korean Language were analyzed. The analysis is basically divided by the proficiency level at which this expression is used. First the study examines how Korean textbooks and dictionaries present the target expression ‘-ge doeda’. The ‘-ge doeda’ construction is a high frequency expression carrying three basic semantic functions ; change-of-state, passive and politeness. The usage of ‘-ge doeda’ deserves a basic teaching expression for the intermediate level learners. The researcher investigates error corpus to examine how many ‘-ge doeda’s functions reflected in. A lot of grammatical and lexical errors are produced by the Korean language learners who use the expression. Based on the learner’s collocational competence, researcher tries to classify the causes and patterns of errors generated by Korean language learners’ use of the ‘-ge doeda’ expression. The results of the error analysis suggest that learners should be more clearly aware of the combination of the previous predicates and the situation context when using ‘-ge doeda’ expression. In this paper the methods will be presented to assess the expression based on the fluency and accuracy evaluation criteria respectively. The study also attempts to suggest useful teaching material based on the analysis of students' repetitive errors.

10:50-12:30 Session 5B
10:50
Sequentially robust neural language representation model in an unsupervised way

ABSTRACT. This study attempts to show through the BERT model that a neural language representation model that works well even when the order is reversed in some part of the sentence, and it can be trained in an unsupervised way. In principle, this model will be better trained if the order remains the same or similar, even when the order is reversed, but there is a limit to training this model in an unsupervised way, due to the lack of proper datasets. Therefore, in this study, after parsing Wikipedia data by two sentences, while the order of the first sentence remains same, tokens of the second sentence was extracted with a certain probability (15%) and in a factorial order. In the training phase, two sentences which are in part shuffled were put into the BERT model. The [CLS] token generated by BERT model representing the meaning of the whole sentences was set to the initial state, and two tokens were extracted and applied to the Recurrent neural network. Consequently, the Fully Connected Layer, and the Softmax Layer are followed. Through this, we implemented a model that predicts whether the order is arranged properly (1) or the order is changed by the sentence generator (0), one Softmax output by every 2 tokens. After training, this outperform in some GLUE tasks the original BERT base model.

11:15
Development of written production for adult learners of Korean: Automatic measurement of text similarity through Word2Vec

ABSTRACT. The present study investigates to what degree learner writing is similar to (or distant from) native speakers’ writing by proficiency. For this purpose, we adopt topic modelling, a Natural Language Processing approach to detecting hidden topics from large volumes of text in an unsupervised manner. In particular, we calculate lexical/semantic similarity between learner writing and native speakers’ writing automatically. 36 Chinese-speaking learners and 10 native speakers of Korean were asked to write argumentative essays about two topics separately in 20 minutes. The learners were divided into three groups through proficiency measurement (cf. Lee-Ellis, 2009): high (n=12), intermediate (n=13), and low (n=11). All the essays were converted electronically, with mis-spelling and typos uncorrected. Text similarity of each group’ essays with reference to native speakers’ essays was calculated by using a Word2Vec model through Gensim (Rehurek & Sojka, 2010). Since Word2Vec employs neural networks, which require a massive amount of training data for application, we used pre-trained embedding (https://github.com/Kyubyong/wordvectors) for this task. Results showed that the quality of learner writing approximated to that of native speakers’ writing as proficiency increased, which indicates development of L2 written production in light of lexical/semantic features. Subsequent research will measure the relation between these similarity scores and evaluation from human raters, supporting the relevance of automatic evaluation of learner writing by way of topic modelling.

Topic1: protection vs. exploitation of nature NSK~high NSK~intermediate NSK~low 0.762 0.743 0.708

Topic2: cooperation vs. competition NSK~high NSK~intermediate NSK~low 0.764 0.737 0.696

(Note. NSK = native speakers of Korean)

11:40
Implementing an Automatic Linguistic Abstract Generator

ABSTRACT. The present talk addresses implementation of an automatic abstract writing system via computational text generation. The system has been fed with approximately 50,000 abstracts extracted from the linguistic journals indexed by SCOPUS (dating back to 2014). The topics of this source include the full range of contemporary linguistics. We sampled GPT-2, a pre-trained transformer-based language model for our abstract generator. This model has demonstrated impressive efficacy on text generation. We focused on fine-tuning an OpenAI GPT-2 for generating linguistic paper abstracts manipulating the 50,000 abstracts as the training data. The abstract generator is based upon a vector mapping of the data points we have collected. The better the mapping, the better the quality of our automatically generated thesis. Evaluating how the generator generates the abstracts well enough is two-fold, viz. accuracy (a quantitative analysis) and validity (a qualitative analysis). The goal of the present study is not only to present an automatically generated abstract but also to identify the recent trends of the global linguistics community. As we attain our goal, we hope to present a map of the current flow of the linguistics society on a calculable grid.

12:05
Towards a computational modeling of word comprehension

ABSTRACT. One of the urgently answered questions in cognitive science of language is the nature of representing language during comprehension: how are language units like words analyzed in the mind/brain, linearly or hierarchically? The thesis that Chomsky (1956, 1957) showed is that language is structured: sentences are composed of not linear but hierarchical structures of “words”, coming to the conclusion that syntax is (at least) “Context-Free Grammar (CFG)”, not “Finite-State Grammar (FSG).” According to “Chomsky Hierarchy (CH)”, the set of strings generated by CFGs properly contains the set of strings generated by FSGs. However, it is a controversial question whether like sentences, words are hierarchically structured, being composed of morphemes. Some of the psycholinguistics have assumed that “words” are axiomatic units. Likewise, computational linguistics has maintained that morphology is FSG (Karttunen, 1983; Beesley & Karttunen, 2003; Roark & Sproat, 2007). In this paper, taking Chomsky’s stand more seriously, especially anti-lexicalist theories such as Distributed Morphology (DM; Halle & Marantz, 1993) that not words but morphemes are axiomatic units, we rely on the relevant corpus of the Korean Lexicon Project and design the computational modeling experiment, testing whether morphology is CFG or FSG. The prediction we have now is that if language is hierarchical and nly one grammar is available to build both sentences and words (“single engine hypothesis” of DM), then word structures should also be hierarchical.

10:50-12:30 Session 5C
Location: IBK Hall
10:50
Single and multi-unit vocabulary in a corpus study of English movies

ABSTRACT. While English movies are shown to be effective instructional materials, limited corpus research has been conducted to provide evidence-based insights for language learning. Lexical coverage (known single words) in corpus analysis provides valuable findings for comprehension, but learners also need to acquire multi-word units for processing advantages to demonstrate fluent performance. This exploratory study compiled 30 recent English movies (released by American or British companies between 2015 and 2019, 308,487 tokens, about 57 hours). Both lexical coverage (AntWordProfiler used) and the frequency, structures, and functions of bundles (Biber et al., 2004) were analyzed using automatic (AntConc) and manual means. The results showed that knowing common 3,000 words is needed to reach 95% comprehension level, and 6000, 98%, similar to those of Webb and Rogers’ earlier films (2009). Concerning bundle classification, functional distribution of major stance expressions and discourse organizers are similar to Coxhead et al.’s (2017) labs and tutorial corpora, but referential bundles were different. “Oh, my god” was ranked on the top. Verb-phrase bundles appeared over 70%, typical of conversation in ‘oral’ registers. Cross-matching with those commonly used 200 spoken phrases in the Academic Formulas List (Simpson-Vlach & Ellis, 2010) as a reference of conversational English indicates most (73%) also appeared in our film bundles. Our linguistic make-up of these latest movies shows the unique features of vocabulary profiles from both single and multi-word perspectives. Pedagogical implications are discussed. To illustrate, stance expressions such as “I don’t think/want to” can help L2 speakers present ideas or manage turn-taking.

11:15
A Corpus-based Study on Vocabulary Coverage of Children Animations

ABSTRACT. Many EFL studies have pointed out that language learners acquire vocabulary through media exposure. Many of the same kind have focused on TV episodes, dramas, and online media. Few looked into the use of movies; very few studied the possible impact of animations for learning English as a foreign language. This study examined the lexical sources provided by Disney animations and further investigated the vocabulary coverage needed to comprehend Disney animations.

The transcripts of twenty Disney animations were analyzed by two online instruments, VocabProfiler and N-Gram Extractor (Lextutor), looking into the vocabulary coverage, lexical bundles, and off-list words observed in Disney animations. The results showed that Disney animations provided vocabulary coverage from 2,000 to 16,000 words and generated 3- to 6-word lexical bundles. This study also suggested that Disney animations can be ideal materials for EFL learners.

11:40
A Corpus-based Study on the Use of Lexical Bundles by Taiwanese College EFL Students

ABSTRACT. The importance of lexical bundles in language teaching and learning has been receiving a growing amount of attention, especially in the field of academic writing. This corpus-based study examines the use of lexical bundles with an emphasis on the functional classification of Taiwanese college English majors’ essays. Three corpora of English academic essays were established based on the student writers’ proficiency levels. An online corpus analysis instrument, N-Gram Extractor (Lextutor), was used to identify 2- to 6-word lexical bundles among the essays. The lexical bundles were classified according to their discourse functions manually.

The results showed that the lexical bundles were observable in students’ essays. The medium-level students used more lexical bundles among the groups, whereas the lower-level students hardly produced lexical bundles in essay writing. In terms of functions, students from the medium-level and the higher-level used more stance bundles and referential bundles in their essays. This study calls for the immediate attention to lexical bundles particularly in the writing of college EFL students.

12:05
Arabic as a Second Language Acquisition For Non-Native Students in Indonesia

ABSTRACT. Learning and discussing the problem of mastering languages is always interesting. Acquisition as an unconscious indicator which is important. This contrasts with learning, which is a conscious process. The process of language acquisition and language learning is always done by several factors. The process of acquiring language can be seen from the perspective of psycholinguistics, sociolinguistics, and neurolinguistics. From the neurolinguistic perspective, renewal of the nervous system in the human brain has a very important role in the process of acquiring external language and internal human environmental factors. This article discusses the process of acquiring Arabic for non-native students in Indonesia.

10:50-12:30 Session 5D
Chair:
10:50
When stylistics meets Corpus Linguistics: The Arabic-English Literary Parallel Corpus

ABSTRACT. Corpus-based stylistic studies have been on the rise in the last few decades. Such studies have to start with a corpus; however, in some cases that is not easily available for different reasons, either those related to the language itself or to the type of texts. Parallel corpora which include Arabic as one of the languages and which include literary texts combine both challenges. Despite the advances in Arabic corpora and resources in the last few decades, it still lags behind other languages and is even considered a ‘resource-poor’ language.

This study reports on a project to compile a parallel corpus of modern Arabic literature and its translation into English, supported by a research grant from the American University of Sharjah. The project aims to reach around 2 million words in its first stage, and it boasts to be the first of its kind in the region to include complete novels and their translations. The project has the support of all the authors and translators involved, as well as the support of the AUC Press, one of the biggest publishers of translated Arabic literature in the world. The corpus, which uses the Sketch Engine software, aims to be a much-needed resource for several research goals, including: (1) Corpus-based stylistic studies of various linguistic aspects (discourse analysis) (2) Study of the impact of visualization tools on parallel corpora of literary texts (3) Sentiment analysis for literature. (4) Corpus-based teaching of literary translation.

11:15
Analysing opinion as a beyond-a-turn unit: A function-to-form approach

ABSTRACT. Being able to express opinion appropriately is an essential pragmatic skill that L2 speakers should develop (Cohen & Tarone, 1994; Iwasaki, 2009). The teaching and assessment of the skill has revolved around use of grammatical forms such as softeners (i.e., sort of) (Bouton, Curry, & Bouton, 2010) or pragmatic routines (i.e., I agree but) (Bardovi-Harlig, Mossman, & Vellenga, 2015). Although the form-driven studies drew much attention to L2 speakers’ ability to state an opinion within a single utterance or turn, it has not been fruitful in capturing the properties of spoken interactions that 1) pragmatic meanings are not always marked overtly; and 2) they are shaped throughout the course of conversations (Grabowski, 2016). There has been hence a call for exploring how to teach and assess opinion-giving skills in more nuanced manners. This study attempts to respond to the need by approaching opinion from a discourse perspective. It will describe the discursive properties of (un)successful opinion-giving of L2 speakers, taking function rather than form as a starting point of analysis. The data set will be the Trinity Lancaster Corpus (Gablasova, Brezina, McEnery, & Boyd, 2015), which consists of language proficiency interviews between test-examiners and test-takers. The interactions will be analysed based on Eggins & Slade (1997)’s framework, further developed from Halliday (1994) for the purpose of analysing conversations. The findings will be discussed in relation to the strengths and challenges of taking a function-to-form approach to learner language, spoken discourse, and pragmatics.

11:40
Corpus Linguistics meets Historiography: Telling the history of academia through the analysis of corpora

ABSTRACT. The current study has two major goals: the first goal is descriptive, namely to identify the major periods in the history of Linguistics and Applied Linguistics; the second is methodological: to present a corpus-driven, bottom-up, multi-dimensional approach for Historiography. Two separate corpora were compiled, one for each field, including four flagship journals each. All of the available articles and reviews published in English in the journals were downloaded, amounting to more than 20,000 texts, covering a period of more than 70 years. The corpus was tagged for POS and lemmatized. A lemma subset of lexical words was selected and their counts normed. The normed counts were entered in a factor analysis, which identified the sets of correlated lexis. These were interpreted as dimensions that correspond to the major discourses shaping the fields. ANOVAs were conducted to measure the degree of variation captured by the dimensions, using both time of publication and journal as fixed factors. To test the strength of the MD models, a Discriminant Function Analysis was conducted which verified the extent to which the texts could be ascribed to the journals and time periods to which they belonged on the basis of their dimension scores. The final analysis was a Hierarchical Cluster Analysis (HCA), which was employed to derive the time-based classification of the texts. This grouped the different years into contiguous time periods based on the similarity of their dimensional profiles. Overall, this research illustrates a way in which Corpus Linguistics can engage with Historiography.

12:05
Korean Metaphors of Emotion across Genres

ABSTRACT. Metaphor and metonymy have been studied extensively over the last two decades within a strand of cognitive linguistics research that seeks to understand conceptual mapping at both the conceptual and the linguistic level. Another rapidly developing strand of related research has explored the ways in which figurative language is used in particular contexts, genres, and registers (Steen et al., 2010) or discourse (Cameron, 2003; Charteris-Black, 2004, 2005, inter alia). This study, which fits into the second strand, analyzes conceptual emotion metaphors of Korean utilizing a data-driven corpus-based method to provide a fine-grained analysis of Korean metaphorical expressions across genres. While an earlier study (Türker, 2013) investigated the extent to which Korean metaphors of emotion are universal or culturally specific, this study analyzes conceptual emotion metaphors of Korean utilizing a data-driven corpus-based method to provide a fine-grained analysis of the use of Korean metaphorical expressions across genres. It uses data collected from newspapers, essays, novels, academic journals, children’s books, magazines, and nonfiction books published between 1980 and 2006. The study hypothesizes that the use of Korean emotion metaphors varies significantly across genres, and changes over time. More specifically, the study investigates whether and how (i) types and frequency of Korean metaphoric expressions vary across different genres; (ii) Korean metaphoric expressions demonstrate change over time; (iii) subject matter (or discipline) influences frequency and choice of metaphoric expressions; and (iv) prominent metaphorical expressions from earlier texts are re-used in different genres to convey new meanings and serve new functions.

10:50-12:30 Session 5E
Location: Helinox Hall
10:50
A Comparative Study on The Morphological Word Embeddings for Korean Sentiment Analysis

ABSTRACT. This study compares the results by applying morphological word embedding to the Korean sentiment analysis model using deep learning. The review corpora were scraped from Naver Movies and preprocessed for machine learning approaches. The construction of the dataset is based on the method noted in a Large movie review dataset from Maas et al., 2011. All reviews are shorter than 140 characters, and each sentiment class included 100k equally. The previous Korean sentiment analysis model generally used the sentiment dictionary. The word embedding has emerged as a deep learning approach to sentiment analysis. In the case of English, after BERT's breakthrough, performance improvements were made using a large number of pre-trained embeddings. Nevertheless, in resource-poor languages ​​such as Korean, the detailed study of word embedding features are available to improve performance. In addition to the commonly used nouns, adjectives, and verbs, additional adjectives and adverbs are extracted and used as embedding features. Based on the accuracy of 0.827 of the baseline model embedded with all parts of speech, the performance of each embedded model is verified.

11:10
Universal Dependencies corpus of spoken Korean

ABSTRACT. This paper discusses the method of building Universal Dependencies corpus of spoken Korean. Unlike written, spoken texts are not easy to parse because of filler, disfluency, and inversion. Developing annotation scheme for them will solve the difficulty in parsing spoken data. As a result, it can extract syntax based content words and propositional information.

Universal Dependencies(UD) is a project that is developing cross-linguistically consistent treebank annotation for many languages. Since Universal Dependencies are focused on cross-linguistic, there is a limit to adapting spoken features of Korean. Universal Dependencies have dependency relation tags concerned with speech such as ‘discourse’, ‘reparandum’. This study proposed language specific relations according to characteristics of the Korean speech.

This study will be conducted on sampling 2,000 sentences from grammatically tagged spoken corpus of the 21st Century Sejong Project. It provides annotation guideline, mapping table between spoken tag set of Korean National corpus and UD tag set. The UD spoken corpus will be used for the syntactic parsing and application such as AI speaker and chatbot.

11:30
Toward Abstract Meaning Representation for Korean

ABSTRACT. This paper discusses the planning and methodology for building Korean AMR corpus. Abstract Meaning Representation (AMR) is a graph-based meaning representation framework that represents the meaning of single or multiple sentences as a traversable, single-rooted directed acyclic graph. Unlike constituent or dependency trees, AMR represents the propositional meaning of a sentence, so sentences with the same meaning can be represented to a unique AMR graph. AMR provides the basis for semantic parsing and can be widely applied to natural language processing tasks such as machine reading comprehension and automatic summarization. AMR is studied in various languages ​such as English, Chinese, Spanish, Brazilian Portuguese, Vietnamese and so on. However, Some language-specific issues remain on adapting AMR to Korean because it is originally developed based on English. This study starts with the process of localizing existing guidelines for Korean AMR corpus. In this phase, extending current annotation schemes correspond to Korean-specific issues can be matters. Korean AMR corpus must be constructed to make full use of existing language resources and to be compatible with various language resources abroad. The documents to be annotated are selected from ETRI Exobrain Corpus and Korean PropBank (Palmer et al., 2006) was used to annotate verb frames when representing AMR. The Korean AMR corpus includes 2,144 sentences at an early stage, considering the problem of unregistered words, and will be released in 2020Q2. In this regard, the paper elaborates on annotation methodology, extended annotation guidelines in the localization, specification of annotation corpus, and future applications.​

11:50
A Study on the pattern of Pauses in Korean by comparing Announcer speaking and Public speaking

ABSTRACT. This study aims to find out the pattern of Pauses in Korean. Korean has no grammatical accent, It is hard to decide the basic unit of intonation description. Generally, Many researchers describe only terminal contour or use IP(Intonation phrase) between pauses. Given this, research on the pattern of the pauses in Korean is necessary. In this study, use speech corpora were recorded using the script reading and the free speech given the task. Announcers and the public participated in the recording.  As a result of the comparative analysis of the reading, the difference between the announcer and the public was not significant in the speed of speech and the frequency of pauses. However, the announcer has a constant position where pause appears compared to the public, and the speed of speech did not change significantly depending on the sections. Furthermore, even in the announcer, it was necessary to be divided into a group and a group in which many toilet papers appeared. The groups were subdivided according to the frequency of dormant emergence to compare the patterns of free speech. (Now in a proceeding)

12:10
Assigning the NOUN of Universal part-of-speech for Korean Universal dependencies treebank

ABSTRACT. This study discusses how to assign the NOUN of Universal part-of-speech when constructing Korean treebanks based on the Universal dependencies. We will calculate the frequency and distribution of the NOUN in public Korean treebank, and establish the criteria of the NOUN assignment, which reflects the general characteristics of Korean nouns. In parsing test with the public Korean Universal dependencies parsers, NOUN assigned words accounted for the highest percentage of 44.86%, and 52.94% of annotating errors occurred in the NOUN assigned words. Therefore, it is expected that categorizing a type of NOUN assigned words and establishing a consistent assignment criterion will improve parsing performance by increasing the consistency of the training data. For high-accuracy parsing, deep learning-based dependency parser learns the features of accumulated training data. Therefore, how precise training data preprocessing can affect parsing performance. It would also be meaningful to review the characteristics of the NOUN assigned words with high-frequency errors and to re-examine the data after re-preprocessing consistently. The purpose of this study is to explore the method of data construction and refinement. We review previous studies related to the Universal dependencies, summarize the properties of the Korean noun phrases, and examine the assignment of noun phrases in public Korean treebanks. Based on the language-specific part-of-speech tag information of Korean, we suggest guidelines to assign NOUN for Korean Universal dependencies treebank.

14:00-15:00 Session 6: [Plenary 1] Douglas Biber. Using multi-dimensional analysis to study fictional style: variation among and within novels with respect to potential universal dimensions or register variation

Plenary Session

Location: Grand Ballroom
14:00
Using multi-dimensional analysis to study fictional style: Variation among and within novels with respect to potential universal dimensions of register variation

ABSTRACT. TBA

15:10-16:50 Session 7A
Location: Choi Young Hall
15:10
Lexical profile of academic spoken English in Anglophone and non-Anglophone settings

ABSTRACT. L2 learners have experienced difficulty in understanding academic spoken English due to vocabulary deficiency in English-medium-instruction (EMI) educational institutions (Goh, 2013). To help these learners improve vocabulary, lexical profiling research has been conducted to indicate the number of words learners need to learn to comprehend academic listening texts (e.g., Dang & Webb, 2014). However, previous research has mainly used native-speaker corpora for analysis, such as the British Academic Spoken English (BASE); and no study has investigated the lexical profile of academic spoken English in non-Anglophone settings, where the majority of speakers are non-native.

This study analyses and compares the lexical profile of three corpora – the corpus of Spoken Academic English as a Lingua Franca (ELFA), the BASE corpus, and the Michigan Corpus of Academic Spoken English (MICASE), which can represent academic spoken English in non-Anglophone, British and American settings respectively. The results have shown that if a learner knows the most frequent 3000 words and proper nouns and marginal words in Nation’s (2016) BNC/COCA lists, they can reach 95% coverage in all the three corpora; it also calculated the vocabulary sizes needed to reach 98% coverage of the three corpora. Additionally, both the Academic Spoken Word List (ASWL) (Dang, Coxhead, & Webb, 2017) and the Academic Word List (AWL) (Coxhead, 2000) have higher coverage in ELFA than in BASE and MICASE. Lastly, the study also calculated the number of words needed to reach the threshold levels with the help of the ASWL and the AWL in the three corpora.

15:35
Corpus-based Lexical Elaboration and Sophistication: The Case of Swahili Adjective -kuu

ABSTRACT. Swahili is a Bantu language and the mother tongue of the Swahili people (Waswahili). The number of Swahili speakers is increasing at an exponential rate, making it as one of the fastest-growing languages in the world. As Swahili has been vehemently promoted as a regional lingua franca and gains ground in all EAC member states, the issue of lexical expansion and elaboration has drawn a special attention. Coining technical terms in a wide range of fields and its subsequent dissemination constitute an integral and indispensable part in promoting Swahili. This lexical expansion, elaboration and sophistication are not only confined to nouns, specifically technical terms. The exact and precise meaning of a word can be established only if it is put into actual use in real-life situation. This is particularly true of adjectives, be they attributive or predicate because adjectives are words that describe or modify nouns. Adjectives remain more or less indeterminate or ambiguous if they are not used with nouns. Considering this complexity and indeterminacy in determining the meaning of words in a given text or utterance, this paper attempts to find corpus-based lexical elaboration, disambiguation and sophistication by citing a Swahili adjective –kuu, whose meanings are not sufficiently and clearly defined. Some authoritative dictionaries that have a limited number of meanings will be demonstrated to emphasize the merit of copus-based lexicography.

16:00
Boyle fullyche and seethe hem a lytil : Boil and Seethe as Verbs of Cooking

ABSTRACT. The culinary art is much more than simply the act of cooking. Often times the preparation of the ingredients and the process requires very precise attention, and this subtle differences can be noted in its uses of cooking verbs. Thus it is not so odd that cookery has diverse verbs to describe the act of cooking, and more specifically, the act of cooking in liquid, in other words, boiling. Likewise, the English dictionary seems to have more than just one simple verb ‘boil’ to describe the action; seemingly synonymous ‘parboil’ and ‘seethe’ also rise to surface. So the principal aim of this study is to present a general diachronic overview of these verbs used for cooking in historical culinary recipes by looking at a small corpus of selected recipe collections from the fourteenth to the seventeenth centuries. I will be giving a general overview of the frequency of the three verbs and other linguistic features, and then will move on to more descriptive analysis with adequate usage examples.

16:25
Examining the Suitability of TED Talks for Academic Listening: A Vocabulary Perspective

ABSTRACT. English teachers often use TED talks as authentic materials for improving English learners’ academic listening abilities in English for academic purposes (EAP) courses. Despite the common integration of such talks into EAP classrooms, the debate concerning whether they are suitable materials for academic listening still remains unsettled. Previous studies have approached this issue by examining TED talks’ vocabulary profiles, but limitations still remain, including the restricted number of TED talks investigated and the impropriety of using written academic word lists to analyze the coverage of academic vocabulary in TED talks, a spoken genre. As such, this study aims to compare the representations of frequent academic spoken vocabulary in a 4.4-million-word TED talk corpus and a 4.4-million-word academic lecture corpus based on the Academic Spoken Word List (ASWL) (Dang, Coxhead, & Webb, 2017) to determine whether TED talks can be suitable academic listening materials. Results show that TED talks and academic lectures have similar ASWL representations, with approximately 90% coverage by the ASWL. This means that learners would frequently encounter many academic words in TED talks as they would do in academic lectures, suggesting that TED talks should be suitable materials for academic listening. Moreover, the high coverage of the ASWL over academic lectures and TED talks also suggests that the ASWL can be used as a helpful vocabulary guidance in EAP courses. Pedagogical suggestions regarding the use of ASWL for improving learners’ academic listening comprehension are also discussed.

15:10-16:50 Session 7B
Chair:
15:10
A corpus-based analysis of the phraseology of the linking adverbial "besides"
PRESENTER: Sugene Kim

ABSTRACT. This study probes the phraseology of the linking adverbial besides to unravel why, unlike similar-meaning transitions such as in addition, it sounds unnatural in some contexts, as in "She likes football. Besides, she likes tennis and basketball" (Oxford Advanced Learner’s Dictionary). A total of 154 corpus extracts from two written English corpora of academic English--the Michigan Corpus of Upper-level Student Papers and the British Academic Written English corpus--and academic sections of the British National Corpus and the Corpus of Contemporary American English were examined to identify the discourse environment in which "besides" is used to bind the sentences together. The analysis suggested that the linking adverbial "besides" co-occurs frequently with pragmalinguistic features typical of argumentations. In 91.5% of the data, the use of "besides" was found to be regulated by the negativity-conditioned nature of what precedes it. In most cases, negation was acquired using an explicit negative word (e.g., "not" or "no"), a negative affix, or words with negative implication. Negative assertions were also made by means of constructions conveying a negative proposition, such as a rhetorical question, the subjunctive mood, or the comparative construction. When negativity in the preceding clause is not expressed syntactically or semantically, the besides-clause was shown to act as a rhetorical cue for treating previously stated arguments as premises for an inference of a de facto proposition, which takes a negative form without fail.

15:35
Use of Modal Verbs in EFL writing: Comparison between L1 English speakers and Chinese, Japanese, and Thai learners

ABSTRACT. This study examined the use of modal verbs (can, could, may, might, will, would) in the essays of 200 L1 English speakers and 1,200 EFL learners in China, Japan, and Thailand. A total of 2,800 essays drawn from the International Corpus Network of Asian Learners of English, 400 and 2,400 by L1 English speakers and EFL learners, respectively, were analyzed in terms of the distribution of each modal. The purpose of this study was to examine the wide variability in the usage of modal verbs among Asian EFL learners and to identify areas in which EFL learners’ subgroups require pedagogical attention. The results of the study revealed that the EFL learners use “can” more liberally, leading to its “overuse,” with its frequency reaching approximately double than that of the L1 English speakers. Conversely, the EFL learners’ underuse of “would” is evident, suggesting a considerable limitation in expressing a hypothetical situation. Further, the findings also suggest examining the qualitative aspects of modal verbs is important. For instance, although the frequency of L1 English speakers and Japanese EFL learners for the usage of “may” is comparable, their use thereof is distinctively different. Based on the varied tendencies observed among the EFL learners, it is shown that different pedagogical approaches are necessary for each subgroup of EFL learners, considering the unique patterns of usage undertaken by each group.

16:00
To Annotate or not to Annotate: Challenges on Encoding Traditional Grammar Features in Morphological Annotation Resources of Indonesian

ABSTRACT. In the tradition of Indonesian grammar or morphology, it is common to describe a morpheme on the basis of syntactico-semantic analysis. Some affixes in Indonesian suggest syntactic (passive, causative, applicative voices) or semantic features (such as agent, instrument, or patient). This is a challenge for automatic morphological annotation systems, as reflected by the existing Indonesian morphological analysers, such as ones developed by Pischeldo et al (2008) and Larasati & Kubon (2011). I here present how this problem is addressed in SANTI-Morf, a work in progress. SANTI-Morf is a collection of machine-readable dictionaries and finite state graphs dedicated to the automatic morphological analysis of Indonesian texts. SANTI-Morf can run on finite-state based corpus processing machines. On the implementation, I will show that the incorporation of some syntactic-semantic values is feasible (e.g passive voice and superlative adjective marker). However, some of them fit better for morphosyntactic or syntactic annotation. Encoding these values on SANTI-Morf adds morphological tags complexities; thus, causing a large number of ambiguities on morphological annotation that remains to be resolved. If these values are required by users for morphological retrieval, they can be encoded, but users must be warned to use these features with cautions as they are ideally disambiguated on morphosyntactic or syntactic annotation.

15:10-16:50 Session 7C
Location: IBK Hall
15:10
The Use of Modal Verbs in Government-Approved Japanese EFL Textbooks

ABSTRACT. This research concludes that government-approved EFL textbooks in Japan do not mirror the use of the central modals (i.e. can, may, will, shall, must, and their preterit forms) for at least two levels: frequency of occurrence and verb phrase structures (VPSs) where each modal co-occurs. This was led by a comparison of the modals use in the Corpus of Contemporary American English (COCA) with that in a junior high and a senior high textbook corpus respectively. These two textbook corpora were constructed by myself from all the government-approved English textbooks since there are no freely-available corpora in Japan. The junior high textbook corpus contains 30,317 running words and the senior high textbook corpus contains 190,029 running words. Frequency analysis revealed gaps in frequency order between COCA and the two textbook corpora respectively. A log-likelihood test uncovered the overuse of ‘can’ as well as the underuse of ‘would’, ‘could’, and ‘might’ in junior high textbooks; there are no occurrences of ‘might’, which is the mid-frequent modal in COCA. In senior high textbooks all the modals except for ‘shall’ are overused relative to COCA. In COCA and senior high textbooks, the modals co-occur with marked voice and aspect whereas in junior high textbooks all the modals except for ‘will’ and ‘can’ co-occur with bare infinitives only. VPSs where ‘will’ and ‘can’ co-occur are limited to the passive voice in addition to bare infinitives. In COCA, however, ‘must’ and ‘shall’ are more likely to co-occur with the passive voice than ‘will’ and ‘can’.

15:35
Look into the Use of Top Phrasal Verbs in PHaVE list among Vietnamese EFL learners

ABSTRACT. Over the past decades, research into the acquisition of phrasal verbs (PVs) has become prominent in linguistic fields whereas studies into this aspect are still limited among Vietnamese learners of English. This study attempts to analyze the actual use of PVs by Vietnamese EFL learners in the Vietnamese component of the EF-Cambridge Open Language Database (EFCAMDAT), which consists of essays submitted to Englishtown, an online school of EF Education First. The essays were collected from learners of various proficiency levels (CEFR stages A1-B2). The top 30 PVs in the PHaVE List compiled by Garnier and Schmitt, and their meaning senses were examined in the present corpus. The errors made by the learners when using PVs were also studied. The results show that 40% of the most frequent PVs were absent from the corpus. Among the PVs found, the majority have their most frequent meaning sense acquired by the learners (88.2%). However, the low frequency of other meaning senses suggests more attention should be paid on the teaching and learning of polysemous phrasal verbs. It was also found that while beginners mostly made the basic grammar mistakes, intermediate learners tended to make errors because of their incomplete understanding of the meaning of PVs when they tried to use more complex PVs or diverse meaning senses. These findings suggest that the idiomatic and polysemous features can be the factors causing difficulties for Vietnamese EFL learners to master PVs. Finally, some recommendations for English language teaching and learning are also proposed.

16:00
A usage-based model of second language production and comprehension: Frequency effects on verb-construction integration

ABSTRACT. This study tests the usage-based model of language learning by investigating how constructional frequency modulates second language (L2) learners’ integration of verb and argument structure construction (hereafter, construction) in English production and comprehension. Usage-based approaches posit that language learning depends on language experience, mainly composed of language input (N. Ellis & Collins, 2009). In this usage-driven process, input frequencies guide language learners to integrate a variety of verbs with a set of constructions. We examined the potential contribution of input frequencies of constructions to verb-construction integration by analyzing the L2 production and comprehension of two complex constructions (i.e., ditransitive and resultative), comparable in structural and semantic complexity but distinctive in input frequency. Given the higher input frequency of the ditransitive than the resultative construction, we predicted that L2 learners will be better able to integrate a verb with the ditransitive construction than with the resultative construction. A corpus-based analysis of L2 learner texts showed greater variability in verbal usage in the production of the ditransitive than the resultative construction. The results of an acceptability judgment task indicated that L2 learners accepted the ditransitive sentences regardless of whether they contained high-frequency or low-frequency verbs, whereas the learners accepted the resultative sentences significantly more when they read high-frequency than low-frequency verbs. Our findings suggest that high frequency of a construction facilitates L2 learners’ integration of the construction with verbs, supporting the main tenet of the usage-based model.

16:25
Using Linguistic Corpus in Learning Arabic at Indonesian Universities

ABSTRACT. This paper aims to describe several things about corpus linguistics and their development and dynamics in the study of Arabic in the international world. Specifically, it will illustrate the extent to which the existence of the Arabic language body and applications can be utilized for processing and analysis of the corpus. This could be used as a basis for discourse on academic thought and discussion regarding the possibility of drafting a corpus model of Arabic in Indonesia as well as its use in the study and learning of Arabic at the university level.

15:10-16:50 Session 7D
15:10
A corpus-based analysis of Korean contrastive connective –ciman

ABSTRACT. This paper investigates how the contrastive connective –ciman in Korean is employed by ordinary speakers of Korean by drawing on Sejong Corpus. Literature on senses possible with the ending -ciman has suggested taxonomic categories including semantic contrast, pragmatic contrast, denial-of-expectation and speech-act contrast. However, such taxonomic approaches cannot exhaust all possible meanings. Nor do they provide clear-cut criteria to demarcate boundaries between them. In addition, most studies have focused on constructed data. Departing from previous studies, this paper suggests a scalar representation of senses of –ciman constructions emerging from the corpus data, which include explicit contrast, denial of expectation/implicature, speech act hedges and idiomatic expressions. The rationale behind this scalar representation is at least two-fold. First, in some cases, the category of an example is rather fuzzy, suggesting that it can serve multiple functions simultaneously. Hence, the boundaries between the categories in the current study will be analytical. Second, scalar representations can better reflect relatedness among various senses of -ciman constructions, abiding by the parsimony of the sense, spelled out by Grice (1989: 47) as “Modified Occam’s Razor (i.e., senses are not to be multiplied beyond necessity). The so-called “pragmatic” senses of -ciman constructions such as denial of expectations and speech acts are accounted for by means of different levels of representations in which the contrast occurs. Furthermore, the possibility of utilizing –ciman as a discourse/stance marker is pursued with reference to idiomatic expressions.

15:35
A Corpus Stylistic Analysis of Malaysian Online Columnists

ABSTRACT. Online media has created various platforms by which people can view and make sense of the world today. This is particularly true for explaining the rise of journalistic commentary and how events are reported through the lens of columnists. In this paper, two Malaysian columnists from two national English online portals: The Star Online and News Straits Times were selected for a corpus-assisted discourse analysis. Frequency lists are firstly compared between each columnist to identify salient words that are used by each writer. Initial findings revealed that both shared similar frequent words (e.g. of, in, that). Using the comparing wordlists feature, stylistic comparisons are further explored, indicating that John Teo over-uses functional words to guide (e.g. 'therefore', 'today') and engage (e.g. 'may', 'seems') the reader as well as lexical words that point to honorifics like 'Datuk', 'Tan' (Sri); specific locations ('Sarawak', 'Sabah') and addressing the readers as (a) 'nation', and/or 'Malaysians'. These results present each writer as idiosyncratic and thus, shaping their rhetoric. The use of the first person pronoun ‘I’ was also investigated, which McNair (2008) claims as typical of commentary journalism. Results indicate further distinctiveness in that Syahredzan projects a more assertive stance (I have, know) as opposed to John Teo who is more suggestive in style (I think, believe). In-depth analysis of specific topics that were discussed by both columnists suggests that the reportage and interpretation of these events are different, signalling the implications for readers’ willingness to accept or reject these viewpoints.

16:00
Modern diachronic study of modals in SEN (Singapore English Newspaper) corpus

ABSTRACT. Diachronic studies have been carried out on British and American English, based on corpora reaching as far back as 850 CE. In contrast, corpus-based studies of English language used in Asia are generally synchronic, most probably due to the short history of the English language in most Asian countries. Motivated by recent diachronic research on contemporary English by Leech et al (2009) and Partington (2010), the present study explores changes that have taken place in Singapore English based on a controlled corpus of articles published in the national broadsheet The Straits Times from three different time periods: 1993, 2005 and 2016. Specifically, the study aims to determine if the core modals (will, would, can, could, should and may) and semi-modals (BE going to, NEED to and HAVE to) in Singapore English exhibit similar frequency trends as the modals in the Brown family corpora (1960s to 1990s) and the SiBol-Guardian corpora (1993, 2005 and 2013). Additionally, the study investigates if any of the modals have changed in semantic function and context of use over the years. The results indicate that the pattern of change is similar in British English and Singapore English in the way the core modals mostly became less frequent, as did the semi-modals HAVE to and BE supposed to, while NEED to increased in frequency over the same period. Interesting differences emerge in the semantic functions and use contexts. The changes could be attributed to processes of democratization, colloquialization, and to a smaller extent, grammaticalization.

16:25
A Multi-Angled Corpus-Based Approach to Testing Bolinger’s Hypothesis

ABSTRACT. Bolinger’s (1968: 127) hypothesis is that ‘a difference in syntactic form always spells a difference in meaning’. Almost all of the previous studies of Bolinger’s hypothesis have been carried out mainly from a semantic standpoint as a single-angled approach. However, this talk shows that through such a single-angled approach, the distinction between semantically competing multi-verb sequences, shown in (1) and (2), cannot be achieved.

(1)a.Come have your dinner. b.Come and have your dinner. (2)a.Go get me a drink! b.Go and get me a drink!

This talk supports one hypothesis: ‘the differences in meaning that different forms exhibit include functional and/or historical differences in meaning’. Corpus linguistics is pivotal in testing this hypothesis. This talk emphasizes the importance in testing Bolinger’s hypothesis from a functional standpoint and/or from a historical standpoint as well as from a semantic standpoint. Based on data from ‘various kinds’ of synchronic and/or diachronic corpora, this talk also shows how the differences in the semantically competing multi-verb sequences are closely related to genres of language use. The genres of language use are effective in distinguishing between the semantically competing multi-verb sequences. If the distinction between them cannot be achieved through such a functional approach, a historical approach, which covers current changes that have taken place over relatively short spans of time, over decades rather than centuries, is necessary. Through differentiating between the semantically competing multi-verb sequences, this talk makes clear the value of the multi-angled corpus-based approach to testing Bolinger’s hypothesis.

15:10-16:50 Session 7E
Location: Helinox Hall
15:10
A study for the usage and error patterns in the connective ending '-고(-go)' by Japanese learners of Korean

ABSTRACT. In this paper, we try to analyze the connective ending '-고(-go)' error pattern made by the Korean learners from the point of view of interlanguage analysis. It is assumed that the learners have difficulties in acquiring Korean connective endings due to the unique features the grammar system of Korean language has. Learning a variety of connective ending is one of the difficult parts for Korean Language learners. Because there are many kinds of connective expressions and they have delicate differences of meaning and they are the most various among the connective expressions and they are essential items to deliver the speak's message logically. The error of the learner is characterized by being displayed while being in close proximity to the target word system while continuously changing as the learner learns the target word. Therefore, I would like to take the viewpoint of the language of the learner as an independent language system displayed in the process of target word acquisition, and clarify the use and error side of '-고(-go)'. The focus in this paper is the semantic function of connective ending '-고(-go)', it analyzes the various meaning of '-고(-go)', and analyze the difference between use pattern and error pattern according to each semantic function. Then we try to understand usage pattern and error pattern of '-고(-go)' which changes according to the level of Korean ability of Japanese learners.

15:35
The semi-automatic extraction of Korean conventional synesthesia and its transfer directional pattern

ABSTRACT. Synesthesia usually refers to an experiential transfer of one sensory domain onto another in the field of linguistics in terms of metaphor, such as for instance sweet music. The current study aims to extract Korean synesthetic metaphors semi-automatically from corpus and then explore the synesthetic transfer directional pattern according to source sensory domain and target sensory domain. In order to retrieve linguistic synesthesia from corpus, first I set up perception-related lexical items based on five sensory domains of touch, taste, smell, sight, and sound in terms of a part-of-speech categorization into noun, adjective, and verb. Second, for the extraction of synesthetic examples, the method that lists all the sentences containing at least two perception-related items was applied to Sejong Corpus. Finally, the extracted candidate sentences were sorted out manually for true synesthesia. Through the semi-automatic extraction method, as a result, I gathered 150 occurrences of Korean synesthetic metaphors, which is not feasible by doing the hands-on checkup of the corpus data. The synesthetic transfer directionality from the result in general conformed to so-called universal schemes from previous canonical studies, except for sight preceding sound in directional order.

16:00
corpus-based contrastive analysis of hedges in Chinese and Korean for Korean teaching: focusing on ‘ㄹ 것이다' 's Chinese corresponds

ABSTRACT. Although there are many previous researches that examined quantitative fuzzy semantics, the contrastive study of hedges in Chinese-Korean language is still undetermined. Wether one is able to accurately use the hedge expressions reflects the pragmatic competence of the foreign language student. The purpose of this paper is to investigate the corresponding expression of Korean hedges expression 'ㄹ 것이다' in official Chinese speech texts, and the application of the results from this research as a pragmatics approach to the teaching scene of Korean as a foreign language education to Chinese students. In this paper, the author selected eleven official speeches by the heads of the Chinese government and their Korean translations as a parallel corpus based. By using the corpus, statistics methods and the pragmatics approach, we found 104 corpus pairs of Korean hedge expression 'ㄹ 것이다' in these eleven official speeches. The results of this study showed that the Korean hedges expression 'ㄹ 것이다' is the translation of various Chinese expressions from the original speeches. Moreover, 51.92% of them are unmarked Chinese expressions. It shows that not only were the suggestive expressions translated into Korean euphemism, but the translators used the Korean hedges expression 'ㄹ 것이다' consciously in order to make it suitable for Korean readers. It suggests that we need to demonstrate the hedges expression 'ㄹ 것이다' in Korean teaching context, in order to improve Chinese students' pragmatic competence and increase the acceptance of Korean learners and readers.

16:25
Korean National Korean-Chinese Parallel Corpus in the 21st Century Sejong Project

ABSTRACT. The 21st Century Sejong Project, which aimed for preparing basic frames and resources to increase the capability of language research and technology has been released since 2008. The project is divided into two parts: 1) Construction of the primary data of the Korean language, 2) Construction of the special (in time, region and language) data of the Korean language. As for the latter one, we constructed a parallel corpus of such language pairs: Korean-English, Korean-Japanese, Korean-Chinese, Korean-Russian, and Korean-French. However, Sejong Korean-Chinese Parallel Corpus is not well-known to the researchers yet. To spread the results of pioneering Sejong Project widely to the literature, in this paper, we introduce the Korean-Chinese Parallel Corpus of Korean National Corpus(KNC) and argue some issues about its building and application. The first part of this paper introduces the size and composition of the corpus. This is followed by concrete compliment process of the corpus – design, collecting parallel texts and especially alignment at the sentence level. Finally, examples of how this parallel corpus can be (and has been) used in Korean-Chinese cross-linguistic study are presented. It is hoped that this paper will be helpful for those who are interested in building and application of Korean-Chinese parallel corpus.

15:10-16:00 Session 7F: [Keynote 1] Michael Barlow. Corpus data and individual differences

Keynote Session

Location: Grand Ballroom
15:10
Corpus data and individual differences

ABSTRACT. TBA

16:00-16:50 Session 8: [Keynote 2] Jae-Woong Choe. Word embeddings and Korean semantics: some explorations

Keynote Session

Location: Grand Ballroom
16:00
Word embeddings and Korean semantics: some explorations

ABSTRACT. TBA

17:00-18:15 Session 9A
Location: Choi Young Hall
17:00
A Keyword Analysis of Test of English for Thai Engineers and Technologists

ABSTRACT. With the need to cater to English language assessment for such a specific field as engineering and technology, Test of English for Thai Engineers and Technologists (TETET) has been established since a decade ago by School of Liberal Arts, King Mongkut’s University of Technology Thonburi (KMUTT), a leading engineering and science university in Thailand. This study aims to explore the linguistic features of TETET in comparison to TOEIC (Test of English for International Communication which is widely used for general office work communication) through a corpus-based analysis. The keyword analysis will be conducted with 20 TETET tests as a target corpus and available TOEIC tests as a reference corpus. The results show the unique content themes (e.g. computer/technology-related and factory-related words) and language style of TETET (e.g. close personal pronouns) against the general workplace test of TOEIC. This study sheds light on implications for material and test development for specific professions.

17:25
Using small corpora of critiques to set pedagogical goals in Business English

ABSTRACT. The final writings for the cohort in the mandated first-year English writing course in the Bachelor of International Business Administration Program at Taiwan’s Feng Chia University is a critique including evaluation. The student cohort, typically two-thirds Chinese and one-third international, write a 1,000-word paper containing at least three references on an article they propose will interest the next year’s cohort. In addition to increasing facility with handling sources, the assignment prepares them for a range of future business writings by increasing both their ability to summarize with accuracy and their comfort with evaluation. We follow Lee and Deakin (2016) in compiling three small, specialized corpora: 20 critiques from 2017; 20 critiques from 2018; and 20 critiques from a range of fields in the online Michigan MICUSP collection. Chan persuasively argues (2019) for student gains in awareness of relational uses of language and multiple genres as well as lexical and syntactic complexity. Accordingly, we look in this discussion at how Chinese and international (largely Western) students use interactional metadiscourse as well as grammatical features of written discourse such as prepositions (Staples et al. 2016) in their critiques.

17:00-18:15 Session 9B
Chair:
17:00
Error proofing the data flow

ABSTRACT. Research based on good data may lead to good results; research based on poor data surely leads to poor results. Data stored electronically is easy to track, but is equally easy to alter, copy and delete. It is therefore essential to eliminate or at least minimize data errors. To the best of my knowledge, there is no published research focussing on the data flow process in corpus studies. In this study, the data flow in an investigation of an annotated corpus of scientific texts is tracked through a number of quality management tools, such as flowcharts, spaghetti diagrams and poka yoke. Errors were discovered in each project phrase from data collection and cleaning to data extraction and statistical analysis. The results reveal ways to streamline processes and to opt for tools and techniques that are less error-prone. Practical advice, on how to minimize data flow errors is provided.

17:25
Winograd Shema Challenge in Korean: knowledge-based approach

ABSTRACT. Since the dichotomy of syntax and semantics has been suggested (e.g. “Colorless green ideas sleep furiously"), how to use encyclopedic knowledge and reasoning for the interpretation of felicitous sentential meaning has been an important issue. In this study, we investigate the issues surrounding the Winograd Schema Challenge (WSC) in Korean, as follows:

(1) Q: siuyhwoy uywontul-un PRO/kutul-i phoklyek-ul cwuchanghanta-nun Citi councilmen-Top PRO/they-Nom violence-Acc advocate-Rel iyu-lo siwitay-uy heka-lul kepuhayssta. reason-with demonstrator-Gen permission-Acc refused nwuka phoklyek-ul cwuchanghayss-ni? who violence-Acc advocated-Q ‘The city councilmeni refused the demonstratorsj a permit because PROj/theyj advocated violence. Who advocated violence? Answer 0: siuyhwoy uywontul ‘the city councilmen’ Answer 1: siwitay ‘demonstrators’

Thus far, although a variety of statistical approaches based on lexical features have been suggested for WSC in NLP (parsing (Sharma 2014; Sharma et al. 2015, a.o.), assigning word sense (Peng et al. 2014, a.o.), integrating context (Liu et al. 2016, a.o.), pragmatic/semantic world-knowledge (Scheller 2014; Richard-Bollans 2018, a.o.), no satisfactory account has been provided. Given that the solution of semantics underlying the WSC needs the basic knowledge and common sense reasoning, we proceed the study as follows: we first construct original 200 Winograd sentences (100 pairs) extracted from three hundred thousand dataset in Korean (the project of corpus integrated verification: Coreference resolution, 2019-2020). Second, by using knowledge hunting framework (Emami et al. 2018) with tri-level works ((i) generating queries, (ii) acquiring relevant knowledge using Information Retrieval, (iii) reasoning on the gathered knowledge), we argue that the knowledge-based system provides the better alternative for Korean WSC.

17:50
Specific Syntactic Complexity Measures and Their Prediction of Chinese University Students’ Argumentative Writing Evaluation

ABSTRACT. This study examines how specific syntactic complexity measures (Lu, 2010, 2011) can predict Chinese university students’ argumentative writing evaluation. To achieve this purpose, a corpus of 400 argumentative essays on the same topic was constructed and each sample has been automatically graded by an essay evaluation system (AES). The samples were then analyzed by Tool for the Automatic Analysis of Syntactic Sophistication and Complexity (TAASSC, Kyle, 2016) to derive numerical values on these specific syntactic measures. To simulate different proficiency levels, the samples were divided into four groups on the basis of their assigned scores. Subsequent ANOVAs showed that 13 out of the 14 measures significantly distinguished writing proficiency, confirming results from previous studies that certain syntactic complexity measures can significantly distinguish writing by learners of different proficiency levels. To further check which measures can best predict writing evaluation, stepwise regression analysis was conducted and the results indicated that three measures significantly predicted writing proficiency. This finding lent support to Plonsky and Oswald’s (2017) claim that multiple regression is a better statistical method in dealing with continuous data and can provide more information than ANOVA.

17:00-18:15 Session 9C
Chair:
Location: IBK Hall
17:00
Spontaneous Motions in L1 and L2 English Speaking: A Corpus-based Study

ABSTRACT. Spontaneous motion is one of the most basic events in every human language but is expressed by varying patterns according to the languages. For example, English usually encodes path in prepositional phrases or adverbial particles and thus is typologically classified as a satellite-framed language, while Korean is a verb-framed language that maps path onto verbs (Talmy, 1989). In the analysis of English spontaneous motion expressions between a native speaker corpus (600 recordings) and a corpus of Korean L2 learners (400 recordings), the Korean learners were found to significantly underuse the satellite-framed patterns, but not the verb-framed patterns, compared with the native speakers, which suggests the role of L1 in L2 production. The satellite-framed patterns, however, accounted for the greatest portion of spontaneous motion expressions in the L2 production, suggesting the effect of dominant L2 input on L2 production.

17:25
Lexical coverage and bundles in TED talks and other speeches: Corpus findings for teaching

ABSTRACT. Nonnative speakers of English need quality presentations as models to learn about communicative skills. Recent TED talks (137 transcripts under 5 topics such as Global issues, released in 2017-2018, 240,098 tokens) as scripted speeches were targets for this current corpus study to examine their single and multi-word vocabulary profiles. While prior initial findings on lexical coverage (known single words) of TED talks are available, extremely limited analyses have investigated their multi-word units (lexical bundles, or recurrent sequences of words)-although mastering bundles has advantages of less processing demand for efficient communication. For comparison, this exploratory study examined both the TED corpus and 44 important speeches (https://americanrhetoric.com/21stcenturyspeeches.htm, 82,442 tokens, speech corpus) publically available with audio MP3 files concerning their lexical coverage and frequency, structures, and discourse functions of bundles (Biber et al., 2004) via automatic processing of AntConc/AntWordProfiler and manual examination. The results showed that knowing common 3,000 words is needed to reach 95% coverage, and 5000, 98% comprehension level for both corpora, easier than those of Nurmukhamedov’ TED corpus (2017). Structural and functional analyses of our bundles show that while TED contains more verb phrasal bundles, the speech corpus, more referential bundles. Compared with the top 200 common spoken phrases in the Academic Formulas List (Simpson-Vlach & Ellis, 2010) as a reference of conversational English, TED talks contain over 60% of them, much higher than the speech corpus. Altogether, TED talks demonstrate better instructional value because they are closer to the registers of classroom teaching and conversation than the speech corpus.

17:50
How corpus-assisted error correction contributes to fewer errors in L2 writing

ABSTRACT. Recently, an increasing number of studies have shown the effectiveness of Data-Driven Learning (DDL). However, the ways in which DDL can promote accurate error correction for second language (L2) students have not been sufficiently investigated. This study addresses this issue by demonstrating how data-driven error correction activities, conducted before essay writing, contribute to fewer errors in the writing of L2 students. The study focused on errors related to article and preposition usage. Thirty Japanese university students of English as a Foreign Language wrote essays before and after the data-driven error correction activity. The author compared the essays to examine whether fewer L2 errors were found in essays written after the activity. The participants attended thirteen corpus-assisted error correction sessions based on teacher feedback. The results showed that the new essay had fewer errors and that the participants’ overuse and underuse of certain prepositions were corrected. From the viewpoint of second language acquisition, the findings suggest that data-driven error correction favors the acquisition of correct and native-like usage of articles and prepositions. DDL is a useful resource for teaching L2 writing.

17:00-18:15 Session 9D
17:00
Collocation as confounder of language-based instruments: The case of the McGill Pain Questionnaire

ABSTRACT. The McGill Pain Questionnaire (MPQ) is a well-established tool used by medics in the diagnosis of pain. It is designed to capture both the quality and intensity of the pain that a patient experiences. Critically, it is a language-based instrument: the patient is presented with a list of single-word descriptors (mostly adjectives, including many participles) organised into 20 groups, and are asked to select one descriptor per group. It is assumed that, within each group, the descriptors label the same quality of pain, but with varying intensity, e.g. one group consists of ‘hot’, ‘burning’, ‘scalding’, and ‘searing’, in assumed order of increasing intensity.

From a linguistic point of view, however, numerous possible confounding factors (i.e. features promoting selection of a descriptor other its pain intensity ‘score’) suggest themselves. One is familiarity, operationalisable as frequency. Another is paradigmatic/syntagmatic association with pain (as term or as concept). Our analysis probes the degree to which patient MPQ responses are explicable in terms of syntagmatic association with ‘pain’, assessed via the Oxford English Corpus and a bespoke asymmetric collocation measure. For many MPQ groups of descriptors, collocation with ‘pain’ is shown to explain patients’ choices (nearly) completely, frequency also playing a role. Little remains to be attributed to the intended variable, intensity, for these ‘types’ of pain. These findings indicate that, at the very least, instruments based on word selection should not be deployed (even outside linguistics) without due attention to collocation, discourse frequency, and allied phenomena.

17:25
Medical Uncertainty and the Art of Communication: Exploring Modality Applied in Medical Research Abstracts through a Cognitive Linguistic Perspective

ABSTRACT. Medicine is a science dealing with uncertainty and the art of probability. Since one body can change all the time, and the same type of treatment may work differently with different people, chance, evidence, and probability are, therefore, the main components of medical messages. For effective communication, doctors, researchers, or health sciences writers, need to master the use of modality, by means of which unreal situations can be discussed. Although many writers have been familiar with modal verbs, the most commonly used epistemic modality in written discourse, how modals are used in general English can be different from scientific contexts. Moreover, in different scientific fields a modal verb may not be used in the same way. Researchers in the fields of medicine and health sciences, therefore, need to be aware of choosing the most effective modals to portray certain degrees of possibility, especially in abstracts, the first and probably only part of research assessed by readers. This paper aims to analyze the use of modal verbs in medical research abstracts through a perspective of cognitive linguistics (Maldonado, 2007; Portner, 2009; Radden & Dirven, 2007). As modality also relates to other aspects of the verb string, the tense, voice, reality status, and situation type of the same sentence will also be analyzed. The findings can benefit scientific writers and provide EAP and ESP teachers material for communicative practice.

17:50
A study on the relationship between emotional expressions and the social attributes of characters in Korean drama corpus

ABSTRACT. The aim of this study is to investigate the relationship between emotional expressions and the social variables of characters in a Korean drama corpus. From a sociolinguistic point of view, language is considered to be non-independent, or social, and language use is closely related to the speakers’ social class, gender, and language pattern. Therefore, this study assumes that language is closely associated with the speakers’ use-related social attributes such as gender, age, nationality, social status, and social context. One of the main topics in recent AI-related research is artificial emotion. A lot of research has been conducted in developing artificial intelligence services that can understand and empathize with human emotions. This study examines the human emotion sociolinguistically. For this purpose, the drama corpus, which includes actual utterances, background and context, and social information and relationships of the characters, is proposed. Additionally, the nested source and target of each utterance of the drama corpus were annotated and their emotions are divided into positive and negative categories. Based on the results of emotions revealed in the utterances, a statistical analysis was conducted to examine how they were related to speakers’ social characteristics such as gender, occupation, and age. Lastly, this study investigates whether the social relationships and characteristics of speakers are critical enough to affect the emotional patterns of their utterances.

17:00-18:15 Session 9E
Location: Helinox Hall
17:00
An English-Chinese Corpus-Based Study of Semantic Prosody's Role in Translation

ABSTRACT. Semantic prosody refers to the language phenomenon that when a word frequently co-occurs with positive or negative collocates, the collocational patterning conveys an attitudinal meaning, favorable or unfavorable. Though the elaboration and exploration of semantic prosody have a profound influence over phraseological studies in Corpus Linguistics, its role in translation has hardly been touched upon. The present study attempts to unveil the role semantic prosody plays in translators’ reproduction of attitudinal meanings in the target text. By exploring semantic prosodies of the Chinese frequent perception verb daliang打量 and its English translation equivalent LOOK up and down in the Multi-field Chinese Corpus of Beijing Language and Culture University Corpus Center, the British National Corpus and the Chinese-English Parallel Corpus of Hong Lou Meng (The Dream of Red Mansions), our study shows that semantic prosody has an important yet complex role to play in translators’ strategic consideration of attitudinal meanings. Translators either dramatically manipulate the original semantic prosody and create a new semantic prosody in the target text, or eschew the original semantic prosody, leading to the void of attitudinal meaning in the target text. The manipulation and eschewal of semantic prosodies further suggest the creativity and unfaithfulness of translators.

17:25
The design of the parallel corpus between Korean and Chinese and its application in translation studies: exemplified by the information-asymmetrical Korean-Chinese idiom translation

ABSTRACT. The parallel corpus has become more and more popular, not only used for linguistic research, but also widely used in language engineering to promote the combination of linguistic theory research and practical language engineering research.This research is based on the parallel corpus approach and sets up a normative Korean-Chinese parallel corpus by random sampling. Authoritative translators’ works are used as samples, and the Chinese idioms are labelled manually to form a corpus which is based on information-asymmetrical idioms translation. In this way, the translation types of information-asymmetrical idioms are categorized to further analyze translation principles and methods. In order to ensure its authority and representativeness, the corpus mainly selects news reports, novels, textbooks and texts of other non-political themes for data. Among these, there are 1.053 million characters worthy of Chinese linguistic material and while 379,500 characters worthy of Korean. All these have been labelled to serve as the parallel coded corpus. On this basis, it is classified into different types such as bilingual correspondence, asymmetry and semi-symmetry, leading to quantitative analysis and a discussion of the translation types of the information-asymmetrical idiom.

17:50
Polyfunctional recurrent phrases in translation: a corpus-based exploratory study

ABSTRACT. In this study, we explore how translators dealt with recurrent phrases functioning as textual discourse-organizing devices (e.g. at the end of the day, the question of whether), which were extracted from a sample of Europarl corpus (Koehn 2005) included in Paralela, a parallel English-Polish corpus (Pęzik 2016). Apart from identifying typical and peripheral Polish equivalents, we also verified to what extent the discoursal functions realised by phrases under scrutiny are "preserved" in translation. Focusing on a polyfuntional English phrase at the end of the day, we performed evaluation of fragments of English-original texts and Polish translations. The inter-rater agreement metrics (raw agreement, Cohen’s Kappa and Krippendorf’s alpha) revealed that in 23.2% of cases (22 out of 95 occurrences), the discoursal functions were differently evaluated by 2 English and 2 Polish annotators who worked independently on English and Polish language data. The findings show that the senses of potentially polyfunctional recurrent phrases are not always transferred from the source text into the target text in a fixed and stable way. It may well be that the specific senses/functions of such recurrent phrases emerge in a particular situation of language use, and as such they may be either preserved or modified in the translation process (e.g. translators may overlook a figurative interpretation of a phrase and opt for a literal one). Hence, more research is required to explore the rationale behind the modification of the discourse functions in translation or behind the alternative renditions that could have been selected by translators.