Automatically Enhancing Tagging Accuracy and Readability for Common Freeware Taggers
ABSTRACT. Part-of-Speech (PoS) tagging is still one of the most common and basic operations carried out in order to enrich corpora. Yet, especially for smaller, less-well funded projects, acquiring a licence for a commercial tagger, such as CLAWS (Garside & Smith 1997), is often not a reasonable option. Hence, such projects generally need to rely on freeware taggers, such as the Stanford PoS Tagger (Toutanova et al. 2003) or TreeTagger (Schmid 1994).
In using such freeware taggers, however, a number of issues tend to arise. First of all, they generally only come in the form of command-line tools many linguists are not familiar with. Thus, the ‘average linguist’ needs to resort to graphical user interfaces designed for them, such as TagAnt (Anthony 2015) or the one developed by Ó Duibhín for the TreeTagger, but which may leave relatively little control over what the final output looks like. Furthermore, even if these taggers by default tend to use a common tagset (Penn), this tagset is highly simplified and does not distinguish between many categories. Yet, it still requires ‘decoding’ relatively ‘cryptic’ tag names whose readability has never been improved since the early days of tagging when, presumably, space was an issue for limiting the expressiveness of tags. Furthermore, these taggers are generally based on some form of probabilistic tagging model that induces predictable errors (Manning 2011; Weisser 2016) due to having been trained on limited domains and without using linguistic rules, and which may require “patching” (Smith 1997: 145).
The current paper introduces a new tool, the Tagging Optimiser, that, to overcome at least some of these issues, a) corrects the output automatically as far as possible, b) diversifies the tag categories, and c) makes the resulting tags more readable.
References:
Anthony, Laurence. 2015. TagAnt (Version 1.2.0) [Computer Software]. Tokyo, Japan: Waseda University. Available from http://www.antlab.sci.waseda.ac.jp/
Fligelstone, Steve, Rayson, Paul & Smith, Nicholas. 1996. Template analysis: bridging the gap between grammar and the lexicon. In J. Thomas & M. Short (Eds). Using corpora for language research. London: Longman. 181–207.
Garside, Roger & Smith, Nicholas. 1997. A hybrid grammatical tagger: CLAWS4. In R. Garside, G. Leech & A. McEnery (Eds.) Corpus Annotation: Linguistic Information from Computer Text Corpora. London: Longman. 102–121.
Manning, Christopher. 2011. Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics? In A. Gelbukh (Ed.). Proceedings of the 12th international conference on Computational linguistics and intelligent text processing (CICLing’11). 171–189.
Schmid, Helmut. 1994. Probabilistic Part-of-Speech Tagging Using Decision Trees. Proceedings of International Conference on New Methods in Language Processing, Manchester, UK.
Smith, Nicholas. 1997. Improving a Tagger. In R. Garside, G. Leech & A. McEnery (Eds.) Corpus Annotation: Linguistic Information from Computer Text Corpora. London: Longman. 137–150.
Toutanova, Kristina, Klein, Dan, Manning, Christopher& Singer, Yoram. 2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of HLT-NAACL 2003. pp. 252–259.
Weisser, Martin. 2016. Practical Corpus Linguistics: an Introduction to Corpus-Based Language Analysis. Malden, MA & Oxford: Wiley-Blackwell.
Using gestures in speaking a second language ---A study based on a multi-modal corpus
ABSTRACT. This study explores the relationship between the frequency of co-speech gesture use and the fluency of second language speech, as well as the role of task type in this relationship. The development of electronic devices and corpus linguistics makes it possible to examine in detail gesture use and its relationship with speech. Research shows that the application of co-speech gestures facilitates speech by helping the speaker to either organize ideas in the conceptualization stage (Alibali, Kita, & Young, 2000) or find the word wanted when formulating linguistic forms (Rauscher, Krauss, & Chen, 1996). However, whether co-speech gestures have a similar influence on second language speech requires more examination. In addition, gesture use is very likely to be influenced by speech contents. In this study, second language speech was elicited using three speaking tasks requiring different degrees of spatial thinking. Participants were sixty-one undergraduate students from a Chinese university for whom English is their second language. The whole process was videotaped and built into a multimedia corpus for analysis. The fluency of the participant’s answer in the interview and the gestures accompanying the answer were coded. Results based on the Spearman’s Rho test show that, across the three tasks involving different degrees of spatial thinking, gesture frequency significantly correlate with the fluency of the second language speech in a positive way, although the correlation becomes weaker as the degree of spatial thinking involved in the task becomes lower. This finding suggests that gestures may be used as a pedagogical tool in L2 speaking, and that encouraging and instructing second language learners to use gestures may help them to speak a second language more fluently.
ABSTRACT. As corpora have grown in size, the task of checking that their component texts are suitable for research needs has become increasingly unwieldy. Increasingly, corpora are studied not simply for their grammar but also for meaning-centred research, involving keyness, and diachronic keyness with a time-oriented text corpus. Where research centres on specific themes and involves downloading from standard text databases, ensuring that text contents contain texts which genuinely relate to the search requirements is difficult, and the resulting corpora are often too large for human checking.
This presentation starts by considering general aspects of corpus clean-up such as standardization of format, elimination of extraneous or corrupted text segments, and then homes further in on techniques for identifying repeated meaning sections. If redundancy, repetition, or boilerplate text can be identified, they can be labelled, and this can help the researcher to choose whether to filter the corpus, extracting only the types of text required, or to estimate the amount of theme-specific text in relation to marginally related sections. All these issues are illustrated by reference to ongoing development within the WordSmith Tools software suite, which has developed over the last year or so to part-automate many of the processes needed, allowing users to see and choose how text sections of a corpus can be labelled for subsequent inclusion, exclusion or filtering. We recognise at the same time that identifying similarity or difference in text form and meaning is not at all a straightforward or simplistic matter. Any results obtained are to be labelled provisional.
Scott, M, 2018. WordSmith Tools. Stroud: Lexical Analysis Software.
Acquisition of the Chinese Particle le by L2 Learners: A Corpus-based Approach
ABSTRACT. The Chinese particle le (了) has proven to be challenging for L2 learners to acquire, both because it may function as either a perfective aspect marker or a sentence final modal particle and because its usage is subject to various semantic, syntactic, prosodic, and discourse constraints. Previous research into the development of knowledge of the uses of le and the order of acquisition of its functions and meanings has yielded inconsistent results. Furthermore, due to the limited data available, previous studies have usually employed written corpus data produced by a small number of learners from a specific L1 background. Utilizing the spoken subcorpus of the large-scale Guangwai-Lancaster Chinese Learner Corpus, this study closely examines the uses of le by learners of Chinese from diverse L1 backgrounds as well as the developmental pattern of their acquisition of this particle. Results demonstrate that learners generally use le in speech with a low frequency and a high degree of accuracy. Significant increase in frequency of use is observed between beginner and intermediate learners, while that in accuracy is observed between intermediate and advanced learners. Evidence from the current study does not support a specific acquisition order for the basic functions of le. Learner errors primarily involve overuse of the particle in conjunction with statives, and may be largely attributed to learners’ deficient knowledge of the constraints of its usage. Findings of our investigation have useful implications for the instruction of the particle le.
A Study on the Use of Color Wheel for the Representation of Emotions
ABSTRACT. Recognition and representation of emotions shown in documents has been a fascinating task in document understanding. The use of Emoticons, smileys and tags are popular in emotion representation, however, it is not as simple as choosing smileys to show the emotions, as emotion changes through the documents, involving other emotions.
Extension of the researches on emotion-color corresponding can be a clue to solve this problem. Most of the social emotion wheels are made of discrete values which stand for so-called basic emotions. Joy and trust, for example, are placed next to each other in Plutchik’s emotion wheel, having love between them as the mixture of these two. Joy and sadness are opposite and facing each other at the center of the wheel, at the strongest level of the emotions: ecstasy and grief, which may not be easy enough to handle with AIs.
Thus in this paper we propose a use of a color wheel as a representation of emotions for AI-based recognition/representation systems. Our proposal is to recognize emotions as continuous values, not clustering emotions into countable numbers of basic emotion tags. We consider two axes and make an emotion-color wheel on these two dimensions. Our emotion wheel contains six basic colors (red, orange, yellow, green, blue, violet) gradually. Positive/negative axis shows the type of the emotions, on the hue wheel placed on the orange/blue line. Happy emotion is on the positive side, and depressing emotion comes on the negative. Inward/outward axis shows the impression of emotions, on the hue wheel placed on red-purple/yellow-green line. Inward, introvert emotions such as trust and fear, are on yellow-green side, and outward, aggressive emotions such as anger and anticipation are on the red-purple side.
We share our empirical results on emotion representation and user interest representation both for tweets.
Acquisition of Tense/Aspect markers in Learners Corpora of English/Chinese/Japanese
ABSTRACT. We will discuss that learning Tense/Aspect markers is affected by a cognitive typology of Tense/Aspect in L1 and L2 languages through the comparison of the following three learners corpora.
(1) English Learners Corpora by Chinese L1 Learners and Japanese L1 Learners
(2) Chinese Learners Corpora by English L1 Learners and Japanese L1 Learners
(3) Japanese Learner’s Corpus by Chinese L1 Learners
First, English Learner’s Corpus by Chinese L1 learners displays overuse of future tense marker “will” while Japanese L1 learners do not display such overuse. It is suggested that this overuse by Chinese L2 learners is due to the wrong analogy that English future tense “will” is correspondent with Chinese imperfective auxiliary “HUI”.
Second, Chinese Learner’s Corpus by Japanese L1 learners displays underuse of imperfective auxiliary “HUI”, perfective marker “-LE” and perfective resultative compliments “-DAO/ HAO / /SHANG / CHU/ WAN/ CHENG” while English L1 learners do not display such underuse.
We assume these phenomena are caused by Japanese Tense/Aspectual system that Japanese only has “Past vs. Non-Past” system, no future tense marker, no distinctive cognitive contrast between “Perfective vs. Imperfective”.
In contrast, Japanese Learner’s Corpus by Chinese L1 displays overuse of Japanese Past tense marker “-TA” for a perfective event. It is suggested that this typical overuse of Past “-TA” is due to the wrong analogy that Japanese Past tense marker “-TA” is correspondent with Chinese perfective marker “-LE”.
In summary, Chinese L1 learners tend to overuse future tense “will” and Japanese past tense marker “-TA” to express imperfective/perfective distinction. Japanese L1 learners tend to underuse imperfective/perfective markers in Chinese since Japanese has no distinctive cognitive contrast between “Perfective vs. Imperfective”. This contrast is due to the typological differences in Tense/Aspect system between English, Chinese and Japanese.
MWE in a small learner corpus: Lexical bundles, formulaic phrases, and light verbs across four levels of learners
ABSTRACT. Feng Chia University, in Central Taiwan, has over 20,000 students, most of whom elect to study business, various sciences and engineering. The Feng Chia Learner Corpus was created in Fall 2017 to identify kinds of language skills learners bring to their first year on campus, in order for faculty at the Language Centre to develop learning-driven materials. It comprises 897 brief extemporaneous essays which fall into four levels, with level 4 the highest, keyed to student entrance exams. 98% of the first-year learners had recently graduated from high school, and had taken English in middle and high school for roughly six years. Their lexical profiles show all four levels depending on the 1000 most used words in English, although a review of adjectives in Level 4 suggests that group varies between CEFR B1 and B2. The prompt for the essays asked for students to choose between recommending a gap year or immediate college entrance after high school graduation. Student reliance on using the conversational light verb ‘make’ led us to use AntConc and WMatrix® to compare the range of causative ‘make’ constructions (Altenberg et al 2001; Butt 2013), multiword expressions, the top 12 keywords and the top 12 four-word n-grams across Levels 1 and 4. Our purpose is to gain a better sense of learner retention of lexical bundles and formulaic phrases (Biber 2009; Chen et al. 2010; Simpson-Vlach et al. 2010; Paquot et al. 2012; Granger 2014) from their pre-college instruction in English, and to see if they are more habituated to using conversational than academic written language.
Altenberg B, Granger S. (2001). The grammatical and lexical patterning of MAKE in native and non-native student writing. Applied Linguistics, 22: 173-195.
Biber D. (2009). A corpus-driven approach to formulaic language in English: Multi-word patterns in speech and writing. International Journal of Corpus Linguistics, 14: 275-311.
Butt M. (2013). The light verb jungle: Still hacking away. In Complex predicates: cross-linguistic perspectives on event structure. Eds. Amberber M, Baker, B, Harvey, M. Cambridge: Cambridge University Press, pp 48-78.
Chen Y, Baker P. (2010). Lexical bundles in L1 and L2 academic writing. Language Learning and Technology 14: 30-49; http://llt.msu.edu/vol14num2/chenbaker.pdf
Granger S. (2014). A lexical bundle approach to comparing languages. Languages in Contrast, 14: 58-72.
Paquot M, Granger S. (2012). Formulaic language in learner corpora. Annual Review of Applied Linguistics, 32: 130-149.
Simpson-Vlach R, Ellis N. (2010). An academic formulas list: New methods in phraseology research. Applied Linguistics, 31: 487-512.
A Corpus-based Investigation of Chinese EFL Learners’ Use of Modal Verbs in Writing for Practical Purposes
ABSTRACT. In recent years, most of the English writing tasks in the College Entrance Examinations (CEE) held in different provinces in China assess test takers’ performance in writing for practical purposes. In such tasks, the use of modal verbs is an important indicator of test takers’ pragmatic competence as well as their grammatical competence. Most of the present studies on the use of modal verbs mainly investigate language learners’ writing for academic purposes or concentrate on language learners at college level. To fill this gap, this study compiles a learner corpus with essays from Chinese EFL (English as Foreign Language) learners’ classroom tests at a high school to reveal their features of using modal verbs. This study finds that Chinese students overuse the modals can, will, should and must, and they use the pronoun you as the subject of the four modals more frequently than native speakers. The findings of this study provide insights into the use of modal verbs by Chinese EFL learners at intermediate level in writing for practical purposes and thus inform the teaching of modal verbs in English classrooms and the development of EFL learners’ pragmatic competence.
A qualitative analysis of the effect of reference corpora and choice of statistic on keyword analysis
ABSTRACT. Keyword analysis is a method of identifying words or phrases that occur statistically more frequently in a target corpus than in a reference corpus. While numerous studies discuss the methods for conducting keyword analysis, few qualitatively compare the results of changing different aspects of the analysis such as the corpora or the statistics used. This study investigates the effect that using different reference corpora and different statistics for calculating keywords has on the content of keyword lists. Two case studies will be reported on relating a keyword analysis of two target corpora against three distinct reference corpora. The target corpora consist of published research from faculty at two PhD-granting programs in applied linguistics in North America. The reference corpora include a custom reference corpus of published research in applied linguistics as well more general reference corpora: one of newspaper and magazine articles and one of fiction texts. Furthermore, we report on two commonly used statistics to generate keywords: log likelihood and odds ratio. The findings suggest that while there are common keywords in lists generated against all reference corpora and by both statistics, the different reference corpora and statistics result in qualitatively distinct keyword lists. Primarily, using a reference corpus of the same nature as the target corpus (i.e., research articles in applied linguistics) better highlights content specific to the target corpus while using a more general reference corpus also uncovers words that represent the register of academic writing in addition to words representative of the discipline of applied linguistics. Additionally, log likelihood generates keywords that tend to be more evenly distributed throughout a corpus while odds ratio seems to highlight words that are very specific to the target corpus. Implications for using keyword analysis in contexts such as English for Specific Purpose and register studies will be discussed.
The effects of corpus use on correction of article- and preposition-omission errors
ABSTRACT. The strengths of corpora in L2 learning have been well documented in previous research (e.g., Flowerdew 2010). However, error correction in data-driven learning (DDL) settings has only been investigated in a few studies despite the fact that accurate error correction improves L2 writing. Satake (2018) examined the effects of corpus use on L2 error correction and found that it promoted accurate correction of article- and preposition-omission errors. Satake (2018) did not prepare an ad hoc pool of particular errors; however, focusing on article- and preposition-omission errors could promote more accurate correction of these errors. Thus, this study explored the effects of corpus use on the correction of article- and preposition-omission errors and compared the results to those of Satake (2018). The author used the same research methods as used in Satake (2018) and prepared an ad hoc pool of article- and preposition-omission errors.
The procedure was as follows: Participants wrote an essay for 25 minutes without consulting the Corpus of Contemporary American English (COCA) and/or dictionaries. Then they were given highlighted feedback from the author and peer students for article- and preposition-omission errors, and they corrected the highlighted errors for 15 minutes, consulting both the COCA corpus and dictionaries at least once. The author collected the essays and created error-annotated corpora to compare the results to those of Satake (2018).
The results showed that error correction with corpus use—focusing on article- and preposition-omission errors—contributed to more accurate correction of these errors than that of Satake (2018). It also promoted more accurate error identification by peer students: fewer correct expressions were wrongly identified as errors than those in Satake (2018). The findings suggest that effective corpus use for error correction requires teachers to focus on particular error types. Adjustments to DDL are needed for accurate error correction.
Errors and Beyond—A Corpus-based Stylistic Analysis of “Japanese English” Discourse
ABSTRACT. Since the number of English users as a second or later language had exceeded that of native speakers of it, the traditional NS (native speakers) and NNS (non-native speakers) dichotomy has been replaced by the new perspective that the different varieties of English produced by NNS are considered not as “deviated versions” of the dominant English language use, but as “World Englishes (WE).” The idea of WE provokes many ESL/EFL teachers to rethink their pedagogical goal setting by raising a question, “What should be a ‘model’ of English in WE-oriented classrooms?” Furthermore, another dichotomy distinguishing between language users and learners can be questioned by the spread of CEFR (Common European Framework of Reference for Languages) where learners are also viewed as users of language (Council of Europe, 2001). CEFR describes what language users across different proficiency levels are supposed to be able to do (CAN-DO). The CEFR’s CAN-DO descriptor can bring about the drastic change in the mindsets not only of language teachers but of learner language analysts. More learner language studies would try to identify the developmental process mainly by focusing on what learners can do, rather than what they cannot do correctly (= errors), which used to be focused in many of the conventional language analyses. In this study, we examine how “learner” language should be modeled from the perspective of WE, and how the model can be implemented into classroom practices. We will do this mainly by identifying the style of “Japanese English” based on several corpora of Japanese learner English. The “style” here means the particular English language use adopted by Japanese learners to fulfill their purposes. Since such style can be exploited in relation to the dominant English language use, we will also examine to what extent we should/can tell “granted styles” from “errors.”
References
Council of Europe. (2001). Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Cambridge: Cambridge University Press.
Corpus-based Error Detector for Computer Scientists
ABSTRACT. This study describes the design and development of a corpus-based error detector for short research articles produced by computer science majors. This genre-specific error detector provides automated pedagogic feedback on surface-level errors using rule-based pattern matching. The corpus development phase generated the data needed for the software development.
In the corpus phase, a learner corpus of all theses (n = 629) submitted for three academic years (2014 to 2017) was compiled. A held-out corpus of 50 theses was created for evaluation purposes. The remaining theses were added to the working corpus. Errors in the working corpus were identified manually and automatically. The first 50 theses were annotated using the UAM Corpus Tool. Errors were classified into one of five categories (i.e. accuracy, brevity, clarity, objectivity and formality), mirroring the content of the in-house thesis writing course. By the fiftieth thesis, saturation had been reached, viz. the number of new errors discovered had dropped considerably. Annotated errors were extracted into an error bank (xml file). Each error was assigned values for severity, detectabililty and frequency. The weighted priority of each error was calculated from these values. For the remaining theses only new errors were recorded and were added directly into the error bank.
In the software phase, regular expressions were created to detect the errors starting with those with the highest weighted priority. Easy-to-understand actionable advice was written that could be displayed on matching the error. A user-friendly interface was created and tested for accuracy and usability.
This error detection tool focusses purely on the phraseologies unsuitable for computer science research articles and so provides an added layer of error detection in addition to generic grammar detectors. The error detection tool reduces the need for teachers to provide feedback on commonly-occurring surface-level errors.
Corpus based Critical Discourse Analysis: A study of Political Ideology in Hindi Newspapers
ABSTRACT. Abstract
Van Dijk (1991) mentions that the content of the newspaper has enormous power to serve as a medium expressing the ideological stance of the newspapers and its readers. However, this study and analysis of newspaper discourse has mainly been attempted from a qualitative perspective, especially in the Indian context. Considering the potential of a huge corpus containing real world data and how it can be investigated to reveal the concept of ideology (Baker, 2010), requires an alternative analytical procedure which focuses on the combination of quantitative techniques with theoretical framework of Critical Discourse Analysis (Hardt- Mautner, 1995).This present research makes an attempt to explore a relatively recent method to study language patterns in the construction of political ideology. It focuses on a corpus-based discourse analysis to probe into the illustration of the ideological scenario in Hindi newspapers during the time period 2014-2016, since when the present ruling political party, Bhartiya Janta Party (BJP) has been in power. The analysis incorporates the corpus methods using corpus analysis tools like AntConc and WordSmith to identify the lexical frequency count, keywords analysis, collocational information and concordances. The analysis starts with identifying the keywords in the political news corpus when compared with the main news corpus. Some of the top occurring keywords are मोदी (Modi), भाजपा (BJP), कांग्रेस (Congress), सरकार (Government) and पार्टी (Political party). We have also calculated the collocates of the keywords followed by the concordance analysis of these keywords. Both concordance and collocation helps to identify lexical patterns in the political texts. These quantitative results then have to be interpreted and checked in accordance with some of the existing theories of Critical Discourse Analysis. We have mainly sorted out Van Dijk (1995) few suggestions in carrying out a critical discourse analysis for the study of political ideology.
Chinese-Arabic parallel corpus-based study of Chinese body language chengyu
ABSTRACT. Body Language Chengyu are a class of chengyu that use descriptions of a person’s facial expressions, body positions and movements to convey information or express emotions. In this paper we study 72 body language chengyu which can be found both in the Titaiyu Xiao Cidian and the Chengyu Da Cidian published by The Commercial Press, by carrying out the following steps:
1.We analysed the diachronic usage of the 72 chengyu in news articles published in the People’s Daily across a period of 70 years (1946-2015), finding that about 30 of them enjoyed a stable and high frequency usage, most of them expressing positive emotions.
2.We analysed the usage of chengyu in two Chinese-Arabic parallel corpora, comprising news and literature texts. We also investigated the strategies used in the Arabic translation of those chengyu. In the news corpus, chengyu describing body movements were most frequent, while in the literature corpus, chengyu describing facial expressions were more frequent. Liberal translation strategies were used most in both corpora, but in some texts Arabic idioms were used, i.e. loan translation. Literal translation strategies were also used.
3.We designed three questionnaires directed at students from the Chinese Department of Cairo University in order to investigate respectively how they understand, use and translate the 30 most frequent body language chengyu. The questionnaires included multiple choice, sentence-making, and translation tasks. The students were divided into two separate groups according to their proficiency levels, one for HSK 5 and another for HSK 6 . Finally, we obtained a learner corpus of more than two-thousand sentences and summarized the errors and mistakes the students committed according to the above mentioned three aspects, giving some suggestions for second-language teaching and learning.
Stylistic Features in Corporate Disclosures and Their Predictive Power
ABSTRACT. We are concerned with the automatic processing of annual reports
submitted to the U.S. SEC's EDGAR filing system. The filings consist
of structured as well as unstructured information. One part of the
filings, the 10-k forms, contains mostly free text, which is segmented
into up to 20 items. Each of the items deals with a particular piece
of information that is required to be disclosed, such as risk factors
faced by the company (item 1A), or the management's discussion of the
financial condition of the company (item 7). We have built a corpus of
all items found in 76,278 documents filed between January 2006 and
December 2015 in HTML format; the nontrivial extraction algorithm will
be made accessible on request to enable reproducibility.
In the paper at hand, we present results of a first exploratory corpus
analysis and provide descriptive statistical figures for a wide range
of NLP annotations including sentiment (Evert et al. 2014), emotion
(Mohammad and Bravo-Marquez 2017, readability (Loughran and Mcdonald
2014), and stylistic features (Biber 1988; Nini 2015) for each item of
the 10-k form. Our long-term goal is to faciliate interpretation of
the text type at hand, which is pre-defined to a certain extend in
content, yet not linguistically standardized.
We firstly find that the register differs significantly across items:
the description of risk factors e.g. is written in a very subjective
manner compared to all other items, whereas the conclusions of the
company's principal officers are on average written in the most
objective -- and most complex -- manner. Furthermore, we use the
industry assignment of the filing company (following the standard
industrial classification), and show that companies from different
sectors use different linguistic registers: financial companies,
e.g., use the most objective and also the most complex language.
We conclude by applying a dimensionality reduction algorithm to the
quantitative linguistic features. To begin with, a qualitative
inspection of documents which are similar in terms of their
quantitative features reveals that these filings vary very much in
content. The experiment thus shows that we do not just detect
content-based near-duplicates, but that we are indeed able to identify
filings that are similar in terms of their register. In turn, the
experiment also shows that the quantitative linguistic features can be
used to predict the company's standard industrial classification.
Is the Continuation of The Mystery of Edwin Drood a Posthumous Work of Charles Dickens?: A Multivariate Analysis
ABSTRACT. Three years after Dickens’ death, Thomas Power James (henceforth James) added a continuation to The Mystery of Edwin Drood, claiming that it was written by the ‘spirit-pen of Charles Dickens, through a medium’. This study attempts to clarify whether the continuation can be considered a posthumous work of Dickens as James suggested. Word preferences in the continuation are analyzed for similarity with those in The Mystery of Edwin Drood. Methods used are multivariate analyses of the frequencies of frequent words in three corpora -The Mystery of Edwin Drood, the continuation, and Our Mutual Friend, with the third corpus added as a reference. The analyses display distinct clustering of sections included in the continuation, on one hand, and those in the two Dickens’ works, on the other, highlighting differences in terms of word preferences. The results suggest that James’ claim regarding the continuation seems dubious.
ABSTRACT. Over the past decade, more and more writers have used the present tense as the primary tense for their narratives. This paper shows that contemporary present-tense fiction has more characteristics which are similar to spoken discourse than past-tense fiction by comparing lexis and structures in two corpora: a corpus consisting of present-tense narratives and a corpus of past-tense narratives.
The 88,751-word corpus of past tense-narrative (the PAST corpus) was created by using texts from the Fiction section of the Lancaster Speech, Writing and Thought Presentation corpus (Semino and Short 2004). The forty titles (twenty from serious fiction, another twenty from popular fiction) were published between 1919 and 1999. A comparable corpus of present-tense narratives (the PREST corpus) also contains forty text samples (twenty each from serious and popular fiction) with a total word count of 87,441. The texts were chosen from novels published between 2000 and 2016. The corpus annotation of discourse presentation in the Lancaster Speech, Writing and Thought Presentation corpus and my own annotation of discourse presentation in the PREST corpus were referred to when characters’ speech and thought presentation were relevant to the issues under examination.
The comparison of the two corpora shows that the distinctive lexical and structural features of the PREST corpus make present-tense narrative stylistically more similar to spoken discourse than past-tense narrative in the following aspects: (1) the use of pronouns and the underuse of proper nouns, (2) the underuse of adjectives, (3) the use of phrasal verbs, and (4) the use of the progressive aspect. I will also discuss how the use of the present tense affects the management of viewpoint in narrative by relating its lexical and structural characteristics to the presentation of characters’ speech and thoughts.
Corpus-based Approaches to ELT: Current Situation and Future Implications in Pakistan
ABSTRACT. In past decades, use of online language corpora and computer tools garnered tremendous attention of English language teachers and academicians. Keeping in view the modern trends and needs of learners, this research focuses on the practical implication of online corpora for ELT (ELT) and its utility in Pakistani context. Michigan Corpus of Academic Spoken English (MICASE) is used as a reference corpus for this research. MICASE is a collection of nearly 1.8 million words of transcribed speech (almost two hundred hours of recording). The transcribed data of MICASE includes wide range of speech events like seminars, lectures, advising sessions and lab sessions. This study concentrates the utility of lexical items at two levels, firstly at syntactic level and secondly its senses in different contexts. Furthermore, it explores the layers of meanings and uses of lexicons through in-depth study of right and left collocates in the reference corpus. The results describe the multiple uses of a lexicon and the importance of right and left collocates in understanding the meanings and senses of lexical items in various contexts. Thus, this strategy can be fruitful for English language learners and academic discourse community who are interested in understanding the versatile uses of lexical items and their contextual meaning. Although there are many advantages of corpus-based approaches to ELT; yet, there are some limitations observed in Pakistani context. In this study, we identified these limitations and provide possible solutions to implement corpus-based approaches to English language teaching in Pakistan.
“When you don’t know what you know” – Use of Māori loanwords in a diachronic corpus of New Zealand English
ABSTRACT. Background We present a corpus-driven analysis of Māori loanwords in New Zealand English (NZE) within a quantitative, diachronic approach.
Methods Previous work on NZE suggests that loanword use is both increasing (Macalister 2006), and highly linked to discourse topic (Degani 2010) and author profile (Calude et al. 2017; De Bres 2006). To address these observations, we collected a topically-constrained diachronic corpus of New Zealand newspapers based on a key-term search of “Māori Language Week”, between the years 2008-2017. Māori Language Week is a well-established, annually celebrated event in New Zealand, since 1975. Once compiled, we manually extracted all the Māori loanwords used in the corpus (of 108,925 words) and we documented all non-proper nouns and their frequency (four proper nouns were also retained, namely, Māori, Pākehā, Kiwi and Matariki).
Findings Our findings provide a comparison of two strands: (1) perception surrounding knowledge of Māori loanwords, and (2) their frequency-of-use. As regards (1), we distinguish marked and non-marked loanwords (following Kruger 2012), and explicit author perceptions (newspaper articles contained explicit information about loanwords which authors deemed to be familiar to the wider New Zealand public). Marked loanwords are words translated or explained (what we term, textual markedness) or loanwords given in quotes, brackets or dashes (graphical markedness). With respect to (2), we report frequency-of-use of the 187 distinct loanword types and 3,800 tokens found in the corpus (of which 1,653 uses came from the loan “Māori”, 1,008 from reo “language” and the remaining 1,139 uses from various loans) and rank these according to semantic class. Finally, we provide comparisons with previous loanword studies of other language genres.
Implications We hope our results and methodology will have implications for other studies of loanword use and language change, and for theoretical debates regarding loanword integration and entrenchment.
References Calude, Andreea, Mark Pagel & Steven Miller. 2017. Modelling borrowing success – A quantitative study of Māori loanwords in New Zealand English. Corpus Linguistics and Linguistics Theory 15 (2). doi:10.1515/cllt-2017-0010.
De Bres, Julia. 2006. Maori lexical items in the mainstream television news in New Zealand. New Zealand English Journal 20. 17–34.
Degani, Marta. 2010. The Pakeha myth of one New Zealand/Aotearoa: An exploration in the use of Maori loanwords in New Zealand English. In Roberta Facchinetti, David Crystal & Barbara Seidlhofer (eds.), From International to Local English – and Back Again, 165–196. Frankfurt am Main: Peter Lang.
Kruger, Haidee. 2012. Postcolonial polysystems: The production and reception of translated children’s literature in South Africa. Amsterdam/Philadelphia: John Benjamins Publishing Company.
Macalister, John. 2006. The Māori presence in the New Zealand English lexicon, 1850-2000: Evidence from a corpus-based study. English World-Wide 27 (1). 1–24.
The construction of a Business Katakana Word List using an instant domain-specific web corpus and BCCWJ
ABSTRACT. The construction of a keyword list from a domain-specific corpus for pedagogical purposes depends on both the selector’s expertise in language education, as well as knowledge of a specialized field which language teachers often lack. Corpus analysis can be applied to make an objective, authentic word list through examination of large volumes of text materials. Chujo and Utiyama (2006) established a statistical measure for identifying domain-specific vocabulary from the British National Corpus (BNC). Similarly, Matsushita (2011) created the Japanese Common Academic Words List with the use of Balanced Corpus of Contemporary Written Japanese (BCCWJ). The primary purpose of this study is to present an easy-to-implement methodology for creating a Business Katakana Word List by combining the methods and procedures from Chujo and Utiyama (2006) and Matsushita (2011). The ultimate aim is to construct a Business Katakana Word List for pedagogical purposes. In this study, an instant domain-specific (Japanese for Business Purposes: JBP) corpus was compiled using the WebBootCat tool (Baroni et al. 2006) integrated in the Sketch Engine, which can create a corpus by crawling the internet in a relatively short period of time. The frequency of words in the JBP corpus was then compared with that in BCCWJ using the statistical measure, Log Likelihood Ratio (LLR). The results show that it is effective to extract specialized vocabulary from an instant web corpus through statistical measures and that LLR is one of the most useful statistical measures in separating JBP katakana vocabulary from Japanese for General Purposes katakana vocabulary. By highlighting katakana keywords, the list can help learners increase their proficiency in business Japanese. It can also support JBP teachers in their course design and provide a practical example for researchers to further investigate the nature of Business Katakana Vocabulary.
Reference
Baroni, M., Kilgarriff, A., Pomikalek, J., and Rychly, P. (2006). WebBootCat: instant domain-specific corpora to support human translators. Proceedings of EAMT, Oslo, 247-252.
Chujo, K., and Utiyama, M. (2006). Selecting level-specific specialized vocabulary using statistical measures. System, 34 (2), 255-269.
Matsushita (2011). Extracting and validating the Japanese Academic World List, Proceedings of the Conference for Teaching Japanese as a Foreign Language, Spring 2011, 244-249. (published in Japanese)
Corpus based analysis of semantic relations between two verbal constituents in lexical compound verbs
ABSTRACT. In this work we analyze semantic combinations of two verbal constituents which make up a compound verb, called “verb + verb” type of compound verb. In Japanese “compound verb” is treated as one word.
In terms of theoretical research on lexical semantics, Kageyama (1993) clarified two kinds of compound verbs. One is “syntactic compound verb”, and the other is “lexical compound verb. In this paper, we treat lexical compound verbs.
In our work we analyze the relations between two verbal constitutes in terms of whether they are combined by common lexical meanings or by contextual meanings with statistically observing the linguistic data extracted from a big corpus.
In our experiment, we used web corpus including 5 billion sentences. We calculated the similarity between a compound verb and its first verbal constituent (V1) and between a compound verb and its second verbal constituent (V2).
Based on the differences of their similarity values with the compound verb, we estimated which constituent verb was considered to be a semantic head. Also we observed whether or not case markers of each verbal constituents changed, comparing with the cases used as a single verb.
From our result, in terms of whether or not V1 and V2 are combined by common lexical meanings, we found three types: synonym combination, partial overlapping, and no overlapping. In the type of relations without semantic overlapping between V1 and V2, there are idioms, “adverb and verb” combinations, “verb and negation” combinations and contextual combinations. Synonym combinations of V1 and V2 and idioms are tightly integrated. In the case of partial overlapping, the combination patterns for compound verbs are predicted from meaning of V1 and V2 and comparatively productive.
Visualizing English Classroom Spoken Data on Multi-modal Interface to Create Versatile Linguistic Resources
ABSTRACT. This study attempts to visualize classroom spoken data on a single web-based window screen. The authors firstly compiled a classroom spoken corpus by the three steps; (1) collecting the videotaping English lessons, (2) transcribing the lessons, and (3) annotating the transcriptions with a tag set designed to describe classroom interactions (Walsh, 2006). Next, the authors arranged the results of the three steps on a computer-run interface to utilize the outcome of the three steps on a single display.; (1) audio-visual data (videotaped English lessons), (2) transcriptions of utterances by the teachers and the students, and (3) annotations that range from the language use such as L1, L2, or the mixture of both to metadata of the teacher-student interactions. The multi-modal interface manages these three elements as individual modules which are usually utilized for exclusive pedagogical purposes as well as linguistic ones; (1) the video module for observing or reflecting the teaching, (2) transcripts for quantitative analyses, and (3) qualitative analyses. This multi-modal interface synthesizes the three modules and will enable researchers, teacher trainers, and novice teachers to overview the all the elements by the control panel (1) to watch the classroom with subtitles and the metadata such as transactions and interaction modes, (2) to show the annotated transactions such as primary, medial, and consolidation (Sinclair, & Coulthard, 1975) and interaction modes such as managerial and materials (Walsh, 2006), and (3) to export the transcriptions depending on the speakers. The authors will demonstrate the web-based multi-modal video interface on a MacBook Pro by showing an English lesson in the first grade in a junior high school in Japan conducted by a non-native English teacher.
References
Sinclair, J., & Coulthard, M. (1975). Towards an analysis of discourse: the English used by teachers and pupils. Oxford: Oxford University Press.
Walsh, S. (2006). Investigating Classroom Discourse. New York: Routledge.
Predicting EFL learners’ oral proficiency levels in monologue tasks
ABSTRACT. This study aims to assess spoken English as a second language (L2) using automated scoring techniques. Automated scoring, in which computer technology evaluates and scores written or spoken content (Shermis and Burstein, 2003), aims to sort a large body of data, which it assigns to a small number of discrete proficiency levels. Objectively measurable features are used as exploratory variables to predict scores defined as criterion variables. For this study, we compiled a corpus of 360 Japanese EFL learners’ spoken utterances, each of which was coded with one of the nine oral proficiency levels used in the Telephone Standard Speaking Test, a monologue speaking test. The test-takers answered ten open-ended questions in a variety of contexts, while speaking over the phone for approximately 15 minutes. The nine levels, which were manually assessed by professional raters and pertained to such aspects of examinees’ speech as vocabulary, grammar, pronunciation, and fluency, were used as criterion variables, and 67 linguistic features analyzed in Biber (1988) were used as explanatory variables. The random forest algorithm (Breiman, 2001), a powerful machine learning method used in automated scoring, was employed to predict oral proficiency. As a result of using random forests with out-of-bag error estimates, correct prediction was achieved for 69.23% of L2 speeches. Predictors that can clearly discriminate oral proficiency levels were, in order of strength, frequency of preposition use, determiners, causative adverbial subordinators, existential there, third-person pronouns, and coordination of independent clauses. It is interesting to note that, compared to results from automated scoring of dialogue rather than monologue in L2 learners (Kobayashi and Abe, 2016), syntactic features play a more robust role in predicting oral proficiency in the monologue test. The results of this study can be applied to creating assessments that are more appropriate for scaling the oral performance of EFL learners.
References
Biber, D. (1988). Variation across speech and writing. Cambridge: Cambridge University Press.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-23.
Kobayashi, Y., & Abe, M. (2016). Automated scoring of L2 spoken English with random forests. Journal of Pan-Pacific Association of Applied Linguistics, 20(1), 55-73.
Shermis, M. D., & Burstein, J. C. (Ed.) (2003). Automated essay scoring: A cross-disciplinary perspective. New York: Routledge.
Elucidation of the influence of the structure of lyrics on the ease of understanding lyrics
ABSTRACT. Lyrics are written in natural language, so its syntax basically follows the natural language syntax rules. In lyrics, however, sentences that deviate from the syntactic rules of everyday sentences, such as inverted words, are also frequently used. Statements that deviate from syntactic rules are usually difficult to understand. However, lyrics on music often do not reduce the ease of semantic understanding. We believe that behind this there is some disciplines allowed only for lyrics, which is considered to be related to lyrics' special characteristics. The purpose of this research is to elucidate the relationship between lyrics' specialty and the ease of understanding lyrics.
Factors that influence the ease of understanding lyrics might include: melody of the music, rhythm of the song, the song title, and the structure of the lyrics. In this research, we focus on the song title and the structure of the lyrics, and try to find the relationship between them and the ease of understanding lyrics with the help of two kinds of corpora that we have built in advance. Specifically, we first conducted some subject experiments on whether being aware of the title of a song will make any difference in understanding its lyrics, or in other words, help understand the meaning of the lyrics. According to the results of the subject experiments, we confirmed that there might be some relationship between the song title and the ease of understanding lyrics. Next, we estimated the similarity between each fragment of the lyrics and the song title using word embedding expressions based on the corpora mentioned above. And with the above results, we found a correlation between the results of similarity calculation and the results of subject experiments.
Based on the above observations, we have reached a hypothesis that if the song title appears frequently in the lyrics, it might make it difficult to understand the meaning of the lyrics by being conscious of the song title, and conversely if the number of titles contained in the lyrics is small, the meaning might become easy to be grasped. As far as we know, our approach is the first attempt focusing on the elucidation of the influence of the structure of lyrics on the ease of understanding lyrics.
Multi-Domain Word Embeddings for Semantic Relation Analysis among Domains
ABSTRACT. In recent years, algorithms for word embeddings have proved to be useful for a large variety of natural language processing tasks. In most existing studies, word embeddings are obtained from large corpora of text data by finding patterns in existing written language. However, existing word embeddings learning methods can not capture semantics of words for not less than two domains. In this paper, we propose a word embeddings learning method for not less than three different domains and analyze characteristics of word embeddings for three different domains. I use three domains, PB (books), PM (magazine) and PN (newspaper) in the Balanced Corpus of Contemporary Written Japanese (BCCWJ) to construct word embeddings using word2vec for each domain. For the obtained word embeddings in each domain, we extract top-k similar words to the target word and analyze similarities and differences among the three domains. As results of the experiments for target words (noun, verb, adjective and adverb), I found that there are some words that have different meanings from the other domains and I also found that adverbs tend to have different meanings in each domain and verbs tend to have similar meanings for all domains. Moreover, the results show that similar words extracted using word2vec tend to be different in each domain such as Katakana words from the PM, political words from the PN and many kinds of similar words from the PB.
Functional Classification of Lexical Bundles in TED Talks
ABSTRACT. Lexical bundles are defined as 'recurrent expressions, regardless of their idiomaticity, and regardless of their structural status' (Biber et al., 1999). In spoken discourse, they are considered to play important functions such as expressing stance, organizing discourse, and indicating reference. Over the last decade, TED Talks have been gaining more popularity as a valuable source of authentic ELT materials for teaching oral presentation skills. The purpose of this study is to extract high frequency lexical bundles in TED Talks and investigate their functions in spoken discourse. Transcripts of 101 TED Talks were collected to build a TED Talk mini corpus which contains approximately 270,000 words. Using the n-gram analysis tool of AntConc, most frequent 100 four-word lexical bundles were extracted from the corpus. Functional classification of those lexical bundles was carried out based on the taxonomies employed in Biber (2006). The results of the analysis indicated that speakers of TED Talks effectively used various types of lexical bundles in their talks. It is suggested that since TED Talks can be regarded as a specific form of public speaking, the speakers elaborately composed their talk scripts using such lexical bundles in order to successfully persuade the audience and disseminate their own ideas or professional knowledge to lay people. Common functions of lexical bundles observed in TED Talks were 'Showing desire', 'Indicating intention/prediction', 'Topic introduction', and 'Identification/focus'. The actual examples of lexical bundles in the talks will be presented and their discourse functions will be discussed in this poster presentation.
References
Anthony, L. (2016). AntConc (Version 3.4.4 ) [Computer Software]. Tokyo, Japan: Waseda University. Available from http://www.antlab.sci.waseda.ac.jp/
Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Longman grammar of spoken and written English. London: Longman.
Biber, D. (2006). University language: A corpus-based study of spoken and written registers. Amsterdam: John Benjamins.
Automatic Generation of Japanese Question-Answering Pairs
ABSTRACT. There are large-scale English question-answer corpora available. However, there are no large-scale Japanese question-answer corpus. Furthermore, various types of question-answer corpora are needed for each domain. Hence, an automatic generation method of question-answer pairs is valuable. Our approach is composed of three steps. The first step is to gather target dataset. The second step is to extract question-answer pairs from the dataset. The third step is to learn the extracting method to a Recurrent Neural Network(RNN). After these three steps, new question-answer pairs can be generated automatically from another dataset using the RNN. In this research, we propose a generating method of Japanese question-answer pairs, and report an evaluation result of generated question-answer pairs by human evaluators and using automatic evaluation metrics. The automatic evaluation includes BLEU, METEOR and sentence similarity. Here, it is well known that selection of appropriate case particles is difficult in Japanese. Thus, we propose dependency coincidence ratio as another automatic evaluation metric.
Exploring word-formation in science fiction using a small corpus
ABSTRACT. This paper reports on a short preliminary study of new-word-formation in science-fiction writing. A small corpus of 7 novels, all written by the same author and set in the same fictional world, was compared against a 194,000-word English dictionary word list using AntWordProfiler (2013). Words not occurring in the dictionary were considered to be potential “new words” created for use in the futuristic setting of the novels. A list of those words occurring in the corpus with a raw frequency of 2 or more was first manually sorted into semantically differentiated categories (characters, races/creatures, groups/organizations, places, technology, job/occupations, food/drink, and other). Then the words in certain categories (technology, jobs/occupations, food/drink, and other) were classified based on method of word formation (acronym/initialism, affixation, backformation, blending, borrowing, clipping, coinage, compounding, hypocorism, other). It was found that the author favoured the use of complex clipping (the compounding of two words where one or both words are clipped) when creating words to label fictional items of technology. Furthermore, it was noted that one element of these complex clipped words was often carefully selected to associate with the morphology/phonology or meaning of a word more commonly occurring with the other element in the word outside of science-fiction. This is presumably due to the author’s desire to create new words whose meanings can be intuited by the reader relatively easily. It was also noted that the author tended towards technologisation, Latinization, and archaisation, to help create the feeling of otherworldliness in the writing.
Anthony, L. (2013). AntWordProfiler (Version 1.4.0w) [Computer Software]. Tokyo, Japan: Waseda University. Available from http://www.laurenceanthony.net/software
Comment Corpus for Study Abroad Program Evaluation
ABSTRACT. Corpus linguistic technique is used for course evaluation in higher education (El-Haleens 2011, Leong et al. 2012, Kaewyong et al. 2015). Our current goal is to extend the corpus-based course evaluation method to an evaluation method for study abroad program. Although study abroad program has a long history in higher education, the assessment is taken as a new challenge (Savicki & Brewer 2015).
This study presents a comment corpus, a collection of comments from our satisfaction survey of study abroad program. This corpus is part of an evaluation project conducted to investigate learning outcomes through study abroad experience. Approximately 600 students participated in study abroad program at an Asian institution, and they wrote comments on their satisfaction with the contents of courses (Academics), off-campus dormitories/houses (Accommodation), and interaction with local students (Interaction). Their primary language was English, and their home institutions were located in North America, Europe, Oceania region, and Asian countries.
The corpus consists of 43,736 words (17,438 words in Academic comments, 13,767 words in Accommodation comments, and 12,531 words in Interaction comments). Linguistic analyses revealed properties of the Academic, Accommodation, and Interaction comments. For instance, the Academic comments were significantly longer than the other comments (p < .01). An n-gram analysis of the Academic comments showed the frequent use of a first-person singular pronoun subject as in “I would” and “I would like.”
The comments in this corpus were manually annotated for satisfaction polarity (positive or negative satisfaction) by a member of faculty at the host university. The comments were also annotated for other properties using machine-learning trained classifiers such as the sentiment polarity (positive or negative sentiment), mood polarity (happy or upset), and linguistic complexity in terms of readability.
A. El-Halees, “Mining opinions in user-generated contents to improve course evaluation,” In J. M. Zain, W. Mohd, W. Maseri, E.-Q. Eyas (eds.) Software Engineering and Computer Systems, Part II, Communications in Computer and Information Science, vol. 180, Berlin: Springer-Verlag Berlin Heidelberg. 2011.
P. Kaewyong, A. Sukprasert, N. Salim, and F. A. Phang, “The possibility of students’ comments automatic interpret using lexicon based sentiment analysis to teacher evaluation,” Proceeding of the 3rd International Conference on Artificial Intelligence and Computer Science, pp. 179-189, 2015.
C. K. Leong, Y. H. Lee, and W. K. Mak, “Mining sentiments in SMS texts for teaching evaluation,” Expert Systems with Applications, vol. 39, pp.2584-2589, 2012.
V. Savicki and E. Brewer, Assessing Study Abroad: Theory, Tools, and Practice. Sterling, Virginia: Stylus, 2015.
Study on tour guide support integrating real examples and Web resources
ABSTRACT. Volunteer tour guides are person who guide visitors to the area in their own way, voluntarily and continuously. In Japan, organizations of such tour guides have been established in many tourist spots since Japan Tourism Agency was inaugurated.
Such organizations have problems of improving skills of volunteer tour guide and training their successors because many of the volunteer guides are of a high age, and not professional tour guides but individuals who want to contribute to the area.
In this study, we purpose to improve the guide skills and to develop a successor training program. To do this, we first analyze data collected by using two resources.
First resource is questionnaire data that collected by getting cooperation from a certain organization of tour guide. In the questionnaire, we asked them questions based on what is often asked by tourists and requested them to answer it in free writing.
We analyze texts in their answers in qualitative and quantitative differences among them, and differences in individual viewpoint. From the results, we consider a guideline to develop successor training program and to sharing knowledge among tour guides in the organizations.
As second resource, we use Web resources. We will examine how to supplement information related to the tourist areas by using web resources. Many volunteer tour guides tend to have a lot of knowledge about the historical backgrounds of the traditional buildings and the tourist spots. However, there would be few persons having not only historical knowledge but also knowledge about new tourist spots and resent topics related to existing tourist spots. The latter knowledge would be useful to make more tourists satisfied with areas they visit. We try to extract such knowledge from Web resource and present them to tour guides.
In this paper, we propose tour guide support integrating real examples and Web resources.
A Semi-Supervised Clustering Method for Dialogue Identification in Meeting Minutes Corpora
ABSTRACT. To achieve the objective of dialogue analysis in corpora, we need collect adequate dialogue dataset for data preprocessing. However, as to preliminary meeting minutes corpora with large scale by web crawling, the dialogue identification work becomes extremely exhausting since it needs large amount of human labeling efforts to discard redundant non-dialogue data. As a potential solution to this problem, clustering method, which is known as an unsupervised machine learning approach, has shown strong problem-solving ability on a wide range of text analysis issues. However, although people benefit from alleviating labeling efforts by clustering, they also suffer from the lack of process evidence, thus leading to a low precision and reliability. Consequently, in this paper, aiming to leverage the precision and human labeling efforts, we propose a semi-supervised clustering method which uses a small amount of labeled data to achieve the goal of dialogue identification in meeting minutes corpora effectively.
First, we select a subset of samples from whole dataset over standard random seeding. We try to guarantee that at least one seed corresponding to each partition of whole dataset. Subsequently, we collect the dialogue markers which point out the dialogue location inside samples by minimal human labeling work. For example, some markers like “Q:”, “A:” indicate their lines are dialogue Q&A contents. We call these word groups are explicit features and each element of them is implicit feature. Extending to the entire dataset, we count explicit/implicit features’ percentages of the total words in each document, and then, we assemble these percentages into feature vectors to represent meeting minutes. The vectors would be calculated by KMeans clustering algorithm and divided into non-dialogue cluster and dialogue cluster. The experiments indicate that our proposal has shown the effectiveness and robustness by substantially saving human labeling efforts on new data sets.
How distinctive is your corpus? A metric for the degree of deviation from the norm
ABSTRACT. We propose a general metric of divergence of one corpus from another. The metric takes the form of the difference in probabilities, representing numerically the degree of the divergence. The metric will also indicate which aspects the two corpora are similar or different. As concrete examples, we compute the metric for two pairs of corpora —standard/dialectal corpora of Japanese— and show in what respects and to what degree these pairs are different.
16:00-16:20Coffee Break
Entrance Lobby of Kagawa International Conference Hall, Tower building
6F, Takamatsu Symbol Tower
Can Neural Machine Translation System Create Training Data?
ABSTRACT. For research and development of natural language processing including machine translation, huge data for R & D is necessary. It is important to collect real-world texts on a large scale, and to develop an environment where anyone can develop and demonstrate a system based on their own idea by using the data freely. First part of this paper describes our attempt to create and release bilingual corpus that anyone can use for anything. We will distribute parallel data which we and our collaborators create. We also gather already translated text and make them open to the public. We plan to create huge parallel corpus by using neural machine translation system. We also describe the usefulness of this type of corpus. We verify it by using both existing parallel corpus which human translated and translated text by machine translation system. That is, we could get better translation quality by using parallel text created by Google NMT than by using manually translated parallel text.
Once this kind of data become existing, it is possible to create a new business through the utilization of such data. Anybody can develop services with real data and can demonstrate it as a business. Wide range of research and development can be implemented from practical research and development in large enterprises and universities etc. to a unique research and development by small and medium enterprises and students.
Based on this idea, we started to develop parallel corpora that will be open to the public and can be used not only for academic but also for business purposes. In this paper, we describe the framework of creating public bilingual data that we are considering. We also describe the effectiveness of parallel corpora that is created by using neural machine translation system and that is playing an important role in data collection and publication.
Mixing and Matching DIY and Ready-made Corpora: Introducing the DIY Text Tools built into The Prime Machine
ABSTRACT. From the early days of Data-Driven Learning, language learners have been encouraged to construct their own small corpora (Johns, 1986). In an overview of the use of corpora in teaching, Yoon (2011) reviews the success and benefits of self-compiled corpora, particularly for high level, highly motivated students. Charles (2012) has shown that DIY corpora activities can be very effective for doctoral students, helping them understand linguistic features of texts in their own disciplines.
The Prime Machine (Jeaco, 2017a) was developed as an ELT concordancing tool and it includes a range of search support and display features which try to make comparison processes easier. In Version 3, there are a number of DIY Text Tools which allow users to process their own texts and compare these with the pre-processed online corpora.
This paper presents a software demonstration of the DIY Text Tools in The Prime Machine. While its tools do not have the flexibility or range of features found in AntConc (Anthony, 2004), WordSmith Tools (Scott, 2010) or the corpus compilation utilities of Sketch Engine (Kilgarriff, Rychly, Smrz, & Tugwell, 2004), some of the “out-of-the-box” functionality includes:
• Being able to generate key words, key key words and key associates (c.f. Scott & Tribble, 2006) using any of the online corpora as a reference corpus.
• Being able to view concordance lines from two corpora side-by-side.
• Being able to view concordance cards with extended concordance lines with paragraphing (Jeaco, 2017b) and having the option to sort concordance lines using a collocation measure (c.f. Collier, 1994).
There are also some experimental features allowing users to compare vocabulary wordlist matches (c.f. Cobb, 2000) in the DIY corpus against an online reference corpus, and scores for matching MI collocations compared against an online reference corpus (c.f. Bestgen & Granger, 2014; Leńko-Szymańska, 2016).
The software hosted on a server in Suzhou, China and is available from www.theprimemachine.net.
References
Anthony, L. (2004). AntConc: A learner and classroom friendly, multi-platform corpus analysis toolkit. Paper presented at the Interactive Workshop on Language e-Learning, Waseda University, Tokyo.
Bestgen, Y., & Granger, S. (2014). Quantifying the development of phraseological competence in L2 English writing: An automated approach. Journal of Second Language Writing, 26, 28-41. doi: 10.1016/j.jslw.2014.09.004
Charles, M. (2012). ‘Proper vocabulary and juicy collocations’: EAP students evaluate do-it-yourself corpus-building. [Article]. English for Specific Purposes, 31, 93-102. doi: 10.1016/j.esp.2011.12.003
Cobb, T. (2000). The Compleat Lexical Tutor, from http://www.lextutor.ca
Collier, A. (1994). A system for automating concordance line selection. Paper presented at the NeMLaP Conference, Manchester.
Jeaco, S. (2017a). Concordancing Lexical Primings. In M. Pace-Sigge & K. J. Patterson (Eds.), Lexical Priming: Applications and Advances (pp. 273-296). Amsterdam: John Benjamins.
Jeaco, S. (2017b). Helping Language Learners Put Concordance Data in Context: Concordance Cards in The Prime Machine. International Journal of Computer-Assisted Language Learning and Teaching, 7(2), 22-39.
Johns, T. (1986). Micro-concord: A language learner's research tool. System, 14(2), 151-162.
Kilgarriff, A., Rychly, P., Smrz, P., & Tugwell, D. (2004). The Sketch Engine. Paper presented at the 2003 International Conference on Natural Language Processing and Knowledge Engineering, Beijing.
Leńko-Szymańska, A. (2016). CollGram profiles and n-gram frequencies as gauges of phraseological competence in EFL learners at different proficiency levels Paper presented at the Teaching and Language Corpora Conference, Giessen.
Scott, M. (2010). WordSmith Tools (Version 5.0). Oxford: Oxford University Press.
Scott, M., & Tribble, C. (2006). Textual Patterns: Key Words and Corpus Analysis in Language Education. Amsterdam: John Benjamins.
Yoon, C. (2011). Concordancing in L2 writing class: An overview of research and issues. Journal of English for Academic Purposes, 10(3), 130-139.
A case study on the features of Spoken Academic English: “a/an+N+of construction”
ABSTRACT. “Academic English” constitutes a variant of English with properties unique enough to describe it as register (in theoretical linguistics) and to teach it with a special case of English for Special Purposes (in applied linguistics) (cf. Brisk & Jeffries, 2008 and Bailey 2012). Although theoretical researchers are more inclusive of spoken language on average, researchers like Biber and Grey also equate Academic English with written academic prose in their investigation (Biber & Grey, 2016). Where spoken language is included, the definition of “Academic English” is often too broad, covering any type of classroom discourse, for instance, in the Corpus of English as Lingua Franca in Academic Settings or in the Michigan Corpus of Spoken Academic English. And with the increasing popularity of Massive Open Online Courses (MOOCs), the visibility of the spoken contents in Academic English is worthy of more severe manifestation (see Baiely, 2012), making itself a unique variant of Academic English, deserving more attention in research. In this paper, an empirical research is conducted in which a MOOC corpus (with 11,259,772 tokens from 124 courses with 5 different disciplines) is compared with two reference corpora—COCA Spoken and COCA Academic Journals —on the focus of the investigation of ‘a/an+N+of ‘construction.’ Besides, we intend to compare and determine which collocates are enclosed within this collocational framework (Renouf & Sinclair 1991) and make the unique properties of Spoken Academic English.
This approach clearly yields the ‘a/an+N+of ‘construction in Spoken Academic English that embraces academic and colloquial lexicons, used either in a narrow (domain-specific) sense (e.g. ‘a flash of’ and ‘a factor of’ in the discipline of physical science, ‘a disease of’ and ‘a release of’ in life science, ‘a function of’ and ‘a sum of’ in computer science, ‘a payoff of’ and ‘a breach of’ in social science, ‘a kind of’ and ‘a type of’ in arts humanities) or in a general sense (e.g. a function of, an example of, a set of, a variety of). Lexicons in the former, such as “flash,” “factor,” “disease,” “sum”, and “payoff” are more restricted to academic vocabulary; lexicons involved in the latter, such as “function,” “example,” and “set” tend to occur more in the general language. In addition, another interesting analysis result is that Spoken Academic English tends to involve more cases of vague lexis (e.g. a lot of, a couple of, a bunch of, a kind of)—surprisingly more frequently used in discipline of arts humanities when compared with other disciplines, and more prevalently adopted in spoken academic when compared with the written academic context, for this framework ‘a…of’ is highly useful in the process of quantifying and categorizing (Kennedy ,1987; Channell 1994). In conclusion, the research shows that Spoken Academic English can partly be analyzed as a mixture of features from Written Academic English on one hand and general spoken English on the other hand; however, it is expected that it will have properties that are unique to it, due to the specific functional constraints of the contexts it is used.
A Corpus-based Study on English Adjectives Ending with Suffix –ly
ABSTRACT. This study strives to analyze the usage, grammatical patterns and collocations of the adjectives ending with the suffix “ly” particularly used in the Intermediate Textbooks with the help of a corpus tool Antconc, get insight into the overall usage, grammatical patterns, collocation and discourse functions of the adjectives ending with the suffix “ly” by employing the Corpus of Contemporary American English (COCA) and bring to light the ESL/EFL learners’ prospectives of learning adjectives ending with the suffix “ly” when they are taught grammatical competence through textbooks. This study also aims to highlight the properties, semantic meanings, formation rules and the productivity of the adjectives ending with the suffix “ly”. To achieve the set objectives, the researcher has compiled a corpus based on the text books which are used as a second/foreign language (English) teaching tool at Intermediate Level in Pakistan. In order to explore the compiled corpus, a corpus tool AntConc was used to find out the various properties of adjectives ending with suffix “ly” which have been used in these text books. This study has also resorted to the Corpus of Contemporary American English (COCA) to further explore and get an insight into the different aspects of the adjectives ending with suffix “ly”.
Lexical characterization of semi-popularization articles on agricultural topics
ABSTRACT. This paper reports on a lexical characterization of a corpus of semi-popularization articles (Munõz, 2015) in agriculture, conducted as one component of a larger project aimed at making a keyword list for undergraduate agriculture students in Japan. Newsletter articles on a range of agriculture-related topics were collected from various universities’ websites and science news websites for general readers. A million-word corpus was compiled from those articles whose topics could be categorized into any of six sub-disciplinary categories, which correspond to the six existing agricultural departments of a national university. The sorting process was manually done by raters who were familiar with the six departments and had been trained through several author-conducted sessions. The coverage of GSL, AWL, and other words in the entire corpus was 76.6%, 7.0%, and 16.4%, respectively: AWL coverage was small and there was a relatively large proportion of words not listed in the GSL or AWL, similarly to Munõz’s (2015) results with semi-popularization articles on corn production. The word list of each departmental subcorpus included many words specific to one department as well as some words common among several departments. Furthermore, a qualitative examination of the corpus revealed that some words included in the GSL or AWL are used with a meaning specific to a departmental subcorpus, a phenomenon pointed out by Hyland and Tse (2007). Thus, even semi-popularization articles, written for general readers unlike research articles, showed many features specific to a particular department, which suggests a need for a particular treatment of the words and their collocations when designing the keyword list. On the other hand, the commonality found among several agricultural departmental subcorpora indicates that some of them may warrant treatment as basic vocabulary of some “departmental groups” or even of “agriculture as a whole”. (Poster)
References
Hyland, K., & Tse, P. (2007). Is There an “Academic Vocabulary”? TESOL Quarterly, 41(2), 235–253. https://doi.org/10.1002/j.1545-7249.2007.tb00058.x
Muñoz, V. L. (2015). The vocabulary of agriculture semi-popularization articles in English: A corpus-based study. English for Specific Purposes, 39, 26–44. https://doi.org/10.1016/j.esp.2015.04.001
Using Corpus-derived Co-occurrence Information to Extract Terminology
ABSTRACT. We report an approach aiming to extract terminology from domain-specific corpora. Our approach has been developed based on Yarowsky’s (1995) ‘one sense per collocation’ notions. By comparing target words’ co-occurring words in general-purpose and domain-specific corpora, our approach can effectively distinguish terms from non-terms, which offers valuable information to both lexicographers and ESP instructors.
Our approach works as follows. First, for a target word, we extract words likely to co-occur with it from both general-purpose and domain-specific corpora. Those collocates are collected based on measures of co-occurrence frequency and mutual information (Church & Hanks, 1990). Next, using the collocates, we identify the target word’s ‘related words’ in texts. Two words basically are regarded as related items if they share many collocates in texts (e.g. ‘result’ and ‘finding’ both collocating with ‘contradictory’, ‘consistent’, and ‘confirm’). Finally, by comparing words’ ‘related words’ in two corpora, we can calculate a CWRS (comparative word relatedness score) for each target. The word ‘translation’, for example, was found to be a medical term because it got different related words in medical and general-purpose texts and, consequently, a low CWRS. We evaluated our approach by examining nouns extracted from the BNC and our 5.5-million-token medical corpus. Fifty nouns found to show low CWRS were selected and checked in a medical term dictionary. Among them, 34 (≒ 70%) were confirmed to be medical terms. Particularly, our approach was found to complement previous corpus comparison approaches (e.g. Chung, 2004) as terms showing no relatively higher frequencies in medical texts could be effectively identified (e.g. ‘reconstruction’ and ‘bed’). We in our paper will also demonstrate a tool which can visually represent general-purpose and domain-specific related words for any word that users target. The tool hopefully will enable lexicographers and ESP instructors to effectively discover more domain-specific word senses.
Dialogue Act Annotation and Identification in a Japanese Multi-party Conversation Corpus
ABSTRACT. In multi-party conversation understanding, dialogue acts provide key information that indicates a role of an utterance and a relationship between utterances. Therefore, many researchers have utilized dialogue acts for understanding multi-party conversations in various NLP tasks, such as dialogue systems and summarization.
In this paper, we describe the following three tasks: dialogue act annotation, dialogue act identification, and validation of our dialogue act annotation for another task.
We annotate dialogue act tags to the Kyutech corpus that contains Japanese multi-party conversations for a decision-making task. We employ ISO standard 24617-2 as a dialogue act tag set which is a standard annotation scheme for dialogue acts.
Also, we propose a method that estimates a suitable dialogue act tag for each utterance using SVMs. We use not only linguistic features but also audio-visual features such as the acoustic information of utterances and the body pose information.
We also evaluate the effectiveness of the dialogue act tags through conversation summarization. We compare the performance of a summarization method using dialogue acts and a method without dialogue acts.
The contributions of this paper are as follows:
- We annotate dialogue acts for each utterance in the Kyutech corpus on the basis of ISO standard 24617-2. We will release our dialogue act tags for each utterance in the Kyutech corpus in the near future.
- We report a method with verbal and non-verbal features for the dialogue act classification. The experimental results show that the acoustic features such as pitch and power values of utterances improve the performance of our method.
- We show that our dialogue act annotation is effective for conversation summarization.
ABSTRACT. Through an extended case study, this paper reveals the metaphorical skeletons hidden in statistical cupboards of selective reporting, casting a new light on inter-annotator agreement (IAA) measures. Simply put, annotation involves assigning labels to language items. Researchers cannot measure the correctness of annotations directly and so resort to reliability as a proxy variable. Reliability of annotations is evaluated through various IAA measures. The underlying assumptions are that lack of IAA rules out validity and high IAA implies validity. Therefore, developers of an annotated corpora aim to achieve high levels of IAA.
Strategic decisions and their impacts on IAA were tracked in an extended corpus study of rhetorical functions in scientific research abstracts. A search of the research notes of the principal investigator resulted in 142 notes tagged with #IAA that were written between 2013 and 2017. The strategic decisions and their actual or perceived impacts on IAA were logged. A root cause analysis was also conducted to identify the causal factors that reduce IAA. The results show numerous strategic decisions, which using template analysis, were grouped into three categories, namely methodological, statistical and rhetorical. High IAA may be attributed to sound or cogent methodological choices, but it could also be due to manipulating the statistical smoke and rhetorical mirrors.
With no standardized convention for reporting IAA in corpus linguistics, researchers can select statistics that portray IAA more or less positively. This bias is reflected in the rhetorical choices regarding factors such as granularity (e.g. categorization and ontological units) and statistical analysis (e.g. sampling fraction, tests reported and treatment of outliers). The metaphorical skeletons hidden in statistical cupboards of selective reporting will be revealed, casting a new light on IAA measures of agreement and disagreement.
Practical guidelines on best practice will be shared.
Discourse function of après in French informal conversations
ABSTRACT. In this presentation, we describe the discursive functions of après (after in English) observed in an informal conversational corpus of native speakers of French, collected by Tokyo University of Foreign Studies in France in 2011. The corpus contains 20 hours (about 400,000 words) of free conversations between two students aged from 18 to 27 years old who know each other well.
In normative grammar, après is considered primarily as a preposition (ex, après le repas / after the meal), which can also be used as an adverb (ex. je viens après / I come later). Its usages as a discourse marker have never been mentioned in these sources. In French linguistics studies, there are some diachronic studies examining the evolution of après (ex. Fagard 2003 and Amiot & Mulder 2015), but from a synchronic perspective, its discursive functions have not yet been investigated enough.
Having analyzed more than 1300 occurrences of après attested in our corpus, we note that its usages as a discourse marker are very frequent in informal conversation and that they play an important role especially in the discourse structuring of dialogues. We distinguish two different types. The first assume the role of text coherence, appearing on the left periphery of utterances. The other, placed on the right periphery, undertake various interpersonal modalities. This result confirms the hypothesis of Beeching and Detges (2014:11) on functional asymmetry on the left and right periphery. Based on real examples, we describe in further detail both types of usages in order to explain the specific character of après.
Finally, comparing formal written texts and informal conversation, we show that the discourse marker usages of après are very specific to informal conversation but not to formal written texts. For a better understanding of the polysemic character of après, we need to take into account the particularities of each type of data.
A Usage Based Analysis of Bangla Discourse Markers in Mass Media Texts
ABSTRACT. Discourse Markers (DM) are functional units of language which are more relevant in pragmatic contexts as they carry reduced semantic content (Levinson 1983; Zwicky 1985). In this presentation we examine DM (conjunctions, interjections, emphasizers and particles) in Bangla, a highly inflectional, Indo Aryan language. The analysis is based on a set of Bangla mass media texts, published between 1981 to 1995, collected from different sources like newspapers, magazines, advertisements, leaflets etc., then compiled and developed by TDIL (Technology Development for Indian Languages). The collection of texts includes different text types such as newspaper reporting, editorials, stories, advertisements, articles, etc.
The Bangla word classification provided by H.R Thompson (2012) is used as a reference to find DM in the corpus. Here the prime focus is on the overall discourse structure of the texts and then the discourse marking elements are investigated. In this study, four categories, namely, Conjunction, Interjection, Emphasizers and Particles are considered where we can find the discourse marking element. Emphasizers and particles are more commonly used for discourse marking where conjunction and interjections have other grammatical properties too. By applying different corpus techniques such as frequency count and concordance analysis the study will aim to describe the frequency and usage (both structural and contextual) of Bangla DM. As there is no exhaustive research has been done on the Bangla DM, this study will surely contribute some greater understanding of the usage of Bangla DM and how they are being treated in mass media genre of texts.
Some of the frequent DM, present in the corpus, are shown in the table below:
Name of the Category List of DM
Conjunction Ar ‘and, else, more’, kintu ‘but’, tAi ‘therefore’, nAki ‘or, alternatively’, bale ‘because of’
Interjection AcchA ‘yes, well’, AhA ‘wow, expression for consolation”, bAh ‘expression of praising’, Are ‘expression’, omA ‘expression of astonishment’
Extending Corpus-Based Discourse Analysis for Exploring Japanese Social Media
ABSTRACT. The Fukushima Daiichi nuclear disaster in March 2011 has led to
discussions about "energy transition" and the phasing out of nuclear
energy throughout the world; a phenomenon called the Fukushima Effect
(cf. Gono'i 2015). Different political camps take different stances
towards the topic and unsurprisingly it has been widely discussed in
the run-up to political elections. Previous research (Yoshino 2013;
Abe 2015) e.g. showed that Japanese newspapers can be categorized into
pro-nuclear ones on the one hand (such as the conservative newspapers
Sankei and Yomiuri) and anti-nuclear ones on the other (such as the
left-leaning Asahi).
At the same time, there is no systematic corpus-based study of the
impact that the Fukushima incident had on social media. The paper at
hand thus provides an in-depth analysis of Japanese social media data
shortly before and in the aftermath of 3/11. In particular, we
triangulate the semantics and interplay of the topic nodes Fukushima
(福島), nuclear phase-out (脱原発), and elections (選挙) on
Twitter. The data is preprocessed to reduce the amount of noise
omnipresent in social media data (see Schaefer et al. 2017); the final
corpus consists of roughly 300,000,000 original posts.
Methodologically, we build on corpus-based discourse analysis (CDA)
(Baker 2006), but are extending it by two novel techniques first
sketched in Heinrich et al. (2018) and refined here. Firstly, the
bottleneck of CDA is the amount of hermeneutic interpretation
necessary to analyze the whole corpus. In order to facilitate this
task, we project high-dimensional word embeddings (Mikolov et
al. 2013), which we created for the specific linguistic register at
hand (Japanese social media data), into a two-dimensional semantically
structured space. Secondly, and more importantly, we triangulate the
semantics of discourse nodes and their collocates by looking at
higher-order collocates. This way we can explore the interplay of
discourses, such as the phase-out of nuclear energy in the context of
elections.
We demonstrate the validity of our approach by reporting and
discussing several empirical findings: (1) the nuclear phase-out
debate entered Japanese Twitter only several weeks after 3/11; (2) its
salience is highly volatile and correlates i.a. with elections; (3)
discourse attitudes associated with the Fukushima incident spilled
over into more general discussions.
A Topic-aware Comparable Corpus of Chinese Variations
ABSTRACT. With its economical and cultural significance, the notion of ’World Chineses’ has been widely recognized in recent years, and the studies of variation of World Chineses at different linguistic levels (e.g., lexical and grammatical) are beginning to unfold.[1]. However, empirical research as well as computational linguistic applications are hindered by the lack of availability of dynamically updated comparable corpora of different varieties of Mandarin Chinese. This study aims to fill the gap by constructing a comparable corpus of Chinese Mandarin and Taiwanese Mandarin (CCCV) from the social media in Mainland China and Taiwan, respectively.
In addition to its size and dynamic nature in longitudinal Web-as-Corpus sampling, this corpus features three aspects: (1) short-text oriented: texts as well as meta-information are mainly crawled via available APIs from social media, which is abundant with short texts. (2) Hashtag-as-common topic: The prevalent use of hashtags among internet users lends itself well to being used as common ground for comparing discrete sentences. We propose using this ubiquitous feature as a tool to study the variations exhibited between mainland Chinese Mandarin and Taiwanese Mandarin. Two popular internet forums, each popular in their respected place, are used to find sentences that share a common hashtag. These hashtags are thus used as a means to group sentences together based on a common feature, namely, a common hashtag. (3) Machine alignment: After the initial pooling of sentences based on hashtags, a mixture of similarity measures is used to pair a sentence from the Mainland China website with one from the Taiwanese social media that meet a certain similarity threshold [3]. The ultimate goal is to provide a corpus resource that, given a particular query, returns Mainland Chinese-Taiwanese Mandarin short texts pairs that meet a certain similarity threshold. This corpus can thus be
used to compare language variation, and serve as training data for a short-text/phrase-level sequence to sequence neural network model as well.
References
[1] Lin, Jingxia, Dingxu Shi, Menghan Jiang and Chu-Ren Huang. Variations in World Chineses.
In: Huang, Jing-schmidt and Meisterernst (eds).The Routledge Handbook of Chinese Applied
Linguistics, 2018.
[2] Nguyen, Dong and Doğruöz, A Seza and Rosé, Carolyn P and de Jong, Franciska. Computational
sociolinguistics: A survey. Computational Linguistics (42):3. 2016.
[3] Kenter, Tom, and Maarten De Rijke. Short text similarity with word embeddings. In: Proceedings
of the 24th ACM international on conference on information and knowledge management.
ACM, 2015.
Towards the First Online Indonesian National Corpus
ABSTRACT. The first corpus work on the Indonesian language took place in the 1960s. Nevertheless, after that initial work, no further attempt was made until the next quarter century. It even took a longer time for Badan Pengembangan dan Pembinaan Bahasa (the Language Development and Fostering Agency), i.e. up to a half-century later, to initiate the work on the Indonesian language corpus. Badan Bahasa has just initiated the project called KOIN (Korpus Indonesia) in 2016. In the first phase of this corpus project, the data were only collected from one particular genre, i.e. academic texts. This is because the Indonesian language is actually the second language of most Indonesian people, and Badan Pengembangan dan Pembinaan Bahasa has always been associated with products of standard language use. Consequently, this corpus is expected to favour standard language. In the future, however, other text genres will be included in the corpus. This paper explains the corpus data and the search functions available for online users.
Developing multilingual language learning resources using the CEFR-J
ABSTRACT. After the release of the Common European Framework of Reference for Languages (CEFR) in 2001, a group of researchers have worked on "reference level descriptions" (RLDs) for individual languages. The CEFR provides the common basis for designing teaching syllabuses, preparing teaching materials, and assessing the results of teaching. From learners' perspectives, the CEFR can serve as a common basis for autonomous learning. The crucial point here is how all the existing materials for teaching a particular language should be rearranged and fit into this coherent framework provided by the CEFR. The RLD work is the core of such implementation activities.
In Japan, the project called the CEFR-J was launched in 2008 and a set of can-do descriptors for 10 CEFR sub-levels (Pre-A1 to B2.2) and related RLD work including profiling vocabulary, grammar, and have been developed. In this study, the English resources created for the CEFR-J will be applied to prepare teaching resources for other major European as well as Asian languages. To do this, a series of teaching/learning resources including the CEFR-J Wordlist and Phrase List originally developed for English were translated into 26 other languages, using Google Translate. Second, these translated word and phrase lists were manually corrected by the team of language experts. This automatic conversion of English to other languages was evaluated against human judgements as well as frequency analysis from web corpora.
Three types of e-learning resources were created based on the wordlists and the phrase lists for teaching those languages to undergraduate students: (1) a flash-card app for learning vocabulary classified by thematic topics and CEFR levels; (2) a web-based sentence pattern writing tool for learning grammar and vocabulary, and (3) a web-based spoken and written production corpus collection tool.
A Developmental Study of Prepositional Phrases in English L1 Students’ Academic Writing
ABSTRACT. Academic writing has been documented undergoing a change from clausal-embedding to phrasal-embedding with the dense use of nominal phrasal features (Biber et al., 2011). Following this trend, most studies have investigated the use of these nominal phrasal features, such as nominalizations and nouns as nominal premodifiers, to explore L2 writing development. Among them, however, only a few put specific focus on prepositional phrases (PPs), with even fewer on their uses in English L1 students’ writing, which is hypothesized to show a trend of development over school years (2011). Based on a sub-corpus of BAWE (i.e. British Academic Written English), this study will examine the development of complexity of PPs in English L1 writing across four school years to testify the validity of the hypothesized developmental stages. More specifically, the internal structures, syntactic functions and semantic functions of PPs will be explored respectively by using various corpus analysis tools. Findings are expected to be: 1) the internal structure of PPs will become more complex, with the use of more various types of prepositions and prepositional complements; 2) PPs will play a variety of different structural roles, changing from the usual adverbials to nominal post-modifiers; 3) PPs will be used to demonstrate more semantic information, extending from concrete meaning to abstract meaning. In a word, this study will set a baseline for the exploration of the writing development in the EFL environments and then will be of great significance for the English for Academic Purposes instruction.
Deviant Uses of Metadiscursive Nouns in NNS Essays
ABSTRACT. Non-native speaker (NNS) writing is often perceived as different from native speaker (NS) writing, but it is considered important for NNS students to follow first language norms in argumentation essays. In an attempt to identify sources of perceived differences between NNS and NS writing, the presenter compared NNS essays, drawn from the Japanese subcorpus of the International Corpus of Learner English (JICLE), and NS essays, drawn from the US subcorpus of the Louvain Corpus of Native English (US), using contrastive interlanguage analysis (Granger, 1996). The examination of the texts focused on the use of metadiscursive nouns, which are abstract and general meaning nouns. These can mark discourse in English texts and help to structure or comment on the discourse by referring to their meanings expressed in the text where they occur. Of the findings of the study, this presentation focuses on nouns that occurred with significant frequency differences, either significantly more or less, in JICLE and in US. Using the shell noun conceptual framework (Schmid, 2000), various aspects of the use of nouns that influenced these frequency differences are discussed through an examination of noun lexicalisation patterns in relation to syntactic patterns where the nouns occurred. The perceived differences in JICLE and US were mainly accounted for by such factors as: the lexicalisation of nouns in anaphoric functions; the inter-segmental meanings of nouns; and, preferred discourse construction types. The findings will be useful for designing syllabi and developing course materials for academic writing for advanced-level NNS students and specific instructional strategies for the use of metadiscursive nouns will also be given.
A Corpus-based Critical Discourse Analysis of Racial Stereotyping in American Newspapers
ABSTRACT. This study investigated the manipulative role of media in developing and circulating racial and ethnic stereotypes. A corpus-based approach is deployed to analyse the newspaper coverage of American mass shootings. A comparison was done between the news reportings of white descent perpetrators and the perpetrators carrying any other ethnic background by exploring lexical items particularly adjectives. Hence, two corpora of approximately 27,000 words were formulated based on ten mass shooting incidents. Among them five incidents were carried out by white shooters and five by non-white shooters. The corpora were formulated from news articles taken from top American newspaper agencies such as The Guardian U.S Edition, The New York Times, New York Daily and The Washington Post. The data was analyzed in the light of CDA approach presented by Fairclough and the concept of “Other” put forth by Edward Said. The study combines two methodological aspects, namely, Critical Discourse Analysis and Corpus linguistics in order to identify the discursive practices used to construct the identity of the perpetrators. Two corpora are analysed with the help of two different softwares that are WordSmith Tools and Lancsbox. Through WordSmith, wordlists and concordances of the adjectives were located and these adjectives were extracted by using Lancsbox. The findings revealed that non-white shooters are almost always described through their ethnicity by mentioning adjectives such as Korean, Afghan, Bosnian, Saudi and Vietnamese. Whereas, a deliberate omission of the race is found when the perpetrator is a white man. The findings further revealed the lexical items with negative connotations like “terrorist”, “extremist” and “radical” are mostly used with Muslim perpetrators and with perpetrators of immigrant status to some extent but never with white mass murderers. These white killers are justified or humanized by declaring mental health issues as the root of the problem.