next day
all days

View: session overviewtalk overview

10:00-12:00 Session 1B: [Workshop] Taking DDL Online: A taste of a SPOC for data-driven learning


This workshop outlines the rationale, design and implementation of a short private online course (SPOC) on data-driven learning (DDL, Johns, 1991), focusing on L2 error correction in post-graduate academic writing and involving over 300 registered users. I aim to discuss the affordances of using a SPOC platform (namely EdX) for online DDL training, describing activities covering a range of useful strategies for DDL-led error detection and correction. In this workshop, attendees will complete a module on how corpora are advantageous over e-dictionaries or translation websites, before trialling a few activities from a module introducing absolute beginners to the use of the SketchEngine corpus query platform. Following this, a discussion of learners’ usage of the SPOC platform and their quantitative and qualitative perceptions of the course is presented for the audience, while the presented and audience will discuss the conceptual and methodological challenges involved in taking DDL instruction online.


To save time on the day, please register for the online course prior to the workshop by following the link below:


Please click ‘register’ in the top right hand corner to register your details, then ‘enroll now’.  If you have any trouble, please mail me at p.cros@uq.edu.au.


Dr. Peter Crosthwaite is a senior lecturer in corpus linguistics and EAP/ESP at the University of Queensland, Australia.  He is published in 23 Web of Science indexed journals, and is the author of Learning the Language of Dentistry in John Benjamins Studies in Corpus Linguistics series, as well as the editor of Data-driven learning for the next generation: Corpora and DDL for younger learners with Routledge

13:00-15:00 Session 2A: [Workshop] Corpus-based analyses of registers and register variation: Theoretical background; methodological issues; major research findings


This workshop will explore corpus-based approaches to the analysis of registers and register variation.  The class will begin with a brief discussion of theoretical concepts, including a comparison/contrast of the constructs of ‘register’, ‘genre’, ‘style’ and ‘text type’.   ‘Registers’ are defined as culturally-recognized categories that have distinctive situational and linguistic characteristics. The workshop will then discuss the ‘register triangle’ (Situation-Function-Linguistic form) and the functional basis of linguistic register variation.  Several examples of linguistic variation will be provided to illustrate the centrality of register as a predictor of linguistic variation, making the case that any analysis of linguistic variation will be incomplete without consideration of register as a possible predictor.  This part will end with discussion of a methodological issue:  the differences between text-linguistic vs variationist vs whole-corpus research designs.  This part of the workshop will focus on the types of research questions that can be appropriately investigated using each research design, and it will present several case studies illustrating the application – and misapplication – of each design type for the study of register variation.

The second part of the workshop will focus on three sets of methodological issues relating to Multi-Dimensional (MD) Analysis.  The first set of issues relates to the methodological decisions made by the researcher in carrying out an MD analysis, including the choice of linguistic features to include in the analysis; determining the number of factors to include in the final analysis; and functional interpretation of factors.  The second set of issues relates to the extent to which registers are well-delimited in their linguistic characteristics, and the possibility of analyzing ‘text-types’ (text categories that are well defined linguistically rather than being defined as culturally-recognized categories).  Quantitative methods for carrying out a text-type analysis will be discussed.  And finally, the workshop will discuss the possibility of analyzing situational variation in continuous, quantitative terms, leading to the possibility of situational text types that complement register descriptions.


Douglas Biber is Regents' Professor of English (Applied Linguistics) at Northern Arizona University.  His research efforts have focused on corpus linguistics, English grammar, and register variation (in English and cross-linguistic; synchronic and diachronic).  He has published over 240 research articles and 25 books and monographs, including primary research studies as well as textbooks.  He is widely known for his work on the corpus-based Longman Grammar of Spoken and Written English (1999) and for the development of ‘Multi-Dimensional Analysis’ (a research approach for the study of register variation), described in earlier books published by Cambridge University Press (1988, 1995, 1998).  More recently, he co-authored a textbook on Register, Genre, and Style [2nd edition] (Cambridge, 2019), co-edited the new Cambridge Handbook of English Corpus Linguistics (2015), and co-authored research monographs on grammatical complexity in written academic English (Cambridge, 2016) and register variation on the web (Cambridge, 2018).

Location: IBK Hall
13:00-15:00 Session 2C: [Workshop] Teaching academic writing with the help of MICUSP


In this workshop, we will discuss ways in which corpora can be beneficially used in the context of teaching English for academic purposes (EAP). We will start with a brief overview of relevant direct and indirect pedagogical corpus applications and then explore together how a freely available online corpus can be accessed by EAP instructors to create materials for the academic writing classroom.

In our explorations, we will focus on the Michigan Corpus of Upper-level Student Papers (MICUSP), a 2.6-million word corpus of A-graded student writing samples from 16 different disciplines, but also refer to other related resources and their pedagogical usefulness along the way. We will explore the browse and search functions in the MICUSP Simple interface (http://eli-corpus.lsa.umich.edu/) and look at a few concrete examples of MICUSP-derived EAP teaching materials.


Ute Römer is an Associate Professor in the Department of Applied Linguistics and ESL at Georgia State University. Prior to this, she was director of the Applied Corpus Linguistics unit at the University of Michigan English Language Institute where she managed the Michigan Corpus of Academic Spoken English (MICASE) and Michigan Corpus of Upper-level Student Papers (MICUSP) projects. Her research interests include phraseology, second language acquisition, academic discourse analysis, and the application of corpora in language learning and teaching. She serves on a range of editorial boards of professional journals and is General Editor of the Studies in Corpus Linguistics book series.

Location: Helinox Hall
15:15-17:15 Session 3B: [Workshop] Using English-Corpora.org: helps for beginners, insights for advanced users


In this workshop, I will quickly review some of the basic functionality of the BYU corpora (now www.english-corpora.org), including search types (words, phrases, lemmas, part of speech, synonyms, customized wordlists, and more), and general functionality like word frequency, concordances, and collocates. But then I will discuss more advanced topics, such as comparing one section of the corpus to another (e.g. two dialects of English, multiple historical periods, or different genres) to examine variation – in ways that are probably not possible with any other set of corpora. I will also provide hands-on examples of how to create Virtual Corpora on almost any topic, which should be helpful for those who are interested in English for Specific Purposes (ESP).

Finally, I will discuss and provide training on the most recent update to the Corpus of Contemporary American English (COCA), which contains texts up through December 2019, and which will be released in February 2020. The updated COCA contains a total of 1.2 billion words of data, including the previous genres of spoken, fiction, magazines, newspapers, and academic, as well as the new genres of subtitles from TV shows, subtitles from movies, transcripts of soap operas, as well as blogs and general web pages (10 genres total, with a total of 100-130 million words in each genre). The updated COCA also contains extensive “word-oriented” pages for the top 60,000 words in the corpus, including (for each word) detailed frequency data (including frequency in each of the 10 genres, range, and dispersion), definitions, frequency of each word form for the lemma, morphologically-related words, links to images, audio and video files, translations to 30+ languages (via Google Translate), WordNet data, collocates, related topics, clusters, and concordance lines.

An overarching topic of discussion throughout the different activities in the workshop will be finding the corpora that are best for particular types of research questions or teaching activities. Finally, I look forward to answering any questions that you might have regarding the corpora and their use.


Mark Davies is Professor of Linguistics at Brigham Young University in Provo, Utah, USA. He specializes in corpus design, construction and use, as well as research on historical, dialectal, and genre-based variation in language. He is the creator of the corpora from www.english-corpora.org, which are probably the most widely used corpora in existence.

15:15-17:15 Session 3C: [Workshop] The Prime Machine - DIY text tools for the exploration of style, register and genre in English language and literature


The Prime Machine (tPM) was originally developed to make it easy for English language learners at my institution to notice and explore patterns of language use in corpora (Jeaco, 2017a).  The software can also be used for linguistic research and includes a number of methodological and visual innovations.  Since Version 3, a number of DIY Text Tools are available, allowing users to process their own texts and compare these with the pre-processed online corpora. 

This workshop will provide an overview of the linguistic research-oriented features of tPM, introducing the software’s ease of use for some well-known corpus processes as well as some of its innovative features. 

Specifically, the participants will see and try out:

  • Viewing concordance cards - extended concordance lines with paragraphing (Jeaco, 2017b);
  • Sorting concordance lines using a collocation measure (c.f. Collier, 1994) and a cohesion measure (c.f. Hoey, 1991);
  • Exploring features of Lexical Priming (Hoey, 2005);
  • Exploring Key Labels (Jeaco, forthcoming);
  • Viewing concordance lines from a DIY corpus side-by-side with results from one of tPM’s online corpora;
  • Generating key words, key key words and key associates (c.f. Scott & Tribble, 2006) using any of the online corpora as a reference corpus;
  • Comparing vocabulary wordlist matches (c.f. Cobb, 2000) in the DIY corpus against an online reference corpus;
  • Comparing clusters in the DIY corpus against an online reference corpus;
  • Generating average collocations matches in DIY corpus texts, compared against an online reference corpus (c.f. Bestgen & Granger, 2014; Leńko-Szymańska, 2016).

The workshop will be designed as a general introduction to corpus methods suitable for small and medium sized collections of texts.  As well as covering the practicalities of hands-on software use, the workshop will discuss some ways these tools relate to research topics in style – through exploring similarities and differences between authors (Mahlberg, 2013) –  and register/genre through exploring similarities and differences between text types and domains (Biber & Conrad, 2009).

Participants are encouraged to bring along small collections of their own plain text English files (collections between 5,000 and 1 million words), and they will also be able to try out the tools using ready-prepared collections of novels and business texts.

The Windows version has been publicly available since May 2018; the MacOS version was released on the website in November 2019.  The software is hosted on a server in Suzhou, China and is available from www.theprimemachine.net.


Bestgen, Y., & Granger, S. (2014). Quantifying the development of phraseological competence in L2 English writing: An automated approach. Journal of Second Language Writing, 26, 28-41.

Biber, D., & Conrad, S. M. (2009). Register, Genre, and Style. Cambridge: Cambridge University Press.

Cobb, T. (2000). The Compleat Lexical Tutor, from http://www.lextutor.ca

Collier, A. (1994). A system for automating concordance line selection. Paper presented at the NeMLaP Conference, Manchester.

Hoey, M. (1991). Patterns of Lexis in Text. Oxford: Oxford University Press.

Hoey, M. (2005). Lexical Priming: A New Theory of Words and Language. London: Routledge.

Jeaco, S. (2017a). Concordancing Lexical Primings. In M. Pace-Sigge & K. J. Patterson (Eds.), Lexical Priming: Applications and Advances (pp. 273-296). Amsterdam: John Benjamins.

Jeaco, S. (2017b). Helping Language Learners Put Concordance Data in Context: Concordance Cards in The Prime Machine. International Journal of Computer-Assisted Language Learning and Teaching, 7(2), 22-39.

Jeaco, S. (forthcoming). Calculating and Displaying Key Labels: The texts, sections, authors and neighbourhoods where words and collocations are likely to be prominent. Corpora.

Leńko-Szymańska, A. (2016). CollGram profiles and n-gram frequencies as gauges of phraseological competence in EFL learners at different proficiency levels Paper presented at the Teaching and Language Corpora Conference, Giessen.

Mahlberg, M. (2013). Corpus stylistics and Dickens's fiction: New York ; Routledge, 2013.

Scott, M., & Tribble, C. (2006). Textual Patterns: Key Words and Corpus Analysis in Language Education. Amsterdam: John Benjamins.


Stephen Jeaco is an Associate Professor at Xi’an Jiaotong-Liverpool University.  He received his PhD in Linguistics from University of Liverpool, UK. He is the developer of the Prime Machine which is a user-friendly corpus tool based on the theory of Lexical Priming. His main research interests are developing software applications for the construction, manipulation and display of corpus data as well as evaluating corpus tools with learners for a variety of tasks.

Location: Helinox Hall