Learner corpus research has seen major development since its inception some 25 years ago. Nevertheless, theoretical, methodological and empirical advances have been summarized in the literature only rarely and, in such cases, selectively rather than systematically. To the authors’ knowledge, in fact, there is no meta-analysis to date that summarizes and synthesizes the body of knowledge resulting from learner corpus research in a specific area of study (e.g. English as a Foreign Language learners’ use of collocations or tense, aspect and modality in learner writing). Equally concerning is that relatively little attention has been paid to the state or development of the field’s methodological practices, an unfortunate circumstance given the empirical rigor needed to reliably and accurately make use of corpus data and analyse frequencies of (co-)occurrence (Gries, 2013; Gries, forthcoming; Gries & Deshors, 2014). Progress in any discipline, however, crucially “depends on sound research methods, principled data analysis, and transparent reporting practices” (Plonsky & Gass, 2011: 325). This study thus aims to provide the first empirical assessment of quantitative research methods and study quality in learner corpus research. Study quality is defined rather broadly as adherence to standards of contextually appropriate methodological rigor in research practices and (b) transparent and complete reporting of such practices” (Plonsky, 2013: 657). Specifically, we systematically review all quantitative, primary studies referenced in the Learner Corpus Bibliography (LCB), a representative bibliography of learner corpus research maintained by the Learner Corpus Association (http://learnercorpusassociation.org) which currently contains approximately 1180 references. The techniques used to retrieve, code, and analyze this body of primary research are characteristic of research synthesis and meta-analysis. Following Plonsky (2013), however, this study differs from those traditions of synthetic research in that the focus here is almost exclusively methodological (i.e. the “how” of learner corpus research) rather than substantive (i.e. the “what”). Each reference in the LCB is surveyed using a coding scheme inspired from the protocol developed and first used by Plonsky & Gass (2011) to assess methodological quality in second language acquisition, and more particularly interaction research. The coding scheme is however revised and expanded to account for the methodological characteristics of corpus linguistics. Quantitative studies are coded for over 50 categories representing six dimensions: (a) publication type (i.e. conference paper, book chapter, journal article), (b) research focus (e.g. lexis, grammar), (c) methodological features (e.g. Contrastive Interlanguage Analysis, keyword analysis, error analysis, use of reference corpus), (d) statistical analyses (e.g. X², t-test, regression analysis), and (e) reporting practices (e.g. reliability coefficients, means). The 25-year span of research represented in the LCB provides a unique opportunity to examine the resulting data cumulatively and also permits analyses of changes taking place over time in the research and reporting practices of this domain. Preliminary results point to several systematic strengths as well as many flaws, such as the absence of research questions or hypotheses, incomplete and inconsistent reporting practices (e.g. means without standard deviations), and low statistical power (i.e. LCR studies generally overrely on tests of statistical significance such as the X² test, do not report effect sizes, rarely check or report whether statistical assumptions have been met, rarely use multivariate analyses). Improvements over time are however clearly noted and there are signs that, like other related disciplines, learner corpus research is slowly “undergoing a change to becoming much more empirical, much more rigorous, and much more quantitative/statistical” (Gries, 2013: 287) In addition to providing direction for future research and research practices, the study’s findings will also be discussed and contextualized within the research cultures of corpus linguistics, second language acquisition, and applied linguistics more generally. References Gries, S. (2013). Statistical tests for the analysis of learner corpus data. In Diaz-Negrillo A., Ballier N. & P. Thompson (eds). Automatic Treatment and Analysis of Learner Corpus Data. Amsterdam & Philadelphia: Benjamins. Gries, S. (forthcoming). Statistics for learner corpus research. In Granger S., G. Gilquin & F. Meunier (Eds). The Cambridge Handbook of Learner Corpus Research. Cambridge University Press. Gries, S., & Deshors, S. (2014). Using regressions to explore deviations between corpus data and a standard/target: two suggestion. Corpora, 9(1), 109–136. Plonsky, L. (2013). Study quality in SLA. An assessment of designs, analyses, and reporting practices in quantitative L2 research. Studies in Second Language Acquisition, 35, 655-687. Plonsky, L. & Gass, S. (2011). Quantitative research methods, study quality and outcomes: the case of interaction research. Language Learning 61(2): 325-366.
The main objective of this article is to demonstrate with the help of learner corpus data the practical relevance of the phraseological dimension of language for writing assessment in higher education. Phraseological competence is now widely recognized as an important part of fluent and idiomatic language use, but its development has not received the attention it deserves in the CEFR. The study investigates the development of linguistic correlates of syntactic, lexical, and phraseological complexity in learner texts at B2, C1, and C2 and shows that while no measure of syntactic or lexical complexity seems to have an impact on human raters' overall judgement of writing quality, two measures of phraseological complexity explain 25% of the variance in the data set. Results suggest that incorporating phraseological competence into the scoring rubrics of university entrance language tests would help language test developers add construct validity to language assessment in higher education. More generally, this study also shows the crucial role that Language for Specific Purposes learner corpora could play in language assessment.
The Varieties of English for Specific Purposes dAtabase (VESPA first release) is the result of an international corpus compilation project that aims to address the lack of large-scale, open access, multi-L1, multi-discipline and multi-register learner corpora. This corpus report provides a detailed description of VESPA and illustrates possible uses of the corpus for register exploration of learner data. Specifically, it first offers an overview of the makeup of the corpus and the online interface that can be used to search and download the corpus. It then gives an illustrative example of a study where multi-dimensional analysis was used to investigate the relative importance of register vis-à-vis other factors in learner academic writing. In the concluding remarks, we identify priorities for future developments in the VESPA project, including the addition of more L1 components, more disciplines and more registers, as well as the compilation of a comparable corpus of native student writing.
Abstract Recent studies of proficiency measurement and reporting practices in applied linguists have revealed widespread use of unsatisfactory practices such as the use of proxy measures of proficiency in place of explicit tests. Learner corpus research is one specific area affected by this problem: few learner corpora contain reliable, valid evaluations of text proficiency. This has led to calls for the development of new L2 writing proficiency measures for use in research contexts. Answering this call, a recent study by Paquot et al. (2022) generated assessments of learner corpus texts using a community-driven approach in which judges, recruited from the linguistic community, conducted assessments using comparative judgement. Although the approach generated reliable assessments, its practical use is limited because linguists are not always available to contribute to data collections. This paper, therefore, explores an alternative approach, in which judges are recruited through a crowdsourcing platform. We find that assessments generated in this way can reach near identical levels of reliability and concurrent validity to those produced by members of the linguistic community.
In our software demonstration, we describe a web-based English for Academic Purposes dictionary-cumwriting aid tool, the Louvain EAP Dictionary (LEAD). The dictionary is based on the analysis of c. 900 academic words and phrases in a large corpus of academic texts and EFL learner corpora representing a wide range of L1 populations. The dictionary contains a rich description of non-technical academic words, with particular focus on their phraseology (collocations and recurrent phrases). Its main originality is its customisability: the content is automatically adapted to users’ needs in terms of discipline and mother tongue background. Another key feature of the LEAD is that is makes full use of the capabilities afforded by the electronic medium in terms of multiplicity of access modes (Tarp 2009). The dictionary can be used as both a semasiological dictionary (from lexeme to meaning) and an onomasiological dictionary (from meaning/concept to lexeme) via a list of typical rhetorical or organisational functions in academic discourse (cf. Pecman 2008). It is also a semi-bilingual dictionary (cf. Laufer & Levitzky-Aviad 2006) as users who have selected a particular mother tongue background can search lexical entries via their translations into that language. The LEAD is designed as an integrated tool where the actual dictionary part is linked up to other language resources and learning tools. It is a hybrid dictionary (cf. Hartman 2005) that includes both a dictionary-cum-corpus and a dictionary-cum-CALL component. As regards direct corpus access, the LEAD innovates by giving access to discipline-specific corpora rather than generic corpora. While the current version of the tool is restricted to some disciplines and mother tongue backgrounds, its flexible architecture allows for further customisation (other L1 background populations, other disciplines, other languages).