Native Language Identification on EFCAMDAT

Xiao Jiang,Yan Huang,Yufan Guo,Jeroen Geertzen,Theodora Alexopoulou,Lin Sun,Anna Korhonen

Native Language Identification on EFCAMDAT

2017

Abstract Native Language Identification (NLI) is a task aimed at determining the native language (L1) of learners of second language (L2) on the basis of their written texts. To date, research on NLI has focused on relatively small corpora. We apply NLI to EFCAMDAT, an L2 English learner corpus that is not only multiple times larger than previous L2 corpora but also provides pseudo-longitudinal data across several proficiency levels. Based on accurate machine learning with a wide range of linguistic features, our investigation reveals interesting patterns in the longitudinal data that are useful for both further development of NLI and its application to research on L2 acquisition. Introduction Native language identification (NLI) is a task aimed at detecting the native language (L1) of writers on the basis of their second language (L2) production. NLI is important for natural language processing (NLP) applications including language tutoring systems and authorship profiling. Moreover, NLI can offer useful empirical data for research on L2 acquisition. For example, NLI can shed light on how L1 background influences L2 learning, and on differences between the writings of L2 learners across different L1 backgrounds. To date, studies on NLI have focused on relatively small learner corpora. Furthermore, none of them have investigated the influence of L1s across L2 proficiency levels. Our work takes the first step toward addressing these problems. We apply NLI to EFCAMDAT, the EF-Cambridge Open Language Database (Geertzen, Alexopoulou, and Korhonen, 2013), an open-access L2 learner corpus. EFCAMDAT consists of writings of learners submitted to Englishtown , the online school of EF. EFCAMDAT stands out for its size, diversity of student backgrounds, and coverage of the proficiency levels. The first release of 2013 (Geertzen, Alexopoulou, and Korhonen, 2013), on which this paper is based, amounts to 30 million words, a corpus multiple times larger than any other available L2 corpora. Using a standard machine learning–based methodology for NLI, we explore the optimal linguistic features for NLI on this data at different proficiency levels. We discover interesting patterns that can be useful for both further development of NLI and its application to research on L2 acquisition. In this introductory section, we first review the history of research on NLI, and introduce the data sets that have been used in earlier NLI research.We then summarise our contribution briefly.

Keywords:

Correction
Cite
Save
Machine Reading By IdeaReader

References

Citations