Maximizing Classification Accuracy in Native Language Identification

Scott Jarvis,Yves Bestgen,Steve Pepper

Maximizing Classification Accuracy in Native Language Identification

2013

Scott Jarvis
Yves Bestgen
Steve Pepper

This paper reports our contribution to the 2013 NLI Shared Task. The purpose of the task was to train a machine-learning system to identify the native-language affiliations of 1,100 texts written in English by nonnative speakers as part of a high-stakes test of general academic English proficiency. We trained our system on the new TOEFL11 corpus, which includes 11,000 essays written by nonnative speakers from 11 native-language backgrounds. Our final system used an SVM classifier with over 400,000 unique features consisting of lexical and POS n-grams occurring in at least two texts in the training set. Our system identified the correct nativelanguage affiliations of 83.6% of the texts in the test set. This was the highest classification accuracy achieved in the 2013 NLI Shared Task.

Keywords:

Training set
Support vector machine
Natural language processing
Native-language identification
Classifier (linguistics)
Speech recognition
Test set
Artificial intelligence
Computer science
svm classifier
english proficiency

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations