Evaluating the Detection of Aberrant Responses in Automated Essay Scoring

2015 
As automated essay scoring grows in popularity, the measurement issues associated with it take on greater importance. One such issue is the detection of aberrant responses. In this study, we considered aberrant responses as those that were not suitable for machine scoring because the responses have characteristics that the scoring system cannot process. Since no such system can yet understand language in a way that a human rater does, the detection of aberrant responses is important for all automated essay scoring systems. Successful identification of aberrant responses can happen before and after machine scoring is attempted (i.e., pre-screening and post-hoc screening). Such identification is essential if the technology is to be used as the primary scoring method. In this study, we investigated the functioning of a set of pre-screening advisory flags that have been used in different automated essay scoring systems. In addition, we evaluated whether the size of the human–machine discrepancy could be predicted as a precursor to developing a general post-hoc screening method. These analyses were conducted using one scoring system as a case example. Empirical results suggested that some pre-screening advisories were operating more effectively than others were. With respect to post-hoc screening, relatively little scoring difficulty was found overall, thereby reducing the ability to predict human–machine discrepancy for those responses that passed through pre-screening. Limitations of the study and suggestions for future studies are also provided.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    12
    References
    3
    Citations
    NaN
    KQI
    []