ABSTRACT Large‐scale educational surveys are low‐stakes assessments of educational outcomes conducted using nationally representative samples. In these surveys, students do not receive individual scores, and the outcome of the assessment is inconsequential for respondents. The low‐stakes nature of these surveys, as well as variations in average performance across countries and other factors such as different testing traditions, are contributing factors to the amount of omitted responses in these assessments. While underlying reasons for omissions are not completely understood, common practice in international assessments is to employ a deterministic treatment of omissions, either as missing observations or as responses that are considered wrong. Both approaches appear problematic. In this project, we analyzed the effects of treating omitted responses either as missing or as wrong, as is done in the majority of international studies, and compared these data‐treatment solutions to model‐based approaches to treating omitted responses. The two types of model‐based approaches used in this study are: (a) extensions of multidimensional item response theory (IRT) with an additional dimension based on response indicator variables defined and calibrated together with the set of items containing the observed responses and (b) multidimensional, multiple‐group IRT models with a grouping variable representing the within‐country stratification of respondents by the amount of omitted responses. These two model‐based approaches were compared on the basis of simulated data and data from about 250,000 students from 30 Organisation for Economic Co‐operation and Development (OECD) Member countries participating in an international large‐scale assessment.
Automated scoring of free drawings or images as responses has yet to be used in large-scale assessments of student achievement. In this study, we propose artificial neural networks to classify these types of graphical responses from a TIMSS 2019 item. We are comparing classification accuracy of convolutional and feed-forward approaches. Our results show that convolutional neural networks (CNNs) outperform feed-forward neural networks in both loss and accuracy. The CNN models classified up to 97.53% of the image responses into the appropriate scoring category, which is comparable to, if not more accurate, than typical human raters. These findings were further strengthened by the observation that the most accurate CNN models correctly classified some image responses that had been incorrectly scored by the human raters. As an additional innovation, we outline a method to select human-rated responses for the training sample based on an application of the expected response function derived from item response theory. This paper argues that CNN-based automated scoring of image responses is a highly accurate procedure that could potentially replace the workload and cost of second human raters for international large-scale assessments (ILSAs), while improving the validity and comparability of scoring complex constructed-response items.
Identifying and considering test-taking effort is of utmost importance for drawing valid inferences on examinee competency in low-stakes tests. Different approaches exist for doing so. The speed-accuracy+engagement model aims at identifying non-effortful test-taking behavior in terms of nonresponse and rapid guessing based on responses and response times. The model allows for identifying rapid-guessing behavior on the item-by-examinee level whilst jointly modeling the processes underlying rapid guessing and effortful responding. To assess whether the model indeed provides a valid measure of test-taking effort, we investigate (1) convergent validity with previously developed behavioral as well as self-report measures on guessing behavior and effort, (2) fit within the nomological network of test-taking motivation derived from expectancy-value theory, and (3) ability to detect differences between groups that can be assumed to differ in test-taking effort. Results suggest that the model captures central aspects of non-effortful test-taking behavior. While it does not cover the whole spectrum of non-effortful test-taking behavior, it provides a measure for some aspects of it, in a manner that is less subjective than self-reports. The article concludes with a discussion of implications for the development of behavioral measures of non-effortful test-taking behavior.
This article applies the approach of adding artificially created data to observations to stabilize estimates to treat missing responses for cases in which students choose to omit answers to questionnaire or achievement test items.This addition of manufactured data is known in the literature as Laplace smoothing or the method of data augmentation priors. It can be understood as a penalty added to a parameter's likelihood function. This approach is used to stabilize results in the National Assessment of Educational Progress (NAEP) analysis and implemented in the MGROUP software program that plays an essential role in generating results files for NAEP.The modified data augmentation approach presented here aims to replace common missing data treatments used in IRT so it can be understood as special deterministic cases of data augmentation priors that add fixed information to the observed data, either by conceptualizing these as adding a fixed form to the likelihood function to constant represent prior information or by understanding the augmentation as a conjugate prior that ‘emulates’ non-random observations.
International large-scale assessments (ILSAs) transitioned from paper-based assessments to computer-based assessments (CBAs) facilitating the use of new item types and more effective data collection tools. This allows implementation of more complex test designs and to collect process and response time (RT) data. These new data types can be used to improve data quality and the accuracy of test scores obtained through latent regression (population) models. However, the move to a CBA also poses challenges for comparability and trend measurement, one of the major goals in ISLAs. We provide an overview of current methods used in ILSAs to examine and assure the comparability of data across different assessment modes and methods that improve the accuracy of test scores by making use of new data types provided by a CBA.