Understanding the ``fit'' of models designed to predict binary outcomes has been a long-standing problem. We propose a flexible, portable, and intuitive metric for quantifying the change in accuracy between two predictive systems in the case of a binary outcome: the InterModel Vigorish (IMV). The IMV is based on an analogy to weighted coins, well-characterized physical systems with tractable probabilities. The IMV is always a statement about the change in fit relative to some baseline model---which can be as simple as the prevalence---whereas other metrics are stand-alone measures that need to be further manipulated to yield indices related to differences in fit across models. Moreover, the IMV is consistently interpretable independent of baseline prevalence. We contrast this metric with alternatives in numerous simulations. The IMV is more sensitive to estimation error than many alternatives and also shows distinctive sensitivity to prevalence. We then showcase its flexibility across examples spanning the social, biomedical, and physical sciences. We also demonstrate how it can be used to provide straightforward interpretation of logistic regression coefficients. The IMV allows for precise answers to questions about changes in model fit in a variety of settings in a manner that will be useful for furthering research with binary outcomes.
The identification of causal relationships between specific genes and social, behavioral, and health outcomes is challenging due to environmental confounding from population stratification and dynastic genetic effects. Numerous existing methods leverage the random genetic differences between parents and their children induced by genetic recombination to estimate effect that are free from environmental confounding. However, such methods require dyadic genetic data within families (i.e. parent-child pairs and/or sibling pairs) and therefore can only be applied in relatively small and selected samples. We introduce the phenotype differences model to compare siblings and estimate the causal effect of genetic predictors using just a single individual's genotype. We show that, under plausible assumptions, the phenotype differences model provides unbiased and consistent estimates of genetic effects. We then utilize the phenotype differences model to estimate the effects of 40 polygenic scores on premature mortality using asymmetrically genotyped sibling pairs in the Wisconsin Longitudinal Study. We find that twelve polygenic scores related to self-rated health, body mass index, education, cognition, depression, life satisfaction, smoking behavior, and chronic obstructive pulmonary disease have a meaningful impact on mortality outcomes. When we combine information across multiple polygenic scores, the sibling in a pair who inherited more longevity-increasing DNA from their parents on average lived 9 months longer and was 7 pp (12%) more likely to survive until age 75 than their brother/sister.
We develop an approach for parsing observed group differences in accuracy into differences due to ability and differences due to time usage. This approach first uses flexible models to identify conditional accuracy functions (CAFs) and then uses these CAFs to decompose group differences in accuracy. We first illustrate that this approach can reliably recover true differences in ability-related accuracy when observed differences in accuracy are confounded by time usage differences across groups. We then use this approach to probe gender differences in science performance for Chile in PISA 2018, and reading fluency for 71 countries in PISA 2018. We observe an increase in the score differential for PISA Chile; i.e., adjusting for time use increases group score differences. For reading fluency, gender score differentials reduce across countries i.e., adjusting for time use decreases group score differences. This offers a method for estimating the impact of time use on observed score differences, offering more information for validity of the interpretation of test scores.
While automatic essay scoring (AES) systems promise speed and objectivity, their usefulness is limited by defensible interpretation of the scores they generate.While AES systems have extremely high reliability, reliability is a necessary (but not sufficient) condition for validity.Any validity argument constructed around an AES system must consider how well scores reflect the construct, how the scoring decisions are made, and to what extent the scores can be manipulated.We examine these key considerations within the context of feature-based AES systems and AES systems based on neural networks.While neural AES systems boast increased reliability compared to feature-based systems, they are less significantly less transparent in how they make their decisions and what aspects of the essay they focus on.In addition, research from computer vision points to the fact that decisions in neural network based classifiers can be manipulated by maliciously crafted input.We investigate similar examples within natural language processing, propose a method for malicious manipulation of essay-length text inputs and evaluate the threat to validity posed by essay manipulation.
Abstract Group differences in test scores are a key metric in education policy. Response time offers novel opportunities for understanding these differences, especially in low‐stakes settings. Here, we describe how observed group differences in test accuracy can be attributed to group differences in latent response speed or group differences in latent capacity, where capacity is defined as expected accuracy for a given response speed. This article introduces a method for decomposing observed group differences in accuracy into these differences in speed versus differences in capacity. We first illustrate in simulation studies that this approach can reliably distinguish between group speed and capacity differences. We then use this approach to probe gender differences in science and reading fluency in PISA 2018 for 71 countries. In science, score differentials largely increase when males, who respond more rapidly, are the higher performing group and decrease when females, who respond more slowly, are the higher performing group. In reading fluency, score differentials decrease where females, who respond more rapidly, are the higher performing group. This method can be used to analyze group differences especially in low‐stakes assessments where there are potential group differences in speed.
Can individual differences in visual processing predict individual differences in reading abilities? The potential role of visual processing deficits in reading difficulties is a longstanding and unresolved question. A major challenge to address this question empirically is to develop reliable behavioral measures for developmental studies because most tasks are not equally reliable across different age groups and require iterative design changes to ensure that the measure reliably indexes the intended construct across the developmental span. In this study we show how to iteratively modify the behavioral task informed by data and the utilization of item response theory to further reduce task redundancies that make the measure reliable, fast, fun and easily deployable to a diverse population of kindergarten and first grade children. Our results show that the ability to rapidly encode visual information - a string of letters or symbols - reliably correlates with reading outcomes at the end of the academic year in a large and diverse sample of kindergarten, first and second grade children.
Abstract An accurate model of the factors that contribute to individual differences in reading ability depends on data collection in large, diverse and representative samples of research participants. However, that is rarely feasible due to the constraints imposed by standardized measures of reading ability which require test administration by trained clinicians or researchers. Here we explore whether a simple, two-alternative forced choice, time limited lexical decision task (LDT), self-delivered through the web-browser, can serve as an accurate and reliable measure of reading ability. We found that performance on the LDT is highly correlated with scores on standardized measures of reading ability such as the Woodcock-Johnson Letter Word Identification test (r = 0.91, disattenuated r = 0.94). Importantly, the LDT reading ability measure is highly reliable (r = 0.97). After optimizing the list of words and pseudowords based on item response theory, we found that a short experiment with 76 trials (2–3 min) provides a reliable (r = 0.95) measure of reading ability. Thus, the self-administered, Rapid Online Assessment of Reading ability (ROAR) developed here overcomes the constraints of resource-intensive, in-person reading assessment, and provides an efficient and automated tool for effective online research into the mechanisms of reading (dis)ability.
The speed-accuracy tradeoff suggests that responses generated under time constraints will be less accurate. While it has undergone extensive experimental verification, it is less clear whether it applies in settings where time pressures are not being experimentally manipulated (but where respondents still vary in their utilization of time). Using a large corpus of 29 response time datasets containing data from cognitive tasks without experimental manipulation of time pressure, we probe whether the speed-accuracy tradeoff holds across a variety of tasks using idiosyncratic within-person variation in speed. We find inconsistent relationships between marginal increases in time spent responding and accuracy; in many cases, marginal increases in time do not predict increases in accuracy. However, we do observe time pressures (in the form of time limits) to consistently reduce accuracy and for rapid responses to typically show the anticipated relationship (i.e., they are more accurate if they are slower). We also consider analysis of items and individuals. We find substantial variation in the item-level associations between speed and accuracy. On the person side, respondents who exhibit more within-person variation in response speed are typically of lower ability. Finally, we consider the predictive power of a person's response time in predicting out-of-sample responses; it is generally a weak predictor. Collectively, our findings suggest the speed-accuracy tradeoff may be limited as a conceptual model in its application in non-experimental settings and, more generally, offer empirical results and an analytic approach that will be useful as more response time data is collected.
This paper examines the processes through which education becomes valued as capital within the racialised and gendered political economy. Using the empirical case of venture capital (VC) investment in for-profit education companies, principally education technology companies and for-profit school chains, it puts forward the concept of educational capitalisation. We conceptualise it as the set of uneven processes, practices, and socio-spatial relationships through which value is extracted from educational processes and practices, and, thus, education is valued in terms of expected monetary return on investment. While our conceptualisation focuses on VC investment, this framework could be used to outline other processes of extraction and valuation in education, including private equity, investment banking, venture philanthropy, and public-private partnerships. We conclude with a co-formational feminist framework for guiding future research, policymaking, and educational decision making that considers the financial, socio-technological, learning & teaching, and political-legal aspects of educational capitalisation, with ethical considerations embedded within each of these domains.
Studies of interaction effects are of great interest because they identify crucial interplay between predictors in explaining outcomes. Previous work has considered several potential sources of statistical bias and substantive misinterpretation in the study of interactions, but less attention has been devoted to the role of the outcome variable in such research. Here, we consider bias and false discovery associated with estimates of interaction parameters as a function of the distributional and metric properties of the outcome variable. We begin by illustrating that, for a variety of noncontinuously distributed outcomes (i.e., binary and count outcomes), attempts to use the linear model for recovery leads to catastrophic levels of bias and false discovery. Next, focusing on transformations of normally distributed variables (i.e., censoring and noninterval scaling), we show that linear models again produce spurious interaction effects. We provide explanations offering geometric and algebraic intuition as to why interactions are a challenge for these incorrectly specified models. In light of these findings, we make two specific recommendations. First, a careful consideration of the outcome's distributional properties should be a standard component of interaction studies. Second, researchers should approach research focusing on interactions with heightened levels of scrutiny. (PsycInfo Database Record (c) 2022 APA, all rights reserved).