Abstract Pass rates are key assessment statistics which are calculated for nearly all high‐stakes examinations. In this article, we define the terminal, first attempt, total attempts, and repeat attempts pass rates, and discuss the uses of each statistic. We also explain why in many situations one should expect the terminal pass rate to be the highest, first attempt pass rate to be the second highest, total attempts pass rate to be the third highest, and repeat attempts pass rate to be the lowest when repeat attempts are allowed. Analyses of data from 14 credentialing programs showed that the expected relationship held for 13 out of 14 of the programs. Additional analyses of pass rates for educational programs in radiography in one state showed that the general relationship held at the state level, but only held for 6 out of 34 educational programs. It is suggested that credentialing programs need to clearly state their pass rate definitions and carefully consider how repeat examinees may influence pass rate statistics. It is also suggested that credentialing programs need to think carefully about the meaning and uses of different pass rate statistics when choosing which pass rates to report to stakeholders.
Differential item functioning (DIF) cancellation occurs when the cumulative effect of an item or set of items exhibiting DIF against one subgroup cancels with other items that exhibit DIF against the comparison group and hence results in non-existent DIF at the test level. This paper investigates DIF cancellation in the context of Rasch measurement. It is shown that this phenomenon is not a property of the Rasch model, but rather, a function of the manner in which item parameters are estimated and the way that DIF impacts these estimates. The conditions under which DIF cancellation would exist when using the Rasch model are suggested and a proof is provided to support this suggestion. Empirical examples are provided to refute prior suggestions that DIF cancellation always exists if the Rasch model is used.
This mixed-method study identifies influences on the literacy habits of adolescent boys. The study sought to answer the question: what factors influence adolescent boys to pursue or not pursue leisure reading? Leisure reading has been found to have a positive impact on academic success (Hughes-Hassell & Rodge, 2007) and boys tend to lag behind in engaging it (Smith & Wilhelm, 2004). A Likert scale survey was given to 137 students, boys and girls, from an upper-middle class, private Christian school in the Midwest. Questions focused on why students do or do not read and what could encourage them to increase their leisure reading frequency. An open-ended question at the end of the survey provided qualitative data. Informed consent was obtained from all participants and no compensation was given.
This paper introduces a simple and intuitive graphical display for transition table based accountability models that can be used to communicate information about students’ status and growth simultaneously. This graphical transition table includes the use of shading to convey year to year transitions and different sized letters for performance categories to depict yearly status. Examples based on Michigan’s transition table used on their Michigan Educational Assessment Program (MEAP) assessments are provided to illustrate the utility of the graphical transition table in practical contexts. Additional potential applications of the graphical transition table are also suggested.
Abstract An important part of test development is ensuring alignment between test forms and content standards. One common way of measuring alignment is the Webb (1997, 2007) alignment procedure. This article investigates (a) how well item writers understand components of the definition of Depth of Knowledge (DOK) from the Webb alignment procedure and (b) how consistent their DOK ratings are with ratings provided by other committees of educators across grade levels, content areas, and alternate assessment levels in a Midwestern state alternate assessment system. Results indicate that many item writers understand key features of DOK. However, some item writers struggled to articulate what DOK means and had some misconceptions. Additional analyses suggested some lack of consistency between the item writer DOK ratings and the committee DOK ratings. Some notable differences were found across alternate assessment levels and content areas. Implications for future item writing training and alignment studies are provided. Notes *p < .05 **p < .01 ***p < .001. *p < .05 **p < .01 ***p < .001. *p < .05 **p < .01 ***p < .001.
Construct maps are tools that display how the underlying achievement construct upon which one is trying to set cut-scores is related to other information used in the process of standard setting. This article reviews what construct maps are, uses construct maps to provide a conceptual framework to view commonly used standard-setting procedures (the Angoff, Bookmark, Mapmark, Briefing Book, Body of Work, Contrasting Groups, Borderline Groups, and Construct Mapping methods), and describes how construct maps can be applied to set cut-scores and provide feedback, evaluate standard-setting methods, and synthesize data from various standard-setting methods when deciding on cut-scores. Suggestions of how construct maps could help resolve several of the common criticisms of operational standard-setting procedures, including issues related to panelist inconsistency and score gaps, are also provided. An example from a large-scale state-testing program illustrates how construct maps may be applied in practice.
Abstract A key consideration when giving any computerized adaptive test (CAT) is how much adaptation is present when the test is used in practice. This study introduces a new framework to measure the amount of adaptation of Rasch‐based CATs based on looking at the differences between the selected item locations (Rasch item difficulty parameters) of the administered items and target item locations determined from provisional ability estimates at the start of each item. Several new indices based on this framework are introduced and compared to previously suggested measures of adaptation using simulated and real test data. Results from the simulation indicate that some previously suggested indices are not as sensitive to changes in item pool size and the use of constraints as the new indices and may not work as well under different item selection rules. The simulation study and real data example also illustrate the utility of using the new indices to measure adaptation at both a group and individual level. Discussion is provided on how one may use several of the indices to measure adaptation of Rasch‐based CATs in practice.
Is Small Really Better? Testing Some Assumptions about High School Size Barbara Schneider Adam E. Wyse Venessa Keesler Several years ago, I was in a meeting with a group of Chicago public school coaches and physical education teachers who were discussing the negative implications of one of Chicago's recent reform initiatives, the construction of smaller high schools. Much like other urban areas, Chicago had begun dismantling some of its large high schools to form smaller entities, with an "optimal" enrollment of 600 students. The coaches were deeply concerned that the small school movement was fostering the elimination of school-sponsored athletic teams, which sometimes acted as a magnet for marginal students, encouraging them to complete high school and in some instances enroll in college. From their perspective, intramural teams were unable to fill the void left by school-sponsored teams, which had helped some students obtain postsecondary scholarships and promoted a high school identity that instilled pride in the student body. Reflecting on their comments, I was struck by how my work and that of others had championed small schools. Could we have been wrong? Small schools were generally viewed as places that fostered a strong sense of community and encouraged academic achievement and attainment. But many of us had not explored whether small schools were better for all types of students. More specifically, would the consequences of creating small-school environments prove to be detrimental, especially for low-income minority students enrolled in urban high schools? [End Page 15] The case for small schools has been made in educational research since the 1960s, when scholars such as Barker and Gump argued that smaller schools provided students with greater opportunities for participation in various extracurricular activities.1 Within a smaller student body, the average adolescent would have a better chance of being on a team, taking a leadership position in the school, and developing stronger relationships with teachers and other adults in the school. The value of small schools was further supported in the 1980s by research on public and private schools that showed that smaller religious schools produced higher graduation rates and lower dropout rates than public schools.2 Analyses of the National Education Longitudinal Study of 1988 in the 1990s also showed that smaller public schools produced substantial gains in mathematics achievement for high school students.3 By 2000, the results of those studies were often used as evidence by policymakers and school administrators to support proposals to decrease school size as a strategy for increasing student achievement. However, initial results from small-school reforms have been inconsistent.4 In light of those results and reviews of earlier work, serious questions are being raised regarding the methodological techniques used to study the effects of school size.5 Several concerns center on the use of inappropriate research designs for assessing causal effects, such as correlational analyses rather than random clinical trials. These concerns have led several educational researchers to revisit propensity score methods for using observational data to approximate experimental designs, methods formalized by Donald Rubin more than thirty years ago.6 As I reviewed my own work and that of my colleagues, it became increasingly clear to me that many of the reforms being advocated, particularly in today's high schools, had rarely been studied using Rubin's methods. Many of the cornerstones of high school reform, including better academic preparation programs, peer tutoring, and use of mentors, often lacked a rigorous evaluation component. Uncertain of what we might find, my two colleagues, Adam Wyse and Venessa Keesler, and I decided to estimate the effects of high school size using Rubin's methods and conventional hierarchical models. This paper describes our efforts, using observational data from the National Education Longitudinal Study, to approximate an experiment on the effects of school size for several student outcomes: mathematics achievement, postsecondary expectations, college attendance plans after high school graduation, and number and type of...
This article discusses sample size and probability threshold considerations in the use of the tailored data method with the Rasch model. In the tailored data method, one performs an initial Rasch analysis and then reanalyzes data after setting item responses to missing that are below a chosen probability threshold. A simple analytical formula is provided that can be used to check whether or not the application of the tailored data method with a chosen probability threshold will create situations in which the number of remaining item responses for the Rasch calibration will or will not meet minimum sample size requirements. The formula is illustrated using a real data example from a medical imaging licensure exam with several different probability thresholds. It is shown that as the probability threshold was increased more item responses were set to missing and the parameter standard errors and item difficulty estimates also tended to increase. It is suggested that some consideration should be given to the chosen probability threshold and how this interacts with potential examinee sample sizes and the accuracy of parameter estimates when calibrating data with the tailored data method.