The anesthesiology department recently instituted an end-of-rotation examination for its medical students. 50% of the student’s grade is based on the examination score. One of the students complained that the examination was unfair because the items he answered incorrectly did not reflect his true knowledge of anesthesiology. Which of the following components of testing theory would BEST demonstrate which individual questions accurately assess medical student knowledge?
- A) Difficulty
- B) Discrimination Index
- C) P-Value
- D) Standard error
B) Discrimination Index
The educator’s goal with any examination is to accurately assess knowledge. Testing theory can help elucidate the relationship between a student’s performance on an examination and their true knowledge.
In the perfect world, a test would measure precisely what it is designed to measure: this is termed reliability. Reliability can be defined as the proportion of observed score variance which can be attributed to true score variance. Reliability is a correlation, of sorts, between the scores we have, and the scores we would get if we gave another nearly-identical test. Practically, reliability in a single exam is often ascertained by Cronbach’s coefficient alpha, which estimates the variance in a student’s performance across different questions that is attributable to true variance (i.e. not an artifact of the questions being tested).
By combining a measure of reliability and the standard deviation of the scores on a test, the standard error of measurement can be calculated to understand the range (analogous to confidence interval) in which the student’s true score falls. While a large standard error may suggest this student’s score is unreliable, it does not speak to specific questions being “unfair”.
Similarly, the P-value or item difficulty (students who answer correctly / total students) indicates questions which are difficult (low P-value), but not which questions best assess true knowledge.
Discrimination indices are used to assess if test questions can accurately discriminate between low- and high-scoring candidates. In other words, question X will have a high index of discrimination if the students who get all the other questions right, also get question X right, AND the students who get all the other questions wrong, also get question X wrong. If the discrimination index is low, it brings suspicion that there is something wrong with the question if low performers got it right, and high performers got it wrong (triggering a mountain of student complaints). However, the indices of discrimination are not without limitations. Care must be taken to apply these theories correctly, but when done so, may improve how we evaluate trainees.
- De Champlain, A.F. A primer on classical test theory and item response theory for assessments in medical education. Medical Education. 2010 Jan;44(1):109-117.
- Devitt, J.H., Kurrek, M.M., Cohen, M.M., Cleave-Hogg, D. The validity of performance assessments using simulation. Anesthesiology. 2001 Jul;95(1):36-42.
- Hobsley, M. Counting apples with oranges: a limitation of the discrimination index. Medical Education. 1999 Mar;33(3):192-6.
David L. Stahl, MD
Assistant Professor, Clinical
Associate Residency Program Director
The Ohio State University
Dr. David Stahl is the Associate Program Director for the Anesthesiology Residency Program at The Ohio State University and a practicing Anesthesiologist Intensivist. He has a clinical interest in obstetric critical care, and an educational interest in learner driven educational objectives and inducing resilience in trainees.