Senior Education Officer
April 21, 2014
By Steve Benton
I am the survivor of two open-heart surgeries, the first when I was 12-years old and the second mid-way into my 52nd year. I was congenitally defective. (If I had been a car they would have recalled me.)
Across the years, health professionals have frequently used a sphygmomanometer (I can barely spell it, let alone say it) to measure my systolic and diastolic blood pressure (BP). In a single office visit I have sometimes had my BP measured by three different health professionals. Although it may vary somewhat across measurements, with multiple samplings one can see a fairly consistent pattern.
Such consistency found in multiple observations of BP also applies to student ratings of instruction. In the publication “Using IDEA Results for Administrative Decision-making” “the Center recommends using six to eight classes…that are representative of all of an instructor’s teaching responsibilities” when making decisions about teaching effectiveness (p. 1). More samples should be collected if class size is small (less than 10 students). This is to ensure that no single class has an undue effect on the overall evaluation of teaching effectiveness.
All measurements have error, including BP and student ratings. But, across multiple samples error (i.e., uncertainty) is reduced and the reliability of the data increases. For example, Figure 1 below shows the plot of inter-class reliability coefficients for IDEA SRI Item #41, excellence of the instructor, as a function of the number of classes rated. As the number of classes rated for an instructor increases the reliability coefficient increases. Once approximately six classes have been rated the coefficient remains above .90, a very high reliability. The same is true for Figure 2, which shows the plot for Item #42, excellence of the course.1
The plots in these figures provide empirical evidence to support IDEA’s recommendation that ratings be collected from several classes before making summative decisions about an instructor’s teaching effectiveness. Just as health professionals feel more confidence in drawing conclusions from multiple rather than single measurements, so should those who evaluate teaching.
Inter-class Reliability Coefficients for Item 41 as a Function of Number of Classes Rated
Inter-class Reliability Coefficients for Item 42 as a Function of Number of Classes Rated
1Item 41 is worded “Overall, I rate this instructor an excellent teacher.” Item 42 states, “Overall, I rate this course as excellent.” Students respond 1 = Definitely False, 2 = More False Than True, 3 = In Between, 4 = More True Than False, or 5 = Definitely True. Reliability coefficients based on 2500 classes.