When Rater Reliability Is Not Enough: Teacher Observation Systems and a Case for the Generalizability Study

Issue/Topic: Teaching Quality--Evaluation and Effectiveness
Author(s): Charalambous, Charalambos; Kraft, Matthew; Hill, Heather
Organization(s): American Educational Research Association
Publication: Educational Researcher
Published On: 2012

Researchers have documented the multiple sources of variance in observational scores due to the sampling of lessons, differences among raters, and the characteristics of the observational instrument itself. In an era that will undoubtedly see major expansion in the number and use of observational instruments, practitioners and researchers need to more carefully examine the sources of variation in observational scores and consider their implications for how these ratings are used.

To illustrate how observational systems might be developed and improved by inclusion of not only observational instruments, but also of scoring designs capable of producing reliable and cost-efficient scores and processes for rater recruitment, training and certification

  • For classroom observation to succeed in its aims, improved observational systems must be developed. These systems should include not only observational instruments but also scoring designs capable of producing reliable and cost-efficient scores and processes for rater recruitment, training, and certification.

  • Contrary to common practice, it is misleading to talk about the reliability of specific instruments; instead, reliability inheres in the joint combination of instruments, rater training and certification systems, and specific scoring designs that constitute an observational system.

  • The analysis demonstrates empirically the hazard of using a common metric—80% interrater agreement—as a sole measure of the reliability of a classroom observation system. Some items below this threshold performed well in the G-study analysis, whereas other items that met this threshold performed poorly.

  • Although reaching high rater agreement levels is clearly preferable, it may not be feasible for some complex performance arenas within teaching, nor should it be used as the sole criterion for determining score reliability.

Policy Implications/Recommendations:
Developing reliable and litigation-proof observational systems takes time, expertise, and generous financial resources. Given that the United States is moving toward national standards for content and curriculum, and given that there is little reason to believe that the basics of good teaching vary greatly from Mississippi to New York, the authors argue for focusing national efforts on developing a set of carefully tested classroom observation systems, providing confidence that the scores and instructional feedback derived from observational systems will become trusted inputs in teacher development and teacher evaluation systems.

Research Design:
Generalizabilty study (G-study) and a series of decision-type studies (D-studies) conducted for the Mathematical Quality of Instruction (MQI) mathematics instruction measurement instrument

From a pool of 24 middle school teachers participating in a related study, sampled 8 who represented different levels of mathematical knowledge of teaching. Nine graduate students and former teachers were used as raters.

Year data is from:


Data Collection and Analysis:
Each of the nine raters assigned scores of low, medium, or high (1, 2, or 3) for every item for each segment within a sample of 24 lessons (eight teachers with three lessons each). To analyze the data, the authors aggregated the segment scores to the lesson level then conducted a G-study to determine the variance components attributable to teachers, lessons, and raters; their two-way interactions; and the combination of the three-way interaction and the measurement error.


Reference in this Web site to any specific commercial products, processes or services, or the use of any trade, firm or corporation name is for the information and convenience of the public, and does not constitute endorsement or recommendation by the Education Commission of the States. Please contact Kathy Christie at 303-299-3613 or kchristie@ecs.org for further information regarding the information posting standards and user policies of the Education Commission of the States.