The Waive of the Future? School Accountability in the Waiver Era

Issue/Topic: Accountability--Measures/Indicators; No Child Left Behind--Reauthorization Issues/Waivers
Author(s): Polikoff, Morgan; McEachin, Andrew; Wrabel, Stephani; Duque, Matthew
Organization(s): American Educational Research Association
Publication: Educational Researcher
Published On: 12/11/2013

Given Congress’s inaction in reauthorizing the Elementary and Secondary Education Act (ESEA), the U.S. Department of Education implemented a waiver program in 2012 to reduce the law’s burden on states. The waiver program provides states the opportunity to implement their own accountability systems, often substantially reducing the number of schools identified for intervention.

To evaluate ESEA waivers using four standards of practice regarding the appropriate use of assessment data established by the American Psychological Association (APA), American Education Research Association (AERA), and National Council on Measurement in Education (NCME), as well as the measurement and assessment literature: construct validity, reliability, fairness and transparency


Construct Validity
  • There are two main ways in which the identification of priority and focus schools in the waiver plans is superior in construct validity to the way schools were identified under the No Child Left Behind Act (NCLB):

    • The inclusion of non-test-based measures -- the most common non-state-test measure is some college/career ready indicator, but states also include attendance, test participation rates, educator effectiveness, school climate and opportunity to learn measures.
    • The use of test-based measures other than proficiency rates -- although no state uses achievement growth as its own criterion for identification, 20 states use a composite index that includes growth.These measures are much closer to identifying schools’ contributions to student learning than NCLB’s percent proficient.

  • Although these are promising signs, almost anything would have been an improvement over the construct validity of adequate yearly progress (AYP) classifications, which were based only on proficiency rates, or changes in proficiency rates, in math and English language arts (ELA). However, there are two main shortcomings with the construct validity of the proposed systems:

    • The Elementary and Secondary Education Act (ESEA) guidelines did not require states to include science assessment in their accountability systems; thus, most states are still using only mathematics and ELA to identify schools. Thus, there will still be strong incentives to focus on mathematics and ELA at the expense of nontested or low-stakes subjects.
    • Although some states include creative nontest measures in their indices, these rarely account for a substantial proportion of the total and are mainly for high schools.
  • AYP was highly reliable in identifying low-performing schools, because school-level proficiency rates are reliable across years.

  • There are several reasons to think that the waiver classifications will be less reliable than AYP:

    • Priority and focus classifications are based on a fixed percent of schools. This norm-referenced approach results in decreased reliability due to measurement error and imprecision around the cut.
    • The use of growth models in composite indices decreases reliability. With what is known about teacher-level measures of student growth, the year-to-year stability of school-level student growth measures is likely moderate at best. Although states could use multiple years of data for their accountability measures, only 12 states chose to do so, and some of these states used multiple years of data for their status measures only, which improves reliability marginally. Even with multiple years of data, composite indices that incorporate growth measures will be less stable than AYP.

  • The use of growth data comes with an important tradeoff: it enhances construct validity but decreases reliability. The most reliable systems will be those that rely on status measures of performance, such as those highlighted above as having weak construct validity. Among states with more construct valid measures, those using multiple years of data will have greater reliability.
  • The fairness of the waiver plans will likely be an improvement over the current AYP system; however, they will still suffer from similar biases against schools serving more students from historically low-performing subgroups because of the heavy reliance on status-based measures of achievement.

  • Although diverse schools will be more likely to fail under any accountability system save one that explicitly controls for student demographics, there are some ways in which the approved waivers will decrease the diversity penalty. One way is in states using “super subgroups” rather than NCLB subgroups for priority or focus classification. Super subgroups generally take two forms: a combination of subgroups based on demographics or a subgroup of the lowest performing students in a school.

  • Composite indices incorporating growth models will aso be fairer than those based more heavily on status measures, because the correlations of growth measures with student characteristics are smaller.
  • Although NCLB’s AYP system was unfair and had weak construct validity, the use of percent proficient was reasonably transparent. On the surface, the new grading systems in place in most states are even more transparent. Many states use either an A-F or point-based index system, which condenses multiple measures of school performance into an aggregate. These indices, because of their familiar form, should be fairly interpretable by educators and the public. However, there are several problems that limit the transparency of these indices, including that many states have composite indices but do not use them to identify priority or focus schools.

  • Another way in which the transparency of the measures is unclear is that many state indices apply seemingly arbitrary weights to unrelated measures to arrive at a composite score. In these cases, there is an apparent tradeoff between the increased construct validity that comes with a composite index and the transparency of the index.

Policy Implications/Recommendations:
  • The first and most important policy recommendation is to incorporate the lessons learned from NCLB into the implementation of the waiver applications. A number of the waiver applications propose policies that are known to pose specific problems. Moving away from the use of unadjusted proficiency rates and adding additional tested subjects to accountability would improve the construct validity of classifications considerably. Since all states are required to test science, they should include science testing results in priority and focus determinations.

  • To further improve the construct validity and fairness of accountability classifications, the U.S. Department of Education should allow states to create more refined comparison groups for schools by conditioning on student demographics in the construction of school performance measures. By excluding student demographics from performance measures, the system expects the same performance from all schools regardless of their student inputs, penalizing schools for factors they cannot control. Although political pressures push against the use of controls for student background, this unfairness may contribute to unintended consequences, such as teachers preferring to work in schools serving more affluent children.

  • Another way to improve construct validity is to move away from within-state achievement gaps to within-school or within-district. This would change the focus away from low-performing subgroups to reducing the gap within a school or district. A within-school or within-district system would send a clear message that all students within a school deserve attention and effort.

  • To improve the reliability of performance classifications, states should use multiple years of data for school performance measures, especially measures incorporating student growth. By now, most states have the ability to use multiple years of data in the construction of school performance measures; there is no good reason not to.

  • It would also improve reliability if states moved away from the arbitrary norm-referenced approach to identifying low-performing schools encouraged by U.S. Department of Education guidelines. Although setting the bar at the bottom 5% or 10% creates a more manageable sample size and likely reduces total costs associated with state interventions, it also adds noise to the system. By design, the use of these cutoffs sends the message that 10% is failing but 11% is not, even though these schools may not meaningfully differ. Rather, schools may benefit from a clear operational definition of a low-performing school that is based on a set of performance criteria.

  • To improve both transparency and construct validity of classification systems, states should reevaluate the construction of composite measures and their use for identifying schools. Although A-to-F systems are, on the surface, transparent, the underlying design of these systems involves a great deal of arbitrariness that makes it difficult for educators and parents to understand performance. Furthermore, keeping indicators separate allows for a more nuanced understanding of the strengths and weaknesses of schools that can be used to tailor interventions.

  • States should conduct short-term analyses of the implementation of their waiver systems and make adjustments.
​Full text of study: http://edr.sagepub.com/content/43/1/45.full.pdf+html?ijkey=LoPEgefArEO0M&keytype=ref&siteid=spedr

Research Design:
Each of the 42 flexibility waiver applications approved by the U.S. Department of Education was analyzed for construct validity, reliability, fairness and transparency using a three-phase process.

42 approved NCLB flexibility applications were analyzed for construct validity, reliability, fairness and transparency.

Year data is from:


Data Collection and Analysis:
Three-part analysis of 42 approved NCLB flexibility requests: 1) waiver applications were read, outlined, and condensed according to waiver principles 2) the condensed outlines were used to code the accountability designs, including subgroup size, subjects tested, components and weights of composite indices, and growth measures 3) the four measurement criteria were applied to describe each waiver application.


Reference in this Web site to any specific commercial products, processes or services, or the use of any trade, firm or corporation name is for the information and convenience of the public, and does not constitute endorsement or recommendation by the Education Commission of the States. Please contact Kathy Christie at 303-299-3613 or kchristie@ecs.org for further information regarding the information posting standards and user policies of the Education Commission of the States.