One can often hear opponents of value-added referring to these methods as “junk science.” The term is meant to express the argument that value-added is unreliable and/or invalid, and that its scientific “façade” is without merit.
Now, I personally am not opposed to using these estimates in evaluations and other personnel policies, but I certainly understand opponents’ skepticism. For one thing, there are some states and districts in which design and implementation has been somewhat careless, and, in these situations, I very much share the skepticism. Moreover, the common argument that evaluations, in order to be “meaningful,” must consist of value-added measures in a heavily-weighted role (e.g., 45-50 percent) is, in my view, unsupportable.
All that said, calling value-added “junk science” completely obscures the important issues. The real questions here are less about the merits of the models per se than how they’re being used. Read More »
** Reprinted here in the Washington Post
Former Florida Governor Jeb Bush has become one of the more influential education advocates in the country. He travels the nation armed with a set of core policy prescriptions, sometimes called the “Florida formula,” as well as “proof” that they work. The evidence that he and his supporters present consists largely of changes in average statewide test scores – NAEP and the state exam (FCAT) – since the reforms started going into place. The basic idea is that increases in testing results are the direct result of these policies.
Governor Bush is no doubt sincere in his effort to improve U.S. education, and, as we’ll see, a few of the policies comprising the “Florida formula” have some test-based track record. However, his primary empirical argument on their behalf – the coincidence of these policies’ implementation with changes in scores and proficiency rates – though common among both “sides” of the education debate, is simply not valid. We’ve discussed why this is the case many times (see here, here and here), as have countless others, in the Florida context as well as more generally.*
There is no need to repeat those points, except to say that they embody the most basic principles of data interpretation and causal inference. It would be wonderful if the evaluation of education policies – or of school systems’ performance more generally – was as easy as looking at raw, cross-sectional testing data. But it is not.
Luckily, one need not rely on these crude methods. We can instead take a look at some of the rigorous research that has specifically evaluated the core reforms comprising the “Florida formula.” As usual, it is a far more nuanced picture than supporters (and critics) would have you believe. Read More »
** Reprinted here in the Washington Post
2012 was another busy year for market-based education reform. The rapid proliferation of charter schools continued, while states and districts went about the hard work of designing and implementing new teacher evaluations that incorporate student testing data, and, in many cases, performance pay programs to go along with them.
As in previous years (see our 2010 and 2011 reviews), much of the research on these three “core areas” – merit pay, charter schools, and the use of value-added and other growth models in teacher evaluations – appeared rather responsive to the direction of policy making, but could not always keep up with its breakneck pace.*
Some lag time is inevitable, not only because good research takes time, but also because there’s a degree to which you have to try things before you can see how they work. Nevertheless, what we don’t know about these policies far exceeds what we know, and, given the sheer scope and rapid pace of reforms over the past few years, one cannot help but get the occasional “flying blind” feeling. Moreover, as is often the case, the only unsupportable position is certainty. Read More »
In a New York Times article a couple of weeks ago, reporter Michael Winerip discusses New York City’s school report card grades, with a focus on an issue that I have raised many times – the role of absolute performance measures (i.e., how highly students scores) in these systems, versus that of growth measures (i.e., whether students are making progress).
Winerip uses the example of two schools – P.S. 30 and P.S. 179 – one of which (P.S. 30) received an A on this year’s report card, while the other (P.S. 179) received an F. These two schools have somewhat similar student populations, at least so far as can be determined using standard education variables, and their students are very roughly comparable in terms of absolute performance (e.g., proficiency rates). The basic reason why one received an A and the other an F is that P.S. 179 received a very low growth score, and growth is heavily weighted in the NYC grade system (representing 60 out of 100 points for elementary and middle schools).
I have argued previously that unadjusted absolute performance measures such as proficiency rates are inappropriate for test-based assessments of schools’ effectiveness, given that they tell you almost nothing about the quality of instruction schools provide, and that growth measures are the better option, albeit one that also has its own issues (e.g., they are more unstable), and must be used responsibly. In this sense, the weighting of the NYC grading system is much more defensible than most of its counterparts across the nation, at least in my view.
But the system is also an example of how details matter – each school’s growth portion is calculated using an unconventional, somewhat questionable approach, one that is, as yet, difficult to treat with a whole lot of confidence. Read More »
Without question, designing school and district rating systems is a difficult task, and Ohio was somewhat ahead of the curve in attempting to do so (and they’re also great about releasing a ton of data every year). As part of its application for ESEA waivers, the state recently announced a newly-designed version of its long-standing system, with the changes slated to go into effect in 2014-15. State officials told reporters that the new scheme is a “more accurate reflection of … true [school and district] quality.”
In reality, however, despite its best intentions, what Ohio has done is perpetuate a troubled system by making less-than-substantive changes that seem to serve the primary purpose of giving lower grades to more schools in order for the results to square with preconceptions about the distribution of “true quality.” It’s not a better system in terms of measurement – both the new and old schemes consist of mostly the same inappropriate components, and the ratings differentiate schools based largely on student characteristics rather than school performance.
So, whether or not the aggregate results seem more plausible is not particularly important, since the manner in which they’re calculated is still deeply flawed. And demonstrating this is very easy. Read More »
Anyone who wants to start a charter school must of course receive permission, and there are laws and policies governing how such permission is granted. In some states, multiple entities (mostly districts) serve as charter authorizers, whereas in others, there is only one or very few. For example, in California there are almost 300 entities that can authorize schools, almost all of them school districts. In contrast, in Arizona, a state board makes all the decisions.
The conventional wisdom among many charter advocates is that the performance of charter schools depends a great deal on the “quality” of authorization policies – how those who grant (or don’t renew) charters make their decisions. This is often the response when supporters are confronted with the fact that charter results are varied but tend to be, on average, no better or worse than those of regular public schools. They argue that some authorization policies are better than others, i.e., bad processes allow some poorly-designed schools start, while failing to close others.
This argument makes sense on the surface, but there seems to be scant evidence on whether and how authorization policies influence charter performance. From that perspective, the authorizer argument might seem a bit like tautology – i.e., there are bad schools because authorizers allow bad schools to open, and fail to close them. As I am not particularly well-versed in this area, I thought I would look into this a little bit. Read More »
In a previous post, I discussed the idea of “attracting the best candidates” to teaching by reviewing the research on the association between pre-service characteristics and future performance (usually defined in terms of teachers’ estimated effect on test scores once they get into the classroom). In general, this body of work indicates that, while far from futile, it’s extremely difficult to predict who will be an “effective” teacher based on their paper traits, including those that are typically used to define “top candidates,” such as the selectivity of the undergraduate institutions they attend, certification test scores and GPA (see here, here, here and here, for examples).
There is some very limited evidence that other, “non-traditional” measures might help. For example, a working paper, released last year, found a statistically discernible, fairly strong association between first-year math value-added and an index constructed from surveys administered to Teach for America candidates. There was, however, no association in reading (note that the sample was small), and no relationships in either subject found during these teachers’ second years.*
A recently-published paper – which appears in the peer-reviewed journal Education Finance and Policy, originally released as working paper in 2008 – represents another step forward in this area. The analysis, presented by the respected quartet of Jonah Rockoff, Brian Jacob, Thomas Kane, and Douglas Staiger (RJKS), attempts to look beyond the set of characteristics that researchers are typically constrained (by data availability) to examine.
In short, the results do reveal some meaningful, potentially policy-relevant associations between pre-service characteristics and future outcomes. From a more general perspective, however, they are also a testament to the difficulties inherent in predicting who will be a good teacher based on observable traits. Read More »
There is some controversy over the fact that Florida’s recently-announced value-added model (one of a class often called “covariate adjustment models”), which will be used to determine merit pay bonuses and other high-stakes decisions, doesn’t include a direct measure of poverty.
Personally, I support adding a direct income proxy to these models, if for no other reason than to avoid this type of debate (and to facilitate the disaggregation of results for instructional purposes). It does bear pointing out, however, that the measure that’s almost always used as a proxy for income/poverty – students’ eligibility for free/reduced-price lunch – is terrible as a poverty (or income) gauge. It tells you only whether a student’s family has earnings below (or above) a given threshold (usually 185 percent of the poverty line), and this masks most of the variation among both eligible and non-eligible students. For example, families with incomes of $5,000 and $20,000 might both be coded as eligible, while families earning $40,000 and $400,000 are both coded as not eligible. A lot of hugely important information gets ignored this way, especially when the vast majority of students are (or are not) eligible, as is the case in many schools and districts.
That said, it’s not quite accurate to assert that Florida and similar models “don’t control for poverty.” The model may not include a direct income measure, but it does control for prior achievement (a student’s test score in the previous year[s]). And a student’s test score is probably a better proxy for income than whether or not they’re eligible for free/reduced-price lunch.
Even more importantly, however, the key issue about bias is not whether the models “control for poverty,” but rather whether they control for the range of factors – school and non-school – that are known to affect student test score growth, independent of teachers’ performance. Income is only one part of this issue, which is relevant to all teachers, regardless of the characteristics of the students that they teach. Read More »
Most of the controversy surrounding value-added and other test-based models of teacher productivity centers on the high-stakes use of these estimates. This is unfortunate – no matter what you think about these methods in the high-stakes context, they have a great deal of potential to improve instruction.
When supporters of value-added and other growth models talk about low-stakes applications, they tend to assert that the data will inspire and motivate teachers who are completely unaware that they’re not raising test scores. In other words, confronted with the value-added evidence that their performance is subpar (at least as far as tests are an indication), teachers will rethink their approach. I don’t find this very compelling. Value-added data will not help teachers – even those who believe in its utility – unless they know why their students’ performance appears to be comparatively low. It’s rather like telling a baseball player they’re not getting hits, or telling a chef that the food is bad – it’s not constructive.
Granted, a big problem is that value-added models are not actually designed to tell us why teachers get different results – i.e., whether certain instructional practices are associated with better student performance. But the data can be made useful in this context; the key is to present the information to teachers in the right way, and rely on their expertise to use it effectively. Read More »
Despite all the heated talk about how to identify and dismiss low-performing teachers, there’s relatively little research on how administrators choose whom to dismiss, whether various dismissal options might actually serve to improve performance, and other aspects in this area. A paper by economist Brian Jacob, released as working paper in 2010 and published late last year in the journal Education Evaluation and Policy Analysis, helps address at least one of these voids, by providing one of the few recent glimpses into administrators’ actual dismissal decisions.
Jacob exploits a change in Chicago Public Schools (CPS) personnel policy that took effect for the 2004-05 school year, one which strengthened principals’ ability to dismiss probationary teachers, allowing non-renewal for any reason, with minimal documentation. He was able to link these personnel records to student test scores, teacher and school characteristics and other variables, in order to examine the characteristics that principals might be considering, directly or indirectly, in deciding who would and would not be dismissed.
Jacob’s findings are intriguing, suggesting a more complicated situation than is sometimes acknowledged in the ongoing debate over teacher dismissal policy. Read More »
A new report, commissioned by the District of Columbia Mayor Vincent Gray and conducted by the Chicago-based consulting organization IFF, was supposed to provide guidance on how the District might act and invest strategically in school improvement, including optimizing the distribution of students across schools, many of which are either over- or under-enrolled.
Needless to say, this is a monumental task. Not only does it entail the identification of high- and low-performing schools, but plans for improving them as well. Even the most rigorous efforts to achieve these goals, especially in a large city like D.C., would be to some degree speculative and error-prone.
This is not a rigorous effort. IFF’s final report is polished and attractive, with lovely maps and color-coded tables presenting a lot of summary statistics. But there’s no emperor underneath those clothes. The report’s data and analysis are so deeply flawed that its (rather non-specific) recommendations should not be taken seriously. Read More »
In a new National Bureau of Economic Research working paper on teacher value-added, researchers Raj Chetty, John Friedman and Jonah Rockoff present results from their analysis of an incredibly detailed dataset linking teachers and students in one large urban school district. The data include students’ testing results between 1991 and 2009, as well as proxies for future student outcomes, mostly from tax records, including college attendance (whether they were reported to have paid tuition or received scholarships), childbearing (whether they claimed dependents) and eventual earnings (as reported on the returns). Needless to say, the actual analysis includes only those students for whom testing data were available, and who could be successfully linked with teachers (with the latter group of course limited to those teaching math or reading in grades 4-8).
The paper caused a remarkable stir last week, and for good reason: It’s one of the most dense, important and interesting analyses on this topic in a very long time. Much of the reaction, however, was less than cautious, specifically the manner in which the research findings were interpreted to support actual policy implications (also see Bruce Baker’s excellent post).
What this paper shows – using an extremely detailed dataset and sophisticated, thoroughly-documented methods – is that teachers matter, perhaps in ways that some didn’t realize. What it does not show is how to measure and improve teacher quality, which are still open questions. This is a crucial distinction, one which has been discussed on this blog numerous times (also here and here), as it is frequently obscured or outright ignored in discussions of how research findings should inform concrete education policy. Read More »
** Also posted here on ‘Valerie Strauss’ Answer Sheet’ in the Washington Post
If 2010 was the year of the bombshell in research in the three “major areas” of market-based education reform – charter schools, performance pay, and value-added in evaluations – then 2011 was the year of the slow, sustained march.
Last year, the landmark Race to the Top program was accompanied by a set of extremely consequential research reports, ranging from the policy-related importance of the first experimental study of teacher-level performance pay (the POINT program in Nashville) and the preliminary report of the $45 million Measures of Effective Teaching project, to the political controversy of the Los Angeles Times’ release of teachers’ scores from their commissioned analysis of Los Angeles testing data.
In 2011, on the other hand, as new schools opened and states and districts went about the hard work of designing and implementing new evaluations compensation systems, the research almost seemed to adapt to the situation. There were few (if any) “milestones,” but rather a steady flow of papers and reports focused on the finer-grained details of actual policy.*
Nevertheless, a review of this year’s research shows that one thing remained constant: Despite all the lofty rhetoric, what we don’t know about these interventions outweighs what we do know by an order of magnitude. Read More »
Value-added and other types of growth models are probably the most controversial issue in education today. These methods, which use sophisticated statistical techniques to attempt to isolate a teacher’s effect on student test score growth, are rapidly assuming a central role in policy, particularly in the new teacher evaluation systems currently being designed and implemented. Proponents view them as a primary tool for differentiating teachers based on performance/effectiveness.
Opponents, on the other hand, including a great many teachers, argue that the models’ estimates are unstable over time, subject to bias and imprecision, and that they rely entirely on standardized test scores, which are, at best, an extremely partial measure of student performance. Many have come to view growth models as exemplifying all that’s wrong with the market-based approach to education policy.
It’s very easy to understand this frustration. But it’s also important to separate the research on value-added from the manner in which the estimates are being used. Virtually all of the contention pertains to the latter, not the former. Actually, you would be hard-pressed to find many solid findings in the value-added literature that wouldn’t ring true to most educators. Read More »
Using value-added and other types of growth model estimates in teacher evaluations is probably the most controversial and oft-discussed issue in education policy over the past few years.
Many people (including a large proportion of teachers) are opposed to using student test scores in their evaluations, as they feel that the measures are not valid or reliable, and that they will incentivize perverse behavior, such as cheating or competition between teachers. Advocates, on the other hand, argue that student performance is a vital part of teachers’ performance evaluations, and that the growth model estimates, while imperfect, represent the best available option.
I am sympathetic to both views. In fact, in my opinion, there are only two unsupportable positions in this debate: Certainty that using these measures in evaluations will work; and certainty that it won’t. Unfortunately, that’s often how the debate has proceeded – two deeply-entrenched sides convinced of their absolutist positions, and resolved that any nuance in or compromise of their views will only preclude the success of their efforts. You’re with them or against them. The problem is that it’s the nuance – the details – that determine policy effects.
Let’s be clear about something: I’m not aware of a shred of evidence – not a shred – that the use of growth model estimates in teacher evaluations improves performance of either teachers or students. Read More »