Teacher Selection: Smart Selection vs. Dumb Selection

I had a twitter argument the other day about a blog posting that compared the current debate around “de-selection” of bad teachers to eugenics. It is perhaps a bit harsh to compare Hanushek  (cited author of papers on de-selecting bad teachers) to Hitler, if that was indeed the intent. However, I did not take that as the intent of the posting by Cedar Riener.  Offensive or not, I felt that the blog posting made 3 key points about errors of reasoning that apply to both eugenecists and to those promoting empirical de-selection of fixed shares of the teacher workforce.  Here’s a quick summary of those three points:

  • The first error is a deterministic view of a complex and uncertain process.
  • The second common error becomes apparent once the need arises to concretely measure quality.
  • The third error is a belief that important traits are fixed rather than changeable.

These are critically important, and help us to delineate between smart selection and, well, dumb selection.  These three errors of reasoning are the basis for dumb selection.  A selection that is, as the author explains, destined to fail.  But, I do not see this particular condemnation of dumb selection to be a condemnation of selection more generally. By contrast, the reformy pundit with whom I was arguing continued to claim that Riener’s blog was condemning any and all forms of selection as doomed to fail, a seemingly absurd proposition (and not how I read it at all).

Clearly, selection can and should play a positive role in the formation of the teacher workforce or in the formation of that team of school personnel that can make a school great.

Smart Selection: In nearly every human endeavor, in every and any workforce or labor market activity exists some form of selection. Selection of individuals into specific careers, jobs and roles and de-selection of individuals out of careers, jobs and roles. Selection in and of itself is clearly not a bad thing. In fact, the best of organizations necessarily select the best available individuals over time to work within those organizations. And, individuals attempt to select the best organizations, careers, jobs and roles to suit their interests, motivation and needs. That is, self-selection. Teacher selection or any education system employee selection is no different. And good teacher selection is obviously important for having good schools. Like any selection process on the labor market, teacher selection involves a two-sided match. On the one hand, there are the school leaders and existing employees (to the extent they play a role in recruitment and selection) who may play a role in determining among a pool of applicants which ones are the best fit for their school and the specific job in question. On the other hand, there are the signals sent out by the school (some within and some outside the control of existing staff and leaders) which influence the composition of the applicant pool and for that matter, whether an individual who is selected decides to stay. These include signals about compensation, job characteristics and work environment. Managing this complex system well is key to having a great school. Sending the right signals. Creating the right environment. Making the right choices among applicants. Knowing when a choice was wrong. And handling difficult decisions with integrity.

There has also been much discussion of late about a recent publication by Brian Jacob of the University of Michigan, who found that when given the opportunity to play a strong role in selecting which probationary teachers should continue in their schools, principals generally selected teachers who later proved to generate good statistical outcomes (test scores). Note that this approach to declaring successful decision making suffers the circular logic I’ve frequently bemoaned on this blog. But, at the very least, Jacob’s findings suggest that decisions made by individuals – human beings considering multiple factors – are not counterproductive when measured against our current batch of narrow and noisy metrics. Specifically, Jacob found:

Principals are more likely to dismiss teachers who are frequently absent and who have previously received poor evaluations. They dismiss elementary school teachers who are less effective in raising student achievement. Principals are also less likely to dismiss teachers who attended competitive undergraduate colleges. It is interesting to note that dismissed teachers who were subsequently hired by a different school are much more likely than other first-year teachers in their new school to be dismissed again.

That to me seems like good selection. And it seems that principals are doing it reasonably well when given the chance. And this is why I also support using principals as the key leverage point in the process (with the caveat that principal quality itself is very unequally distributed, and must be improved).

Dumb “Selection:” Dumb selection on the other hand – the kind of selection that is destined to fail if applied en masse in public schooling or any endeavor suffers the three major flaws of reasoning addressed by Cedar Riener in his blog post.  Now, you say to yourself, but who is really promoting dumb selection and what more specifically are the elements of dumb selection when it comes to the teacher workforce? Here are the elements:

  1. Heavy (especially a defined fixed, large share) weight in making teacher evaluation, compensation or dismissal decisions placed on Value-Added metrics, which can be corrupted, may suffer severe statistical bias, and are highly noisy and error prone.
  2. Explicit, prior specification of the exact share of teachers who should be de-selected in any given year, or year after year over time OR prior specification of exact scores or ratings (categories) derived from those scores requiring action to be taken – including de-selection.

Sadly, several states have already adopted into policy the first of these dumb selection concepts – the mandate of a fixed weight to be place on problematic measures.  See this post by Matt Di Carlo at ShankerBlog for more on this topic.

Thus far, I do not know of states or districts that have, for example, required that 5% of the bottom scoring teachers in any given year be de-selected. But, states and districts have established categorical rating systems for teachers from high to low rating groups, based arbitrary cut points applied to these noisy measures, and have required that dismissal, intervention and compensation decisions be based on where teachers fall in the fixed, arbitrary classification scheme in a given year, or sequence of three years.

To some extent, the notion of de-selecting fixed shares of the teacher workforce based on noisy metrics comes from economists simulations based on convenience of measures than on active policy conversations. But in the past year, the lines between these simulations and reality have become blurred as policy conversations have indeed drifted toward actually using fixed values based on noisy achievement measures in place of seniority as a blunt tool to deselect teachers during times of budget cuts.  If and when these simplified social science thought exercises are applied as public policy involving teachers, they do reek of the disturbingly technocratic, “value-neutral” mindset pervasive in eugenics as well.

One other recent paper that’s gotten attention, applies this technocratic (my preference over eugenic) approach to determine whether using performance measures instead of seniority would result a) in different patterns of layoffs and b) in different average “effectiveness” scores (again, that circular logic rears its ugly head) Now, of course, if you lay off based on effectiveness scores rather than seniority, the average effectiveness scores of those left should be higher. The deck is stacked in this reformy analysis. But, even then, the authors find very small differences, largely because a) seniority based layoffs seem to be affecting mainly first and second year teachers, and b) effectiveness scores tend to be lower for first and second year teachers. Overall, the authors find:

We next examine our value-added measures of teacher effectiveness and find that teachers who received layoff notices were about 5 percent of a standard deviation less effective on average than the average teacher who did not receive a notice. This result is not surprising given that teachers who received layoff notices included many first and second-year teachers, and numerous studies show that, on average, effectiveness improves substantially over a teacher’s first few years of teaching.

Perhaps most importantly, these thought experiments, not ready for policy implementation prime time (nor will they ever be?) necessarily ignore the full complexity of the system to which they are applied, and as Riener noted, assume that individual’s traits are fixed – how you are rated by the statistical model today is assumed to correct (despite a huge chance it’s not) and assumed to be sufficient for classifying your usefulness as an employee, now and forever (be it a 1 or 3 year snapshot). In that sense, Riener’s comparison, while offensive to some, was right on target.

To summarize: Smart selection good. Dumb selection bad. Most importantly, selection itself is neither good nor bad. It all depends on how it’s done.

Passing Muster Fails Muster? (An Evaluation of Evaluating Evaluation Systems)

The Brookings Institution has now released their web based version of Passing Muster including a nifty calculation tool for rating teacher evaluation systems. Unfortunately, in my view, this rating system fails muster in at least two major ways.

First, the authors explain their (lack of) preferences for specific types of evaluation systems as follows:

“Our proposal for a system to identify highly-effective teachers is agnostic about the relative weight of test-based measures vs. other components in a teacher evaluation system.  It requires only that the system include a spread of verifiable and comparable teacher evaluations, be sufficiently reliable and valid to identify persistently superior teachers, and incorporate student achievement on standardized assessments as at least some portion of the evaluation system for teachers in those grades and subjects in which all students are tested.”

That is, a district’s evaluation system can consider student test scores to whatever extent they want, in balance with other approaches to teacher evaluation.  The logic here is a bit contorted from the start. The authors explain what they believe are necessary components of the system, but then claim to be agnostic on how those components are weighted.

But, if you’re not agnostic on the components, then saying you’re agnostic on the weights is not particularly soothing.

Clearly, they are not agnostic on the components or their weight, because the system goes on to evaluate the validity of each and every component based on the extent to which that component correlates with the subsequent year value-added measure.  This is rather like saying, we remain agnostic on whether you focus on reading or math this year, but we are going to evaluate your effectiveness by testing you on math. Or more precisely, we remain agnostic on whether you emphasize conceptual understanding and creative thinking this year, but we are going to evaluate your effectiveness on a pencil and paper, bubble test of specific mathematics competencies and vocabulary and grammar.

Second, while hanging ratings of evaluation systems entirely on their correlation with “next year’s value added,” the authors choose to again remain agnostic on the specifics for estimating the value-added effectiveness measures. That is, as I’ve blogged in the past, the authors express a strong preference that the value added measures be highly correlated from year to year, but remain agnostic as to whether those measures are actually valid, or instead are highly correlated mainly because the measures contain significant consistent bias – bias which disadvantages specific teachers in specific schools – and doe so year after year after year!

Here are the steps for evaluating a teacher evaluation system as laid out in Passing Muster:

Step 1: Target Percentile of True Value Added

Step 2: Constant factor (tolerance)

Step 3: Correlation of teacher level total evaluation score in current year, with next year value added

Step 4: Correlation of non-value added components with next year’s value added

Step 5: Correlation of this year’s value added with next year’s value added

Step 6: Number of teachers subject to the same evaluation system used to calculate correlation in step 3 ( a correlation with next year’s value added!)

Step 7: Number of current teachers subject to only the non-value added system

In researchy terms, their system is all reliability and no validity (or, at least, inferring the latter from the former).

But, rather than simply having each district evaluate its own evaluation system by correlating its current year ratings with next year’s value-added, the Brookings report suggests that states should evaluate district teacher evaluation systems by measuring the extent that district teacher evaluations correlate with a state standardized value-added metric for the following year.

But again, the authors remain agnostic on how that model should/might be estimated, favoring that the state level model be “consistent” year to year, rather than accurate. After all, how could districts consistently measure the quality of their evaluation systems if the state external benchmark against which they are evaluated was not consistent?

As a result, where a state chooses to adopt a consistently biased statewide standardized value-added model, and use that model to evaluate district teacher evaluation systems, the state in effect backs districts into adopting consistently biased year-to-year teacher evaluations… that have the same consistent biases as the state model.

The report does suggest that in the future, there might be other appropriate external benchmarks, but that:

“Currently value-added measures are, in most states, the only one of these measures that is available across districts and standardized.  As discussed above, value-added scores based on state administered end-of-year or end-of-course assessments are not perfect measures of teaching effectiveness, but they do have some face validity and are widely available.”

That is, value-added measures  – however well or poorly estimated – should be the benchmark for whether a teacher evaluation system is a good one, simply because they are available and we think, in some cases, that they may provide meaningful information (though even that remains disputable- to quote Jesse Rothstein’s review of the Gates/Kane Measures of  Effective Teaching study: “In particular, the correlations between value-added scores on state and alternative assessments are so small that they cast serious doubt on the entire value-added enterprise.” See: http://nepc.colorado.edu/files/TTR-MET-Rothstein.pdf).

I might find some humor in all of this strange logic and circular reasoning if the policy implications weren’t so serious.

The Perils of Favoring Consistency over Validity: Are “bad” VAMS more “consistent” than better ones?

This is another stat-geeky researcher post, but I’ll try to tease out the practical implications. This post comes about partly, though not directly in response to a new Brown Center/Brookings report on evaluating teacher evaluation systems. From that report, by an impressive team of authors, one can tease out two apparent preferences for evaluation systems, or more specifically for any statistical component of those evaluation systems to be based on student assessment scores.

  1. A preference to isolate as precisely as statistically feasible, the influence of the teacher on student test score gains;
  2. A preference to have a statistical rating of teacher effectiveness that is relatively consistent from year to year (where the more consistent models still aren’t particularly consistent).

While there shouldn’t necessarily be a conflict between identifying the best model of teacher effects and having a model that is reliable over time, I would argue that the pressure to achieve the second objective above may lead researchers – especially those developing models for direct application in school districts – to make inappropriate decisions regarding the first objective.  After all, one of the most common critiques levied at those using value-added models to rate teacher effectiveness is the lack of consistency of the year to year ratings.

Further, even the Brown Center/Brookings report took a completely agnostic stance regarding the possibility that better and worse models exist, but played up the relative importance of consistency, or reliability, of the teacher’s persistent effect over time.

There are “better” and “worse” models

The reality is that there are better and worse value-added models (though even better ones remain problematic). Specifically there are better and worse ways to handle certain problems that emerge from using value-added modeling to determine teacher effectiveness. One of the biggest issues is how well the model corrects for problems of the non-random assignment of students to teachers across classrooms and schools. It is incredibly difficult to untangle teacher effects from peer group effects and/or any other factor within schooling at the classroom level (mix of students/ lighting/heating/ noise/ class size). We can only better isolate the teacher effect from these other effects if each teacher is given the opportunity to work across varied settings and with varied students over time.

A fine example of taking an insufficient model (LA Times, Buddin Model) and raising it to a higher level with the same data are the alternative modeling exercises prepared by Derek Briggs & Ben Domingue of the University of Colorado.  Among other things, Briggs/Domingue shows that by including classroom level peer characteristics in addition to student level dummy variables for economic status and race, significantly reduces the extent to which teacher effectiveness ratings remain influenced by the non-random sorting of students across classrooms.

In our first stage we looked for empirical evidence that students and teachers are sorted into classrooms non-randomly on the basis of variables that are not being controlled for in Buddin’s value-added model. To do this, we investigated whether a student’s teacher in the future could have an effect on a student’s test performance in the past—something that is logically impossible and a sign that the model is flawed (has been misspecified). We found strong evidence that this is the case, especially for reading outcomes. If students are non-randomly assigned to teachers in ways that systemically advantage some teachers and disadvantage others (e.g., stronger students tending to be in certain teachers’ classrooms), then these advantages and disadvantages will show up whether one looks at past teachers, present teachers, or future teachers. That is, the model’s outputs result, at least in part, from this bias, in addition to the teacher effectiveness the model is hoping to capture.

Later:

The second stage of the sensitivity analysis was designed to illustrate the magnitude of this bias. To do this, we specified an alternate value-added model that, in addition to the variables Buddin used in his approach, controlled for (1) a longer history of a student’s test performance, (2) peer influence, and (3) school-level factors.

Clearly, it is important to include classroom level and peer group covariates to attempt to  identify more precisely the “teacher effect,” and remove the bias in teacher estimates that results from the non-random ways in which kids are sorted across schools and classrooms.

Two levels of the non-random assignment problem

To clarify, there may be at least two levels to the non-random assignment problem, and both may be persistent problems over time for any given teacher or group of teachers under a single evaluation system. In other words: Persistent non-random assignment!

As I mentioned above, we can only untangle the classroom level effects, which include different mixes of students, class sizes and classroom settings, or even time of day a specific course is taught, if each teacher to be evaluated has the opportunity to teach different mixes of kids, in different classroom settings and at different times of day and so on. Otherwise, some teachers are subjected to persistently different teaching conditions.

Focusing specifically on the importance of students and peer effect, it is more likely than not, that rather than having totally different groups and types of kids year after year, some teachers:

  • persistently work with children coming from the most disadvantaged family/household background environments;
  • persistently take on the role of trying to serve the most disruptive children.

At the very least, statistical modeling efforts must attempt to correct for the first of these peer effects with comprehensive classroom level measures of peer composition (and a longer trail of lagged test scores for each student). Briggs showed that doing so made significant improvements to the LAT model. And Briggs showed that the LAT model contained substantial biases, and failed specific falsification tests used to identify those biases. Specifically, the effectiveness of a student’s subsequent teacher could be used to predict the effectiveness of their previous teacher. Briggs/Domingue note:

These results provide strong evidence that students are being sorted into grade 4 and grade 5 classrooms on the basis of variables that have not been included in the LAVAM (p. 11)

That is, a persistent pattern of non-random sorting which affects teachers’ effectiveness ratings. And, a persistent pattern of bias in those ratings that was significantly reduced by Briggs’ improved models.

At this point, you’re probably wondering why I keep harping on this term “persistent.”

Persistent Teacher Effect vs Persistent Model Bias?

So, back to the original point, and the conflict between those two objectives, reframed:

  1. Getting a model consistent enough to shut up those VAM naysayers;
  2. Estimating a statistically more valid VAM, by including appropriate levels of complexity (and accepting the reduced numbers of teachers who can be evaluated as data demands are increased).

Put this way, it’s a battle between REFORMY and RESEARCHY. Obviously, I favor the RESEARCHY perspective, mainly because it favors a BETTER MODEL! And a BETTER MODEL IS A FAIRER MODEL!  But sadly, I think that REFORMY will too often win this epic battle.

Now, about that word “persistent.” Ever since the Gates/Kane teaching effectiveness report, there has been new interest in identifying the “persistent effect of teachers” on student test score gains. That is, an obsession with focusing public attention on that tiny sapling of explained variation in test scores that persists from year to year, while making great effort to divert public attention away from the forest of variance explained by other factors. “Persistent” is also the term du jour for the Brown/Brookings report.

A huge leap in those reports referring to “persistent effect” is to expand that phrase from the persistent classroom level variance explained to: “persistent year to year contribution of teachers to student achievement.” (p. 16, Brown/Brookings) It is assumed that any “persistent effect” estimated from any value added model – regardless of the features of that model – represents a persistent “teacher effect.”

But the persistent effect likely contains two components – persistent teacher effect & persistent bias – and the balance of weight of those components depends largely on how well the model deals with non-random assignment. The “persistent teacher effect” may easily be dwarfed by the “persistent non-random assignment bias” in an insufficiently specified model (or one dependent on crappy data).

AND, the persistently crappy model – by failing to reduce the persistent bias – is actually quite likely to be much more stable over time.  In other words, if the model fails miserably at correcting for non-random assignment, a teacher who gets stuck with the most difficult kids year after year is much more likely to get a consistently bad rating. More effectively correct for non-random sorting, and the teacher’s rating likely jumps around at least a bit more from year to year.

And we all know that in the current conversations – model consistency trumps model validity. That must change! Above and beyond all of the MAJOR TECHNICAL AND PRACTICAL CONCERNS I’ve raised repeatedly in this blog, there exists little or no incentive, and little or no pressure from researchers (who should no better) for state policy makers or local public school districts to actually try to produce more valid measures of effectiveness. In fact, too many incentives and pressures exist to use bad measures rather then better ones.

NOTE:

The Brookings method for assessing the validity of comprehensive evaluations works best/only works with a more stable VAM model.  This means that their system provides an incentive for using a more stable model at the expense of accuracy.  As a result, they’ve sort of built into their system – which is supposed to measure accuracy of evaluations – an incentive for less accurate VAM models.  It’s kind of a vicious circle.


Student Test Score Based Measures of Teacher Effectiveness Won’t Improve NJ Schools

Op-Ed from: http://www.northjersey.com

The recent Teacher Effectiveness Task Force report recommended basing teacher evaluation significantly on student test scores. A few weeks earlier, Education Commissioner Cerf recommended that teacher tenure and dismissal, as well as compensation decisions be based largely on student assessment data.

Implicit in these recommendations is that the state and local districts would design a system for linking student assessment data to teachers for purposes of estimating teacher effectiveness. The goal of statistical “teacher effectiveness” measurement systems, including the most common approach called value-added modeling (VAM), is to estimate the extent to which a specific teacher contributes to the learning gains of a group (or groups) of students assigned to that teacher in a given year.

Unfortunately, while this all sounds good, it just doesn’t work, at least not well enough to even begin considering using it for making high stakes decisions about teacher tenure, dismissal or compensation. Here’s a short list (my full list is much longer) of reasons why:

  1. It is not possible to equate the difficulty of moving a group of children 5 points (or rank and percentile positions) at one end of a test scale to moving children 5 points at the other end. Yet that is precisely what the proposed evaluations endeavor to accomplish. In such a system, the only fair way to compare one teacher to another would be to ensure that each has a randomly assigned group of children whose initial achievement is spread similarly across the testing scale. Real schools and districts don’t work that way.  It is also not possible to compare a 5 point gain in reading to a 5 point gain in math. These limitations undermine the entire proposed system.
  2. Even with the best models and data, teacher ratings are highly inconsistent from year to year, and have very high rates of misclassification. According to one recent major study, there is a 35% chance of identifying an average teacher as poor, given one year of data, and 25% chance given three years. Getting a good rating is a statistical crap shoot.
  3. If we rate the same teacher with the same students, but with two different tests in the same subject, we get very different results. Cal. Berkeley Economist Jesse Rothstein, re-evaluating the findings of the much touted Gates Foundation Measuring Effective Teaching (MET) study noted that more than 40% of teachers who placed in the bottom quarter on one test (state test) were in the top half when using the other test (alternative). That is, teacher ratings based on the state assessment were only slightly better than a coin toss for identifying which teachers did well using the alternative assessment.
  4. No-matter how hard statisticians try, and no matter how good the data and statistical model, it is very difficult to separate a teacher’s effect on student learning gains from other classroom effects, like peer effect (race and poverty of peer group).  New Jersey schools are highly segregated, hampering our ability to make valid comparisons across teachers who work in vastly different settings. Statistical models attempt to adjust away these differences, but usually come up short.
  5. Kids learn over the summer too and higher income kids learn more than their lower income peers over the summer. As a result, annual testing data aren’t very useful for measuring teacher effectiveness. Annual (rather than fall-spring) testing data significantly disadvantage teachers serving children whose summer learning lags. Setting aside all of the un-resolvable problems above, this one can be fixed with fall-spring assessments. But it cannot be resolved in any fast-tracked plan involving current New Jersey assessments, which are annual. The task force report irresponsibly ignores this HUGE AND OBVIOUS concern, recommending fast-tracked use of current assessment data.
  6. As noted by the task force, only those teachers responsible for reading and math in grades 3 to 8 could readily be assigned ratings (less than 20% of teachers). Testing everything else is a foolish and expensive endeavor. This means school districts will need separate contracts for separate classes of teachers and will have limited ability to move teachers from one contract type to another (from second to fourth grade). Further, pundits have been arguing that a) we should be using effectiveness measures instead of experience to implement layoffs due to budget cuts, and b) we shouldn’t be laying off core, classroom teachers in grades 3 to 8. But those are the only teachers for whom “effectiveness” measures would be available?
  7. Basing teacher evaluations, tenure decisions and dismissal decisions on scores that may be influenced by which students a teacher serves provides a substantial disincentive for teachers to serve kids with the greatest needs, disruptive kids, or kids with disruptive family lives. Many of these factors are not, and can not be captured by variables in the best models. Some have argued that including value-added metrics in teacher evaluation reduces the ability of school administrators to arbitrarily dismiss a teacher. Rather, use of these metrics provides new opportunities to sabotage a teacher’s career through creative student assignment practices.

In short, we may be able to estimate a statistical model that suggests that teacher effects vary widely across the education system – that teachers matter. But we would be hard pressed to use that model to identify with any degree of certainty which individual teachers are good teachers and which are bad.

Contrary to education reform wisdom, adopting such problematic measures will not make the teaching profession a more desirable career option for America’s best and brightest college graduates. In fact, it will likely make things much worse. Establishing a system where achieving tenure or getting a raise becomes a roll of the dice and where a teacher’s career can be ended by a roll of the dice is no way to improve the teacher work force.

Contrary to education reform wisdom, using these metrics as a basis for dismissing teachers will NOT reduce the legal hassles associated with removal of tenured teachers.  As the first rounds of teachers are dismissed by random error of statistical models alone, by manipulation of student assignments, or when larger shares of minority teachers are dismissed largely as a function of the students they serve, there will likely be a new flood of lawsuits like none ever previously experienced. Employment lawyers, sharpen your pencils and round up your statistics experts.

Authors of the task force report might argue that they are putting only 45% of the weight of evaluations on these measures. The rest will include a mix of other objective and subjective measures. The reality of an evaluation that includes a single large, or even significant weight, placed on a single quantified factor is that that specific factor necessarily becomes the tipping point, or trigger mechanism. It may be 45% of the evaluation weight, but it becomes 100% of the decision, because it’s a fixed, clearly defined (though poorly estimated) metric.

Self-proclaimed “reformers” make the argument that the present system of teacher evaluation is so bad as to be non-existent. Reformers argue that the current system has 100% error rate (assuming current evaluations label all teachers as good, when all are actually bad)!

From the “reformer” viewpoint, something is always better than nothing.

Value added is something.

We must do something.

Therefore, we must do value-added.

Reformers also point to studies showing that teacher’s value-added scores are the best predictor (albeit a weak and error prone predictor) of teacher’s future value added scores – a self-fulfilling prophecy. These arguments are incredibly flimsy.

In response, I often explain that if we lived in a society that walked everywhere, and a new automotive invention came along, but had the tendency to burst into a ball of flames on every third start, I think I’d walk. Now is a time to walk! Some innovations just aren’t ready for broad public adoption – and some may never be. Some, like this one, may not be a very good idea to begin with. That said, improving teacher evaluation is not a simple either/or and now may be a good time to step back from this false dichotomy and discuss more productive alternatives.

Expanded gambling okay in NJ, but only if it involves gambling on teachers’ jobs!

I may be the only one in New Jersey who had a twisted enough view of today’s news stories to pick up on this connection. Seemingly irrelevant to my blog, today, the Governor of New Jersey vetoed a bill that would have approved online gambling. At the same time, the Governor’s teacher effectiveness task force released its long-awaited report. And it did not disappoint. Well, I guess that’s a matter of expectations. I had very low expectations to begin with – fully expecting a poorly written, ill-conceived rant about how to connect teacher evaluations to test scores – growth scores – and how it is imperative that a large share of teacher evaluation be based on growth scores. And I got all of that and more!!!!!

I have written about this topic on multiple occasions.

For the full series on this topic, see: https://schoolfinance101.wordpress.com/category/race-to-the-top/value-added-teacher-evaluation/

And for my presentation slides on this topic, including summaries of the relevant research, see: https://schoolfinance101.com/wp-content/uploads/2010/10/teacher-evaluation_general.pdf

When it comes to critiquing the Task Force Report, I’m not even sure where to begin. In short, the report proposes the most ill-informed toxic brew of policy recommendations that one can imagine. The centerpiece, of course, is heavy… very heavy reliance on statewide student testing measures yet to be developed… yet to be evaluated for their statistical reliability … or their meaningfulness of any sort (including predictive validity of future student success). As Howard Wainer explains here, even the best available testing measures are not up to the task of identifying more and less effective teachers: http://www.njspotlight.com/ets_video2/

But who cares what the testing and measurement experts think anyway. This is about the kids… and we must fix our dreadful system and do it now… we can’t wait! The children can’t wait!

So then, what does this have to do with the online gambling veto? Well, it struck me as interesting that, on the one hand, the Governor vetoes a bill that would approve online gambling, but the Governor’s Task Force proposes a teacher evaluation plan that would make teachers’ year to year job security and teacher evaluations largely a game of chance. Yes, a roll of the dice. Roll a 6 and you’re fired! Damn hard to get 3 in a row (positive evaluations) to get tenure. Exponentially easier to get 2 in a row (bad evals) and get fired. No online gambling for sure, but gambling on the livelihood of teachers? That’s absolutely fine!

Interestingly, one of the only external sources even cited (outside of citing the comparably problematic Washington DC IMPACT contract, and think tanky schlock like the New Teacher Project’s “Teacher Evaluation 2.0“), was the Gates Foundation’s Measuring Effective Teaching Project (MET).  Of course, the task force report fails to mention that the Gates Foundation MET project report does not make a very compelling statistical case that using test scores as a major factor for evaluating teachers is a good idea. Actually, they fail to mention anything substantive about the MET reports. I wrote about the MET report here.  And economist Jesse Rothstein took a closer look at the Gates MET findings here! Rothstein concluded:

In particular, the correlations between value-added scores on state and alternative assessments are so small that they cast serious doubt on the entire value-added enterprise. The data suggest that more than 20% of teachers in the bottom quarter of the state test math distribution (and more than 30% of those in the bottom quarter for ELA) are in the top half of the alternative assessment distribution. Furthermore, these are “disattenuated” estimates that assume away the impact of measurement error. More than 40% of those whose actually available state exam scores place them in the bottom quarter are in the top half on the alternative assessment.
In other words, teacher evaluations based on observed state test outcomes are only slightly better than coin tosses at identifying teachers whose students perform unusually well or badly on assessments of conceptual understanding.

Yep that’s right. It’s little more than a coin toss or a roll of the dice! Online gambling (personally, I don’t care one way or the other about it), not okay. Gambling on teachers’ livelihoods with statistical error? Absolutely fine. After all, it’s those damn teachers that have sucked the economy dry with their high salaries and gold-plated benefits packages! And after all, it is the only profession in the world where you can do a really crappy job year after year after year… and you’re totally protected, right? Of course it’s that way. Say it loud enough and enough times, over and over again, and it must be true.

Here are a few random thoughts I have about the report:

  • So… as I understand it, they want to base 45% of a teacher’s evaluation on measures that have a 35% chance of misclassifying an average teacher as ineffective – and these are measures that only apply to about 15 to 20% of the teacher workforce? That doesn’t sound very well thought out to me.
  • Forcing reading and math teachers to be evaluated by measures over which they have limited control, and measures that jump around significantly from year to year and disadvantage teachers in more difficult settings isn’t likely to make New Jersey’s best and brightest jump at the chance to teach in Newark, Camden or Jersey City.
  • Even if the current system of teacher evaluation is less than ideal, it doesn’t mean that we should jump to adopt metrics that are as problematic as these. Promoters of these options would have the public believe that it’s either the status quo – which is necessarily bad – or test-score based evaluation – which is obviously good. This is untrue at many levels. First, New Jersey’s status quo is pretty good. Second, New Jersey’s best public and private schools don’t use test scores as a primary or major source of teacher evaluation. Yet somehow, they are still pretty darn good. So, using or not using test scores to hire and fire teachers is not likely the problem nor is it the solution. It’s an absurd false dichotomy.
  • Authors of the report might argue that they are putting only 45% of the weight of evaluations on these measures. The rest will include a mix of other objective and subjective measures. The reality of an evaluation that includes a single large, or even significant weight, placed on a single quantified factor is that that specific factor necessarily becomes the tipping point, or trigger mechanism. It may be 45% of the evaluation weight, but it becomes 100% of the decision, because it’s a fixed, clearly defined (though poorly estimated) metric.

Here’s a quick run-down on some of the issues associated with using student test scores to evaluate teachers:

[from a forthcoming article on legal issues associated with using test scores to evaluate, and dismiss teachers]

Most VAM teacher ratings attempt to predict the influence of the teacher on the student’s end-of-year test score, given the student’s prior test score and descriptive characteristics – for example, whether the student is poor, has a disability, or is limited in her English language proficiency.[1] These statistical controls are designed to account for the differences that teachers face in serving different student populations.  However, there are many problems associated with using VAM to determine whether teachers are effective.  The remainder of this section details many of those problems.

Instability of Teacher Ratings

The assumption in value-added modeling for estimating teacher “effectiveness” is that if one uses data on enough students passing through a given teacher each year, one can generate a stable estimate of the contribution of that teacher to those children’s achievement gains.[2] However, this assumption is problematic because of the concept of inter-temporal instability: that is, the same teacher is highly likely to get a very different value-added rating from one year to the next.  Tim Sass notes that the year-to-year correlation for a teacher’s value-added rating is only about 0.2 or 0.3 – at best a very modest correlation.  Sass also notes that:

About one quarter to one third of the teachers in the bottom and top quintiles stay in the same quintile from one year to the next while roughly 10 to 15 percent of teachers move all the way from the bottom quintile to the top and an equal proportion fall from the top quintile to the lowest quintile in the next year.[3]

Further, most of the change or difference in the teacher’s value-added rating from one year to the next is unexplainable – not by differences in observed student characteristics, peer characteristics or school characteristics.[4]

Similarly, preliminary analyses from the Measures of Effective Teaching Project, funded by the Bill and Melinda Gates Foundation found:

When the between-section or between-year correlation in teacher value-added is below .5, the implication is that more than half of the observed variation is due to transitory effects rather than stable differences between teachers. That is the case for all of the measures of value-added we calculated.[5]

While some statistical corrections and multi-year analysis might help, it is hard to guarantee or even be reasonably sure that a teacher would not be dismissed simply as a function of unexplainable low performance for two or three years in a row.

Classification & Model Prediction Error

Another technical problem of VAM teacher evaluation systems is classification and/or model prediction error.  Researchers at Mathematica Policy Research Institute in a study funded by the U.S. Department of Education carried out a series of statistical tests and reviews of existing studies to determine the identification “error” rates for ineffective teachers when using typical value-added modeling methods.[6] The report found:

Type I and II error rates for comparing a teacher’s performance to the average are likely to be about 25 percent with three years of data and 35 percent with one year of data. Corresponding error rates for overall false positive and negative errors are 10 and 20 percent, respectively.[7]

Type I error refers to the probability that based on a certain number of years of data, the model will find that a truly average teacher performed significantly worse than average.[8] So, that means that there is about a 25% chance, if using three years of data or 35% chance if using one year of data that a teacher who is “average” would be identified as “significantly worse than average” and potentially be fired.  Of particular concern is the likelihood that a “good teacher” is falsely identified as a “bad” teacher, in this case a “false positive” identification. According to the study, this occurs one in ten times (given three years of data) and two in ten (given only one year of data).

Same Teachers, Different Tests, Different Results

Determining whether a teacher is effective may vary depending on the assessment used for a specific subject area and not whether that teacher is a generally effective teacher in that subject area.  For example, Houston uses two standardized test each year to measure student achievement: the state Texas Assessment of Knowledge and Skills (TAKS) and the nationally-normed Stanford Achievement Test.[9] Corcoran and colleagues used Houston Independent School District (HISD) data from each test to calculate separate value-added measures for fourth and fifth grade teachers.[10] The authors found that a teacher’s value-added can vary considerably depending on which test is used.[11] Specifically:

among those who ranked in the top category (5) on the TAKS reading test, more than 17 percent ranked among the lowest two categories on the Stanford test.  Similarly, more than 15 percent of the lowest value-added teachers on the TAKS were in the highest two categories on the Stanford.[12]

Similar issues apply to tests on different scales – different possible ranges of scores, or different statistical modification or treatment of raw scores, for example, whether student test scores are first converted into standardized scores relative to an average score, or expressed on some other scale such as percentile rank (which is done is some cases but would generally be considered inappropriate).  For instance, if a teacher is typically assigned higher performing students and the scaling of a test is such that it becomes very difficult for students with high starting scores to improve over time, that teacher will be at a disadvantage. But, another test of the same content or simply with different scaling of scores (so that smaller gains are adjusted to reflect the relative difficulty of achieving those gains) may produce an entirely different rating for that teacher.

Difficulty in Isolating Any One Teacher’s Influence on Student Achievement

It is difficult if not entirely infeasible to isolate one specific teacher’s contribution to student’s learning, leading to situations where a teacher might be identified as a bad teacher simply because her colleagues are ineffective. This is called a spillover effect. [13] For students who have more than one teacher across subjects (and/or teaching aides/assistants), each teacher’s value-added measures may be influenced by the other teachers serving the same students. Kirabo Jackson and Elias Bruegmann, for example, found in a study of North Carolina teachers that students perform better, on average, when their teachers have more effective colleagues.[14] Cory Koedel found that reading achievement in high school is influenced by both English and math teachers.[15] These spillover effects mean that teachers assigned to weaker teams of teachers might be disadvantaged, through no fault of their own.

Non-Random Assignment of Students Across Teachers, Schools And Districts

The fact that teacher value-added ratings cannot be disentangled from patterns of student assignment across schools and districts leads to the likelihood that teachers serving larger shares of one population versus another are more likely to be identified as effective or ineffective, through no fault of their own.  Non-random assignment, like inter-temporal instability is a seemingly complicated statistical issue. The non-random assignment problem relates not to the error in the measurement (test scores) but to the complications of applying a statistical model to real world conditions. The most fair comparisons between teachers would occur in a case where teachers could be randomly assigned to comparable classrooms with comparable resources, and where exactly the same number of students could be randomly assigned to those teachers, so that each teacher would have the same numbers of children and children of similar family backgrounds, prior performance, personal motivation and other characteristics. Obviously, this does not happen in reality.

Students are not sorted randomly across schools, across districts, or across teachers within schools. And teachers are not randomly assigned across school settings, with equal resources. It is certainly likely that one fourth grade teacher in a school is assigned more difficult students year-after-year than another. This may occur by choice of that teacher – a desire to try to help out these students – or other factors including the desire of a principal to make a teacher’s work more difficult.  While most value-added models contain some crude indicators of poverty status, language proficiency and disability classification, few if any sufficiently mitigate the bias that occurs from non-random student assignment. That bias occurs from such apparently subtle forces as the influence of peers on one another, and the inability of value-added models to sufficiently isolate the teacher effect from the peer effect, both of which occur at the same level of the system – the classroom.[16]

Jesse Rothstein notes that “[r]esults indicate that even the best feasible value-added models may be substantially biased, with the magnitude of the bias depending on the amount of information available for use in classroom assignments.”[17]

Value-added modeling has more recently been at the center of public debate after the Los Angeles Times contracted RAND Corporation economist Richard Buddin to estimate value-added scores for Los Angeles teachers, and the Times reporters then posted the names of individual teachers classified as effective or ineffective on their web site.[18] The model used by the Los Angeles Times, estimated by Buddin, was a fairly typical one, and the technical documentation proved rich with evidence of the types of model bias described by Rothstein and others. For example:

  • 97% of children in the lowest performing schools are poor, and 55% in higher performing schools are poor;
  • The number of gifted children a teacher has affects their value-added estimate positively – The more gifted children the teacher has, the higher the effectiveness rating;
  • Black teachers have lower value-added scores for both English Language Arts and Math than white teachers, and these are some of the largest negative correlates with effectiveness ratings provided in the report – especially for MATH;
  • Having more black students in your class is negatively associated with teacher’s value-added scores, though this effect is relatively small;
  • Asian teachers have higher value-added scores than white teachers for Math, with the positive association between being Asian and math teaching effectiveness being as strong as the negative association for black teachers.

Some of these associations above are explained by related research by Hanushek and Rivkin, which shows measurable effects of the racial composition of peer groups on individual student’s outcomes and explains the difficulty in distilling these effects from teacher effects.[19] Note that it is also likely that associations with teacher race above are entangled with student race, where black teachers are more likely to be in classrooms with larger shares of black students.[20]

All value-added comparisons are relative. They can be used for comparing one teacher to another in a school, teachers in one school to teachers in another school, or in one district to other districts. The reference group becomes critically important when determining the potential for disparate impact of negative teacher ratings, resulting from model bias. For example, if one were to employ a district-wide performance-based dismissal (or retention) policy in Los Angeles using the Los Angeles Times model, one would likely layoff disproportionate numbers of teachers in poor schools and black teachers of black students, while disproportionately retaining Asian teachers.  But, if one adopted the layoff policy relative to within-school rather than district-wide norms, because children are largely segregated by neighborhoods and schools, the disparate effect might be lessened. The policy may neither be fairer nor better in terms of educational improvement, but racially disparate dismissals might be reduced.

Finally, because teacher value-added ratings cannot be disentangled entirely from patterns of student assignment across teachers within schools, principals may manipulate assignment of difficult and/or unmotivated students in order to compromise a teacher’s value-added ratings, increasing the principal’s ability to dismiss that teacher. This concern might be mitigated by requirements for lottery-based student assignment and teacher assignments. However, such requirements could create cumbersome student assignment processes and processes that interfere with achieving the best teacher match for each child.

Whereas the problem of stability rates and error rates above are issues of “statistical error,” the problem of non-random assignment is one of “model bias.” Many value-added ratings of “teacher effectiveness” suffer from both large degrees of error, and severe levels of model bias.  The two are cumulative problems, not overlapping. In fact, the extent of error in the measures may partially mask the full extent of bias. In other words, we might not even know how prodigious the bias is.

In The Best Possible Case, About 20% of Contracted Certified Teachers in a District Might Have Value-Added Scores

Setting aside the substantial concerns above over “measurement error” and “model bias” which severely compromise the reliability and validity of value-added ratings of teachers, in most public school districts, fewer than 20% of certified teaching staff could be assigned any type of value-added assessment score.  Existing standardized assessments typically focus on reading or language arts, and math performance between grades three and eight.  Because baseline scores are required, and ideally multiple prior scores to limit model bias, it becomes difficult to fairly rate third grade teachers.  By middle school or junior high, students are interacting with many more teachers and it becomes more difficult to assign value-added scores to any one teacher. When considering the various support staff roles, specialist teachers, teachers of elective and/or advanced secondary courses, value-added measures are generally applicable to only a small minority of teachers in any school district (<20%). Thus, in order to make value-added measures a defined element of teacher evaluation in teacher contracts, one must have separately negotiated contracts for those teachers to whom these measures apply and this is administratively cumbersome and potentially expensive for districts in these difficult economic times.

Washington DC’s IMPACT teacher evaluation system is one example that differentiates classes of teachers by having, or not, value-added measures.[21] While contractually feasible, this approach creates separate classes of teachers in schools and may have unintended consequences for educational practices, including increasing tensions between non-value-added-rated teachers wishing to pull students of value-added-rated teachers out of class for special projects or activities.


[1] Value-added ratings of teachers are generally not based on a simple subtraction of each student’s spring test score and previous fall test score for a specific subject. Such an approach would clearly disadvantage teachers who happen to serve less motivated groups of students, or students with more difficult home lives and/or fewer family resources to support their academic progress through the year. It would be even more problematic to simply use the spring test score from the prior year as the baseline score, and the spring of the current year to evaluate the current year teacher, because the teacher had little control over any learning gain or loss that may have occurred during the prior summer. And these gains and losses tend to be different for students from higher and lower socio-economic status.  See Karl L. Alexander et al., Schools, Achievement, and Inequality: A Seasonal Perspective, 23 Educ. Eval. and Pol’y Analysis 171 (2001). Recent findings from a study funded by the Bill and Melinda Gates Foundation confirm these “seasonal” effects: “The norm sample results imply that students improve their reading comprehension scores just as much (or more) between April and October as between October and April in the following grade. Scores may be rising as kids mature and get more practice outside of school.” Bill & Melinda Gates Foundation, Learning about Teaching: Initial Findings from the Measures of Effective Teaching Project 8, available at http://www.metproject.org/downloads/Preliminary_Findings-Research_Paper.pdf.

[2] Tim R. Sass, The Stability of Value-Added Measures of Teacher Quality and Implications for Teacher Compensation Policy, Urban Institute (2008), available at http://www.urban.org/UploadedPDF/1001266_stabilityofvalue.pdf. See also Daniel F. McCaffrey et al., The Intertemporal Variability of Teacher Effect Estimates, 4 Educ. Fin. & Pol’y, 572 (2009).

[3] Sass, supra note 27.

[4] Id.

[5] Bill & Melinda Gates Foundation, supra note 26.

[6] Peter Z. Schochet & Hanley S. Chiang, Error Rates in Measuring Teacher and School Performance Based on Student Test Score Gains (NCEE 2010-4004). Washington, DC: National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education (2010).

[7] Id.

[8] Id. at 12.

[9] Sean P. Corcoran, Jennifer L. Jennings & Andrew A. Beveridge, Teacher Effectiveness on High- and Low-Stakes Tests, Paper presented at the Institute for Research on Poverty summer workshop, Madison, WI (2010).

[10] Id.

[11] Id.

[12] Id.

[13] Cory Koedel, An Empirical Analysis of Teacher Spillover Effects in Secondary School, 28 Econ. of Educ. Rev.682 (2009).

[14] C. Kirabo Jackson & Elias Bruegmann, Teaching Students and Teaching Each Other: The Importance of Peer Learning for Teachers, 1 Am. Econ. J.: Applied Econ. 85 (2009).

[15] Koedel, supra note 38.

[16] There exist at least two different approaches to control for peer group composition. On approach, used by Caroline Hoxby and Gretchen Weingarth  involves constructing measures of the average entry level of performance for all other students in the class.  C. Hoxby & G. Weingarth, Taking Race Out of the Equation: School Reassignment and the Structure of Peer Effects, available at http://www.hks.harvard.edu/inequality/Seminar/Papers/Hoxby06.pdf. Another involves constructing measures of the average racial and socioeconomic characteristics of classmates, as done by Eric Hanushek and Steven Rivkin. E. Hanushek & S. Rivkin, School Quality and the Black-White Achievement Gap, available at http://www.nber.org/papers/w12651.pdf?new_window=1.

[17] Jesse Rothstein, Teacher Quality in Educational Production: Tracking, Decay, and Student Achievement, 25 Q. J. Econ. (2008). See also Jesse Rothstein, Student Sorting and Bias in Value Added Estimation: Selection on Observables and Unobservables, available at http://gsppi.berkeley.edu/faculty/jrothstein/published/rothstein_vam2.pdf. Many advocates of value-added approaches point to a piece by Thomas Kane and Douglas Staiger as downplaying Rothstein’s concerns. Thomas Kane & Douglas Staiger, Estimating Teacher Impacts on Student Achievement: An Experimental Evaluation, available at http://www.nber.org/papers/w14607.pdf?new_window=1.   However, Eric Hanushek and Steve Rivkin explain, regarding the Kane and Staiger analysis: “the possible uniqueness of the sample and the limitations of the specification test suggest care in interpretation of the results.” Eric A. Hanushek & Steve G. Rivkin, S., Generalizations about Using Value-Added Measures of Teacher Quality 8, available at http://www.utdallas.edu/research/tsp-erc/pdf/jrnl_hanushek_rivkin_2010_teacher_quality.pdf.

[18] Richard Buddin, How Effective Are Los Angeles Elementary Teachers and Schools?, available at http://www.latimes.com/media/acrobat/2010-08/55538493.pdf.

[19] Eric Hanushek & Steve Rivkin, School Quality and the Black-White Achievement Gap, Educ. Working Paper Archive, Univ. of Ark., Dep’t of Educ. Reform (2007).

[20] Charles T. Clotfelter et al., Who Teaches Whom? Race and the Distribution of Novice Teachers, 24 Econ. of Educ. Rev. 377 (2005).

Dice Rolling Activity for New Jersey Teachers

Yesterday, New Jersey’s Education Commissioner announced his plans for how teachers should be evaluated, what teachers should have to do to achieve tenure, and on what basis a teacher could be relieved of tenure. In short, Commissioner Cerf borrowed from the Colorado teacher tenure and evaluation plan which includes a few key elements (Colorado version outlined at end of post):

1. Evaluations based 50% on teacher effectiveness ratings generated with student assessment data – or value-added modeling (though not stated in those specific terms)

2. Teachers must receive 3 positive evaluations in a row in order to achieve tenure.

3. Teachers can lose tenure status or be placed at risk of losing tenure status if they receive 2 negative evaluations in a row.

This post is intended to illustrate just how ill-conceived – how poorly thought out – the above parameters are. This all seems logical on its face, to anyone who knows little or nothing about the fallibility of measuring teacher effectiveness or probability and statistics more generally. Of course we only want to tenure “good” teachers and we want a simple mechanism to get rid of bad ones. If it was only that easy to set up simple parameters of goodness and badness and put such a system into place. Well, it’s not.

Here’s an activity for teachers to try today. It may take more than a day to get it done.

MATERIALS: DICE (well, really just one Die)! That’s all you need!

STEP 1: Roll Die. Record result. Roll again. Record result. Keep rolling until you get the same number 3 times in a row. STOP. Write down the total number of rolls.

STEP 2: Roll Die. Record result. Roll again. Record result. Keep rolling until you get the same number 2 times in a row. STOP. Write down the total number of rolls.

Post your results in the comments section below.

Now, what the heck does this all mean? Well, as I’ve written on multiple occasions, the year to year instability of teacher ratings based on student assessment scores is huge. Alternatively stated, the relationship between a teacher’s rating in one year and the next is pretty weak. The likelihood of getting the same rating two straight years is pretty low, and three straight years is very low. The year to year correlation, whether we are talking about the recent Gates/Kane studies or previous work, is about .2 to .3. There’s about a 35% chance that an average teacher in any year is misidentified as poor, given one year of data, and 25% chance given two years of data. That’s very high error rate and very low year to year relationship. This is noise. Error. Teachers – this is not something over which you have control! Teachers have little control over whether they can get 3 good years in a row. AND IN THIS CASE, I’M TALKING ONLY ABOUT THE NOISE IN THE DATA, NOT THE BIAS RESULTING FROM WHICH STUDENTS YOU HAVE!

What does this mean for teachers being tenured and de-tenured under the above parameters? Given the random error, instability alone, it could take quite a long time, a damn long time for any teacher to actually string together 3 good years of value added ratings. And even if one does, we can’t be that confident that he/she is really a good teacher. The dice rolling activity above may actually provide a reasonable estimate of how long it would take a teacher to get tenure (depending on how high or low teacher ratings have to be to achieve or lose tenure). In that case, you’ve got a 1/6 chance with each roll that you get the same number you got on the previous. Of course, getting the same number as your first roll two more times is a much lower probability than getting that number only one more time. You can play it more conservatively by just seeing how long it takes to get 3 rolls in a row where you get a 4, 5 or 6 (above average rating), and then how long it takes to get only two in a row of a 1, 2, or 3.

What does that mean? That means that it could take a damn long time to string together the ratings to get tenure, and not very long to be on the chopping block for losing it. Try the activity. Report your results below.

Each roll above is one year of experience. How many rolls did it take you to get tenure? And how long to lose it?

Now, I’ve actually given you a break here, because I’ve assumed that when you got the first of three in a row, that the number you got was equivalent to a “good” teacher rating. It might have been a bad, or just average rating. So, when you got three in a row, those three in a row might get you fired instead of tenured. So, let’s assume a 5 or a 6 represent a good rating. Try the exercise again and see how long it takes to get three 5s or three 6s in a row. (or increase your odds of either success or failure by lumping together any 5 or 6 as successful and any 1 or 2 as unsuccessful, or counting any roll of 1-3 as unsuccessful and any roll of 4 -6 as successful)

Of course, this change has to work both ways too. See how long it takes to get two 1s or two 2s in a row, assuming those represent bad ratings.

Now, defenders of this approach will likely argue that they are putting only 50% of the weight of evaluations on these measures. The rest will include a mix of other objective and subjective measures. The reality of an evaluation that includes a single large, or even significant weight, placed on a single quantified factor is that that specific factor necessarily becomes the tipping point, or trigger mechanism. It may be 50% of the evaluation weight, but it becomes 100% of the decision, because it’s a fixed, clearly defined (though poorly estimated) metric.

In short, based on the instability of measures alone, the average time to tenure will be quite long, and highly unpredictable. And, those who actually get tenure may not be much more effective, or any more, than those who don’t. It’s a crap shoot. Literally!

Then, losing tenure will be pretty easy… also on a crap shoot… but your odds of losing are much greater than your odds were of winning.

And who’s going to be lining up for these jobs?

Summary of research on “intertemporal instability” and “error rates”

The assumption in value-added modeling for estimating teacher “effectiveness” is that if one uses data on enough students passing through a given teacher each year, one can generate a stable estimate of the contribution of that teacher to those children’s achievement gains.[1] However, this assumption is problematic because of the concept of inter-temporal instability: that is, the same teacher is highly likely to get a very different value-added rating from one year to the next.  Tim Sass notes that the year-to-year correlation for a teacher’s value-added rating is only about 0.2 or 0.3 – at best a very modest correlation.  Sass also notes that:

About one quarter to one third of the teachers in the bottom and top quintiles stay in the same quintile from one year to the next while roughly 10 to 15 percent of teachers move all the way from the bottom quintile to the top and an equal proportion fall from the top quintile to the lowest quintile in the next year.[2]

Further, most of the change or difference in the teacher’s value-added rating from one year to the next is unexplainable – not by differences in observed student characteristics, peer characteristics or school characteristics.[3]

Similarly, preliminary analyses from the Measures of Effective Teaching Project, funded by the Bill and Melinda Gates Foundation found:

When the between-section or between-year correlation in teacher value-added is below .5, the implication is that more than half of the observed variation is due to transitory effects rather than stable differences between teachers. That is the case for all of the measures of value-added we calculated.[4]

While some statistical corrections and multi-year analysis might help, it is hard to guarantee or even be reasonably sure that a teacher would not be dismissed simply as a function of unexplainable low performance for two or three years in a row.

Classification & Model Prediction Error

Another technical problem of VAM teacher evaluation systems is classification and/or model prediction error.  Researchers at Mathematica Policy Research Institute in a study funded by the U.S. Department of Education carried out a series of statistical tests and reviews of existing studies to determine the identification “error” rates for ineffective teachers when using typical value-added modeling methods.[5] The report found:

Type I and II error rates for comparing a teacher’s performance to the average are likely to be about 25 percent with three years of data and 35 percent with one year of data. Corresponding error rates for overall false positive and negative errors are 10 and 20 percent, respectively.[6]

Type I error refers to the probability that based on a certain number of years of data, the model will find that a truly average teacher performed significantly worse than average.[7] So, that means that there is about a 25% chance, if using three years of data or 35% chance if using one year of data that a teacher who is “average” would be identified as “significantly worse than average” and potentially be fired.  Of particular concern is the likelihood that a “good teacher” is falsely identified as a “bad” teacher, in this case a “false positive” identification. According to the study, this occurs one in ten times (given three years of data) and two in ten (given only one year of data).

Same Teachers, Different Tests, Different Results

Determining whether a teacher is effective may vary depending on the assessment used for a specific subject area and not whether that teacher is a generally effective teacher in that subject area.  For example, Houston uses two standardized test each year to measure student achievement: the state Texas Assessment of Knowledge and Skills (TAKS) and the nationally-normed Stanford Achievement Test.[8] Corcoran and colleagues used Houston Independent School District (HISD) data from each test to calculate separate value-added measures for fourth and fifth grade teachers.[9] The authors found that a teacher’s value-added can vary considerably depending on which test is used.[10] Specifically:

among those who ranked in the top category (5) on the TAKS reading test, more than 17 percent ranked among the lowest two categories on the Stanford test.  Similarly, more than 15 percent of the lowest value-added teachers on the TAKS were in the highest two categories on the Stanford.[11]

Similar issues apply to tests on different scales – different possible ranges of scores, or different statistical modification or treatment of raw scores, for example, whether student test scores are first converted into standardized scores relative to an average score, or expressed on some other scale such as percentile rank (which is done is some cases but would generally be considered inappropriate).  For instance, if a teacher is typically assigned higher performing students and the scaling of a test is such that it becomes very difficult for students with high starting scores to improve over time, that teacher will be at a disadvantage. But, another test of the same content or simply with different scaling of scores (so that smaller gains are adjusted to reflect the relative difficulty of achieving those gains) may produce an entirely different rating for that teacher.

Brief Description of Colorado Model

Colorado, Louisiana, and Tennessee have teacher evaluation systems proposed that will require 50% or more of the evaluations to be based on their students’ academic growth.  This section summarizes the evaluation systems in these states as well as the procedural protections that are provided for teachers.

Colorado’s statute creates a state council for educator effectiveness that advises the state board of education.[12] A major goal of these councils is to aid in the creation of teacher evaluation systems that “every teacher is evaluated using multiple fair, transparent, timely, rigorous, and valid methods.”[13] Considerations of student academic growth must comprise at least 50% of each evaluation.[14] Quality measures for teachers must include “measures of student longitudinal academic growth” such as “interim assessments results or evidence of student work, provided that all are rigorous and comparable across classrooms and aligned with state model content standards and performance standards.”[15] These quality standards must take diverse factors into account including “special education, student mobility, and classrooms with a student population in which ninety-five percent meet the definition of high-risk student.”[16]

Colorado’s statute also calls for school districts to develop appeals procedures.  A teacher or principal who is deemed ineffective must receive written notice, documentation used for making this determination, and identification of deficiency.[17] Further, the school district must ensure that a tenured teacher who disagrees with this designation has “an opportunity to appeal that rating, in accordance with a fair and transparent process, where applicable, through collective bargaining.”[18] If no collective bargaining agreement is in place, then the teacher may request a review “by a mutually agreed-upon third party.”[19] The school district or board for cooperative services must develop a remediation plan to correct these deficiencies, which will include professional development opportunities that are intended to help the teacher achieve an effective rating in her next evaluation.[20] The teacher or principal must receive a reasonable amount of time to correct such deficiencies.[21]


[1] Tim R. Sass, The Stability of Value-Added Measures of Teacher Quality and Implications for Teacher Compensation Policy, Urban Institute (2008), available at http://www.urban.org/UploadedPDF/1001266_stabilityofvalue.pdf. See also Daniel F. McCaffrey et al., The Intertemporal Variability of Teacher Effect Estimates, 4 Educ. Fin. & Pol’y, 572 (2009).

[2] Sass, supra note 27.

[3] Id.

[4] Bill & Melinda Gates Foundation, supra note 26.

[5] Peter Z. Schochet & Hanley S. Chiang, Error Rates in Measuring Teacher and School Performance Based on Student Test Score Gains (NCEE 2010-4004). Washington, DC: National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education (2010).

[6] Id.

[7] Id. at 12.

[8] Sean P. Corcoran, Jennifer L. Jennings & Andrew A. Beveridge, Teacher Effectiveness on High- and Low-Stakes Tests, Paper presented at the Institute for Research on Poverty summer workshop, Madison, WI (2010).

[9] Id.

[10] Id.

[11] Id.

[12] Co. Rev. Stat. § 22-9-105.5(2)(a) (2010).

[13] Id. § 22-9-105.5(3)(a).

[14] Id.

[15] Id.

[16] Id.  The statute also calls for the creation of performance evaluation councils that advise school districts.  Id. § 22-9-107(1).  The performance evaluation councils also help school districts develop teacher evaluation systems that must be based on the same measures as that developed by the state council for educator effectiveness.  Id. § 22-9-106(1)(e)(II).  However, the performance evaluation councils lose their authority to set standards once the state board has promulgated rules and the initial phase of statewide implementation has been completed.  Id. § 22-9-106(1)(e)(I).

[17] Id. § 22-9-106(3.5)(b)(II).

[18] Id.

[19] Id.

[20] Id.

[21] Id.

Reformy Disconnect: “Quality Based” RIF?

I addressed this point previously in my post on cost-effectiveness of quality based layoffs, but it was buried deep in the post.

Reformers are increasingly calling for quality based layoffs versus seniority based layoffs, as if a simple dichotomy. Sounds like a no brainer when framed in these distorted terms.

I pointed out in the previous post that if the proposal on the table is really about using value-added teacher effect estimates versus years of service, we’re really talking about the choice between significantly biased and error prone – largely random – layoffs versus using years of service. It doesn’t sound as much like a no brainer when put in those terms, does it? While reformers might argue that seniority based layoffs are still more “error prone” than effectiveness rating layoffs, it is actually quite difficult to determine which, in this case, is more error prone. Existing simulation studies identifying value-added estimates as the less bad option, use value-added estimates to determine which option is better. Circular logic (as I previously wrote)?

We’re having this policy conversation about layoffs now because states are choosing (yes choosing, not forced, not by necessity) to slash aid to high need school districts that are highly dependent on state aid, and will likely be implementing reduction in force (RIF) policies. That is, laying off teachers. So, reformy pundits argue that they should be laying off those dead wood teachers – those with bad effectiveness ratings, instead of those young, energetic highly qualified ones.

So, here are the basic parameters for quality-based RIF:

1. We must mandate test-score based teacher effectiveness ratings as a basis for teacher layoffs.

2. But, we acknowledge that those effectiveness ratings can at best be applied to less than 20% of teachers in our districts, specifically teachers of record – classroom teachers – responsible for teaching math and reading in grades 3 to 8 (4 to 8 if only annual assessment data)

3. Districts are going to be faced with significant budget cuts which may require laying off around 5% or somewhat more of their total staff, including teaching staff.

4. But, districts should make efforts to layoff staff (teachers) not responsible for the teaching of core subject areas.

Is anyone else seeing the disconnect here? Yeah, there are many levels of it, some more obvious than others. Let’s take this from the district administrator’s/local board of education perspective:

“Okay, so I’m supposed to use effectiveness measures to decide which teachers to lay off. But, I only have effectiveness measures for those teachers who are supposed to be last on my list for lay offs? Those in core areas. The tested areas. How is that supposed to work?”

Indeed the point of the various “quality based layoff” simulations that have been presented (the logic of which is problematic) is to layoff teachers in core content areas and rely on improved average quality of core content teachers over time to drive system wide improvements. These simulations rely on heroic assumptions of a long waiting list of higher quality teacher applicants just frothing at the mouth to take those jobs from which they too might be fired within a few years due to random statistical error (or biased estimates) alone.

That aside, reduction in force isn’t about choosing which teachers to be dismissed so that you can replace them with better ones. It’s about budgetary crisis mode and reduction of total staffing costs. And reduction in force is not implemented in a synthetic scenario where the choice only exists to lay off either core classroom teachers based on seniority, or core classroom teachers based on effectiveness ratings (the constructed reality of the layoff simulations). Reduction in force is implemented with consideration for the full array of teaching positions that exist in any school or district. “Last in, first out” or LIFO as reformy types call it, does not mean ranking all teachers systemwide by experience and RIF-ing the newest teachers regardless of what they teach, or the program they are in. Specific programs and positions can be cut, and typically are.

And it is unlikely that local district administrators in high need districts would, or even should, look first to cut deeply into core content area teachers. So, a 5% staffing cut might be accomplished before ever cutting a single teacher for whom an effectiveness rating occurs – or very few. So, in the context of RIF, layoffs actually based on effectiveness ratings are a drop in the bucket.

So now I’m confused. Why is this such a pressing policy issue here and now? Does chipping away at seniority based provisions really have much to do with improving the implementation of RIF policies? Perhaps some are using the current economic environment and reformy momentum to achieve other long-run objectives?

Thinking through cost-benefit analysis and layoff policies


If you’re running a school district or a private school and you are deciding on what to keep in your budget and what to discard, you are making trade-offs. You are making trade-offs as to whether you want to spend money on X or on Y, or perhaps a more complicated mix of many options. How you come to your decision depends on a number of factors:

  1. The cost – the total costs of the various ingredients that go into providing X and providing Y. That is, how many people, at what salary and benefits, how much space at what overhead cost (per time used) and how much stuff (materials, supplies and equipment) and at what market prices?
  2. The benefits – the potential dollar return to doing X versus doing Y. For example, how much dollar savings might be generated in operating cost savings from reorganizing our staffing and use of space, if we spend up front (capital expenses) to reorganize and consolidate our elementary schools where they have become significantly imbalanced over time?
  3. The effects – the relative effectiveness of doing X versus doing Y. For example, in the simplest case, if we are choosing between two reading programs, what are the reading achievement gains, or effects, from each program? Or, more pertinent to the current conversation (but far more complex to estimates), what are the relative effects of reducing class size by 2 students when compared to keeping a “high quality” teacher.
  4. The utility – The utility of each option refers to the extent that the option in question addresses a preferred outcome goal. Utility is about preferences, or tastes. For example, in the current accountability context, one might be pressured to place greater “utility” on improving math or reading outcomes in grades 3 through 8. If the costs of a preferred program are comparable to the costs of a less preferred program… well… the preferred program wins. There are many ways to determine what’s “preferred,” and more often than not, public input plays a key role especially in smaller, more affluent suburban school districts. As noted above, federal and state policy have played a significant role in defining utility in the past decade (and arguably, distorting resource allocation to a point of significant imbalance in resource-constrained districts)

This basic cost analysis framework laid out by Henry Levin back in 1983 and revisited by Levin and McEwan since should provide the basis for important trade-off decisions in school budgeting and should provide the conceptual basis for arguments like those made by Petrilli and Roza in their recent policy brief. But such a framework is noticeably absent and likely so because most of the proposals made by Petrilli and Roza:

  1. are not sufficiently precise to apply such a framework  largely because little is known about the likely outcomes (which may in fact be quite harmful); and
  2. because they have failed entirely to consider in detail the related costs of proposed options, especially up-front costs of many of the options (like school reorganization or developing teacher evaluation systems). Note that the full length book (from which the brief comes) is no more thoughtful or rigorous.

Back of the Napkin Application to Layoff Options

Allow me to provide a back-of-the-napkin example of some of the pieces that might go into determining the savings and/or benefits from the BIG suggestion made by Pettrilli and Roza – which is to use quality based layoffs in place of seniority based layoffs when cutting budgets. This one would seem to be a no-brainer. Clearly, if we layoff based on quality, we’ll have better teachers left (greater effectiveness) and we’ll have saved a ton money or a ton of teachers. That is, if we are determined to layoff X teachers, it will save more money to lay off more senior, more expensive teachers than to lay off novice teachers. However, that’s not the likely what-if scenario. More likely is that we are faced with cutting X% of our staffing budget, so the difference will be in the number of teachers we need to lay off in order to achieve that X%, and the benefit difference might be measured in terms of the change in average class size resulting from laying off teachers by “quality” measures and laying off teachers by seniority.

Let’s lay out some of the pieces of this cost benefit analysis to show its complexity.

First of all, let’s consider how to evaluate the distribution of the different layoff policies.

Option 1 – Layoffs based on seniority

This one is relatively easy and involves starting from the bottom in terms of experience and laying off as many junior teachers as necessary to achieve 5% savings to our staffing budget.

Option 2 – Layoffs based on quality

Here’s the tricky part. Budget cuts and layoffs are here and now. Most districts do not have in place rigorous teacher evaluation systems that would allow them to make high stakes decisions based on teacher quality metrics. AND, existing teacher quality metrics where they do exist (NY, DC, LA) are very problematic. So, on the one hand, if districts rush to immediately implement “quality” based layoffs, districts will likely revert to relying heavily on some form of student test score driven teacher effectiveness rating, modeled crudely (like the LA Times model).  Recall that even in better models of this type, we are looking at a 35% chance of identifying an average teacher as “bad” and 20% chance of identifying a good teacher as “bad.”

In general, the good and bad value-added ratings fall somewhat randomly across the experience distribution. So, for simplicity in this example, I will assume that quality based firings are essentially random. That is, they would result in dismissals randomly distributed across the experience range. Arguably, value-added based layoffs are little more than random, given that a) there is huge year to year error even when comparing on the same test and b) there are huge differences when rating teachers using one test, versus using another.

Testing this out with Newark Public Schools – Elementary Classroom Teachers 2009-10

At the very least, one would think that randomly firing our way to a 5% personnel budget cut would create a huge difference when compared to firing our way to a 5% personnel budget cut by eliminating the newest and cheapest teachers. I’m going to run these numbers using salaries only, for illustrative purposes (one can make many fun arguments about how to parse out fixed vs. variable benefits costs, or deferred benefits vs. short run cost differences for pensions and deferred sick pay, etc.).

We start with just over 1,000 elementary classroom teachers in Newark Public Schools, and assume an average class size of 25 for simplicity. The number of teachers is real (at least according to state data) but the class sizes are artificially simplified. We are also assuming all students and classroom space to be interchangeable.  A 5% cut is about $3.7 million. Let’s assume we’ve already done our best to cut elsewhere in the district budget, perhaps more than 5% across other areas, but we are left with the painful reality of cutting 5% from core classroom teachers in grades K-8. In any case, we’re hoping for some dramatic saving here – or at least benefits revealed in terms of keeping class sizes in check.

Figure 1: Staffing Cut Scenarios for Newark Public Schools using 2009-10 Data

If we layoff only the least experienced teachers to achieve the 5% cut, we layoff only teachers with 3 or fewer years of experience when using the Newark data.  The average experience of those laid off is 1.8 years. And we end up laying off 72 teachers (a sucky reality no matter how you cut it).

If we use a random number generator to determine layoffs (really, a small difference from using Value-added modeling), we end up laying off only 54 teachers instead of 72. We save 18 teachers, or 1.7% of our elementary classroom teacher workforce.

What’s the class size effect of saving these 18 teachers? Well, under the seniority based layoff policy, class size rises from 25 to 26.86. Under the random layoff policy, class size rises from 25 to 26.37. That is, class size is affected by about half a student per class. This may be important, but it still seems like a relatively small effect for a BIG policy change. This option necessarily assumes no downside to the random loss of experienced teachers. Of course, the argument is that more of those classes now have a good teacher in front of them. But again, doing this here and now with the type of information available means relying not even on the “best” of teacher effectiveness models, but relying on expedited, particularly sloppy, not thoroughly vetted models. I would have continued concerns even with richer models, like those explored in the recent Gates/Kane report, which still prove insufficient.

Perhaps most importantly, how does this new policy affect the future teacher workforce in Newark – the desirability for up-and-coming teachers to pursue a teaching career in Newark, where their career might be cut off at any point, by random statistical error? And how does that tradeoff balance with a net difference of about half a student per classroom?

What about other costs?

Petrilli and Roza, among others, ignore entirely any potential downside to the teacher workforce – those who might choose to enter that workforce if school districts or states al-of-the-sudden decide to rely heavily on error prone and biased measures of teacher effectiveness to implement layoff policies.  This downside might be counterbalanced by increased salaries, on average and especially on the front end. That is, to achieve equal incoming teacher quality over time, given the new uncertainty, might require higher front end salaries. This cost is ignored entirely (or simply assumed to come from somewhere else, like cutting benefits… simply negating step increments, or supplements for master’s degrees, each of which have other unmeasured consequences).

I have assumed above that districts would rely heavily on available student testing data, creating error-prone, largely random layoffs, while ignoring the cost of applying the evaluation system to achieve the layoffs. Arguably, even contracting an outside statistician to run the models and identifying the teachers to be laid off would cost another $50,000 to $75,000, leading to reduction of at least one more teacher position under the “quality based” layoff model.

And then there are the legal costs of fighting the due process claims that the dismissals were arbitrary and the potential legal claims over racially disparate firings. Forthcoming law review article to be posted soon.

Alternatively, developing a more rigorous teacher evaluation system that might more legitimately guide layoff policies requires significant up-front costs, ignored entirely in the current overly simplistic, misguided rhetoric.

How can we implement quality based layoffs when we’re supposed to be laying off teachers NOT teaching math and reading in elementary grades?

Here’s another issue that Petrilli, Roza and others seem to totally ignore. They argue that we must a) dismiss teachers based on quality and b) must make sure we don’t compromise class sizes in core instructional areas, like reading and math in the elementary grades.

Let’s ponder this for a moment. The only teachers to whom we can readily assign (albeit deeply flawed) effectiveness ratings are those teaching math and reading between grades 3 and 8. So, the only teachers who we could conceivably layoff based on preferred “reformy” quality metrics are teachers who are directly responsible for teaching math and reading between grades 3 and 8.

That is, in order to implement quality based layoffs, as reformers suggest, we must be laying off math and reading teachers between grades 3 and 8, except that we are supposed to be laying off other teachers, not those teachers. WOW… didn’t think that one through very well… did they?

Am I saying seniority layoffs are great?

No. Clearly seniority layoffs are imperfect and arguably there is no perfect answer to layoff policies. Layoffs suck and sometimes that sucky option has to be implemented. Sometimes that that sucky option has to be implemented with a blunt and convenient instrument and one that is easily defined, such as years of service. It is foolish to argue that teaching is the only profession where those who’ve been around for a while – those who’ve done their time – have greater protection when the axe comes down. Might I suggest that paying one’s dues even plays a significant role in many private sector jobs? Really? And it is equally foolish to argue that every other profession EXCEPT TEACHING necessarily makes precise quality decisions regarding employees when that axe comes down.

The tradeoff being made in this case is a tradeoff  NOT between “keeping quality teachers” versus “keeping old, dead wood” as Petrilli, Roza and others would argue, but rather the tradeoff between laying off teachers on the unfortunately crude basis of seniority only, versus laying off teachers on a marginally-better-than-random, roll-of-the-dice basis. I would argue the latter may actually be more problematic for the future quality of the teaching workforce!  Yes, pundits seem to think that destabilizing the teaching workforce can only make it better. How could it possibly get worse, they argue? Substantially increasing the uncertainty of career earnings for teachers can certainly make it worse.

Bad Teachers Hurt Kids, but Salary Cuts Have no Down Side?

The assumption constantly thrown around in these policy briefs is that putting a bad teacher in front of the kids is the worst possible thing you could do. We have to fire those teachers. They are bad for kids. They hurt kids.

But, the same pundits argue that we should cut pay for the teachers in any number of ways (including paying for benefits) and subject teachers to layoff policies that are little more than random. Since so many teachers are bad teachers – and simply bad people – these policies are, of course, not offensive. Right? Kids good. Teachers bad. Treat kids well. Take it out on teachers. No harm to kids. Easy!

I’m having a hard time swallowing that. That’s just not a reasonable way to treat a workforce (if you want a good workforce), no less a reasonable way to treat a workforce charged with educating children. In fact, it’s bad for the kids, and just plain ignorant to assert that one can treat the teachers badly, lower their pay, morale and ultimately the quality of the teacher workforce and expect there to be no downside for the kids.

Petrilli and Roza make the assumption that there is big savings to be found from cutting teacher salaries directly and also indirectly by passing along benefits costs to teachers.  That’s a salary cut! Or at least a cut to the total compensation package and it’s a package deal! This argument seems to be coupled with an assumption that there is absolutely no loss of benefit or effectiveness from pursuing this cost-cutting approach (because we’ll be firing all of the sucky teachers anyway). That is, teacher quality will remain constant even if teacher salaries are cut substantially.  A substantial body of research questions that assumption:

  • Murnane and Olson (1989) find that salaries affect the decision to enter teaching and the duration of the teaching career;
  • Figlio (1997, 2002) and Ferguson (1991) find that higher salaries are associated with better qualified teachers;
  • Figlio and Reuben (2001) “find that tax limits systematically reduce the average quality of education majors, as well as new public school teachers in states that have passed these limits;”
  • Ondrich, Pas and Yinger (2008) “find that teachers in districts with higher salaries relative to non-teaching salaries in the same county are less likely to leave teaching and that a teacher is less likely to change districts when he or she teaches in a district near the top of the teacher salary distribution in that county.”

To mention a few.

That is, in the aggregate, higher salaries (and better working conditions) can attract a stronger teacher workforce, and at a local level, having more competitive teaching salaries compared either to non-teaching jobs in the same labor market or compared to teaching jobs in other districts in the same labor market can help attract and especially retain teachers.

Allegretto, Corcoran and Mishel, among others, have shown that teacher wages have lagged over time – fallen behind non-teaching professions. AND, they have shown that the benefits differences are smaller than many others argue and certainly do not make up the difference in the wage deficit over time. I have shown previously on my blog that teacher wages in New Jersey have similarly lagged behind!

So, let’s assume we believe that teacher quality necessarily trumps reduced class size, for the same dollar spent. Sadly, this has been a really difficult trade-off to untangle in empirical research and while reformers boldly assume this, the evidence is not clear. But let’s accept that assumption. But let’s also accept the evidence that overall wages and local wage advantages lead to a stronger teacher workforce.

If that’s the case, then the appropriate decision to make at the district level would be to lay off teachers and marginally increase class sizes, while making sure to keep salaries competitive. After all, the aggregate data seem to suggest that over the past few decades we’ve increased the number of personnel more than we’ve increased the salaries of those personnel. That is, cut numbers of staff before cutting or freezing salaries. In fact, one might even choose to cut more staff and pay even higher salaries to gain competitive advantage in tough economic times. Some have suggested as much.  I’m not sold on that either, especially when we start talking about increasing class sizes to 30, 35 or even 50.  Note that class size may also affect the competitive wage that must be paid to a teacher in order to recruit and retain teachers of constant quality. Nonetheless, it is important to understand the role of teacher compensation in ensuring the overall quality of the teacher workforce and it is absurd to assume no negative consequences of slashing teacher pay across-the-board.

Take home point!

In summary, we should be providing thoughtful decision frameworks for local public school administrators to make cost-effective decisions regarding resource allocation rather than spewing laundry lists of reformy strategies for which no thoughtful cost-effectiveness analysis has ever been conducted.

Further, now is not the time to act in panic and haste to adopt these unfounded strategies without appropriate consideration of the up-front costs of making truly effective reforms.

A few references

Richard J. Murnane and Randall Olsen (1989) The effects of salaries and opportunity costs on length of state in teaching. Evidence from Michigan. Review of Economics and Statistics 71 (2) 347-352

David N. Figlio (1997) Teacher Salaries and Teacher Quality. Economics Letters 55 267-271.

David N. Figlio (2002) Can Public Schools Buy Better-Qualified Teachers?” Industrial and Labor Relations Review 55, 686-699.

Figlio (1997, 2002) and Ferguson (1991) find that higher salaries are associated with better qualified teachers

Ronald Ferguson (1991) Paying for Public Education: New Evidence on How and Why Money Matters. Harvard Journal on Legislation. 28 (2) 465-498.

Figlio, D.N., Reuben, K. (2001) Tax limits and the qualifications of new teachers Journal of Public Economics 80 (1) 49-71

Ondrich, J., Pas, E., Yinger, J. (2008) The Determinants of Teacher Attrition in Upstate New York.  Public Finance Review 36 (1) 112-144

A few comments on the Gates/Kane value-added study

A few comments on the Gates/Kane Value-added study

(My apologies in advance for an excessively technical, research geeky post, but I felt it necessary in this case)

Take home points

1) As I read it, the new Gates/Kane value-added findings are NOT by any stretch of the imagination an endorsement of using value-added measures of teacher effectiveness for rating individual teachers as effective or not or for making high-stakes employment decisions. In this regard, the Gates/Kane findings are consistent with previous findings regarding stability, precision and accuracy of rating individual teachers.

2) Even in the best of cases, measures used in value-added added models remain insufficiently precise or accurate to account for the differences in children served by different teachers in different classrooms (see discussion of poverty measure in first section, point #2 below)

3) Too many of these studies, including this one, adopt the logic that value-added outcomes can be treated both as a measure of effectiveness to be investigated (independent variable) and as the true measure of effectiveness (the dependent measure). That is, this study like others evaluates the usefulness of both value added measures and other measures of teacher quality by their ability to predict future (or different group) value-added measures. Certainly, the deck is stacked in favor of value added measures under such a model. See value-added as a predictor of itself below.

4) Value-added measures can be useful for exploring variations in student achievement gains across classroom settings and teachers, but I would argue that they remain of very limited use for identifying more precisely or accurately, the quality of individual teachers.  Among other things, the most useful findings in the new Gates/Kane study apply to very few teachers in the system (see final point below).

Detailed discussion

Much has been made of the preliminary findings of the Gates Foundation study on teacher effectiveness. Jason Felch of the LA Times has characterized the study as an outright endorsement of the use of Value-added measures as the primary basis for determining teacher effectiveness. Mike Johnston, the Colorado State Senator behind that state’s new teacher tenure law, which requires that 50% of teacher evaluation be based on student growth (and tenure and removal of tenure based on the evaluation scheme), also seemed thrilled – via twitter – that the Gates study found that value-added scores in one year predict value-added scores in another – seemingly assuming this finding unproblematically endorses his policies (?) (via Twitter: SenJohnston Mike Johnston New Gates foundation report on effective teaching: value added on state test strongest predictor of future performance).

But, as I read it, the new Gates study is – even setting aside its preliminary nature – NOT AN OUTRIGHT ENDORSEMENT OF USING VALUE-ADDED MEASURES AS A SIGNIFICANT BASIS FOR MAKING HIGH STAKES DECISIONS ABOUT TEACHER DISMISSAL/RETENTION, AS IS MANDATED VIA STATE POLICIES LIKE THOSE ADOPTED IN COLORADO – OR AS SUGGESTED BY THE ABSURDLY NARROW APPROACH FOR “OUTING” TEACHERS TAKEN BY MR. FELCH AND THE LA TIMES.

Rather, the new Gates study tells us that we can use value-added analysis to learn about variations in student learning (or at least in test score growth) across classrooms and schools and that we can assume that some of this variation is related to variations in teacher quality. But, there remains substantial uncertainty in the capacity to estimate whether any one teacher is a good teacher or a bad one.

Perhaps the most important and interesting aspects of the study are its current and proposed explorations of the relationship between value-added measures and other measures, including student perceptions, principal perceptions and external evaluator ratings.

Gates Report vs. LA Times Analysis

In short, data quality and modeling matter, but you can only do so much.

For starters, let’s compare some of the features of the Gates study value added models to the LAT models. These are some important differences to look for when you see value- added models being applied to study student performance differences across classrooms – especially where the goal is to assign outcome effects to teachers.

  1. The LAT Times model, like many others, uses annual achievement data (as far as I can tell) to determine teacher effectiveness, whereas the Gates study at least explores the seasonality of learning – or more specifically, how much achievement change occurs over the summer (which is certainly outside of teacher’s control AND differs across students by their socioeconomic status). One of the more interesting findings of the Gates study is that from 4th grade on: “The norm sample results imply that students improve their reading comprehension scores just as much (or more) between April and October as between October and April in the following grade. Scores may be rising as kids mature and get more practice outside of school.” This means that if there exist substantial differences in summer learning by students’ family income level and/or other factors, as has been found in other studies, then using annual data could significantly and inappropriately disadvantage teachers who are assigned students whose reading skills lagged over the summer. The existing blunt indicator of low income status is unlikely to be sufficiently precise to correct for summer learning differences.
  2. The LA Times model did include such blunt measures for poverty status and language proficiency, as well as disability status (single indicator), but later found shares of gifted children to be associated with differences in teacher ratings, along with student race. The Gates study includes similarly crude indicators of socioeconomic status, but does include in their value-added model whether individual children are classified as gifted. It also includes student race and the average characteristics of students in each classroom (peer group effect). This is much richer and more appropriate model, but still likely insufficient to fully account for the non-random distribution of students.  That is, the Gates study models at least attempt to correct for the influence of peers in the classroom in addition to individual characteristics of students, but even this may be insufficient. One particular concern of mine is the use of a single dichotomous measure of child poverty – whether the child qualifies for free or reduced price lunch – and the share of children in each class who do. The reality is that in many urban public schooling settings like those involved in the Gates study, several elementary/middle schools have over 80% children qualifying for free or reduced lunch, but this apparent similarity is no guarantee of similar poverty conditions among the children in one school or classroom compared to another. One classroom might be filled 80% with children whose family income is at or below 100% income threshold for poverty, whereas another classroom might be filled with 80% children whose income is 85% higher (at the threshold for “reduced” price lunch). This is a big difference that is not captured with this crude measure.
  3. The LAT analysis uses a single set of achievement measures. Other studies like the work of Sean Corcoran (see below) using data from Houston, TX have shown us the relatively weak relationship between value-added ratings of teachers produced by one test and value added ratings of teachers produced by another test. Thankfully, the Gates foundation analysis takes steps to explore this question further, but I would argue, overstates the relationship they found between tests or states that relationship in a way that might be misinterpreted by pundits seeking to advance the use of value-added for high stakes decisions (more later).

Learning about Variance vs. Rating Individual Teachers with Precision and Accuracy

If we are talking about using the value-added method to classify individual teachers as effective or ineffective and to use this information as the basis for dismissing teachers or for compensation, then we should be very concerned with the precision and accuracy of the measures as they apply to each individual teacher. In this context, one can characterize precision and accuracy as follows.

  • Precision – That there exists little error in our estimate that a teacher is responsible for producing good or bad student value-added on the test instrument used.  That is, we have little chance of classifying a good teacher as bad, an average teacher as bad, or vice versa.
  • Accuracy – That the test instrument and our use of it to measure teacher effectiveness is really measuring “true” effectiveness of the teacher – or truly how good that teacher is at doing all of the things we expect that teacher to do.

If, instead of classifying individual teachers as good or bad (and firing them, or shaming them in the newspaper or on milk cartons), we are actually interested in learning about variations in “effectiveness” across many teachers and many sections of students over many years, and whether student perceptions, supervisor evaluations, classroom conditions and teaching practices are associated with differences in effectiveness, we are less concerned about precise and accurate classification of individuals and more concerned about the relationships between measures, across many individuals (measured with error).  That is, do groups of teachers who do more of “X” seem to produce better value-added gains? Do groups of teachers prepared in this way seem to produce better outcomes? We are not concerned about whether a given teacher is accurately “scored.” Instead, we are concerned about general trends and averages.

The Gates study, like most previous studies, finds what I would call relatively weak correlations between the value-added score an individual teacher receives for one section of students in math or reading compared to another, and from one year to the next. The Gates research report noted:

“When the between-section or between-year correlation in teacher value-added is below .5, the implication is that more than half of the observed variation is due to transitory effects rather than stable differences between teachers. That is the case for all of the measures of value-added we calculated.”

Below is a table of those correlations – taken from their Table #5.

Unfortunately, summaries of the Gates study seem to obsess on how relatively high the correlation is from year to year for teachers rated by student performance on the state math test (.404) and largely ignore how much lower many of the other correlations are. Why is the correlation for the ELA test under .20 and what does that say about the high-stakes usefulness of the approach? Like other studies evaluating the stability of value-added ratings, the correlations seem to run between .20 and .40, with some falling below .20. That’s not a very high correlation – which then suggests not a very high degree of precision in figuring out which individual teacher is a good teacher versus which one is bad. BUT THAT’S NOT THE POINT EITHER!

Now, the Gates study rightly points out that lower correlations do not mean that the information is entirely unimportant. The study focuses on what it calls “persistent” effects or “stable” effects, arguing that if there’s a ton of variation across classrooms and teachers, being able to explain even a portion of that variation is important – A portion of a lot is still something. A small slice of a huge pie may still provide some sustenance. The report notes:

“Assuming that the distribution of teacher effects is “bell-shaped” (that is, a normal distribution), this means that if one could accurately identify the subset of teachers with value-added in the top quartile, they would raise achievement for the average student in their class by .18 standard deviations relative to those assigned to the median teacher. Similarly, the worst quarter of teachers would lower achievement by .18 standard deviations. So the difference in average student achievement between having a top or bottom quartile teacher would be .36 standard deviations.” (p.19)

The language here is really, really, important, because it speaks to a theoretical and/or hypothetical difference between high and low performing teachers drawn from a very large analysis of teacher effects (across many teachers, classrooms, and multiple years). THIS DOES NOT SPEAK TO THE POSSIBILITY THAT WE CAN PRECISELY AND ACCURATELY IDENTIFY WHETHER ANY SINGLE TEACHER FALLS IN THE TOP OR BOTTOM GROUP! It’s a finding that makes sense when understood correctly but one that is ripe for misuse and misunderstanding.

Yes, in probabilistic terms, this does suggest that if we implement mass layoffs in a system as large as NYC and base those layoffs on value-added measures, we have a pretty good chance of increasing value-added in later years – assuming our layoff policy does not change other conditions (class size, average quality of those in the system – replacement quality). But any improvements can be expected to be far, far, far less than the .18 figure used in the passage above. Even assuming no measurement error – that the district if laying off the “right” teachers (a silly assumption), the newly hired teachers can be expected to fall, at best, across the same normal curve. But I’ve discussed my taste for this approach to collateral damage in previous posts. In short, I believe it’s unnecessary and not that likely to play out as we might assume. (see discussion of reform engineers at bottom)

A Few more Technical Notes

Persistent or Stable Effects: The Gates report focuses on what it terms “persistent” effects of teachers on student value-added – assuming that these persistent effects represent the consistent, over time or across sections influence of a specific teacher on his/her students’ achievement gains. The report focuses on such “persistent” effects for a few reasons. First, the report uses this discussion to, I would argue, overplay the persistent influence teachers have on student outcomes – as in the quote above which is later used in the report to explain the share of the black-white achievement gap that could be closed by highly effective teachers. The assertion is that even if teacher effects explain small portion of variations in student achievement gains, if variations in those gains are huge, then explaining a portion is important. Nonetheless, the persistent effects remain a relatively small portion (as high as “modest” portion in some cases) – which dramatically reduces the precision with which we can identify the effectiveness of any one teacher (taking as given that the tests are the true measure of effectiveness – the validity concern).

AND, I would argue that it is a stretch to assume that the persistent effects within teachers are entirely a function of teacher effectiveness. The persistent effect of teachers may also include the persistent characteristics of students assigned to that teacher – that the teacher, year after year, and across sections is more likely to be assigned the more difficult students (or the more expert students). Persistent pattern yes. Persistent teacher effect? Perhaps partially (How much? Who knows?).

Like other studies, the identification of persistent effects from year to year, or across sections in the new Gates study merely reinforces that with more sections and/or more years of data (more students passing through) for any given teacher, we can gain a more stable value-added estimate and more precise indication of the value-added associated with the individual teacher. Again, the persistent effect may be a measure of the persistence of something other than the teacher’s actual effectiveness (teacher X always has the most disruptive kids, larger classes, noisiest/hottest/coldest – generally worst classroom).  The Gates study does not (BECAUSE IT WASN’T MEANT TO) assess how the error rate of identifying a teacher as “good” or “bad” changes with each additional year of data, but given that other findings are so consistent with other studies, I would suspect the error rate to be similar as well.

Differences Between Tests: The Gates study provides some useful comparisons of value-added ratings of teachers on one test, compared with ratings of the same teachers on another test – a) for kids in the same section in the same year, and b) for kids in different sections of classes with the same teacher.

Note that in a similar analysis, Corcoran, Jennings and Beveridge found:

“among those who ranked in the top category (5) on the TAKS reading test, more than 17 percent ranked among the lowest two categories on the Stanford test. Similarly, more than 15 percent of the lowest value-added teachers on the TAKS were in the highest two categories on the Stanford.”

Corcoran, Sean P., Jennifer L. Jennings, and Andrew A. Beveridge. 2010. “Teacher Effectiveness on High- and Low-Stakes Tests.” Paper presented at the Institute for Research on Poverty summer workshop, Madison, WI.

That is, analysis of teacher value-added ratings on two separate tests called into question the extent to which individual teachers might accurately be classified as effective when using a single testing instrument. That is, if we assume both tests to measure how effective a teacher is a teaching “math,” or a specific subject within “math,” then both tests should tell us the same thing about each teacher – which ones are truly effective math teachers and which ones are not. Corcoran’s findings raise serious questions about accuracy in this regard.

The Gates study argues that comparing teacher-value added across two math tests – where one is more conceptual – allows them to validate that doing well on one test, the state test – as long as the results are correlated with the other, more conceptual test, did not compromise conceptual learning. That seems reasonable enough, to the extent that the testing instruments are being appropriately described (and to the extent they are valid instruments).  In terms of value-added ratings, the Gates study, like the Corcoran study, finds only a modest relationship between ratings of teacher based on one test and ratings based on the other:

“the correlation between a teacher’s value-added on the state test and their value-added on the Balanced Assessment in Math was .377 in the same section and .161 between sections.”

But the Gates study also explores the relationships between “persistent” components across tests – which must be done across sections taking the test in the same year (until subsequent years become available). They find:

“we estimate the correlation between the persistent component of teacher impacts on the state test and on BAM is moderately large, .54.”

“The correlation in the stable teacher component of ELA value-added and the Stanford 9 OE was lower, .37.”

I’m uncomfortable with the phrasing here that says – “persistent component of teacher impacts” – in part because there exist a number of other persistent conditions or factors that may be embedded in the persistent effect, as I discuss above. Setting that aside, however, what the authors are exploring is whether the correlated component – the portions of student performance on any given test that are assumed to represent teacher effectiveness – is similar between tests.

In any case, however, these correlations like the others in the Gates analysis are telling us how highly associated – or not – the assumed persistent component is across tests across many teachers teaching many sections of the same class.  This allows the authors to assert that across all of these teachers and the various sections they teach, there is a “moderately” large relationship between student performance on the two different tests, supporting the authors’ argument that one test somewhat validates the other. But again, this analysis, like the others in the report, does not suggest by any stretch of the imagination that either one test or the other will allow us to precisely identify the good teacher versus the bad one. There is still a significant amount of reshuffling going on in teacher ratings from one test to the next, even with the same students in the same class sections in the same year. And, of course, good teaching is not synonymous with raising a student’s test scores.

This analysis does suggest that we might – by using several tests – get a more accurate picture of student performance and how it varies across teachers, and does at least suggest that across multiple tests – if the persistent component is correlated – just like across multiple years – we might get a more stable picture of which teachers are doing better/worse.  Precise enough for high stakes decisions (and besides, how much more testing can we/they handle?)? I’m still not confident that’s the case.

Value-added is the best predictor of itself

This seems to be one of the findings that gets the most media-play (and was the basis of Senator Johnston’s proud tweets). Of course value-added is a better predictor of future value-added (on the same test and with the same model) than other factors are of future value-added – even if value-added is only a weak predictor of future (or different section) value-added. Amazingly, however, many of the student survey responses on factors related to things like “Challenge” seem almost as related to value-added as value-added to itself. That is a surprising finding, and I’m not sure yet what to make of it. [note that the correlation between student ratings and VAM were for the same class & year, whereas VAM predicting VAM is a) across sections and b) across years).

Again, the main problem with this VAM predicts VAM argument is that it assumes value-added ratings in the subsequent year to be THE valid measure of the desired outcome. But that’s the part we just don’t yet know. Perhaps the student perceptions are actually a more valid representation of good teaching than the value-added measure? Perhaps we should flip the question around? It does seem reasonable enough to assume that we want to see students improve their knowledge and skills in measurable ways on high quality assessments. Whether our current batch of assessments, as we are currently using them and as they are being used in this analysis accomplishes that goal remains questionable.

What is perhaps most useful about the Gates study and future research questions is that it begins to explore with greater depth and breadth the other factors that are – and are not – associated with student achievement gains.

Findings apply to a relatively small share of teachers

I have noted in other blog posts on this topic that in the best of cases (or perhaps worst if we actually followed through with it), we might apply value added ratings to somewhat less than 20% of teachers – those directly responsible and solely responsible for teaching reading or math to insulated clusters of children in grades 3 to 8 – well… 4-8, actually … since many VA models use annual data and the testing starts with grade 3. Even for the elementary school teachers who could be rated, the content of the ratings would exclude a great deal of what they teach. Note that most of the interesting findings in the new Gates study are those which allow us to evaluate the correlations of teachers across different sections of the same course in addition to subsequent years. These comparisons can only be made at the middle school level (and/or upper elementary, if taught by section). Further, many of the language arts correlations were very low, limiting the more interesting discussions to math alone. That is, we need to keep in mind that in this particular study, most of the interesting findings apply to no more than 5% to 10% of teachers – those involved in teaching math in the upper elementary and middle grades – specifically those teaching multiple sections of the same math content each year.


The Circular Logic of Quality-Based Layoff Arguments

Many pundits are responding enthusiastically to the new LA Times article on quality-based layoffs – or how dismissing teachers based on Value-added scores rather than on seniority would have saved LAUSD many of its better teachers, rather than simply saving its older ones.

Some are pointing out that this new LA Times report is the “right” way to use value-added as compared with the “wrong” way that LA Times had used the information previously this year.

Recently, I explained the problematic circular logic being used to support these “quality-based layoff” arguments. Obviously, if we dismiss teachers based on “true” quality measures, rather than experience which is, of course, not correlated with “true” quality measures, then we save the jobs of good teachers and get rid of bad ones. Simple enough? Not so. Here’s my explanation, once again.

This argument draws on an interesting thought piece and simulation posted at http://www.caldercenter.org  ( Teacher Layoffs: An Empirical Illustration of Seniority vs. Measures of Effectiveness), which was later summarized in a (less thoughtful) recent Brookings report (http://www.brookings.edu/~/media/Files/rc/reports/2010/1117_evaluating_teachers/1117_evaluating_teachers.pdf).

That paper demonstrated that if one dismisses teachers based on VAM, future predicted student gains are higher than if one dismisses teachers based on experience (or seniority). The authors point out that less experienced teachers are scattered across the full range of effectiveness – based on VAM – and therefore, dismissing teachers on the basis of experience leads to dismissal of both good and bad teachers – as measured by VAM. By contrast, teachers with low value-added are invariably – low value-added – BY DEFINITION. Therefore, dismissing on the basis of low value-added leaves more high value-added teachers in the system – including more teachers who show high value-added in later years (current value added is more correlated with future value added than is experience).

It is assumed in this simulation that VAM (based on a specific set of assessments and model specification) produces the true measure of teacher quality both as basis for current teacher dismissals and as basis for evaluating the effectiveness of choosing to dismiss based on VAM versus dismissing based on experience.

The authors similarly dismiss principal evaluations of teachers as ineffective because they too are less correlated with value-added measures than value-added measures with themselves.

Might I argue the opposite? – Value-added measures are flawed because they only weakly predict which teachers we know – by observation – are good and which ones we know are bad? A specious argument – but no more specious than its inverse.

The circular logic here is, well, problematic. Of course if we measure the effectiveness of the policy decision in terms of VAM, making the policy decision based on VAM (using the same model and assessments) will produce the more highly correlated outcome – correlated with VAM, that is.

However, it is quite likely that if we simply use different assessment data or different VAM model specification to evaluate the results of the alternative dismissal policies that we might find neither VAM-based dismissal nor experienced based dismissal better or worse than the other.

For example, Corcoran and Jennings conducted an analysis of the same teachers on two different tests in Houston, Texas, finding:

…among those who ranked in the top category (5) on the TAKS reading test, more than 17 percent ranked among the lowest two categories on the Stanford test. Similarly, more than 15 percent of the lowest value-added teachers on the TAKS were in the highest two categories on the Stanford.

  • Corcoran, Sean P., Jennifer L. Jennings, and Andrew A. Beveridge. 2010. “Teacher Effectiveness on High- and Low-Stakes Tests.” Paper presented at the Institute for Research on Poverty summer workshop, Madison, WI.

So, what would happen if we did a simulation of “quality based” layoffs versus experience-based layoffs using the Houston data, where the quality-based layoffs were based on a VAM model using the Texas Assessments (TAKS), but then we evaluate the effectiveness of the layoff alternatives using a value-added model of Stanford achievement test data? Arguably the odds would still be stacked in favor of VAM predicting VAM – even if different VAM measures (and perhaps different model specifications). But, I suspect the results would be much less compelling than the original simulation.

The results under this alternative approach may, however, be reduced entirely to noise – meaning that the VAM based layoffs would be the equivalent of random firings – drawn from a hat and poorly if at all correlated with the outcome measure estimated by a different VAM – as opposed to experienced based firings. Neither would be a much better predictor of future value-added.  But for all their flaws, I’d take the experienced based dismissal policy over the roll of the dice, randomized firing policy any day.

In the case of the LA Times analysis, the situation is particularly disturbing if we look back on some of the findings in their own technical report.

I explained in a previous post that the LA Times value-added model had potentially significant bias in its estimates of teacher quality. For example, in my earlier post, I explain that:

Buddin finds that black teachers have lower value-added scores for both ELA and MATH. Further, these are some of the largest negative effects in the second level analysis – especially for MATH. The interpretation here (for parent readers of the LA Times web site) is that having a black teacher for math is worse than having a novice teacher. In fact, it’s the worst possible thing! Having a black teacher for ELA is comparable to having a novice teacher.

Buddin also finds that having more black students in your class is negatively associated with teacher’s value-added scores, but writes off the effect as small. Teachers of black students in LA are simply worse? There is NO discussion of the potentially significant overlap between black teachers, novice teachers and serving black students, concentrated in black schools (as addressed by Hanushek and Rivken in link above).

By contrast, Buddin finds that having an Asian teacher is much, much better for MATH. In fact, Asian teachers are as much better (than white teachers) for math as black teachers are worse! Parents – go find yourself an Asian math teacher in LA? Also, having more Asian students in your class is associated with higher teacher ratings for Math. That is, you’re a better math teacher if you’ve got more Asian students, and you’re a really good math teacher if you’re Asian and have more Asian students?????

One of the more intriguing arguments in the new LA Times article is that under the seniority based layoff policy:

Schools in some of the city’s poorest areas were disproportionately hurt by the layoffs. Nearly one in 10 teachers in South Los Angeles schools was laid off, nearly twice the rate in other areas. Sixteen schools lost at least a fourth of their teachers, all but one of them in South or Central Los Angeles.

http://articles.latimes.com/2010/dec/04/local/la-me-1205-teachers-seniority-20101204/2

That is, new teachers who were laid off based on seniority preferences were concentrated in high need schools. But so too were teachers with low value-added ratings?

While arguing that “far fewer” teachers would be laid off in high need schools under a quality-based layoff policy, the LA Times does not however offer up how many teachers would have been dismissed from these schools had their biased value-added measures been used instead? Recall that from the original LA Times analysis:

97% of children in the lowest performing schools are poor, and 55% in higher performing schools are poor.

Combine this finding with the findings above regarding the relationship between race and value-added ratings and it is difficult to conceive how VAM based layoffs of teachers in LA would not also fall disparately on high poverty and high minority schools. The disparate effect may be partially offset by statistical noise, but that simply means that some teachers in lower poverty schools will be dismissed on the basis of random statistical error, instead of race-correlated statistical bias (which leads to a higher rate of dismissals in higher poverty, higher minority schools).

Further, the seniority based layoff policy leads to more teachers being dismissed in high poverty schools because the district placed more novice teachers in high poverty schools, whereas the value-added based layoff policy would likely lead to more teachers being dismissed from high poverty, high minority schools, experienced or not, because they were placed in high poverty, high minority schools.

So, even though we might make a rational case that seniority based layoffs are not the best possible option, because they may not be highly correlated with true (not “true”) teaching quality, I fail to see how the current proposed alternatives are much if any better.  They only appear to be better when we measure them against themselves as the “true” measure of success.