Two Persistent Reformy Misrepresentations regarding VAM Estimates


I have written much on this blog about problems with the use of Value-added Estimates of teacher effect (used loosely) on student test score gains on this blog. I have addressed problems with both the reliability and validity of VAM estimates, and I have pointed out how SGP based estimates of student growth are invalid on their face for determining teacher effectiveness.

But, I keep hearing two common refrains from the uber-reformy (those completely oblivious to the statistics and research of VAM while also lacking any depth of understanding of the complexities of the social systems [schools] into which they propose to implement VAM as a de-selection tool) crowd. Sadly, these are the people who seem to be drafting policies these days.

Here are the persistent misrepresentations:

Misrepresentation #1: That this reliability and error stuff only makes it hard for us to distinguish among all those teachers clustered in the middle of the distribution. BUT… we can certainly be confident about those at the extremes of the distribution.  We know who the really good and really bad teachers are based on their VAM estimates.

WRONG!

This would possibly be a reasonable assertion if reliability and error rates were the only problem. But this statement ignores entirely the issue of omitted variables bias (other stuff that affects teacher effect estimates that may have been missed in the model), and just how much those observations in the tails jump around when we tweak the VAM by adding or removing variables, or rescaling measures.

A recent paper by Dale Ballou & colleagues illustrates this problem:

“In this paper, we consider the impact of omitted variables on teachers’ value-added estimates, and whether commonly used single-equation or two-stage estimates are preferable when possibly important covariates are not available for inclusion in the value-added model. The findings indicate that these modeling choices can significantly influence outcomes for individual teachers, particularly those in the tails of the performance distribution who are most likely to be targeted by high-stakes policies.” (Ballou et al., 2012) [emphasis added]

The problem is that we can never know when we’ve got that model specification just right. Further, while we might be able to run checks as to whether the model estimates display bias with respect to measurable external factors, we can’t know if there is bias with respect to stuff we can’t measure, nor can we always tell if there are clusters of teachers in our model whose effectiveness estimates are biased in one direction and other clusters in another direction (also in relation to stuff unmeasured).  That is, we can only test this omitted variables bias stuff when we can add in and take out measures that we have. We simply don’t know how much bias remains due to all sorts of other unmeasured stuff, nor do we know just how much that bias may affect many of those distributions in the tails!

Misrepresentation #2: We may be having difficulty in these early stages of estimating and using VAM models to determine teacher effectiveness, but these are just early development problems that will be cleared up with better models, better data and better tests.

WRONG AGAIN!

Quite possibly, what we are seeing now is as good as it gets.  Keep in mind that many of the often cited papers applying the value-added methodology date back to the mid-1990s. Yeah…. we’ve been at this for a while and we’ve got what we’ve got!

Consider the sources of the problems with the reliability and validity of VAM estimates, or in other words:

The sources of random error and/or noise in VAM estimates

Random error in testing data can be a function of undetected and uncorrected poorly designed test items, such as items with no correct response or more than one correct response, testing conditions/disruptions, and kids being kids – making goofy errors such as filling in the wrong bubble (or toggling the wrong box in computerized testing) or simply having a brain fart on stuff they probably otherwise knew quite well. We’re talking about large groups of 8 and 9 year old kids in some cases, in physically uncomfortable settings, under stress, with numerous potential distractions.

Do we really think all of these sources of noise are going to go away? Substantively improve over time? Testing technology gains only have a small chance at marginally improving some of these. I hope to see those improvements. But it’s a drop in the bucket when it comes to the usefulness, reliability and validity of VAM estimates.

The factors other than the teacher which may influence the average test score gain of students linked to that teacher

First and foremost, kids simply aren’t randomly sorted across teachers and the various ways in which kids aren’t randomly sorted (by socioeconomic status, by disability status, by parental and/or child motivation level) substantively influence VAM estimates. As mentioned above, we can never know how much the unmeasured stuff influences the VAM estimates.  Why? It’s unmeasured!

Second, teachers aren’t randomly sorted among teaching peers and VAM studies have shown what appear to be spillover effects – where teachers seem to get higher VAM estimates when other teachers serving the same students get higher VAM estimates.  Teacher aides, class sizes, lighting/heating/cooling aren’t randomly distributed and all of this stuff may matter.

And you know what?  This stuff isn’t going to change in the near future.  In fact, the more time we waste obsessing on the future of VAM-based de-selection policies instead of equitably and adequately financing our school systems, the more that equity of schooling conditions is going to erode across children, teachers, schools and districts – in ways that are very much non-random [uh… that means certain kids will get more screwed than others].  So perhaps our time would be much better spent trying to improve the equity of those conditions across children. Provide more parity in teacher compensation and working conditions, and better integrating/distributing student populations.

Look – if we were trying to set up an experiment or a program evaluation in which we wanted our VAM estimates to be most useful – least likely to be biased by unmeasured stuff – we would take whatever steps we could to achieve the “all else equal” requirement.  Translated to the non-experimental setting – applied in the real world – this all else equal requirement means that we actually have to concern ourselves with equality of teaching conditions – equality of the distribution of students by race, SES and other factors.  Yeah… that actually means equitable access to financial resources – equitable access to all sorts of stuff (including peer group).

In other words, we’d be required to exercise more care in establishing equality of conditions or explaining why we couldn’t if we were simply comparing program effectiveness for academic publication than the current reformy crowd is willing to exercise when deciding which teachers to fire. [then again, the problem is that they don’t seem to know the difference. Heck, some of them are still hanging their hopes on measures that aren’t even designed for the purpose !]

But this conversation is completely out-of-sight, out-of-mind for the uber-reformy crowd. That’s perhaps the most ludicrous part of all of this reformy VAM-pocrisy !  Ignoring the substantive changes to the education system that could actually improve the validity of VAM estimates by asserting that VAM estimates alone will do the job, which they couldn’t possibly do if we continue to ignore all this stuff!

Finally, one more reason why VAM estimates are unlikely to become more valid or more useful over time? Once we start using these models with high stakes attached, the tendency for the data to become more corrupted and less valid escalates exponentially!

By the way, VAM estimates don’t seem to be very useful for evaluating a) the effectiveness of teacher preparation programs [due to the non-random geographic distributions of graduates] or b) principals either! More on this at another point.

Note on VAM-based de-selection: Yeah… the uber-reformy types will argue that no-one is saying that VAM should be used 100% for teacher de-selection, and further that no-one is really even arguing for de-selection.  WRONG! AGAIN! As I discussed in a previous post, the standard reformy legislation template includes three basic features which essentially amount to using VAM (or even worse SGPs) as the primary basis for teacher de-selection – yes, de-selection. First, use of VAM estimates in a parallel weighting system with other components requires that VAM be considered even in the presence of a likely false positive. NY legislation prohibits a teacher from being rated highly if their test-based effectiveness estimate is low. Further, where VAM estimates vary more than other components, they will quite often be the tipping point – nearly 100% of the decision even if only 20% of the weight – and even where most of that variation is NOISE or BIAS… not even “real” effect (effect on test score growth). Second, the reformy template often requires (as does the TEACHNJ bill in NJ) that teachers be de-selected (or at least have their tenure revoked) after any two years in a row of falling on the wrong side of an arbitrary cut point rammed through these noisy data.

Finally, don’t give me the anything is better than the status quo crap!

9 thoughts on “Two Persistent Reformy Misrepresentations regarding VAM Estimates

  1. Great stuff, Bruce! What makes your work so valuable is that you not only understand the statistics, but you understand the social system to which the statistics are applied. Some proponents understand the statistics, but many of the advocates for VAM understand neither.
    Diane Ravitch

  2. Dr. Baker:
    This is a great breakdown of the misconceptions people have of VAM. My district is one of the first in NJ to pilot the new evaluation using student test scores in our summative. The teachers are deathly afraid of what may come of this and how it may be used against them.
    Thanks for sharing. I’m always using your research to back my claims against VAM! Thank you!!

    1. Allow me to clarify that NJ is actually not using VAM, but rather something called student growth percentiles. They (NJDOE) might argue to you that this is somehow better than VAM… but they are absolutely, 100% WRONG in this argument. SGPs are not even designed to distill a teacher effect on student achievement growth: https://schoolfinance101.wordpress.com/2012/03/31/firing-teachers-based-on-bad-vam-versus-wrong-sgp-measures-of-effectiveness-legal-note/

  3. Dr, Baker,
    They still don’t get it! Please explain to Okaikor, that NJ is not using VAM but SPG. That using SPG is equivalent to rolling the dice. That under proposed legislation the lower 15% will not be proficient and two in a row (approximately equivalent to rolling snake eyes) will cause loss of Tenure!

    1. Interestingly, on the surface, the results of SGP might appear more stable than VAM estimates. They might appear to be less of a roll of the dice. BUT… that’s largely because SGPs don’t even attempt to control for the other factors that might be affecting student test score growth. They simply leave all of the potential biases in there! In other words, SGPs might be more consistent – more consistently wrong/bias that is!

    2. I’m sorry. Yes, I did mean to say Student Growth %tile. I should have corrected myself, but didn’t.

  4. What I think is remarkable is that this is all driven by a need to (a) pick a group of teachers to fire (that will solve everything!) and (b) to pick those teachers without having to do anything icky like having to meet them or watch them teach or even (heaven forfend) visit their schools.

    There’s also an obsession with the statistics of how many teachers are “fired” without recognition of how many teachers leave the system without a formal firing action.

    And finally, it seems to me that using value-added measurements it’s obvious that Arne Duncan has clearly failed to deliver.

  5. Thanks Bruce. Great piece.
    I have argued also that the classroom composition effects, or peer effects as they are sometimes called, determine VAM and SPG scores to a degree the “reformers don’t want to admit. These many peer effects are almost always unmeasured, say % girls vs. boys in middle school (which means management problems or management heaven, depending on the imbalance of one or the other sex). Hoxby, a black economist shows the % of blacks and whites in a class has large effects on test scores. In Chile the % of indiginous kids in a class has effects. Europeans have show PISA scores vary by school composition effects. The number of endogenous variables is endless and as you say, unmeasured. But even when a good number of them are measured, their interactions with other variable are not easy to understand statistically, but logically we would expect interactions to have effects of some magnitude. It may not be the % girls that make a difference in a classroom, though that is what we find, but the % of middle class girls among those girls that makes the difference. Or the % of college graduates among the fathers of the girls that makes the difference. Etc. Once you get into interactions, you enter a hall of mirrors and it doesn’t take you long to discover that everything is really hard to figure out because you simply cannot measure everything. As you note–you simply cannot build equations with enough variables to control for all the main effects and the interactions.
    But the reformers are VAMboozeled, as my colleague Audrey Amrein-Beardsley says, and it is so disheartening.

Comments are closed.