It’s good to be King: More Misguided Rhetoric on the NY State Eval System

Very little time to write today, but I must comment on this NY Post article on the bias I’ve been discussing in the NY State teacher/principal growth percentile ratings. Sociologist Aaron Pallas of TC and economist Sean Corcoran of NYU express appropriate concerns about the degrees of bias found and reported in the technical report provided by the state’s own consultant developing the models. And this article overall raises concern that these problems were simply blown off. I would/and have put it more bluntly. Here’s my replay of events – quoting the parties involved:

First, the state’s consultants designing their teacher and principal effectiveness measures find that those measures are substantively biased:

Despite the model conditioning on prior year test scores, schools and teachers with students who had higher prior year test scores, on average, had higher MGPs. Teachers of classes with higher percentages of economically disadvantaged students had lower MGPs. (p. 1) https://schoolfinance101.files.wordpress.com/2012/11/growth-model-11-12-air-technical-report.pdf

But instead of questioning their own measures, they decide to give them their blessing and pass them along to the state as being “fair and accurate.”

The model selected to estimate growth scores for New York State provides a fair and accurate method for estimating individual teacher and principal effectiveness based on specific regulatory requirements for a “growth model” in the 2011-2012 school year. p. 40 https://schoolfinance101.files.wordpress.com/2012/11/growth-model-11-12-air-technical-report.pdf

The next step was for the Chancellor to take this misinformation and polish it up as pure spin as part of the power play against the teachers in New York City (who’ve already had the opportunity to scrutinize what is arguably a better but still substantially flawed set of metrics). The Chancellor proclaimed:

The student-growth scores provided by the state for teacher evaluations are adjusted for factors such as students who are English Language Learners, students with disabilities and students living in poverty. When used right, growth data from student assessments provide an objective measurement of student achievement and, by extension, teacher performance. http://www.nypost.com/p/news/opinion/opedcolumnists/for_nyc_students_move_on_evaluations_EZVY4h9ddpxQSGz3oBWf0M

Then send in the enforcers…. This statement came from a letter sent to a district that did decide to play ball with the state on the teacher evaluation regulations. The state responded that… sure… you can adopt the system of multiple measures you propose – BUT ONLY AS LONG AS ALL OF THOSE OTHER MEASURES ARE SUFFICIENTLY CORRELATED WITH OUR BIASED MEASURES… AND ONLY AS LONG AS AT LEAST SOMEONE GETS A BAD RATING.

The department will be analyzing data supplied by districts, BOCES and/or schools and may order a corrective action plan if there are unacceptably low correlation results between the student growth subcomponent and any other measure of teacher and principal effectiveness… https://schoolfinance101.wordpress.com/2012/12/05/its-time-to-just-say-no-more-thoughts-on-the-ny-state-tchr-eval-system/

So… what’s my gripe today? Well, in this particular NY Post article we have some rather astounding quotes from NY State Commissioner John King, given the information above. Now, last I talked about John King, he was strutting about NY with this handy new graph of completely fabricated information on how to improve educational productivity. So, what’s King up to now? Here’s how John King explained the potential bias in the measures and how that bias a) is possibly not bias at all, and b) even if it is, it’s not that big a problem:

“It’s a question of, is this telling you something descriptive about where talent is placed? Or is it telling you something about the classroom effect [or] school effect of concentrations of students?” said King.

“This data alone can’t really answer that question, which is one of the reasons to have multiple measures — so that you have other information to inform your decision-making,” he added. “No one would say we should evaluate educators on growth scores alone. It’s a part of the picture, but it’s not the whole picture.”

So, in King’s view, the bias identified in the AIR technical report might just be a signal as to where the good teachers really are. Kids in schools with lower poverty – kids in schools with higher average starting scores and kids in schools with fewer children with disabilities simply have the better teachers. While there certainly may be some patterned sorting of teachers by their actual effect on test scores a) this proposition is less likely than the expectation of classroom effect and b) making this assumption when not really being able to tease out cause is a highly suspect approach to teacher evaluation (reformy thinking at its finest!).

The kicker is in how King explains why the potential bias isn’t a problem. King argues that the multiple measures approach buffers against over-reliance on the growth percentiles. As he states so boldly – “it’s part of the picture, but it’s not the whole picture.”

The absurdity here is that KING HAS DECLARED TO LOCAL OFFICIALS THAT ALL OTHER MEASURES THEY CHOOSE TO INCLUDE MUST BE SUFFICIENTLY CORRELATED WITH THESE GROWTH PERCENTILE MEASURES! That’s precisely what the letter quoted above and sent to one local official says! Even this wasn’t the case, the growth percentiles which may wrongly classify teachers for factors outside their control, might carry disproportionate weight in determining teacher ratings (merely as a function of the extent of variation – most of which is noise & much of the remainder is biased). But, when you require that all other measures be correlated with this suspect measure – you’ve stacked the deck to be substantially if not entirely built on a flawed foundation.

THIS HAS TO STOP. STATE OFFICIALS MUST BE CALLED OUT ON THIS RIDICULOUS CONTORTED/DECEPTIVE & OUTRIGHT DISHONEST RHETORIC!

Note: King also tries to play up the fact that at any level of poverty, there are some teachers getting higher or lower ratings. This explanation ignores the fact that much of the remaining variation in teacher estimates is noise. Some will get higher or lower ratings in a given year simply because of the noise/instability in the measures. These variations may be entirely meaningless.

It’s time to just say NO! More thoughts on the NY State Tchr Eval System

This post is a follow up on two recent previous posts in which I first criticized consultants to the State of New York for finding substantial patterns of bias in their estimates of principal (correction: School Aggregate) and teacher (correction: Classroom aggregate) median growth percentile scores but still declaring those scores to be fair and accurate, and next criticized the Chancellor of the Board of Regents for her editorial attempting to strong-arm NYC to move forward on an evaluation system adopting those flawed metrics – and declaring the metrics to be “objective” (implying both fair and accurate).

Let’s review. First, the AIR report on the median growth percentiles found, among other biases:

Despite the model conditioning on prior year test scores, schools and teachers with students who had higher prior year test scores, on average, had higher MGPs. Teachers of classes with higher percentages of economically disadvantaged students had lower MGPs. (p. 1)

In other words… if you are a teacher who so happens to have a group of students with higher initial scores, you are likely to get a higher rating, whether that difference is legitimately associated with your teaching effectiveness or not. And, if you are a teacher with more economically disadvantaged kids, you’re likely to get a lower rating. That is, the measures are biased – modestly – on these bases.

Despite these findings, the authors of the technical report chose to conclude:

The model selected to estimate growth scores for New York State provides a fair and accurate method for estimating individual teacher and principal effectiveness based on specific regulatory requirements for a “growth model” in the 2011-2012 school year. p. 40

I provide far more extensive discussion here! But even a modest bias across the system as a whole can indicate the potential for substantial bias for underlying clusters of teachers serving very high poverty populations or very high or very low prior scoring students. In other words, THE MEASURE IS NOT ACCURATE – AND BY EXTENSION – IS NOT FAIR!!!!! Is this not obvious enough?

The authors of the technical report were wrong – technically wrong – and I would argue morally and ethically wrong in providing NYSED their endorsement of these measures! You just don’t declare outright, when your own analyses show otherwise, that a measure [to be used for labeling people] is fair and accurate! [setting aside the general mischaracterization that these are measures of “teacher and principal effectiveness”]

Within a few days after writing this post, I noticed that Chancellor Merryl Tisch of the NY State Board of Regents had posted an op-ed in the NY POST attempting to strong-arm an agreement on a new teacher evaluation system between NYC teachers and the city. In the op-ed, the Chancellor opined:

The student-growth scores provided by the state for teacher evaluations are adjusted for factors such as students who are English Language Learners, students with disabilities and students living in poverty. When used right, growth data from student assessments provide an objective measurement of student achievement and, by extension, teacher performance.

As I noted in my post the other day, one might quibble that Chancellor Tisch has merely stated that the measures are “adjusted for” certain factors and she has not claimed that those adjustments actually work to eliminate bias – which the technical report indicates THEY DO NOT. Further, she has merely declared that the measures are “objective” and not that they are accurate or precise. Personally, I don’t find this deceitful propaganda at all comforting! Objective or not – if the measures are biased, they are not accurate and if they are not accurate they, by extension are not fair.

Sadly, the story of misinformation and disinformation doesn’t stop here. It only gets worse! I received a copy of a letter yesterday from a NY school district that had its teacher evaluation plan approved by NYSED. Here is a portion of the approval letter:

Now, I assume this language to be boilerplate. Perhaps not. I’ve underling the good stuff. What we have here is NYSED threatening that they may enforce a corrective action plan on the district if the district uses any other measures of teacher or principal effectiveness that are not sufficiently correlated WITH THE STATE’S OWN BIASED MEASURES OF PRINCIPAL AND TEACHER EFFECTIVENESS!

This is the icing on the cake! This is sick- warped- wrong! Consultants to the state find that the measures are biased, and then declare they are “fair and accurate.” The Chancellor spews propaganda that reliance on these measures must proceed with all deliberate speed! (or ELSE!!!!!!!). Then the Chancellor’s enforcers warn individual district officials that they will be subjected to mind control – excuse me – departmental oversight – if they dare to present their own observational or other ratings of teachers or principals that don’t correlate sufficiently with the state imposed, biased measures.

I really don’t even know what to say anymore??????????

But I think it’s time to just say no!

AIR Pollution in NY State? Comments on the NY State Teacher/Principal Rating Models/Report

I was immediately intrigued the other day when a friend passed along a link to the recent technical report on the New York State growth model, the results of which are expected/required to be integrated into district level teacher and principal evaluation systems under that state’s new teacher evaluation regulations. I did as I often do and went straight for the pictures – in this case- the scatterplots of the relationships between various “other” measures and the teacher and principal “effect” measures. There was plenty of interesting stuff there, some of which I’ll discuss below.

But then I went to the written language of the report – specifically the report’s (albeit in DRAFT form) conclusions. The conclusions were only two short paragraphs long, despite much to ponder being provided in the body of the report. The authors’ main conclusion was as follows:

The model selected to estimate growth scores for New York State provides a fair and accurate method for estimating individual teacher and principal effectiveness based on specific regulatory requirements for a “growth model” in the 2011-2012 school year. p. 40

http://engageny.org/wp-content/uploads/2012/06/growth-model-11-12-air-technical-report.pdf

13-Nov-2012 20:54

Updated Final Report: http://engageny.org/sites/default/files/resource/attachments/growth-model-11-12-air-technical-report_0.pdf

Local copy of original DRAFT report: growth-model-11-12-air-technical-report

Local copy of FINAL report: growth-model-11-12-air-technical-report_FINAL

Unfortunately, the multitude of graphs that immediately preceded this conclusion undermine it entirely. but first, allow me to address the egregious conceptual problems with the framing of this conclusion.

First Conceptually

Let’s start with the low hanging fruit here. First and foremost, nowhere in the technical report, nowhere in their data analyses, do the authors actually measure “individual teacher and principal effectiveness.” And quite honestly, I don’t give a crap if the “specific regulatory requirements” refer to such measures in these terms. If that’s what the author is referring to in this language, that’s a pathetic copout. Indeed it may have been their charge to “measure individual teacher and principal effectiveness based on requirements stated in XYZ.” That’s how contracts for such work are often stated. But that does not obligate the author to conclude that this is actually what has been statistically accomplished. And I’m just getting started.

So, what is being measured and reported? At best, what we have are:

An estimate of student relative test score change on one assessment each for ELA and Math (scaled to growth percentile) for students who happen to be clustered in certain classrooms.

THIS IS NOT TO BE CONFLATED WITH “TEACHER EFFECTIVENESS”

Rather, it is merely a classroom aggregate statistical association based on data points pertaining to two subjects being addressed by teachers in those classrooms, for a group of children who happen to spend a minority share of their day and year in those classrooms.

An estimate of student relative test score change on one assessment each for ELA and Math (scaled to growth percentile) for students who happen to be clustered in certain schools.

THIS IS NOT TO BE CONFLATED WITH “PRINCIPAL EFFECTIVENESS”

Rather, it is merely a school aggregate statistical association based on data points pertaining to two subjects being addressed by teachers in classrooms that are housed in a given school under the leadership of perhaps one or more principals, vps, etc., for a group of children who happen to spend a minority share of their day and year in those classrooms.

Now Statistically

Following are a series of charts presented in the technical report, immediately preceding the above conclusion.

Classroom Level Rating Bias

School Level Rating Bias

And there are many more figures displaying more subtle biases, but biases that for clusters of teachers may be quite significant and consequential.

Based on the figures above, there certainly appears to be, both at the teacher, excuse me – classroom, and principal – I mean school level, substantial bias in the Mean Growth Percentile ratings with respect to initial performance levels on both math and reading. Teachers with students who had higher starting scores and principals in schools with higher starting scores tended to have higher Mean Growth Percentiles.

This might occur for several reasons. First, it might just be that the tests used to generate the MGPs are scaled such that it’s just easier to achieve growth in the upper ranges of scores. I came to a similar finding of bias in the NYC value added model, where schools having higher starting math scores showed higher value added. So perhaps something is going on here. It might also be that students clustered among higher performing peers tend to do better. And, it’s at least conceivable that students who previously had strong teachers and remain clustered together from year to year, continue to show strong growth. What is less likely is that many of the actual “better” teachers just so happen to be teaching the kids who had better scores to begin with.

That the systemic bias appears greater in the school level estimates than in the teacher level estimates is suggestive that the teacher level estimates may actually be even more bias than they appear. The aggregation of otherwise less biased estimates should not reveal more bias.

Further, as I’ve mentioned on several times on this blog previously, even if there weren’t such glaringly apparent overall patterns of bias their still might be underlying biased clusters. That is, groups of teachers serving certain types of students might have ratings that are substantially WRONG, either in relation to observed characteristics of the students they serve or their settings, or of unobserved characteristics.

Closing Thoughts

To be blunt – the measures are neither conceptually nor statistically accurate. They suffer significant bias, as shown and then completely ignored by the authors. And inaccurate measures can’t be fair. Characterizing them as such is irresponsible.

I’ve now written 2 articles and numerous blog posts in which I have raised concerns about the likely overly rigid use of these very types of metrics when making high stakes personnel decisions. I have pointed out that misuse of this information may raise significant legal concerns. That is, when district administrators do start making teacher or principal dismissal decisions based on these data, there will likely follow, some very interesting litigation over whether this information really is sufficient for upholding due process (depending largely on how it is applied in the process).

I have pointed out that the originators of the SGP approach have stated in numerous technical documents and academic papers that SGPs are intended to be a descriptive tool and are not for making causal assertions (they are not for “attribution of responsibility”) regarding teacher effects on student outcomes. Yet, the authors persist in encouraging states and local districts to do just that. I certainly expect to see them called to the witness stand the first time SGP information is misused to attribute student failure to a teacher.

But the case of the NY-AIR technical report is somewhat more disconcerting. Here, we have a technically proficient author working for a highly respected organization – American Institutes for Research – ignoring all of the statistical red flags (after waiving them), and seemingly oblivious to gaping conceptual holes (commonly understood limitations) between the actual statistical analyses presented and the concluding statements made (and language used throughout).

The conclusion are WRONG – statistically and conceptually. And the author needs to recognize that being so damn bluntly wrong may be consequential for the livelihoods of thousands of individual teachers and principals! Yes, it is indeed another leap for a local school administrator to use their state approved evaluation framework, coupled with these measures, to actually decide to adversely affect the livelihood and potential career of some wrongly classified teacher or principal – but the author of this report has given them the tool and provided his blessing. And that’s inexcusable.

And a video with song!

==================

Note: In the executive summary, the report acknowledges these biases:

Despite the model conditioning on prior year test scores, schools and teachers with students who had higher prior year test scores, on average, had higher MGPs. Teachers of classes with higher percentages of economically disadvantaged students had lower MGPs.

But then blows them off throughout the remainder of the report, and never mentions that this might be important.

Local copy of report: growth-model-11-12-air-technical-report

On the Stability (or not) of Being Irreplaceable

This is just a quick note with a few pictures in response to the TNTP “Irreplaceables” report that came out a few weeks back – a report that is utterly ridiculous at many levels (especially this graph!)… but due to the storm I just didn’t get a chance address it. But let’s just entertain for the moment the premise that teachers who achieve a value-added rating in the top 20% in a given year are… just plain freakin’ awesome…. and that districts should take whatever steps they can to focus on retaining this specific momentary slice of teachers. At the same time, districts might not want concern themselves with all of those other teachers that range only from okay… all the way down to those that simply stink!

The TNTP report focuses on teachers who were in the top 14% in Washington DC based on aggregate IMPACT ratings, which do include more than value-added alone, but are certainly driven by the Value-added metric. TNTP compares DC to other districts, and explains that the top 20% by value-added are assumed to be higher performers.

For the other four districts we studied, we used teacher value-added scores or student academic growth measures to identify high- and low-performing teachers—those whose students made much more or much less academic progress than expected. These data provided us with a common yardstick for teacher performance. Teachers scoring in approximately the top 20 percent were identified as Irreplaceables. While teachers of this caliber earn high ratings in student surveys and have been shown to have a positive impact that extends far beyond test scores, we acknowledge that such measures are limited to certain grades and subjects and should not be the only ones used in real-world teacher evaluations. http://tntp.org/assets/documents/TNTP_DCIrreplaceables_2012.pdf

Let’s take a stab at this with the NYC Teacher Value-added Percentiles which I played around with in some previous posts.

The following graphs play out the premise of “irreplaceables” with NYC value-added percentile data. I start by identifying those teachers that are in the top 20% in 2005-06 and then see where they land in each subsequent year through 2009-10.

NOTE: IT’S REALLY NOT A GREAT IDEA TO MAKE SCATTERPLOTS OF THE RELATIONSHIP BETWEEN PERCENTILE RANKS – BETTER TO USE THE ACTUAL VAM SCORES. BUT THIS IS ILLUSTRATIVE… THE POINT BEING TO SEE WHERE ALL OF THOSE DOTS THAT ARE “IRREPLACEABLE” IN YEAR 1 (2005-06) STAY THAT WAY YEAR AFTER YEAR!

I’ve chosen to focus on the MATHEMATICS ratings here… which were actually the more stable ratings from year to year (but were stable potentially because the were biased!)

See: https://schoolfinance101.wordpress.com/2012/02/28/youve-been-vam-ified-thoughts-graphs-on-the-nyc-teacher-data/

Figure 1 – Who is irreplaceable in 2006-07 after being irreplaceable in 2005-06?

Figure 1 shows that there are certainly more “irreplaceables” (awesome teachers) that remain above the median the following year than fall below it… but there sure are one heck of a lot of those irreplaceables that are below the median the next year… and a few that are near the 0%ile! This is not, by any stretch to condemn those individuals for being falsely rated as irreplaceable but actually sucking. Rather, this is to point out that there is comparable likelihood that these teachers were wrongly classified each year (potentially like nearly every other teacher in the mix).

Figure 2 – Among those 2005-06 Irreplaceables, how do they reshuffle between 2006-07 & 2007-08?

Hmm… now they’re moving all over the place. A small cluster do appear to stay in the upper right. But, we are dealing with a dramatically diminishing pool of the persistently awesome here. And I’m not even pointing out the number of cases in the data set that are simply disappearing from year to year. Another post – another day.

I provide an analysis along these lines here: https://schoolfinance101.wordpress.com/2012/03/01/about-those-dice-ready-set-roll-on-the-vam-ification-of-tenure/

Figure 3 – How many of those teachers who were totally awesome in 2005-06 were still totally awesome in 2009-10?

The relationship between ratings from year to year is even weaker when one looks at the endpoints of the data set, comparing 2005-06 ratings to 2009-10 ones. Again, we’ve got teachers who were supposedly “irreplaceable” in 2005-06 who are at the bottom of the heap in 2009-10.

Yes, there is still a cluster of teachers who had a top 20% rating in 2005-06 and have one again in 2009-10. BUT… many… uh… most of these had a much lower rating for at least one of the in between years!

Of the thousands of teachers for whom ratings exist for each year, there are 14 in math and 5 in ELA that stay in the top 20% for each year! Sure hope they don’t leave!

====

Note: Because the NYC teacher data release did not provide unique identifiers for matching teachers from year to year, for my previous analyses I had constructed a matching identifier based on teacher name, subject and grade level within school. So, my year to year comparisons include only those teachers who are teaching the same subject and grade level in the same school from one year to the next. Arguably, this matching approach might lead to greater stability than might be expected if I included teachers who moved to different schools serving different students and/or changed subject areas or levels.

Two Persistent Reformy Misrepresentations regarding VAM Estimates

I have written much on this blog about problems with the use of Value-added Estimates of teacher effect (used loosely) on student test score gains on this blog. I have addressed problems with both the reliability and validity of VAM estimates, and I have pointed out how SGP based estimates of student growth are invalid on their face for determining teacher effectiveness.

But, I keep hearing two common refrains from the uber-reformy (those completely oblivious to the statistics and research of VAM while also lacking any depth of understanding of the complexities of the social systems [schools] into which they propose to implement VAM as a de-selection tool) crowd. Sadly, these are the people who seem to be drafting policies these days.

Here are the persistent misrepresentations:

Misrepresentation #1: That this reliability and error stuff only makes it hard for us to distinguish among all those teachers clustered in the middle of the distribution. BUT… we can certainly be confident about those at the extremes of the distribution. We know who the really good and really bad teachers are based on their VAM estimates.

WRONG!

This would possibly be a reasonable assertion if reliability and error rates were the only problem. But this statement ignores entirely the issue of omitted variables bias (other stuff that affects teacher effect estimates that may have been missed in the model), and just how much those observations in the tails jump around when we tweak the VAM by adding or removing variables, or rescaling measures.

A recent paper by Dale Ballou & colleagues illustrates this problem:

“In this paper, we consider the impact of omitted variables on teachers’ value-added estimates, and whether commonly used single-equation or two-stage estimates are preferable when possibly important covariates are not available for inclusion in the value-added model. The findings indicate that these modeling choices can significantly influence outcomes for individual teachers, particularly those in the tails of the performance distribution who are most likely to be targeted by high-stakes policies.” (Ballou et al., 2012) [emphasis added]

The problem is that we can never know when we’ve got that model specification just right. Further, while we might be able to run checks as to whether the model estimates display bias with respect to measurable external factors, we can’t know if there is bias with respect to stuff we can’t measure, nor can we always tell if there are clusters of teachers in our model whose effectiveness estimates are biased in one direction and other clusters in another direction (also in relation to stuff unmeasured). That is, we can only test this omitted variables bias stuff when we can add in and take out measures that we have. We simply don’t know how much bias remains due to all sorts of other unmeasured stuff, nor do we know just how much that bias may affect many of those distributions in the tails!

Misrepresentation #2: We may be having difficulty in these early stages of estimating and using VAM models to determine teacher effectiveness, but these are just early development problems that will be cleared up with better models, better data and better tests.

WRONG AGAIN!

Quite possibly, what we are seeing now is as good as it gets. Keep in mind that many of the often cited papers applying the value-added methodology date back to the mid-1990s. Yeah…. we’ve been at this for a while and we’ve got what we’ve got!

Consider the sources of the problems with the reliability and validity of VAM estimates, or in other words:

The sources of random error and/or noise in VAM estimates

Random error in testing data can be a function of undetected and uncorrected poorly designed test items, such as items with no correct response or more than one correct response, testing conditions/disruptions, and kids being kids – making goofy errors such as filling in the wrong bubble (or toggling the wrong box in computerized testing) or simply having a brain fart on stuff they probably otherwise knew quite well. We’re talking about large groups of 8 and 9 year old kids in some cases, in physically uncomfortable settings, under stress, with numerous potential distractions.

Do we really think all of these sources of noise are going to go away? Substantively improve over time? Testing technology gains only have a small chance at marginally improving some of these. I hope to see those improvements. But it’s a drop in the bucket when it comes to the usefulness, reliability and validity of VAM estimates.

The factors other than the teacher which may influence the average test score gain of students linked to that teacher

First and foremost, kids simply aren’t randomly sorted across teachers and the various ways in which kids aren’t randomly sorted (by socioeconomic status, by disability status, by parental and/or child motivation level) substantively influence VAM estimates. As mentioned above, we can never know how much the unmeasured stuff influences the VAM estimates. Why? It’s unmeasured!

Second, teachers aren’t randomly sorted among teaching peers and VAM studies have shown what appear to be spillover effects – where teachers seem to get higher VAM estimates when other teachers serving the same students get higher VAM estimates. Teacher aides, class sizes, lighting/heating/cooling aren’t randomly distributed and all of this stuff may matter.

And you know what? This stuff isn’t going to change in the near future. In fact, the more time we waste obsessing on the future of VAM-based de-selection policies instead of equitably and adequately financing our school systems, the more that equity of schooling conditions is going to erode across children, teachers, schools and districts – in ways that are very much non-random [uh… that means certain kids will get more screwed than others]. So perhaps our time would be much better spent trying to improve the equity of those conditions across children. Provide more parity in teacher compensation and working conditions, and better integrating/distributing student populations.

Look – if we were trying to set up an experiment or a program evaluation in which we wanted our VAM estimates to be most useful – least likely to be biased by unmeasured stuff – we would take whatever steps we could to achieve the “all else equal” requirement. Translated to the non-experimental setting – applied in the real world – this all else equal requirement means that we actually have to concern ourselves with equality of teaching conditions – equality of the distribution of students by race, SES and other factors. Yeah… that actually means equitable access to financial resources – equitable access to all sorts of stuff (including peer group).

In other words, we’d be required to exercise more care in establishing equality of conditions or explaining why we couldn’t if we were simply comparing program effectiveness for academic publication than the current reformy crowd is willing to exercise when deciding which teachers to fire. [then again, the problem is that they don’t seem to know the difference. Heck, some of them are still hanging their hopes on measures that aren’t even designed for the purpose !]

But this conversation is completely out-of-sight, out-of-mind for the uber-reformy crowd. That’s perhaps the most ludicrous part of all of this reformy VAM-pocrisy ! Ignoring the substantive changes to the education system that could actually improve the validity of VAM estimates by asserting that VAM estimates alone will do the job, which they couldn’t possibly do if we continue to ignore all this stuff!

Finally, one more reason why VAM estimates are unlikely to become more valid or more useful over time? Once we start using these models with high stakes attached, the tendency for the data to become more corrupted and less valid escalates exponentially!

By the way, VAM estimates don’t seem to be very useful for evaluating a) the effectiveness of teacher preparation programs [due to the non-random geographic distributions of graduates] or b) principals either! More on this at another point.

Note on VAM-based de-selection: Yeah… the uber-reformy types will argue that no-one is saying that VAM should be used 100% for teacher de-selection, and further that no-one is really even arguing for de-selection. WRONG! AGAIN! As I discussed in a previous post, the standard reformy legislation template includes three basic features which essentially amount to using VAM (or even worse SGPs) as the primary basis for teacher de-selection – yes, de-selection. First, use of VAM estimates in a parallel weighting system with other components requires that VAM be considered even in the presence of a likely false positive. NY legislation prohibits a teacher from being rated highly if their test-based effectiveness estimate is low. Further, where VAM estimates vary more than other components, they will quite often be the tipping point – nearly 100% of the decision even if only 20% of the weight – and even where most of that variation is NOISE or BIAS… not even “real” effect (effect on test score growth). Second, the reformy template often requires (as does the TEACHNJ bill in NJ) that teachers be de-selected (or at least have their tenure revoked) after any two years in a row of falling on the wrong side of an arbitrary cut point rammed through these noisy data.

Finally, don’t give me the anything is better than the status quo crap!

Video Thoughts on Test Scores, VAM, SGP & Teacher Evaluation

Recent Bank Street College of Education Symposium on teacher evaluation

Additional video clips from legislative forum at the New Jersey Principals and Supervisors Association

General Issues in Teacher Evaluation: Where to Start in New Jersey

http://www.youtube.com/watch?v=5B7gAkB5-QU&feature=player_detailpage#t=1208s

Pilots versus Expedited Legislated Evaluation Models (Rigidity of Legislation)

http://www.youtube.com/watch?v=5B7gAkB5-QU&feature=player_detailpage#t=1878s

Complete Forum Video:

If it’s not valid, reliability doesn’t matter so much! More on VAM-ing & SGP-ing Teacher Dismissal

This post includes a few more preliminary musings regarding the use of value-added measures and student growth percentiles for teacher evaluation, specifically for making high-stakes decisions, and especially in those cases where new statutes and regulations mandate rigid use/heavy emphasis on these measures, as I discussed in the previous post.

========

The recent release of New York City teacher value-added estimates to several media outlets stimulated much discussion about standard errors and statistical noise found in estimates of teacher effectiveness derived from the city’s value-added model. But lost in that discussion was any emphasis on whether the predicted value-added measures were valid estimates of teacher effects to begin with. That is, did they actually represent what they were intended to represent – the teacher’s influence on a true measure of student achievement, or learning growth while under that teacher’s tutelage. As framed in teacher evaluation legislation, that measure is typically characterized as “student achievement growth,” and it is assumed that one can measure the influence of the teacher on “student achievement growth” in a particular content domain.

A brief note on the semantics versus the statistics and measurement in evaluation and accountability is in order.

At issue are policies involving teacher “evaluation” and more specifically evaluation of teacher effectiveness, where in cases of dismissal the evaluation objective is to identify particularly ineffective teachers.

In order to “evaluate” (assess, appraise, estimate) a teacher’s effectiveness with respect to student growth, one must be able to “infer” (deduce, conjecture, surmise…) that the teacher affected or could have affected that student growth. That is, for example, given one year’s bad rating, the teacher had sufficient information to understand how to improve her rating in the following year. Further, one must choose measures that provide some basis for such inference.

Inference and attribution (ascription, credit, designation) are not separable when evaluating teacher effectiveness. To make an inference about teacher effectiveness based on student achievement growth, one must attribute responsibility for that growth to the teacher.

In some cases, proponents of student growth percentiles alter their wording [in a truly annoying & dreadfully superficial way] for general public appeal to argue that:

SGPs are a measure of student achievement growth.
Student achievement growth is a primary objective of schooling.
Therefore, teachers and schools should obviously be held accountable for student achievement growth.

Where accountable is a synonym for responsible, to the extent that SGPs were designed to separate the measurement of student growth from attribution of responsibility for it, then SGPs are also invalid on their face for holding teachers accountable. For a teacher to be accountable for that growth it must be attributable to them and one must be using a method which permits such inference.

Allow me to reiterate this quote from the authors of SGP:

“The development of the Student Growth Percentile methodology was guided by Rubin et al’s (2004) admonition that VAM quantities are, at best, descriptive measures.” (Betebenner, Wenning & Briggs, 2011)

I will save for another day a discussion of the nuanced differences between statistical causation and inference and causation and inference as might be evaluated more broadly in the context of litigation over determination of teacher effectiveness. The big problem in the current context, as I have explained in my previous post, is created by legislative attempts to attach strict timelines, absolute weights and precise classifications to data that simply cannot be applied in this way.

Major Validity Concerns

We identify[at least ] 3 categories of significant compromises to inference and attribution and therefore accountability for student achievement growth:

The value-added estimate (or SGP) was influenced by something other than the teacher alone
The value-added (or SGP) estimate given one assessment of the teacher’s content domain produces a different rating than the value-added estimate given a different assessment tool
The value-added estimate (or SGP) is compromised by missing data and/or student mobility, disrupting the link between teacher and students. [the actual data link required for attribution]

The first major issue compromising attribution of responsibility for or inference regarding teacher effectiveness based on student growth is that some other factor or set of factors actually caused the student achievement growth or lack thereof. A particularly bothersome feature of many value-added models is that they rely on annual testing data. That is, student achievement growth is measured from April or May in one year to April or May in the next, where the school year runs from September to mid or late June. As such, for example, the 4^th grade teacher is assigned a rating based on children who attended her class from September to April (testing time), or about 7 months, where 2.5 months were spent doing any variety of other things, and another 2.5 months were spent with their prior grade teacher. Let alone the different access to resources each child has during their after school and weekend hours during the 7 months over which they have contact with their teacher of record.

Students with different access to summer and out-of-school time resources may not be randomly assigned across teachers within a given school or across schools within a district. And students who had prior year teachers who may have checked out versus the teacher who delved into the subsequent year’s curriculum during the post-testing month of the prior year may also not be randomly distributed. All of these factors go unobserved and unmeasured in the calculation of a teacher’s effectiveness, potentially severely compromising the validity of a teacher’s effectiveness estimate. Summer learning varies widely across students by economic backgrounds (Alexander, Entwisle & Olsen, 2001) Further, in the recent Gates MET Studies (2010), the authors found: “The norm sample results imply that students improve their reading comprehension scores just as much (or more) between April and October as between October and April in the following grade. Scores may be rising as kids mature and get more practice outside of school.” (p. )

Numerous authors have conducted analyses revealing the problems of omitted variables bias and the non-random sorting of students across classrooms (Rothstein, 2011, 2010, 2009, Briggs & Domingue, 2011, Ballou et al., 2012). In short, some value-added models are better than others, in that by including additional explanatory measures, the models seem to correct for at least some biases. Omitted variables bias is where any given teacher’s predicted value is influenced partly by factors other than the teacher herself. That is, the estimate is higher or lower than it should be, because some other factor has influenced the estimate. Unfortunately, one can never really know if there are still additional factors that might be used to correct for that bias. Many such factors are simply unobservable. Others may be measurable and observable but are simply unavailable, or poorly measured in the data. While there are some methods which can substantially reduce the influence of unobservables on teacher effect estimates, those methods can typically only be applied to a very small subset of teachers within very large data sets.[2] In a recent conference paper, Ballou and colleagues evaluated the role of omitted variables bias in value-added models and the potential effects on personnel decisions. They concluded:

“In this paper, we consider the impact of omitted variables on teachers’ value-added estimates, and whether commonly used single-equation or two-stage estimates are preferable when possibly important covariates are not available for inclusion in the value-added model. The findings indicate that these modeling choices can significantly influence outcomes for individual teachers, particularly those in the tails of the performance distribution who are most likely to be targeted by high-stakes policies.” (Ballou et al., 2012)

A related problem is the extent to which such biases may appear to be a wash, on the whole, across large data sets, but where specific circumstances or omitted variables may have rather severe effects on predicted values for specific teachers. To reiterate, these are not merely issues of instability or error. These are issues of whether the models are estimating the teacher’s effect on student outcomes, or the effect of something else on student outcomes. Teachers should not be dismissed for factors beyond their control. Further, statutes and regulations should not require that principals dismiss teachers or revoke their tenure in those cases where the principal understands intuitively that the teacher’s rating was compromised by some other cause. [as would be the case under the TEACHNJ Act]

Other factors which severely compromise inference and attribution, and thus validity, include the fact that the measured value-added gains of a teacher’s peers – or team members working with the same students – may be correlated, either because of unmeasured attributes of the students or because of spillover effects of working alongside more effective colleagues (one may never know) (Koedel, 2009, Jackson & Bruegmann, 2009). Further, there may simply be differences across classrooms or school settings that remain correlated with effectiveness ratings that simply were not fully captured by the statistical models.

Significant evidence of bias plagued the value-added model estimated for the Los Angeles Times in 2010, including significant patterns of racial disparities in teacher ratings both by the race of the student served and by the race of the teachers (see Green, Baker and Oluwole, 2012). These model biases raise the possibility that Title VII disparate impact claims might also be filed by teachers dismissed on the basis of their value-added estimates. Additional analyses of the data, including richer models using additional variables mitigated substantial portions of the bias in the LA Times models (Briggs & Domingue, 2010).

A handful of studies have also found that teacher ratings vary significantly, even for the same subject area, if different assessments of that subject are used. If a teacher is broadly responsible for effectively teaching in their subject area, and not the specific content of any one test, different results from different tests raise additional validity concerns. Which test better represents the teacher’s responsibilities? [must we specify which test counts/matters/represents those responsibilities in teacher contracts?] If more than one, in what proportions? If results from different tests completely counterbalance, how is one to determine the teacher’s true effectiveness in their subject area? Using data on two different assessments used in Houston Independent School District, Corcoran and Jennings (2010) find:

[A]mong those who ranked in the top category (5) on the TAKS reading test, more than 17 percent ranked among the lowest two categories on the Stanford test. Similarly, more than 15 percent of the lowest value-added teachers on the TAKS were in the highest two categories on the Stanford.

The Gates Foundation Measures of Effective Teaching Project also evaluated consistency of teacher ratings produced on different assessments of mathematics achievement. In a review of the Gates findings, Rothstein (2010) explained:

The data suggest that more than 20% of teachers in the bottom quarter of the state test math distribution (and more than 30% of those in the bottom quarter for ELA) are in the top half of the alternative assessment distribution.(p. 5)

And:

In other words, teacher evaluations based on observed state test outcomes are only slightly better than coin tosses at identifying teachers whose students perform unusually well or badly on assessments of conceptual understanding.(p. 5)

Finally, student mobility, missing data, and algorithms for accounting for that missing data can severely compromise inferences regarding teacher effectiveness. Corcoran (2010) explains that the extent of missing data can be quite large and can vary by student type:

Because of high rates of student mobility in this [Houston] population (in addition to test exemption and absenteeism), the percentage of students who have both a current and prior year test score – a prerequisite for value-added – is even lower (see Figure 6). Among all grade four to six students in HISD, only 66 percent had both of these scores, a fraction that falls to 62 percent for Black students, 47 percent for ESL students, and 41 percent for recent immigrants.” (Corcoran, 2010, p.20- 21)

Thus, many teacher effectiveness ratings would be based on significantly incomplete information, and further, the extent to which that information is incomplete would be highly dependent on the types of students served by the teacher.

One statistical resolution to this problem is imputation. In effect, imputation creates pre-test or post-test scores for those students who weren’t there. One approach is to use the average score for students who were there, or more precisely for otherwise similar students who were there. On its face imputation is problematic when it comes to attribution of responsibility for student outcomes to the teacher, if some of those outcomes are statistically generated for students who were not even there. But not using imputation may lead to estimates of effectiveness that are severely biased, especially when there is so much missing data. Howard Wainer (2011) esteemed statistician and measurement expert formerly with Educational Testing Service (ETS) explains somewhat mockingly how teachers might game imputation of missing data by sending all of their best students on a field trip during fall testing days, and then, in the name of fairness, sending the weakest students on a field trip during spring testing days.[3] Clearly, in such a case of gaming, the predicted value-added assigned to the teacher as a function of the average scores of low performing students at the beginning of the year (while their high performing classmates were on their trip), and high performing ones at the end of the year (while their low performing classmates were on their trip), would not be correctly attributed to the teacher’s actual teaching effectiveness, though it might be attributable to the teacher’s ability to game the system.

In short, validity concerns are at least as great as reliability concerns, if not greater. If a measure is simply not valid, it really doesn’t matter whether it is reliable or not.

If a measure cannot be used to validly infer teacher effectiveness, cannot be used to attribute responsibility for student achievement growth to the teacher, then that measure is highly suspect as a basis for high stakes decisions making when evaluating teacher (or teaching) effectiveness or for teacher and school accountability systems more generally.

References & Additional Readings

Alexander, K.L, Entwisle, D.R., Olsen, L.S. (2001) Schools, Achievement and Inequality: A Seasonal Perspective. Educational Evaluation and Policy Analysis 23 (2) 171-191

Ballou, D., Mokher, C.G., Cavaluzzo, L. (2012) Using Value-Added Assessment for Personnel Decisions: How Omitted Variables and Model Specification Influence Teachers’ Outcomes. Annual Meeting of the Association for Education Finance and Policy. Boston, MA. http://aefpweb.org/sites/default/files/webform/AEFP-Using%20VAM%20for%20personnel%20decisions_02-29-12.docx

Ballou, D. (2012). Review of “The Long-Term Impacts of Teachers: Teacher Value-Added and Student Outcomes in Adulthood.” Boulder, CO: National Education Policy Center. Retrieved [date] from http://nepc.colorado.edu/thinktank/review-long-term-impacts

Baker, E.L., Barton, P.E., Darling-Hammong, L., Haertel, E., Ladd, H.F., Linn, R.L., Ravitch, D., Rothstein, R., Shavelson, R.J., Shepard, L.A. (2010) Problems with the Use of Student Test Scores to Evaluate Teachers. Washington, DC: Economic Policy Institute. http://epi.3cdn.net/724cd9a1eb91c40ff0_hwm6iij90.pdf

Betebenner, D., Wenning, R.J., Briggs, D.C. (2011) Student Growth Percentiles and Shoe Leather. http://www.ednewscolorado.org/2011/09/13/24400-student-growth-percentiles-and-shoe-leather

Boyd, D.J., Lankford, H., Loeb, S., & Wyckoff, J.H. (July, 2010). Teacher layoffs: An empirical illustration of seniority vs. measures of effectiveness. Brief 12. National Center for Evaluation of Longitudinal Data in Education Research. Washington, DC: The Urban Institute.

Briggs, D., Betebenner, D., (2009) Is student achievement scale dependent? Paper presented at the invited symposium Measuring and Evaluating Changes in Student Achievement: A Conversation about Technical and Conceptual Issues at the annual meeting of the National Council for Measurement in Education, San Diego, CA, April 14, 2009. http://dirwww.colorado.edu/education/faculty/derekbriggs/Docs/Briggs_Weeks_Is%20Growth%20in%20Student%20Achievement%20Scale%20Dependent.pdf

Briggs, D. & Domingue, B. (2011). Due Diligence and the Evaluation of Teachers: A review of the value-added analysis underlying the effectiveness rankings of Los Angeles Unified School District teachers by the Los Angeles Times. Boulder, CO: National Education Policy Center. Retrieved [date] from http://nepc.colorado.edu/publication/due-diligence.

Budden, R. (2010) How Effective Are Los Angeles Elementary Teachers and Schools?, Aug. 2010, available at http://www.latimes.com/media/acrobat/2010-08/55538493.pdf.

Braun, H, Chudowsky, N, & Koenig, J (eds). (2010) Getting value out of value-added. Report of a Workshop. Washington, DC: National Research Council, National Academies Press.

Braun, H. I. (2005). Using student progress to evaluate teachers: A primer on value-added models. Princeton, NJ: Educational Testing Service. Retrieved February, 27, 2008.

Chetty, R., Friedman, J., Rockoff, J. (2011) The Long Term Impacts of Teachers: Teacher Value Added and Student outcomes in Adulthood. NBER Working Paper # 17699 http://www.nber.org/papers/w17699

Clotfelter, C., Ladd, H.F., Vigdor, J. (2005) Who Teaches Whom? Race and the distribution of Novice Teachers. Economics of Education Review 24 (4) 377-392

Clotfelter, C., Glennie, E. Ladd, H., & Vigdor, J. (2008). Would higher salaries keep teachers in high-poverty schools? Evidence from a policy intervention in North Carolina. Journal of Public Economics 92, 1352-70.

Corcoran, S.P. (2010) Can Teachers Be Evaluated by their Students’ Test Scores? Should they Be? The Use of Value Added Measures of Teacher Effectiveness in Policy and Practice. Annenberg Institute for School Reform. http://annenberginstitute.org/pdf/valueaddedreport.pdf

Corcoran, S.P. (2011) Presentation at the Institute for Research on Poverty Summer Workshop: Teacher Effectiveness on High- and Low-Stakes Tests (Apr. 10, 2011), available at https://files.nyu.edu/sc129/public/papers/corcoran_jennings_beveridge_2011_wkg_teacher_effects.pdf.

Corcoran, Sean P., Jennifer L. Jennings, and Andrew A. Beveridge. 2010. “Teacher Effectiveness on High- and Low-Stakes Tests.” Paper presented at the Institute for Research on Poverty summer workshop, Madison, WI.

D.C. Pub. Sch., IMPACT Guidebooks (2011), available at http://dcps.dc.gov/portal/site/DCPS/menuitem.06de50edb2b17a932c69621014f62010/?vgnextoid=b00b64505ddc3210VgnVCM1000007e6f0201RCRD.

Education Trust (2011) Fact Sheet- Teacher Quality. Washington, DC. http://www.edtrust.org/sites/edtrust.org/files/Ed%20Trust%20Facts%20on%20Teacher%20Equity_0.pdf

Hanushek, E.A., Rivkin, S.G., (2010) Presentation for the American Economic Association: Generalizations about Using Value-Added Measures of Teacher Quality 8 (Jan. 3-5, 2010), available at http://www.utdallas.edu/research/tsp-erc/pdf/jrnl_hanushek_rivkin_2010_teacher_quality.pdf

Working with Teachers to Develop Fair and Reliable Measures of Effective Teaching. MET Project White Paper. Seattle, Washington: Bill & Melinda Gates Foundation, 1. Retrieved December 16, 2010, from http://www.metproject.org/downloads/met-framing-paper.pdf.

Learning about Teaching: Initial Findings from the Measures of Effective Teaching Project. MET Project Research Paper. Seattle, Washington: Bill & Melinda Gates Foundation. Retrieved December 16, 2010, from http://www.metproject.org/downloads/Preliminary_Findings-Research_Paper.pdf.

Jackson, C.K., Bruegmann, E. (2009) Teaching Students and Teaching Each Other: The Importance of Peer Learning for Teachers. American Economic Journal: Applied Economics 1(4): 85–108

Kane, T., Staiger, D., (2008) Estimating Teacher Impacts on Student Achievement: An Experimental Evaluation. NBER Working Paper #16407 http://www.nber.org/papers/w14607

Koedel, C. (2009) An Empirical Analysis of Teacher Spillover Effects in Secondary School. 28 (6 ) 682-692

Koedel, C., & Betts, J. R. (2009). Does student sorting invalidate value-added models of teacher effectiveness? An extended analysis of the Rothstein critique. Working Paper.

Jacob, B. & Lefgren, L. (2008). Can principals identify effective teachers? Evidence on subjective performance evaluation in education. Journal of Labor Economics. 26(1), 101-36.

Sass, T.R., (2008) The Stability of Value-Added Measures of Teacher Quality and Implications for Teacher Compensation Policy. National Center for Analysis of Longitudinal Data in Educational Research. Policy Brief #4. http://eric.ed.gov/PDFS/ED508273.pdf

McCaffrey, D. F., Lockwood, J. R, Koretz, & Hamilton, L. (2003). Evaluating value-added models for teacher accountability. RAND Research Report prepared for the Carnegie Corporation.

McCaffrey, D. F., Lockwood, J. R., Koretz, D., Louis, T. A., & Hamilton, L. (2004). Models for value-added modeling of teacher effects. Journal of Educational and Behavioral Statistics, 29(1), 67.

Rothstein, J. (2011). Review of “Learning About Teaching: Initial Findings from the Measures of Effective Teaching Project.” Boulder, CO: National Education Policy Center. Retrieved [date] from http://nepc.colorado.edu/thinktank/review-learning-about-teaching.

Rothstein, J. (2009). Student sorting and bias in value-added estimation: Selection on observables and unobservables. Education Finance and Policy, 4(4), 537–571.

Rothstein, J. (2010). Teacher Quality in Educational Production: Tracking, Decay, and Student Achievement. Quarterly Journal of Economics, 125(1), 175–214.

Sanders, W. L., Saxton, A. M., & Horn, S. P. (1997). The Tennessee Value-Added Assessment System: A quantitative outcomes-based approach to educational assessment. In J. Millman (Ed.), Grading teachers, grading schools: Is student achievement a valid measure? (pp. 137-162). Thousand Oaks, CA: Corwin Press.

Sanders, William L., Rivers, June C., 1996. Cumulative and residual effects of teachers on future student academic achievement. Knoxville: University of Tennessee Value- Added Research and Assessment Center.

Sass, T.R. (2008) The Stability of Value-Added Measures of Teacher Quality and Implications for Teacher Compensation Policy. Urban Institute http://www.urban.org/UploadedPDF/1001266_stabilityofvalue.pdf

McCaffrey, D.F., Sass, T.R., Lockwood, J.R., Mihaly, K. (2009) The Intertemporal Variability of Teacher Effect Estimates. Education Finance and Policy 4 (4) 572-606

McCaffrey, D.F., Lockwood, J.R. (2011) Missing Data in Value Added Modeling of Teacher Effects. Annals of Applied Statistics 5 (2A) 773-797

Reardon, S. F. & Raudenbush, S. W. (2009). Assumptions of value-added models for estimating school effects. Education Finance and Policy, 4(4), 492–519.

Rubin, D. B., Stuart, E. A., and Zanutto, E. L. (2004). A potential outcomes view of value-added assessment in education. Journal of Educational and Behavioral Statistics, 29(1):103–116.

Schochet, P.Z., Chiang, H.S. (2010) Error Rates in Measuring Teacher and School Performance Based on Student Test Score Gains. Institute for Education Sciences, U.S. Department of Education. http://ies.ed.gov/ncee/pubs/20104004/pdf/20104004.pdf.

The Toxic Trifecta, Bad Measurement & Evolving Teacher Evaluation Policies

This post contains my preliminary thoughts in development for a forthcoming article dealing with the intersection between statistical and measurement issues in teacher evaluation and teachers’ constitutional rights where those measures are used for making high stakes decisions.

The Toxic Trifecta in Current Legislative Models for Teacher Evaluation

A relatively consistent legislative framework for teacher evaluation has evolved across states in the past few years. Many of the legal concerns that arise do so because of inflexible, arbitrary and often ill-conceived yet standard components of this legislative template. There exist three basic features of the standard model, each of which is problematic on its own regard, and those problems become multiplied when used in combination.

First, the standard evaluation model proposed in legislation requires that objective measures of student achievement growth necessarily be considered in a weighting system of parallel components. Student achievement growth measures are assigned, for example, a 40 or 50% weight alongside observation and other evaluation measures. Placing the measures alongside one another in a weighting scheme assumes all measures in the scheme to be of equal validity and reliability but of varied importance (utility) – varied weight. Each measure must be included, and must be assigned the prescribed weight – with no opportunity to question the validity of any measure. [1] Such a system also assumes that the various measures included in the system are each scaled such that they can vary to similar degrees. That is, that the observational evaluations will be scaled to produce similar variation to the student growth measures, and that the variance in both measures is equally valid – not compromised by random error or bias. In fact, however, it remains highly likely that some components of the teacher evaluation model will vary far more than others if by no other reasons than that some measures contain more random noise than others or that some of the variation is attributable to factors beyond the teachers’ control. Regardless of the assigned weights and regardless of the cause of the variation (true or false measure) the measure that varies more will carry more weight in the final classification of the teacher as effective or not. In a system that places differential weight, but assumes equal validity across measures, even if the student achievement growth component is only a minority share of the weight, it may easily become the primary tipping point in most high stakes personnel decisions.

Second, the standard evaluation model proposed in legislation requires that teachers be placed into effectiveness categories by assigning arbitrary numerical cutoffs to the aggregated weighted evaluation components. That is, a teacher in the 25%ile or lower when combining all evaluation components might be assigned a rating of “ineffective,” whereas the teacher at the 26%ile might be labeled effective. Further, the teacher’s placement into these groupings may largely if not entirely hinge on their rating in the student achievement growth component of their evaluation. Teachers on either side of the arbitrary cutoff are undoubtedly statistically no different from one another. In many cases as with the recently released teacher effectiveness estimates on New York City teachers, the error ranges for the teacher percentile ranks have been on the order of 35%ile points (on average, up to 50% with one year of data). Assuming that there is any real difference between the teacher at the 25%ile and 26%ile (as their point estimate) is a huge unwarranted stretch. Placing an arbitrary, rigid, cut-off score into such noisy measures makes distinctions that simply cannot be justified especially when making high stakes employment decisions.

Third, the standard evaluation model proposed in legislation places exact timelines on the conditions for removal of tenure. Typical legislation dictates that teacher tenure either can or must be revoked and the teacher dismissed after 2 consecutive years of being rated ineffective (where tenure can only be achieved after 3 consecutive years of being rate effective).[2] As such, whether a teacher rightly or wrongly falls just below or just above the arbitrary cut-offs that define performance categories may have relatively inflexible consequences.

The Forced Choice between “Bad” Measures and “Wrong” Ones

Two separate camps have recently emerged in state policy regarding development and application of measures of student achievement growth to be used in newly adopted teacher evaluation systems. The first general category of methods is known as value-added models and the second as student growth percentiles. Among researchers it is well understood that these are substantively different measures by their design, one being a possible component of the other. But these measures and their potential uses have been conflated by policymakers wishing to expedite implementation of new teacher evaluation policies and pilot programs.

Arguably, one reason for the increasing popularity of the student growth percentile (SGP) approach across states is the extent of highly publicized scrutiny and large and growing body of empirical research over problems with using value-added measures for determining teacher effectiveness (See Green, Baker and Oluwole, 2012). Yet, there has been little such research on the usefulness of student growth percentiles for determining teacher effectiveness. The reason for this vacuum is not that student growth percentiles are simply not susceptible to the problems of value-added models, but that researchers have chosen not to evaluate their validity for this purpose – estimating teacher effectiveness – because they are not designed to infer teacher effectiveness.

A value added estimate uses assessment data in the context of a statistical model (regression analysis), where the objective is to estimate the extent to which a student having a specific teacher or attending a specific school influences that student’s difference in score from the beginning of the year to the end of the year – or period of treatment (in school or with teacher). The most thorough of VAMs attempt to account for several prior year test scores (to account for the extent that having a certain teacher alters a child’s trajectory), classroom level mix of students, individual student background characteristics, and possibly school characteristics. The goal is to identify most accurately the share of the student’s or group of students’ value-added that should be attributed to the teacher as opposed to other factors outside of the teachers’ control.

By contrast, a student growth percentile is a descriptive measure of the relative change of a student’s performance compared to that of all students and based on a given underlying test or set of tests. That is, the individual scores obtained on these underlying tests are used to construct an index of student growth, where the median student, for example, may serve as a baseline for comparison. Some students have achievement growth on the underlying tests that is greater than the median student, while others have growth from one test to the next that is less. That is, the approach estimates not how much the underlying scores changed, but how much the student moved within the mix of other students taking the same assessments, using a method called quantile regression to estimate the rarity that a child falls in her current position in the distribution, given her past position in the distribution.[3] Student growth percentile measures may be used to characterize each individual student’s growth, or may be aggregated to the classroom level or school level, and/or across children who started at similar points in the distribution to attempt to characterize collective growth of groups of students.

Many, if not most value-added models also involve normative rescaling of student achievement data, measuring in relative terms how much individual students or groups of students have moved within the large mix of students. The key difference is that the value-added models include other factors in an attempt to identify the extent to which having a specific teacher contributed to that growth, whereas student growth percentiles are simply a descriptive measure of the growth itself. A student growth percentile measure could be used in a value-added model.

As described by the authors of the Colorado Growth Model:

A primary purpose in the development of the Colorado Growth Model (Student Growth Percentiles/SGPs) was to distinguish the measure from the use: To separate the description of student progress (the SGP) from the attribution of responsibility for that progress.” (Betebenner, Wenning & Briggs, 2011)

Unlike value-added teacher effect estimates, student growth percentiles are not intended for attribution of responsibility for student progress to either the teacher or the school. But if that is so clearly the case (as recently stated as Fall, 2011) is it plausible that states or local school districts will actually choose to use the measures to make inferences? Below is a brief explanation from a Q&A section of the New Jersey Department of Education web site regarding implementation of pilot teacher evaluation programs:

Standardized test scores are not available for every subject or grade. For those that exist (Math and English Language Arts teachers of grades 4-8), Student Growth Percentages (SGPs), which require pre- and post-assessments, will be used. The SGPs should account for 35%-45% of evaluations. The NJDOE will work with pilot districts to determine how student achievement will be measured in non-tested subjects and grades.[4]

This explanation clearly indicates that student growth percentile data are to be used for “evaluation” of teacher effectiveness. In fact, the SGPs alone, as they stand, as descriptive measures “should be used to account for 35% to 45% of evaluations.” Other states including Colorado have already adopted (pioneered) the use of Student Growth Percentiles as a statewide accountability measure and have concurrently passed high stakes teacher evaluation legislation. But it remains to be seen how the SGP data will be used in district specific contexts in guiding high stakes decisions.

While value-added models are intended estimate teacher effects on student achievement growth, they fail to do so in any accurate or precise way (see Green, Oluwole & Baker, 2012). By contrast, student growth percentiles make no such attempt.[5] Specifically, value-added measures tend to be highly unstable from year to year, and have very wide error ranges when applied to individual teachers, making confident distinctions between “good” and “bad” teachers difficult if not impossible. Further, while value-added models attempt to isolate that portion of student achievement growth that is caused by having a specific teacher they often fail to do so and it is difficult if not impossible to discern a) how much they have failed and b) in which direction for which teachers. That is, the individual teacher estimates may be biased by factors not fully addressed in the models, and we may not know how much. We also know that when different tests are used for the same content, teacher receive widely varied ratings raising additional questions about the validity of the measures.

While we do not have similar information from existing research on student growth percentiles, it stands to reason that since they are based on the same types of testing data, they will be similarly susceptible to error and noise. But more problematically, since student growth percentiles make no attempt (by design) to consider other factors that contribute to student achievement growth, the measures have significant potential for omitted variables bias. SGPs leave the interpreter of the data to naively infer (by omission) that all growth among students in the classroom of a given teacher must be associated with that teacher. Even subtle changes to explanatory variables in value-added models change substantively the ratings of individual teachers (Ballou et al., 2012, Briggs & Domingue, 2010). Excluding all potential explanatory variables, as do SGPs, takes this problem to the extreme. As a result, it may turn out that SGP measures at the teacher level appear more stable from year to year than value-added estimates, but that stability may be entirely a function of teachers serving similar populations of students from year to year. That is, the measures may contain stable omitted variables bias, and thus may be stable in their invalidity.

In defense of Student Growth Percentiles as accountability measures but with no mention of their use for teacher evaluation, Betebenner, Wenning and Briggs (2011) explain that one school of thought is that value-added estimates are also most reasonably interpreted as descriptive measures, and should not be used to infer teacher or school effectiveness:

“The development of the Student Growth Percentile methodology was guided by Rubin et al’s (2004) admonition that VAM quantities are, at best, descriptive measures.” (Betebenner, Wenning & Briggs, 2011)

Rubin et al explain:

“Value-added assessment is a complex issue, and we appreciate the efforts of Ballou et al. (2004), McCaffrey et al. (2004) and Tekwe et al. (2004). However, we do not think that their analyses are estimating causal quantities, except under extreme and unrealistic assumptions. We argue that models such as these should not be seen as estimating causal effects of teachers or schools, but rather as providing descriptive measures.” (Rubin et al., 2004)

Arguably, these explanations do less to validate the usefulness of Student Growth Percentiles as accountability measures (inferring attribution and/or responsibility to schools and teachers) and far more to invalidate the usefulness of both Student Growth Percentiles and Value-Added Models for these purposes.

New Jersey’s TEACHNJ: At The Intersection of the Toxic Trifecta and “Wrong” Measures

A short while back, John Mooney over at NJ Spotlight provided an overview of a pending bill in the New Jersey legislature which just so happens to contain explicitly at least two out of three of the elements of the Toxic Trifecta and contains the third implicitly by granting deference to the NJ Department of Education to approve the quantitative measures used in evaluation systems.

Text of the Bill: http://www.njleg.state.nj.us/2012/Bills/S0500/407_I1.PDF

First, the bill throughout refers to the creation of performance categories as discussed above, implicitly if not explicitly declaring those categories to be absolute, clearly defined and fully differentiable from one another.

Second, while the bill is not explicit in its requirement of specific quantified performance metrics the bill grants latitude on this matter to the NJ Department of Education (to approve local plans) which a) is developing a student growth percentile model to be used for these purposes, and b) under its pilot plan is suggesting (if not requiring) that districts use the student growth percentile data for 35 to 45% of evaluations, as noted above.

Third, the bill places an absolute and inflexible timeline on dismissal:

Notwithstanding any provision of law to the contrary, the principal, in consultation with the panel, shall revoke the tenure granted to an employee in the position of teacher, assistant principal, or vice-principal if the employee is evaluated as ineffective in two consecutive annual evaluations. (p. 10)

The key word here is “shall” which indicates a statutory obligation to revoke tenure. It does not say “may,” or “at the principal’s discretion.” It says shall.

The principal shall revoke tenure if a teacher is unlucky enough to land below an arbitrary cut-point, using a measure not designed for such purposes, for two years in a row. (even if the teacher was lucky enough to achieve an “awesome” rating every other year of her career!)

The kicker is that the bill goes one step further to attempt to eliminate any due process right a teacher might have to challenge the basis for the dismissal:

The revocation of the tenure status of a teacher, assistant principal, or vice-principal shall not be subject to grievance or appeal except where the ground for the grievance or appeal is that the principal failed to adhere substantially to the evaluation process. (p. 10)

In other words, the bill attempts to establish that teachers shall have no basis (no procedural due process claim) for grievance as long as the principal has followed their evaluation plan, ignoring the possibility – the fact – that these evaluation plans themselves, approved or not, will create scenarios and cause personnel decisions which violate due process rights. Further, the attempt at restricting due process rights laid out in the bill itself is a threat to due process and would likely be challenged.

Declaring any old process to constitute due process does not make it so! Especially where the process is built on not only “bad” but “wrong” measures used in a framework that forces dismissal decisions on at least 3 completely arbitrary and capricious bases (2 consecutive years in isolation, fixed weight on wrong measure, arbitrary cut-points for performance categories).

So this raises the big question of what’s behind all of this. Clearly, one thing that’s behind all of this is an astonishing ignorance of statistics and measurement among state legislators favoring the toxic trifecta – either that or a willful neglect of their legislative duty to respect constitutional protections including due process (or both!).

[1] A more reasonable alternative being to use the statistical information as a preliminary screening tool for identifying potential problem areas, and then using more intensive observations and additional evaluation tools as follow-up. This approach acknowledges that the signals provided by the statistical information may in fact be false either as a function of reliability problems or lacking validity (other conditions contributed to the rating), and therefore in some if not many cases, should be discarded. The parallel consideration more commonly used requires that the student growth metric be considered and weighted as prescribed, reliable, valid or not.

[2] For example, at the time of writing this draft, the bill introduced in New Jersey read: “Notwithstanding any provision of law to the contrary, the principal shall revoke the tenure granted to an employee in the position of teacher, assistant principal, or vice-principal, regardless of when the employee acquired tenure, if the employee is evaluated as ineffective or partially effective in one year’s annual summative evaluation and in the next year’s annual summative evaluation the employee does not show improvement by being evaluated in a higher rating category. The only evaluations which may be used by the principal for tenure revocation are those evaluations conducted in the 2013-2014 school year and thereafter which use the rubric adopted by the board and approved by the commissioner. The school improvement panel may make recommendations to the principal on a teacher’s tenure revocation.” http://www.njspotlight.com/assets/12/0203/0158

[3] For more precise explanations, see: http://dirwww.colorado.edu/education/faculty/derekbriggs/Docs/Briggs_Weeks_Is%20Growth%20in%20Student%20Achievement%20Scale%20Dependent.pdf

[4] http://www.state.nj.us/education/EE4NJ/faq/

[5] Briggs and Betebenner (2009) explain: “However, there is an important philosophical difference between the two modeling approaches in that Betebenner (2008) has focused upon the use of SGPs as a descriptive tool to characterize growth at the student-level, while the LM (layered model) is typically the engine behind the teacher or school effects that get produced for inferential purposes in the EVAAS.” (Briggs & Betebenner, 2009, p. )

References