Thought for the day…

Many will consider this blasphemy, but, I’ve been pondering lately:

If our best public and private schools are pretty good (perhaps even better than Finland?),

And, if the majority (not all, but most) of our best AND our worst public (and private) schools use salary schedules which base teacher compensation primarily on degrees/credits/credentials obtained and years of experience or service…

Can we really attribute the failures of our worst schools  to these features of teacher compensation?

Yeah… there might be a better (more efficient and effective way), but is this really the main problem?

Teacher “effectiveness” ratings Freedom of Information Requests

Andy Rotherham over at Eduwonk posted an Irony Alert yesterday as many media outlets poised themselves to start “outing” ineffective teachers by posting publicly those teacher’s value-added effectiveness scores. Rotherham argued:

In light of this blow up about value-added in New York City, in a lot of places if the teachers unions would actually get serious about actually using value-add data as part of teacher evaluations it could be shielded from “Freedom of Information”requests that identify teachers, just as many aspects of personnel evaluations are.   They’re caught in their own mousetrap here.  My take on the larger issue from a few weeks ago and LA.

I thought…. hmmm… really? That doesn’t seem right. Is this just a clever argument intended to dupe teachers into getting those scores into their evaluations on some false assumption that the information would then be protected? Are these issues even transferable from state to state? Is the raw data used for generating the teacher effectiveness ratings actually considered part of the personnel file? I’m somewhat of an amateur on this school law stuff, but have enough background to start asking these questions when such arguments are tossed out there. So I did. I asked a handful of legal scholars in education policy, each of whom deals regularly with legal questions over personnel records under state law and with student record information.

Justin Bathon over at Ed Jurist has now posted his conversation starter for the legal community.

This is good stuff, and the very kind of conversation we should be having when such questions are raised. Ask the experts. Much of the argument hinges on when the raw data is translated into a measure that actually becomes part of the personnel file (at least with regard to the “shield” issue posed by Rotherham). Here’s Justin Bathon’s summary:

Anyway, summarizing, I think the raw data is generally going to be made publicly open following FOIA requests. I think New York City is currently correct in their assessment that no exemption exists under New York’s Freedom of Information Law. However, this is just my analysis after considering this issue for a single day and I want to caution against over reliance on my initial assumptions. A thorough analysis needs to be conducted of all 50 state policies, interpreting regulations, attorney general opinions, and previous case-law. Further, data experts such as Bruce must assist the analysis with a complete understanding of each state’s dataset and the possible links to both teachers and their evaluations within the datasets. Thus, there is still a lot of work left to be done.

This is a legal frontier (another one of those enabled by technology) that most legislatures would not have contemplated as possible in enacting their open records laws. Thus, it is a great topic for us to debate further to inform future policy actions on open records personnel evaluation exemptions.

Please, read the rest of his well thought out, albeit preliminary post.

Here are my follow-up comments (cross-posted at edjurist) on data/data structures and their link to teacher evaluations:

Here are some data scenarios:

A. The district has individual student test score data that are linkable to individual teachers but the district doesn’t use those data to generate any estimates of individual teacher “effectiveness,” has not adopted any statistical method for doing so and therefore does not include any such estimates as part of personnel records. The individual students’ identity can be masked but with matched ID over time and specific characteristics attached (race, low-income status)
B. The district has individual student test score data that are linkable to individual teachers just as above, and the district does a) have an adopted statistical model/method for generating teacher value added “effectiveness” scores, but uses those estimates only for district level evaluation/analysis and not for individual teacher evaluation.
C. The district has individual student test score data that are linkable to individual teachers as above, and a) the district has an adopted statistical method/model for generating teacher value-added “effectiveness” scores and has negotiated a contractual agreement with teachers (or is operating under a state policy framework) which requires inclusion of the “effectiveness” scores in the formal evaluation of the teacher.

Under option C above, sufficient technical documentation should be available such that “effectiveness” estimates could be checked/replicated/audited by an outside source.  That is, while there should be materials that provide sufficiently understandable explanations such that teachers can understand their own evaluations and extent to which their “effectiveness” ratings are, or are not under their own control, there should also be a detailed explanation of the exact variables used in the model, the scaling of those variables, etc. and the specification of the regression equation that is used to estimate teacher effects. There should be sufficient detail to replicate district generated teacher effectiveness scores.

That aside, a few different scenarios arise.

1. The LA Times scenario, as I understand it, falls under the first conditions above. The data existed in raw form. The district was not using those data for “effectiveness” rating. The LAT got the data and handed them over to Richard Buddin of RAND. Buddin then estimated the most reasonable regression equation he could with the available data and, for that matter, produced a sufficiently detailed technical report – such that anyone accessing the same data could replicate his findings. I suspect that individual student names were masked, but the students were clearly matched to identifiable teachers, and student data included specific identifiers of race, poverty, etc. and participation in programs such as gifted programs (indicator on child that he/she labeled as gifted). Not sure what, if any, issues are raised by detailed descriptive information on child level data. In this case, the data requested by LAT and handed over to Buddin were not linked to teacher evaluation by the district itself, in any way, as I understand it.

2. As I understood the recent NYC media flap, the city itself was looking to report/release the value-added ratings and the city itself also intends to use those “value added” ratings for personnel evaluation. It sounded to me that Charleston, SC was proposing roughly the same. Each teacher would have a nifty little report card showing his her “relative” effectiveness rating compared to other teachers. This effectiveness rating is essentially a “category” labeling a teacher as “better than average” or “worse than average.” These categories are derived from more specific “estimates” which come from a statistical model, which generates a coefficient for each teacher’s “effect” on the students who have passed through that teacher’s classroom (these coefficients having substantial uncertainty and embedded bias which I have discussed previously… but that’s not the point here). So, the effectiveness profile of the teacher is an aggregation of these “effects” into larger categories – but is nonetheless directly drawn from these effect estimates generate by the district itself for teacher evaluation purposes (even if subcontracted by the district to a statistician). I would expect that the specific estimate and the profile aggregation would be part of the teacher’s personnel record.

So, now that the city’s official release of effectiveness profiles is on hold, what if a local newspaper requested a) the raw student data linkable to teachers, with student names masked but with sufficient demographic detail on each student and with identifiable information on teachers, and b) the detailed technical documentation on the statistical model and specific variables used in that model?  The newspaper could then contract a competent statistician to generate his/her own estimates of teacher effectiveness using the same data used by the district and the same method. These would not be “official” effectiveness estimates, nor could the media outlet claim them to be. But they would be a best attempt at a replication. Heck, it might be more fun if they used a slightly different model, because the ratings might end up substantially different from the district’s own estimates.  Replicating or not, the districts own methods, and producing roughly the same or very different ratings for teachers, these estimates would still not be the official ones. Given the noise and variation in such estimates at the teacher level, it might actually be pretty hard to get estimates that correlate substantially with the district’s own estimates – and one would never know, because the district official effectiveness estimates for teachers would still be private.

I would assume under these circumstances, partly because the “official” personnel file estimates would remain unknown, and because it’s highly probable that the independent estimates produced by the media outlet – even if trying to replicate district estimates – might vary wildly from the district estimates – that the media outlet could get the data, estimate the model and report their results – their unofficial results. On the one hand, the media outlet could rely on the uncertainty of the estimates to justify that what they produce should not be considered “official” estimates. And on the other hand… in bold print in the paper… they could argue as the LA Times Jasons have … that these estimates are good and reliable estimates of actual teacher effectiveness!

The conversation continues over at EdJurist: http://www.edjurist.com/blog/value-added-evaluation-data-and-foia-state-versions-that-is.html?lastPage=true#comment10260419

Interesting follow-up point from Scott Bauries over at Ed Juris:

Thus, from the legal perspective, I am left with one question: if the data and conclusions are being used as reflected in option “c,” but the media only gets the conclusions and not the raw data, then does the law allow a teacher to protect his or her reputation from unfair damage due to the publishing of a conclusion based on a noisy equation?

This is a very complicated question, involving both defamation law and the First Amendment. For example, is a public school teacher a “public official” or “public figure” for First Amendment purposes, such that the standard for proving defamation per se is increased? If so, then is the relevant statistical analysis illustrating the noisy nature of the conclusion enough to show falsehood for the purposes of a defamation claim? I think probably not in both instances, but I don’t think this precise issue has ever come up.

When reformy ideologies clash…

(note: lots of ideas here that I wanted to start writing about… but not yet well organized or articulated. It will come, with time, I hope.)

Summary of Reformy Ideology

Bluntly stated, the two major components of education reform ideology are as follows:

  • Reformy Ideology #1: Teacher quality is the one single factor that has the greatest effect on a child’s life chances. Get a bad teacher or two in a row, and you’re screwed for life. The “best possible” way to measure teacher quality is by estimating the teacher’s influence on student test scores (value-added). Hiring, retention and dismissal decisions must, that is, MUST be based primarily on this information. This information may be supplemented, but value-added must play the dominant single role.
  • Reformy Ideology #2: Charter schools are the answer to most of the problems of poor urban school districts. Take any poor, failing urban school district, close its dreadfully failing schools and replace them as quickly as possible with charter schools and children in the urban core will have greatly expanded high-quality educational opportunities.

Now, let me do a bit of clarification here. These are the representations of reform ideology at the extremes – but these views are not uncommon in the “reform” community. Let me also clarify that item #2 above isn’t about the broader issue of charter schooling, the origins of charter schooling, purposes of individual charter schools or research on the effectiveness of specific charter school models. Item #2 above is specifically about the argument that large urban districts can and should be replaced with an open market of charter schools – that charter schools should not just be used to try out new and interesting ideas which may be replicated in other schools – charter or not – but rather that charters should become dominant providers of urban education.

In my framing of item #1 above, I do not by any means intend to discredit the importance of high quality teachers. I’m with the “reformers” on that idea. But, it is certainly an overstatement to attribute all of student gains and life chances to teacher quality alone. And, as I have discussed previously on many occasions on this blog, it is very problematic to assume that we presently have sufficient tools for measuring precisely and accurately the true effectiveness of any single teacher.

So, this brings me to recent completely unrelated events and media on education reform issues that raise some interesting points of conflict.

Part I: Ideology Clashing with… Ideology

The first example of the clash of reformy ideologies comes from upstate New York as the state begins the process of implementing the “reforms” that got that state Race to the Top funding. In short, charter school operators really don’t seem to want to be compelled to adopt the first prong of reformy ideology. What? Charters don’t want to be compelled to use student test scores as the primary, or even a major basis for personnel decisions? Blasphemy!

In recent weeks, in casual conversations and at symposia, I’ve actually heard a number of charter school operators raise serious questions about being compelled to adopt Reform Ideology #1 above. Charter operators appreciate their autonomy, and while most do enjoy wider latitude over personnel decisions than large urban school districts that serve as their hosts, most do not necessarily base the majority of their hiring, firing and compensation decisions on student test scores and test scores alone. And most don’t seem very interested in being compelled to do so – adopting a one size fits all evaluation model. Apparently, they are also relatively uninterested in disclosing how they evaluate faculty. Here’s a quote from the Albany Times Union.

Carroll, one of the most prominent education reformers in the state, helped write the state’s original charter laws. He said if the charter schools accepted the money, they would lose their current flexibility in the firing and hiring of teachers. He also said charter schools would be forced to disclose their teacher evaluation process, which is now confidential, and that it could become harder to fire an educator deemed ineffective.

http://www.timesunion.com/default/article/No-to-cash-with-a-catch-714008.php

So then, if expanding charters is a major component of reform, and making sure teachers are evaluated by student test scores is a major component of reform, how can this apparent clash be reconciled? It can’t! It seems hypocritical at best to force public school districts to play by a seriously flawed set of teacher evaluation rules and then let charters off the hook? This is especially true if one of the supposed benefits of charter schools is to experiment with creative strategies that may be emulated by traditional public schools and if traditional public schools are expected to improve by competing with charters on a level playing field. I’m with the charter leaders on this one.

UPDATE: Tom Carrol has clarified his comments here: http://www.nyfera.org/?p=2827, where he attempts to explain that charters are opposed to the teacher evaluation requirements not because charters oppose the ideas behind using data to evaluate teachers, but that charters oppose having the state education department mandate how that data should be used in evaluations:

SED simply has no authority to set thresholds for the use of data in teacher evaluations in charter schools.  Nor do they have the authority to require us to group teachers by four categories, or require such annual evaluations to be “a significant factor” for “promotion, retention, tenure determination and supplemental compensation.”  Nor do they have the authority to require charters to pursue “the removal of teachers and principals receiving two consecutive annual ratings of ‘ineffective’ after receiving supports from improvement plans.”

Carrol’s clarification coupled with his unsubstantiated claim that charters are already doing these things really doesn’t change my point above – that it remains hypocritical to hoist these deeply problematic policies on traditional public schools while letting charters off the hook on the basis that charters should be given the flexibility to experiment with alternative strategies and should be exempt from full disclosure regarding those strategies.

Part II: Ideology Clashing with Research

Now, this one is really not an obvious major clash, but rather a more subtle clash between ideology and research embedded in Eric Hanushek’s WSJ editorial the other day.  Embedded in the editorial were comments/claims that I found at least a little disingenuous.

The point of Hanushek’s OpEd was to explain that there is no war on teachers, and that this is really about getting teacher’s unions to negotiate for reasonable changes in contracts that would allow for more expeditious dismissal of the worst teachers – the bottom 5% or so. I’ll admit that I really haven’t seen Hanushek himself outright attacking teachers in his work of late, especially in his actual research on teacher labor markets. Much of it is very good and very useful stuff. That said, it’s hard to deny that many major public figures – talking head tweeters and bloggers for think tanks, etc. – have actually engaged in an all out attack on teachers, the teaching profession and teachers unions.

Setting that broader issue aside, Hanushek’s Op Ed rubbed me the wrong way because of examples he chose to advance his argument, and the extent to which his examples in the Op Ed clash – and clash quite significantly – with his own best recent research. Hanushek summarized his arguments about the non-war on teachers as follows:

What’s really going on is different. President Obama states that we can’t tolerate bad teachers in classrooms, and he has promoted rewarding the most effective teachers so they stay in the classroom. The Los Angeles Times published data identifying both effective and ineffective teachers. And “Waiting for ‘Superman'” (in which I provide commentary) highlighted exceptional teachers and pointed out that teachers unions don’t focus enough on teacher quality.

This is not a war on teachers en masse. It is recognition of what every parent knows: Some teachers are exceptional, but a small number are dreadful. And if that is the case, we should think of ways to change the balance.

http://online.wsj.com/article_email/SB10001424052748703794104575546502615802206-lMyQjAxMTAwMDEwODExNDgyWj.html

So, part of his claim is that it was unjustified for teachers unions – not teachers mind you – but their unions to object so loudly when the LA Times merely – in the public interest – revealed data which validly (implied above, by absence of disclaimers) identified, labeled and named effective and ineffective teachers.

Wait… isn’t it Eric Hanushek’s own research and writing that highlights problems with using value-added measurement to evaluate teachers where non-random student assignment occurs (which is pretty much anywhere)? For me, it was my familiarity with his work that led me to explore the biases in the LAT model that I’ve written about previously on this blog.  In that same post, I explain why I am more inclined to accept Jesse Rothstein’s concerns over the problems of non-random assignment of students than to brush those concerns aside based on the findings of Kane and Staiger.  Hanushek provides a compelling explanation for why he too places more weight on Rothstein’s findings.

Correct me if I’m wrong, but isn’t it possible that teachers and their union were in an uproar at least partly because the LA Times released highly suspect, potentially error ridden and extremely biased estimates of teacher quality? And that the LA Times misrepresented those estimates to the general public as good, real estimates of actual teacher effectiveness?

Yes, much of Eric Hanushek’s recent writing does advocate for some reasonable use of value-added estimates for determining teacher effectiveness, but he usually does so while giving appropriate attention to the various caveats and while emphasizing that value-added estimates should likely not be a single determining factor. He notes:

Potential problems certainly suggest that statistical estimates of quality based on student achievement in reading and mathematics should not constitute the sole component of any evaluation system.

http://edpro.stanford.edu/hanushek/admin/pages/files/uploads/HanushekRivkin%20AEA2010.CALDER.pdf

Yet, that’s just what the LA Times did, and without even mentioning the caveats!

Isn’t it a bit of an unfair assertion, given Hanushek’s own research and writing on value-added estimates, to claim that LA teachers and their union were completely unjustified in their response to the LA Times?

Part III: Ideology Clashing with Reality

There exists at least one segment of the truly reformy crowd that believes deeply in second major ideology laid out at the beginning of this post – that if we can simply close failing urban schools (the whole district if we have to!) and let charters proliferate, children in the urban core will have many more opportunities to attend truly good schools. Yes, these reformers throw in the caveat that we must let only “good” charters, “high performing” charters start-up in place of the failing urban schools. And when viewing the situation retrospectively, these same reformy types will point out that if we look only at the upper half of the charters, they are doing better than average. Yeah… Yeah… whatever.

One long-term research project that has interested me of late is to look in-depth at those “failing” urban school districts that over the past decade have had the largest shares of their student population shift to charter schools – that is, the largest charter market share districts. Here is link to the Charter Market share report from the National Alliance for Public Charter Schools: http://www.publiccharters.org/Market_Share_09

It would seem that if we adopt the reformy ideology above, that if we identify those districts with the largest charter market shares, those districts should now be models for high quality, equitably distributed educational opportunities. We should eventually see sizeable effects in the achievement and attainment of children growing up in these cities, we should see quality of life increasing dramatically, housing values improving with an influx of families with school-aged children – a variety of interesting, empirically testable hypotheses, which I hope to explore in the future.

In the mean-time, however, we have new and interesting descriptive information from a report from the Ewing and Marion Kauffman Foundation in Kansas City, focused on educational opportunities in Kansas City. Kansas City is #4 on charter market share, according to the National Alliance report, and rose to that position much earlier in the charter proliferation era than other cities. As a result, by reformy logic, Kansas City should be a hotbed for educational opportunity for school-aged children – after years of previously throwing money down the drain in the Kansas City Missouri Public School District (many of these claims actually being Urban Legend).

In Kansas City, the reality of charter expansion has clashed substantially with the reformy ideology. Arthur Benson in a recent Kansas City Star Op Ed, noted:

Charters have subtle means for selecting or de-selecting students to fit their school’s model. The Kansas City School District keeps its doors open to non-English speakers and all those kids sent back from the charter schools. In spite of those hurdles, Kansas City district schools across the board out-perform charter schools. That is not saying much. We have until recently failed 80 percent of our kids, but most charters fail more.

I was initially curious about Benson’s (a district board member and attorney) claims that charters have done so poorly in Kansas City. Could it really be that the massive expansion of charter schools in Kansas City has done little to improve and may have aided in the erosion of high quality educational opportunities for Kansas City children?

The recent Kauffman Foundation report draws some similar conclusions, and Kauffman Foundation has generally been an advocate for charter schools. The report classifies district and charter schools into groups by performance, with level 4 being the lowest, and level 1 being the only acceptable group.

  • Level I- A school that met or exceeded the state standard on the MAP Communication Arts and Mathematics exams in 2008-2009.
  • Level II- A school that scored between 75 and 99 percent of the state standard on the MAP Communication Arts and Mathematics exams in 2008-2009.
  • Level III– A school that scored between 50 and 74 percent of the state standard on the MAP Communication Arts and Mathematics exams in 2008-2009.
  • Level IV– A school that scored below 50 percent of the state standard on the MAP Communication

Among other things, the report found that charter operators had avoided opening schools in the neediest neighborhoods. Rather, they set up shop in lower need neighborhoods, potentially exacerbating disparities in opportunities across the city’s zip codes. The report recommended:

A strategy for charter school growth should be developed by Kansas City education leaders. Charter schools should only be approved by DESE if they can demonstrate how they intend to fill a geographic need or a specific void in the communities they intend to serve.

Regarding charter performance more generally, the report noted:

In many communities charter schools are a model that increases students’ access to better public schools, but the majority of charter school students (5,490 or 64.7 percent) are in a Level IV school. Many of Kansas City’s charters have existed for 10 years and are still not able to reach even half of state standard.

Now, I’m not sure I accept their premise that in many communities this actually works – and that it just went awry for some strange reason in Kansas City. That said, the reality in Kansas City, by the authors own acknowledgment is in sharp contrast with the reality the authors believe exists in other cities.

One implication (not tested directly) of this report is that the massive charter school expansion that occurred in Kansas City may have done little or nothing to improve the overall availability or distribution of educational opportunities for children in that city and may have actually made things worse.

Isn’t it strange how we hear so little about these things as we look to replicate these models of great reformy success in other cities of comparable scale such as Newark, NJ?

On False Dichotomies and Warped Reformy Logic

Pundit Claim 1 – Value added modeling is necessarily better than the “status quo”

There exists this strange perspective that we are faced with a simple choice in teacher evaluation – a choice between using student test scores and value-added modeling, or continuing with the status quo. This is a false dichotomy, false dilemma or logical fallacy. In other words, it’s a really stupid argument in which we are forced to assume that there are only two choices that exist. This argument is usually coupled with an implicit assumption that one of the two must be superior.

“Reformers” continue to press the argument that current teacher evaluations are so bad, so unreliable, that anything is better than this “status quo.”

Expressed mathematically:

Anything > Status Quo

Bear with me while I use the “greater than” symbol to imply “really freakin’ better than… if not totally awesome… wicked awesome in fact,” but since it’s relative, it would have be “wicked awesomer.”

Because value-added modeling exists and purports to measure teacher effectiveness, it therefore counts as “something,” which is a subclass of “anything” and therefore it is better than the “status quo.” That is:

Value-added modeling = “something”

Something ⊆ Anything (something is a subset of anything)

Something > Status Quo

Value-added modeling > Current Teacher Evaluation

Again, where “>”  means “awesomer” even though we know that current teacher evaluation is anything but awesome.

It’s just that simple!

After all, you can’t even measure the error rate in current principal and supervisor evaluations of teachers can you? And if you can’t measure the error rate it must be higher than any error rate you can measure? More really basic reformy logic! That is, the unobserved error rate in one system is necessarily greater than the observed error rate of another – even if we have no way to quantify it – in fact, because we have no way to quantify it?

Unobserved error rate of ‘status quo’ > measured error rate of VAM

Let’s be really blunt here. Both are patently stupid arguments.

And both of these arguments bring to mind one of my favorite analogies related to this issue. If we were in a society that still walked pretty much everywhere, and some tech genius invented a new cool thing – called the automobile – but the automobile would burst into a superheated fireball on every fifth start, I think I’d keep walking until they worked out that little kink. If they never worked out that little kink, I’d probably still be walking. I’ve written previously about how this relates to likely error rates in teacher dismissal (misclassifying truly effective teachers as ineffective) as would occur when using typical value-added modeling approaches.

Pundit Claim 2 – If we get rid of the bad teachers, the system will necessarily be better

The assumption of many pundits is that replacing existing teachers necessarily improves the teaching workforce – that the average potential applicant for any/all available teaching jobs will be better than the average person already there, or at least better than the person we dismiss as ineffective. Now, recall that we have a pretty high chance of misclassifying truly effective teachers and dismissing them.

Now, the math here is similar to that above. The basic premise is that:

Anything > Status Quo

First of all, we know already that schools with more difficult working conditions have a much more difficult time recruiting and retaining quality teachers. Working conditions play a significant role in teacher sorting in initial job matches and in teacher moves over time.

We also know, just by looking at such information as the patterns of higher and lower “effectiveness” scores in the LA Times analysis, that if we dismiss teachers on the basis of their value added scores, we will be dismissing larger shares of teachers in higher poverty, higher minority schools. Or, we can just take the Central Falls, RI approach and declare the entire school failing based on its average performance over time (setting aside demographics and resources) and just fire everyone. Surely the replacements will be better. How could we do worse? Right?

Here’s the thing – even if we assume that some of the lower performance of teachers in poorer LA schools or the lower performance of Central Falls HS is a function of a weaker, less effective teacher workforce, we can only make things “better” by replacing that workforce with “better” teachers.

It is completely arrogant to take the reformy attitude of “how can we possibly do worse?” How could we possibly get a worse pool of teachers than the lazy slugs already in the system?

If the teacher pool in these schools is in fact less effective, and don’t just look that way statistically because of other factors, it may just be that these schools had a difficult time recruiting and retaining teachers to begin with. If we introduce our “game changing” policies – firing all of the teachers for low school performance, or firing individual teachers for bad effectiveness ratings – we will likely make things even worse.

Any teacher wishing to step in line to replace the previous cohort of “failures,” will have to not only consider the difficult working conditions but also the disproportionate likelihood that she/he will be fired a few years down the line, for factors well beyond his/her control (e.g. that pesky non-random assignment problem). That’s a significant change in working conditions – job risk. Without either changing other working conditions or substantially increasing compensation to offset this new risk, the applicant pool is not likely to get better – especially when risk is not increased similarly in other “more desirable” school districts. All else equal, the applicant pool is likely to get worse. The disparity in the quality of applicants for teaching positions is likely to increase dramatically, and the average quality of applicants to high poverty, high minority concentration districts may decline significantly.

Bonus video with thanks to Sherman Dorn:

Value-Added and “Favoritism”

Kevin Carey from Ed Sector has done it again. He’s come up with yet another argument that fails to pass even the most basic smell test. A few weeks ago, I picked on Kevin for making the argument that while charter schools, on average, are average, really good charter schools are better than average. Or, as he himself phrased it:

reasonable people acknowledge that the best charter schools–let’s call them “high-quality” charter schools–are really good

I myself am reasonable on occasion and fully accept this premise. Some schools are really good, and some not so good. And that applies to charter schools and non-charters alike, as I show in my recent post Searching for Superguy.

Well, last week Kevin Carey did it again – made a claim that simply doesn’t even pass the most basic smell test.  In the New York Times Room for Debate series on value-added measurement of teachers, Carey argued that Value-added measures would protect teachers from favoritism. Principals would no-longer be able to go after certain teachers based on their own personal biases. Teachers would be able to back up their “real” performance with hard data. Here’s a quote:

“Value-added analysis can protect teachers from favoritism by using hard numbers and allow those with unorthodox methods to prove their worth.” (Kevin Carey, here)

The reality is that value-added measures simply create new opportunities to manipulate teacher evaluations through favoritism. In fact, it might even be easier to get a teacher fired by making sure the teacher has a weak value-added scorecard. Because value-added estimates are sensitive to non-random assignment of students, principals can easily manipulate the distributions of disruptive students, students with special needs, students with weak prior growth and other factors, which, if not fully accounted for by the VA model will bias teacher ratings. And some factors – like disruptive students, or those who simply don’t give a $#*! won’t (and can’t) be addressed in the VA models. That is, a clever principal can use the VA non-random assignment bias to create a statistical illusion that a teacher is a bad teacher. One might argue that some principals likely already engage in a practice of assigning more “difficult” students to certain teachers – those less favored by the principal. So, even if the principal is less clever and merely spiteful, the same effect can occur.

I wrote in an earlier post about the types of contractual protections teachers should argue for, in order to protect against such practices:

The language in the class size/random assignment clause will have to be pretty precise to guarantee that each teacher is treated fairly – in a purely statistical sense. Teachers should negotiate for a system that guarantees “comparable class size across teachers – not to deviate more than X” and that year to year student assignment to classes should be managed through a “stratified randomized lottery system with independent auditors to oversee that system.” Stratified by disability classification, poverty status, language proficiency, neighborhood context, number of books in each child’s home setting, etc. That is, each class must be equally balanced with a randomly (lottery) selected set of children by each relevant classification.

This may all sound absurd, but sadly, under policies requiring high stakes decisions such as dismissal to be based on value added measures, this stuff will likely become necessary. And, it will severely constrain principals who wish to work closely with teachers on making thoughtful, individualized classroom assignments for students. I address the new incentives of teachers to avoid taking on the “tough” cases in this post: https://schoolfinance101.wordpress.com/2010/09/01/kids-who-don%E2%80%99t-give-a-sht/

Technical follow-up: I noticed that Kevin Carey claims that VA measures “level the playing field for teachers who are assigned students of different ability.” This statement, as a general conclusion, is wrong.

a) VA measures do account for the initial performance level of individual students, or they would not be VA measures. Even this becomes problematic when measures are annual rather than fall/spring, so that summer learning loss is included in the year to year gain. An even more thorough approach for reducing model bias is to have multiple years of lagged scores on each child in order to estimate the extent to which a teacher can change a child’s trajectory (growth curve). That makes it more difficult to evaluate 3rd or 4th grade teachers, where many lagged scores aren’t yet available. The LAT model may have had multiple years of data on each teacher, but didn’t have multiple lagged scores on each child. All that the LAT approach does is to generate a more stable measure for a teacher, even if it is merely a stable measure of the bias of which students that teacher typically has assigned to him/her.

b) VA measures might crudely account for socio-economic status, disability status or language proficiency status, which may also  affect learning gains. But, typical VA models, like the LA Times model by Buddin tend to use relatively crude, dichotomous proxies/indicators for these things. They don’t effectively capture the range of differences among kids. They don’t capture numerous potentially important, unmeasured differences.  Nor do they typically capture classroom composition – peer group – effect which has been shown to be significant in many studies, whether measured by racial/ethnic/socioeconomic composition of the peer group or by average performance of the peer group.

c) For students who have more than one teacher across subjects (and/or teaching aides/assistants), each teacher’s VA measures may be influenced by the other teachers serving the same students.

I could go on, but recommend revisiting my previous posts on the topic where I have already addressed most of these concerns.

Value-added and the non-random sorting of kids who don’t give a sh^%t

Last week, this video from The Onion (asking whether tests are biased against kids who don’t give a sh^%t) was going viral among the education social networking geeks like me. At the same time, the conversations continued on the Los Angeles Times Value-Added story, with LAT releasing the scores for individual teachers.

I’ve written many blog posts in recent weeks on this topic. Lately, it seems that the emphasis on the conversation has turned toward finding a middle ground – discussing the appropriate role for VAM (Value Added Modeling) – if any, in teacher evaluation. But also, there is renewed rhetoric defending VAM. Most of that rhetoric seems to take on most directly the concern over the error rates in VAM – and lack of strong year to year correlation between which teachers are rated high or low.

The new rhetoric points out that we’re only having this conversation about VAM error rates because we can measure the error rate in VAM, but can’t even do that for peer or supervisor evaluation – which might be much worse (argue the pundits). The new rhetoric argues that VAM is still the “best available” method for evaluating teacher “performance.” Let me point out that if the “best available” automobile burst into flames on every fifth start, I think I’d walk or stay home instead. I’d take pretty significant steps to avoid driving. Now, we’re not talking about death by VAM here, but the idea that random error alone – under an inflexible VAM based policy structure – could lead to wrongfully firing a teacher is pretty significant.

Again, this current discussion pertains only to the “error rate” issue. Other major – perhaps even bigger issues include the problem that so few teachers could even have test scores attached to them – creating a whole separate sub-class (<20%) of teachers in each school system and increasing divisions among teachers – creating significant tension, for example between teachers under the VAM (math/reading) rating system, and teachers who might want to meet with some of their students for music, art or other enrichment endeavors.

Perhaps most significantly, there still exists that pesky little problem of VAM not being able to sufficiently account for the non-random sorting of students across schools and teachers. For those who wish to use Kane and Staiger as their out on this (without reference to broader research on this topic), see my previous post on the LAT analysis. Their findings are interesting, but not the single definitive source on this issue. Note also that the LAT analysis itself reveals some bias likely associated with non-random assignment (the topic of my post).

So then, what the heck does this have to do with The Onion video about testing and kids who don’t give a sh^%t?

I would argue that the non-random assignment of kids who don’t give a sh^%t presents a significant concern for VAM. Consider any typical upper elementary school. It is quite possible that kids who don’t give a sh^%t are more likely to be assigned to one fourth grade teacher year-after-year than to another. This may occur because that fourth grade teacher really wants to try to help these kids out, and has some, though limited success in doing so. This may also occur because the principal has it in for one teacher – and really wants to make his/her life difficult. Or, it may occur because all of the parents of kids who do give a sh^%t (in part because their parents give a sh^%t) consistently request the same teacher year after year.

In all likelihood, whether the kids give a sh^%t about doing well – and specifically doing well on the tests used for generating VA estimates – matters, and may matter a lot. Teachers with disproportionate numbers of kids who don’t give a sh^%t may, as a result receive systematically lower VA scores, and if the sorting mechanisms above are in place, this may occur year after year.

What incentive does this provide for the teacher who wanted to help – to help kids give a sh^%t? Statistically, even if that teacher made some progress in overcoming the give a sh^%t factor, the teacher would get a low rating because give a sh^%t factor would not be accounted for in the model. Buddin’s LAT model includes dummy variables for kids who are low income and kids who are limited in their English language proficiency. But, there’s no readily available indicator for kids who don’t give a sh^%t. So we can’t effectively compare one teacher with 10 (of 25) kids who don’t give a sh^%t to another with 5 (of 25) who don’t give a sh^%t. We can hope that giving a sh^%t , or not, is picked up by the child’s prior year performance, and even better, by the prior multiple years of value-added estimates on that child. But, do we really know whether giving a sh^%t is a stable student characteristic over time? Many VAM models like the LAT one don’t capture multiple prior years of value-added for each student.

I noted in previous posts that peer-effect is among those factors that compromises (biases) teacher VAM ratings. Buddin’s LAT model, as far as I can tell, doesn’t try to capture differences in peer group when attempting to “isolate” teacher effect (though this is very difficult to accomplish). Unlike racial characteristics or child poverty, whether 1 or 10 kids in a class give a sh^%t might rub off on others in the class. Or, the disruptive behavior of kids who don’t give a sh^%t might significantly compromise the learning (and value-added estimates) of others. Yet, all of this goes unmeasured in even the best VAMs.

Once again, just pondering…

NEW: BONUS VIDEO

http://www.youtube.com/watch?v=OMivkYJbcAk&feature=player_embedded

LA Times Study: Asian math teachers better than Black ones

The big news over the weekend involved the LA Times posting of value-added ratings of LA public school teachers.

Here’s how the Times spun their methodology:

Seeking to shed light on the problem, The Times obtained seven years of math and English test scores from the Los Angeles Unified School District and used the information to estimate the effectiveness of L.A. teachers — something the district could do but has not.

The Times used a statistical approach known as value-added analysis, which rates teachers based on their students’ progress on standardized tests from year to year. Each student’s performance is compared with his or her own in past years, which largely controls for outside influences often blamed for academic failure: poverty, prior learning and other factors.

This spin immediately concerned me, because it appears to assume that simply using a student’s prior score erases, or controls for, any and all differences among students by family backgrounds as well as classroom level differences – who attends school with whom.

Thankfully (thanks to the immediate investigative work of Sherman Dorn), the analysis was at least marginally better than that and conducted by a very technically proficient researcher at RAND named Richard Buddin. Here’s his technical report:

The problem is that even someone as good as Buddin can only work with the data he has. And there are at least 3 major shortcomings of the data that Buddin appeared to have available for his value added models. I’m setting aside here the potential quality of the achievement measures themselves.  Calculating (estimating) a teacher’s effect on their students’ learning and specifically, identifying the differences across teachers where those students are not randomly assigned (with same class size, comparable peer group, same air quality, lighting, materials, supplies, etc.) requires that we do a pretty damn good job of accounting for the measurable differences across the children assigned to teachers. This is especially true if our plan is to post names on the wall (or web)!

Here’s my quick read, short list of shortcomings to Buddin’s data, that I would suspect, lead to significant problems in precisely determining differences in teacher quality across students:

  1. While Buddin’s analysis includes student characteristics that may (and in fact appear to) influence student gains, Buddin – likely due to data limitations – includes only a simple classification variable for whether a student is a Title I student or not, and a simple classification variable for whether a student is limited in their English proficiency. These measures are woefully insufficient for a model being used to label teachers on a website as good or bad. Buddin notes that 97% of children in the lowest performing schools are poor, and 55% in higher performing schools are poor. Identifying children simply as poor or not poor misses entirely the variation among the poor to very poor children in LA public schools – which is most of the variation in family background in LA public schools. That is, the estimated model does not control at all for one teacher teaching a class of children who barely qualify for Title I programs, versus a teacher with a classroom of children of destitute homeless families, or multigenerational poverty. I suspect Buddin, himself, would have liked to have had more detailed information. But, you can only use what you’ve got. When you do, however, you need to be very clear about the shortcomings. Again, most kids in LA public schools are poor and the gradients of poverty are substantial. Those gradients are neglected entirely.  Further, the model includes no “classroom” related factors such as class size, student peer group composition (either by a Hoxby approach of average ability level of peer group, or considering racial composition of peer group as done by Hanushek and Rivkin. Then again, it’s nearly if not entirely impossible to fully correct for classroom level factors in these models.).
  2. It would appear that Buddin’s analysis uses annual testing data, not fall-spring assessments. This means that the year-to-year gains interpreted as “teacher effects” include summer learning and/or summer learning lag. That is, we are assigning blame, or praise to teachers based on what kids learned, or lost over the summer. If this is true of the models, this is deeply problematic. Okay, you say, but Buddin accounted for whether a student was a Title I student and summer opportunities are highly associated with Poverty Status. But, as I note above, this very crude indicator is far from sufficient to differentiate across most LA public school students.
  3. Finally, researchers like Jesse Rothstein, among others have suggested that having multiple years of prior scores on students can significantly reduce the influence of non-random assignment of students to teachers on the ratings of teachers. Rothstein speaks of using 3-years of lagged scores (http://gsppi.berkeley.edu/faculty/jrothstein/published/rothstein_vam2.pdf) so as to sufficiently characterize the learning trajectories of students entering any given teacher’s class. It does not appear that Buddin’s analysis includes multiple lagged scores.

So then what are some possible effects of these problems, where might we notice them, and why might they be problematic?

One important effect, which I’ve blogged about previously, is that the value-added teacher ratings could be substantially biased by the non-random sorting of students – or in more human terms – teachers of children having characteristics not addressed by the models could be unfairly penalized, or for that matter, unfairly benefited.

Buddin is kind enough in his technical paper to provide for us, various teacher characteristics and student characteristics that are associated with the teacher value-added effects – that is, what kinds of teachers are good, and which ones are more likely to suck? Buddin shows some of the usual suspects, like the fact that novice (first 3 years) teachers tended to have lower average value added scores. Now, this might be reasonable if we also knew that novice teachers weren’t necessarily clustered with the poorest of students in the district. But, we don’t know that.

Strangely, Buddin also shows us that the number of gifted children a teacher has affects their value-added estimate – The more gifted children you have, the better teacher you are??? That seems a bit problematic, and raises the question as to why “gifted” was not used as a control measure in the value-added ratings? Statistically, this could be problematic if giftedness was defined by the outcome measure – test scores (making it endogenous). Nonetheless, the finding that having more gifted children is associated with the teacher effectiveness rating raises at least some concern over that pesky little non-random assignment issue.

Now here’s the fun, and most problematic part:

Buddin finds that black teachers have lower value-added scores for both ELA and MATH. Further, these are some of the largest negative effects in the second level analysis – especially for MATH. The interpretation here (for parent readers of the LA Times web site) is that having a black teacher for math is worse than having a novice teacher. In fact, it’s the worst possible thing! Having a black teacher for ELA is comparable to having a novice teacher.

Buddin also finds that having more black students in your class is negatively associated with teacher’s value-added scores, but writes off the effect as small. Teachers of black students in LA are simply worse? There is NO discussion of the potentially significant overlap between black teachers, novice teachers and serving black students, concentrated in black schools (as addressed by Hanushek and Rivken in link above).

By contrast, Buddin finds that having an Asian teacher is much, much better for MATH. In fact, Asian teachers are as much better (than white teachers) for math as black teachers are worse! Parents – go find yourself an Asian math teacher in LA? Also, having more Asian students in your class is associated with higher teacher ratings for Math. That is, you’re a better math teacher if you’ve got more Asian students, and you’re a really good math teacher if you’re Asian and have more Asian students?????

Talk about some nifty statistical stereotyping.

It makes me wonder if there might also be some racial disparity in the “gifted” classification variable, with more Asian students and fewer black students district-wide being classified as “gifted.”

IS ANYONE SEEING THE PROBLEM HERE? Should we really be considering using this information to either guide parent selection of teachers or to decide which teachers get fired?

I discussed the link between non-random assignment and racially disparate effects previously here:

https://schoolfinance101.wordpress.com/2010/06/02/pondering-legal-implications-of-value-added-teacher-evaluation/

Indeed there may be some substantive differences in the average academic (undergraduate & high school) preparation in math of black and Asian teachers in LA. And these differences may translate into real differences in the effectiveness of math teaching. But sadly, we’re not having that conversation here. Rather, the LA times is putting out a database, built on insufficient underlying model parameters, that produces these potentially seriously biased results.

While some of these statistically significant effects might be “small” across the entire population of teachers in LA, the likelihood that these “biases” significantly affect specific individual teacher’s value-added ratings is much greater – and that’s what’s so offensive about the use of this information by the LA Times. The “best possible,” still questionable, models estimated are not being used to draw simple, aggregate conclusions about the degree of variance across schools and classrooms, but rather they are being used to label individual cases from a large data set as “good” or “bad.” That is entirely inappropriate!

Note: On Kane and Staiger versus Rothstein and non-random assignment

Finally, a comment on references to two different studies on the influence of non-random assignment. Those wishing to write off the problems of non-random assignment typically refer to Kane and Staiger’s analysis using a relatively small, randomized sample. Those wishing to raise concerns over non-random assignment typically refer to Jesse Rothstein’s work. Eric Hanushek, in an exceptional overview article on Value Added assessment summarizes these two articles, and his own work as follows:

An alternative approach of Kane and Staiger (2008) of using estimates from a random assignment of teachers to classrooms finds little bias in traditional estimation, although the possible uniqueness of the sample and the limitations of the specification test suggest care in interpretation of the results.

A compelling part of the analysis in Rothstein (2010) is the development of falsification tests, where future teachers are shown to have significant effects on current achievement. Although this could be driven in part by subsequent year classroom placement on based on current achievement, the analysis suggests the presence of additional unobserved differences..

In related work, Hanushek and Rivkin (2010) use alternative, albeit imperfect, methods for judging which schools systematically sort students in a large Texas district. In the “sorted” samples, where random classroom assignment is rejected, this falsification test performs like that in North Carolina, but this is not the case in the remaining “unsorted” sample where random assignment is not rejected.

http://edpro.stanford.edu/hanushek/admin/pages/files/uploads/HanushekRivkin%20AEA2010.CALDER.pdf

Rolling Dice: If I roll a “6” you’re fired!

Okay… Picture this…I’m rolling dice… and each time I roll a “6” some loud-mouthed, tweet happy pundit who just loves value-added assessment for teachers gets fired. Sound fair? It might happen to  someone who sucks at their job…or might just be someone who is rather average. Doesn’t matter. They lost on the roll of the dice.  A 1 in 6 chance. Not that bad. A 5 in 6 chance of keeping their job. Can’t you live with that?

This report was just released the other day from the National Center for Education Statistics:

http://ies.ed.gov/ncee/pubs/20104004/pdf/20104004.pdf

The report carries out a series of statistical tests to determine the identification “error” rates for “bad teachers” when using typical value added statistical methods. Here’s a synopsis of the findings from the report itself:

Type I and II error rates for comparing a teacher’s performance to the average are likely to be about 25 percent with three years of data and 35 percent with one year of data. Corresponding error rates for overall false positive and negative errors are 10 and 20 percent, respectively.

Where:

Type I error rate (α) is the probability that based on c years of data, the hypothesis test will find that a truly average teacher (such as Teacher 4) performed significantly worse than average. (p. 12)

So, that means that there is about a 25% chance, if using three years of data or 35% chance if using 1 year of data that a teacher who is “average” would be identified as “significantly worse than average” and potentially be fired. So, what I really need are some 4 sided dice. I gave the pundits odds that are too good! Admittedly, this is the likelihood of identifying an “average” teacher as well below average. The likelihood of identifying an above average teacher as below average would be lower. Here’s the relevant definition of a “false positive” error rate from the study”

the false positive error rate, ()FPRq, is the probability that a teacher (such as Teacher 5) whose true performance level is q SDs above average is falsely identified for special assistance. (p. 12)

From the first quote above, even this occurs 1 in 10 times (given three years of data and 2 in 10 given only one year). And here’s the definition of a “false negative error:”

false negative error rate is the probability that the hypothesis test will fail to identify teachers (such as Teachers 1 and 2 in Figure 2.1) whose true performance is at least T SDs below average.

…which also occurs 1 in 10 times (given three years of data and 2 in 10 given only one year).

These concerns are not new. In a previous post, I discuss various problems with using value added measures for identifying good and bad teachers, such as temporal instability: http://www.urban.org/UploadedPDF/1001266_stabilityofvalue.pdf.

The introduction of this new report notes:

Existing research has consistently found that teacher- and school-level averages of student test score gains can be unstable over time. Studies have found only moderate year-to-year correlations—ranging from 0.2 to 0.6—in the value-added estimates of individual teachers (McCaffrey et al. 2009; Goldhaber and Hansen 2008) or small to medium-sized school grade-level teams (Kane and Staiger 2002b). As a result, there are significant annual changes in teacher rankings based on value-added estimates.

In my first post on this topic (and subsequent ones), I point out that the National Academies have already cautioned that:

“A student’s scores may be affected by many factors other than a teacher — his or her motivation, for example, or the amount of parental support — and value-added techniques have not yet found a good way to account for these other elements.”

http://www8.nationalacademies.org/onpinews/newsitem.aspx?RecordID=1278

And again, this new report provides a laundry list of factors that affect value-added assessment beyond the scope of the analysis itself:

However, several other features of value-added estimators that have been analyzed in the literature also have important implications for the appropriate use of value-added modeling in performance measurement. These features include the extent of estimator bias (Kane and Staiger 2008; Rothstein 2010; Koedel and Betts 2009), the scaling of test scores used in the estimates (Ballou 2009; Briggs and Weeks 2009), the degree to which the estimates reflect students’ future benefits from their current teachers’ instruction (Jacob et al. 2008), the appropriate reference point from which to compare the magnitude of estimation errors (Rogosa 2005), the association between value-added estimates and other measures of teacher quality (Rockoff et al. 2008; Jacob and Lefgren 2008), and the presence of spillover effects between teachers (Jackson and Bruegmann 2009).

In my opinion, the most significant problem here is the non-random assignment problem. The noise problem is significant and important, but much less significant than the non-random assignment problem. It just happens to be the topic of the day.

But alas, we continue to move forward… full steam ahead.

As I see it there are two groups of characters pitching fast-track adoption of value-added teacher evaluation policies.

Statistically Inept Pundits (who really don’t care anyway): The statistically inept pundits are those we see on Twitter every day, applauding the mass firing of DC teachers, praising the Colorado teacher evaluation bill and thinking that RttT is just AWESOME, regardless of the mixed (at best) evidence behind the reforms promoted by RttT (like value-added teacher assessment). My take is that they have no idea what any of this means… have little capacity to understand it anyway… and probably don’t much care. To them, I’m just a curmudgeonly academic throwing a wet blanket on their teacher bashing party. After all, who… but a wet blanket could really be against making sure all kids have good teachers… making sure that we fire and/or lay off the bad teachers, not just the inexperienced ones. These teachers are dangerous after all. They are hurting kids. We must stop them! Can’t argue that.  Or can we? The problem is, we just don’t have ideal, or even reasonably good methods for distinguishing between those good and bad teachers. And school districts that are all-of-the-sudden facing huge budget deficits and laying off hundreds of teachers, don’t retroactively have in place an evaluation system with sufficient precision to weed out the bad – nor could they.  Implementing “quality-based layoffs” here and now is among the most problematic suggestions currently out there.  The value-added assessment systems yet-to-be-implemented aren’t even up to the task. I’m really confused why these pundits who have so little knowledge about this stuff are so convinced that it is just so AWESOME.

Reform Engineers: Reform engineers view this issue in purely statistical and probabilistic terms – setting legal, moral and ethical concerns aside. I can empathize with that somewhat, until I try to make it actually work in schools and until I let those moral, ethical and legal concerns creep into my head. Perhaps I’ve gone soft. I’d have been all for this no more than 5 years ago. The reform engineer assumes first that it is the test scores that we want to improve  as our central objective – and only the test scores. Test scores are the be-all and end-all measure.  The reform engineer is okay with the odds above because more than 50% of the time they will fire the right person. That may be good enough – statistically. And, as long as they have decent odds of replacing the low performing teacher with at least an average teacher – each time – then the system should move gradually in a positive direction.  All that matters is that we have the potential for a net positive quality effect on replacing the 3/4 of fired teachers who were correctly identified and at least breaking even on the 1/4 who were falsely fired. That’s a pretty loaded set of assumptions though. Are we really going to get the best applicants to a school district where they know they might be fired for no reason on a 25% chance (if using 3 years of data) or 35% chance (on one year?). Of course, I didn’t even factor into this the number of bad teachers identified as good.

I guess that one could try to dismiss those moral, ethical and legal concerns regarding wrongly dismissing teachers by arguing that if it’s better for the kids in the end, then wrongly firing 1 in 4 average teachers along the way is the price we have to pay. I suspect that’s what the pundits would argue – since it’s about fairness to the kids, not fairness to the teachers, right? Still, this seems like a heavy toll to pay, an unnecessary toll, and quite honestly, one that’s not even that likely to work even in the best of engineered circumstances.

========

Follow up notes: A few comments I have received have argued from a reform engineering perspective that if we a) use the maximum number of years of data possible, and b) focus on identifying the bottom 10% or fewer of teachers, based on the analysis in the NCES/Mathematica report, we might significantly reduce our error rate – down to say 10% of teachers being incorrectly fired. Further, it is more likely that those incorrectly identified as failing are closer to failing anyway. That is not, however, true in all cases. This raises the interesting ethical question of – what is the tolerable threshold for randomly firing the wrong  teacher? or keeping the wrong teacher?

Further, I’d like to emphasize again that there are many problems that seriously undermine the application of value-added assessment for teacher hiring/firing decisions. This issue probably ranks about 3rd among the major problem categories. And this issue has many dimensions. First there is the statistical and measurement issue of having statistical noise result in wrongful teacher dismissal. There are also the litigation consequences that follow. There are also the questions over how the use of such methods will influence individuals thinking about pursuing teaching as a career, if pay is not substantially increased to counterbalance these new job risks. It’s not just about tweaking the statistical model and cut-points to bring the false positives into a tolerable zone. This type of shortsightedness is all too common in the types of technocratic solutions I, myself, used to favor.

Here’s a quick synopsis of the two other  major issues undermining the usefulness of value-added assessment for teacher evaluation & dismissal (on the assumption that majority weight is placed on value-added assessment):

1) That students are not randomly assigned across teachers and that this non-random assignment may severely bias estimates of teacher quality. The fact that non-random assignment of students may bias estimates of teacher quality will also likely have adverse labor market effects, making it harder to get the teachers we need in the classrooms where we need them most – at least without a substantial increase to their salaries to offset the risk.

2) That only a fraction of teachers can even be evaluated this way in the best of possible cases (generally less than 20%), and even their “teacher effects” are tainted – or enhanced – by one another. As I discussed previously, this means establishing different contracts for those who will versus those who will not be evaluated by test scores, creating at least two classes of teachers in schools and likely leading to even greater tensions between them. Further, there will likely be labor market effects with certain types of teachers either jockeying for position as a VAM evaluated teacher, or avoiding those positions.

More can be found on my entire blog thread on this topic: https://schoolfinance101.wordpress.com/category/race-to-the-top/value-added-teacher-evaluation/

Negotiating Points for Teachers on Value-Added Evaluations

A short time back I posted an explanation of how using value-added student testing data could lead to a series of legal problems for school districts and states.  That post can be found here:

https://schoolfinance101.wordpress.com/2010/06/02/pondering-legal-implications-of-value-added-teacher-evaluation/

We had some interesting follow-up discussion over on www.edjurist.com.

My concerns regarding legal issues arose from statistical problems and some practical problems associated with using value-added assessment to reliably and validly measure teacher effectiveness. The main issue is to protect against wrongly firing teachers on the basis of statistical noise, or on the basis of factors that influenced the value-added scores that were not related to teacher effectiveness.

Among other things, I pointed out problems associated with the non-random assignment of students, and how non-random assignment of students across classrooms of teachers can influence significantly – bias that is – value-added estimates of teacher effectiveness. Non-random assignment could, under certain state policies or district contracts, lead to the “de-tenuring” and/or dismissal of a teacher simply on the basis of students assigned to that teacher. Links to research and more detailed explanation of the non-random assignment problem are provided on the previous post above.

Of course, this also means that school principals or superintendents – anyone with sufficient authority to influence teacher and student assignment – could intentionally stack classes against the interest  of specific teachers. A principal could assign students to a teacher with the intent of harming that teacher’s value-added estimates.

To protect against this possibility, I suggest that teachers unions or individual teachers argue for language in their contracts which requires that students be randomly assigned and that class sizes be precisely the same – along with the time of day when courses are taught, lighting, room temperature , nutrition and any other possible factors that could compromise a teacher’s value added score and could be manipulated against a teacher.

The language in the class size/random assignment clause will have to be pretty precise to guarantee that each teacher is treated fairly – in a purely statistical sense. Teachers should negotiate for a system that guarantees “comparable class size across teachers – not to deviate more than X” and that year to year student assignment to classes should be managed through a “stratified randomized lottery system with independent auditors to oversee that system.” Stratified by disability classification, poverty status, language proficiency, neighborhood context, number of books in each child’s home setting, etc. That is, each class must be equally balanced with a randomly (lottery) selected set of children by each relevant classification.  This gets out of hand really fast.

KEEP IN MIND THAT THIS SPECIAL CONTRACT STILL APPLIES TO ONLY SOMEWHAT FEWER THAN 20% OF TEACHERS – THOSE WHO COULD EVEN REASONABLY BE LINKED TO SPECIFIC STUDENTS’ READING AND MATH ACHIEVEMENT.

I welcome suggestions for other clauses that should be included.

Just pondering the possibilities.
A recent summary of state statutes regarding teacher evaluation can be found here: http://www.ecs.org/clearinghouse/86/21/8621.pdf

See also: http://www.caldercenter.org/upload/CALDER-Research-and-Policy-Brief-9.pdf

This is a thoughtful read from a general supporter of using VA assessments to create better incentives to improve teacher quality. Read the “Policy Uses” section on pages 3-4.

Pondering Legal Implications of Value-Added Teacher Evaluation

I’m going out on a limb here. I’m a finance guy. Not a lawyer. But, I do have a reasonable background on school law thanks to colleagues in the field like Mickey Imber at U. of Kansas and my frequent coauthor Preston Green at Penn State. That said, any screw ups in my legal analysis below are my own and not attributable to either Preston or Mickey. In any case, I’ve been wondering about the validity of the claim that some pundits seem to be making that these new teacher evaluation policies are going to make it easier and less expensive to dismiss teachers.

=====

A handful of states have now adopted legislation which mandates that teacher evaluation be linked to student test data. Specifically, legislation adopted in states like Colorado, Louisiana and Kentucky and legislation vetoed in Florida follow a template of requiring that teacher evaluation for pay increase, for retaining tenure and ultimately for dismissal must be based 50% or 51% on student “value-added” or “growth” test scores alone. That is, student test score data could make or break a salary increase decision, but could also make or break a teacher’s ability to retain tenure. Pundits backing these policies often highlight provisions for multi-year data tracking on teachers so that a teacher would not lose tenure status until he/she shows poor student growth for 2 or 3 years running. These provisions are supposed to eliminate the possibility that random error or a “bad crop of students” alone could determine a teacher’s future.

Pundits are taking the position that these new evaluation criteria will make it easier to dismiss teachers and will reduce the costs of dismissing a teacher that result from litigation. Oh, how foolish!

The way I see it, this new crop of state statutes and regulations which include arbitrary use of questionable data, applied in a questionably appropriate way will most likely lead to a flood of litigation like none that has ever been witnessed.

Why would that be? How can a teacher possibly sue the school district for being fired because he/she was a bad teacher? Simply writing into state statute or department regulations that one’s “property interest” to tenure and continued employment must be primarily tied to student test scores does not by any stretch of the legal imagination guarantee that dismissal based on student test scores will stand up to legal challenges – good and legitimate legal challenges.

There are (at least) two very likely legal challenges that will occur once we start to experience our first rounds of teacher dismissal based on student assessment data.

Due Process Challenges

Removing a teacher’s tenure status is denial of a teacher’s property interest and doing so requires “due process.” That’s not an insurmountable barrier, even under typical teacher contracts that don’t require dismissal based on student test scores. Simply declaring that “a teacher will be fired if he/she shows 2 straight years of bad student test scores (growth or value-added)” and then firing a teacher for as much does not mean that the teacher necessarily was provided due process. Under a policy requiring that 51% of the employment decision be based on student value added test scores, a teacher could be wrongly terminated due to:

a) Temporal instability of the value-added measures

http://www.urban.org/UploadedPDF/1001266_stabilityofvalue.pdf

Ooooh…Temporal instability… what’s that supposed to mean? What it means is that teacher value-added ratings, which are averages of individual student gains, tend not to be that stable over time. The same teacher is highly likely to get a totally different value added rating from one year to the next. The above link points to a policy brief which explains that the year to year correlation for a teacher’s value added rating is only about .2 or .3. Further, most of the change or difference in the teacher’s value added rating from one year to the next is unexplainable – not by differences in observed student characteristics, peer characteristics or school characteristics. 87.5% (elementary math) to 70% (8th grade math) noise! While some statistical corrections and multi-year measures might help, it’s hard to guarantee or even be reasonably sure that a teacher wouldn’t be dismissed simply as a function of unexplainable low performance for 2 or 3 years in a row. That is, simply due to noise, and not the more troublesome issue of how students are clustered across schools, districts and classrooms.

b) Non-random assignment of students

The only fair way to compare teachers’ ability to produce student value-added is to randomly assign all students, statewide to all teachers… and then of course, to have all students live in exactly comparable settings with exactly comparable support structures outside of school, etc., etc. etc. That’s right. We’d have to send all of our teachers and all of our students to a single boarding school location somewhere in the state and make sure, absolutely sure that we randomly assigned students, the same number of students to each and every teacher in the system.

Obviously, that’s not going to happen. Students are not randomly sorted and the fact that they are not has serious consequences for comparing teachers’ ability to produce student value-added. See: http://gsppi.berkeley.edu/faculty/jrothstein/published/rothstein_vam2.pdf

c) Student manipulation of test results

As she travels the nation on her book tour, Diane Ravitch raises another possibility for how a teacher might find him/herself out of a job by no real fault of actual bad teaching. As she puts it, this approach to teacher evaluation puts the teacher’s job directly in the students’ hands. And the students can, if they wish, choose to consciously abuse that responsibility.  That is, the students could actually choose to bomb the state assessments to get a teacher fired, whether it’s a good teacher or a bad one. This would most certainly raise due process concerns.

d) A whole bunch of other uncontrollable stuff

A recent National Academies report noted:

“A student’s scores may be affected by many factors other than a teacher — his or her motivation, for example, or the amount of parental support — and value-added techniques have not yet found a good way to account for these other elements.”

http://www8.nationalacademies.org/onpinews/newsitem.aspx?RecordID=1278

This report generally urged caution regarding overemphasis of student value-added test scores in teacher evaluation – especially in high stakes decisions. Surely, if I was an expert witness testifying on behalf of a teacher who had been wrongly dismissed, I’d be pointing out that the National Academies said that using the student assessment data in this way is not a good idea.

Title VII of the Civil Rights Act Challenges

The non-random assignment of students leads to the second likely legal claim that will flood the courts as student testing based teacher dismissals begin – Claims of racially disparate teacher dismissal under Title VII of the Civil Rights Act of 1964.  Given that students are not randomly assigned and that poor and minority – specifically black – students are densely clustered in certain schools and districts and that black teachers are much more likely to be working in schools with classrooms of low-income black students, it is highly likely that teacher dismissals will occur in a racially disparate pattern. Black teachers of low-income black students will be several times more likely to be dismissed on the basis of poor value-added test scores. This is especially true where a statewide fixed, rigid requirement is adopted and where a teacher must be de-tenured and/or dismissed if he/she shows value-added below some fixed value-added threshold on state assessments.

So, here’s how this one plays out. For every 1 white teacher dismissed on value-added basis, 10 or more black teachers are dismissed –  relative to the overall proportions of black and white teachers. This gives the black teachers the argument that the policy has racially disparate effect. No, it doesn’t end there. A policy doesn’t violate Title VII merely because it has racially disparate effect. That just starts the ball rolling – gets the argument into court.

The state gets to defend itself – by claiming that producing value-added test scores is a legitimate part of a teacher’s job and then explaining how the use of those scores is, in fact neutral with respect to race. It just happens to have the disparate effect. Right? But, as the state would argue, that’s a good thing because it ensures that we can put better teachers in front of these poor minority kids, and get rid of the bad ones.

But, the problem is that the significant body of research on non-random assignment of students and its effect of value added scores indicates that it’s not necessarily differences in the actual effectiveness of black versus white teachers, but that the black teachers are concentrated in the poor black schools and that student clustering and not teacher effectiveness is leading to the disparate rates of teacher dismissal.  So they weren’t fired because they were precisely measurably ineffective, they were fired because they had classrooms of poor minority students year after year? At the very least, it is statistically problematic to distill one effect from the other! As a result, it’s statistically problematic to argue that the teacher should be dismissed! There is at least equal likelihood that the teacher is wrongly dismissed as there is that the teacher is rightly dismissed. I suspect a court might be concerned by this.

Reduction in Force

Note that many of these same concerns apply to all of the recent rhetoric over teacher layoffs and the need to base those layoffs on effectiveness rather than seniority. It all sounds good, until you actually try to go into a school district of any size and identify the 100 “least effective” teachers given the current state of data for teacher evaluation. Simply writing into a reduction in force (RIF) policy a requirement of dismissal based on “effectiveness” does not instantly validate the “effectiveness” measures. And even the best “effectiveness” measures, as discussed above, remain really problematic, providing tenured teachers reduced on grounds of ineffectiveness multiple options for legal action.

Additional Concerns

These two legal arguments ignore the fact that school districts and states will have to establish two separate types of contracts for teachers to begin with, since even in the best of statistical cases, only about 1/5 of teachers (those directly responsible for teaching math or reading in grades three through eight) might possibly be evaluated via student test scores (see: https://schoolfinance101.wordpress.com/2009/12/04/pondering-the-usefulness-of-value-added-assessment-of-teachers/)

I’ve written previously about the technical concerns over value-added assessment of teachers and my concern that pundits are seemingly completely ignorant of the statistical issues. I’m also baffled that few others in the current policy discussion seem even remotely aware of just how few teachers might – in the best possible case – be evaluated via student test scores, and the need for separate contracts. But, I am perhaps most perplexed that no-one seems to be acknowledging the massive legal mess likely to ensue when (or if) these poorly conceived policies are put into action.

I’ll save for another day the discussion of just who will be waiting in line to fill those teaching vacancies created by rigid use of test scores for disproportionately dismissing teachers in poor urban schools. Will they, on average, be better or perhaps worse than those displaced before them? Just who will wait in this line to be unfairly judged?

For a related article on the use of certification exams for credentialing teachers, see:

Green, P.C., Sireci, S.G. (2005) Legal and Psychometric Criteria for Evaluating Teacher Certification Tests.  Educational Measurement: Issues and Practice. Volume 19 Issue 1, Pages 22 – 31