LA Times Study: Asian math teachers better than Black ones

The big news over the weekend involved the LA Times posting of value-added ratings of LA public school teachers.

Here’s how the Times spun their methodology:

Seeking to shed light on the problem, The Times obtained seven years of math and English test scores from the Los Angeles Unified School District and used the information to estimate the effectiveness of L.A. teachers — something the district could do but has not.

The Times used a statistical approach known as value-added analysis, which rates teachers based on their students’ progress on standardized tests from year to year. Each student’s performance is compared with his or her own in past years, which largely controls for outside influences often blamed for academic failure: poverty, prior learning and other factors.

This spin immediately concerned me, because it appears to assume that simply using a student’s prior score erases, or controls for, any and all differences among students by family backgrounds as well as classroom level differences – who attends school with whom.

Thankfully (thanks to the immediate investigative work of Sherman Dorn), the analysis was at least marginally better than that and conducted by a very technically proficient researcher at RAND named Richard Buddin. Here’s his technical report:

The problem is that even someone as good as Buddin can only work with the data he has. And there are at least 3 major shortcomings of the data that Buddin appeared to have available for his value added models. I’m setting aside here the potential quality of the achievement measures themselves.  Calculating (estimating) a teacher’s effect on their students’ learning and specifically, identifying the differences across teachers where those students are not randomly assigned (with same class size, comparable peer group, same air quality, lighting, materials, supplies, etc.) requires that we do a pretty damn good job of accounting for the measurable differences across the children assigned to teachers. This is especially true if our plan is to post names on the wall (or web)!

Here’s my quick read, short list of shortcomings to Buddin’s data, that I would suspect, lead to significant problems in precisely determining differences in teacher quality across students:

  1. While Buddin’s analysis includes student characteristics that may (and in fact appear to) influence student gains, Buddin – likely due to data limitations – includes only a simple classification variable for whether a student is a Title I student or not, and a simple classification variable for whether a student is limited in their English proficiency. These measures are woefully insufficient for a model being used to label teachers on a website as good or bad. Buddin notes that 97% of children in the lowest performing schools are poor, and 55% in higher performing schools are poor. Identifying children simply as poor or not poor misses entirely the variation among the poor to very poor children in LA public schools – which is most of the variation in family background in LA public schools. That is, the estimated model does not control at all for one teacher teaching a class of children who barely qualify for Title I programs, versus a teacher with a classroom of children of destitute homeless families, or multigenerational poverty. I suspect Buddin, himself, would have liked to have had more detailed information. But, you can only use what you’ve got. When you do, however, you need to be very clear about the shortcomings. Again, most kids in LA public schools are poor and the gradients of poverty are substantial. Those gradients are neglected entirely.  Further, the model includes no “classroom” related factors such as class size, student peer group composition (either by a Hoxby approach of average ability level of peer group, or considering racial composition of peer group as done by Hanushek and Rivkin. Then again, it’s nearly if not entirely impossible to fully correct for classroom level factors in these models.).
  2. It would appear that Buddin’s analysis uses annual testing data, not fall-spring assessments. This means that the year-to-year gains interpreted as “teacher effects” include summer learning and/or summer learning lag. That is, we are assigning blame, or praise to teachers based on what kids learned, or lost over the summer. If this is true of the models, this is deeply problematic. Okay, you say, but Buddin accounted for whether a student was a Title I student and summer opportunities are highly associated with Poverty Status. But, as I note above, this very crude indicator is far from sufficient to differentiate across most LA public school students.
  3. Finally, researchers like Jesse Rothstein, among others have suggested that having multiple years of prior scores on students can significantly reduce the influence of non-random assignment of students to teachers on the ratings of teachers. Rothstein speaks of using 3-years of lagged scores (http://gsppi.berkeley.edu/faculty/jrothstein/published/rothstein_vam2.pdf) so as to sufficiently characterize the learning trajectories of students entering any given teacher’s class. It does not appear that Buddin’s analysis includes multiple lagged scores.

So then what are some possible effects of these problems, where might we notice them, and why might they be problematic?

One important effect, which I’ve blogged about previously, is that the value-added teacher ratings could be substantially biased by the non-random sorting of students – or in more human terms – teachers of children having characteristics not addressed by the models could be unfairly penalized, or for that matter, unfairly benefited.

Buddin is kind enough in his technical paper to provide for us, various teacher characteristics and student characteristics that are associated with the teacher value-added effects – that is, what kinds of teachers are good, and which ones are more likely to suck? Buddin shows some of the usual suspects, like the fact that novice (first 3 years) teachers tended to have lower average value added scores. Now, this might be reasonable if we also knew that novice teachers weren’t necessarily clustered with the poorest of students in the district. But, we don’t know that.

Strangely, Buddin also shows us that the number of gifted children a teacher has affects their value-added estimate – The more gifted children you have, the better teacher you are??? That seems a bit problematic, and raises the question as to why “gifted” was not used as a control measure in the value-added ratings? Statistically, this could be problematic if giftedness was defined by the outcome measure – test scores (making it endogenous). Nonetheless, the finding that having more gifted children is associated with the teacher effectiveness rating raises at least some concern over that pesky little non-random assignment issue.

Now here’s the fun, and most problematic part:

Buddin finds that black teachers have lower value-added scores for both ELA and MATH. Further, these are some of the largest negative effects in the second level analysis – especially for MATH. The interpretation here (for parent readers of the LA Times web site) is that having a black teacher for math is worse than having a novice teacher. In fact, it’s the worst possible thing! Having a black teacher for ELA is comparable to having a novice teacher.

Buddin also finds that having more black students in your class is negatively associated with teacher’s value-added scores, but writes off the effect as small. Teachers of black students in LA are simply worse? There is NO discussion of the potentially significant overlap between black teachers, novice teachers and serving black students, concentrated in black schools (as addressed by Hanushek and Rivken in link above).

By contrast, Buddin finds that having an Asian teacher is much, much better for MATH. In fact, Asian teachers are as much better (than white teachers) for math as black teachers are worse! Parents – go find yourself an Asian math teacher in LA? Also, having more Asian students in your class is associated with higher teacher ratings for Math. That is, you’re a better math teacher if you’ve got more Asian students, and you’re a really good math teacher if you’re Asian and have more Asian students?????

Talk about some nifty statistical stereotyping.

It makes me wonder if there might also be some racial disparity in the “gifted” classification variable, with more Asian students and fewer black students district-wide being classified as “gifted.”

IS ANYONE SEEING THE PROBLEM HERE? Should we really be considering using this information to either guide parent selection of teachers or to decide which teachers get fired?

I discussed the link between non-random assignment and racially disparate effects previously here:

https://schoolfinance101.wordpress.com/2010/06/02/pondering-legal-implications-of-value-added-teacher-evaluation/

Indeed there may be some substantive differences in the average academic (undergraduate & high school) preparation in math of black and Asian teachers in LA. And these differences may translate into real differences in the effectiveness of math teaching. But sadly, we’re not having that conversation here. Rather, the LA times is putting out a database, built on insufficient underlying model parameters, that produces these potentially seriously biased results.

While some of these statistically significant effects might be “small” across the entire population of teachers in LA, the likelihood that these “biases” significantly affect specific individual teacher’s value-added ratings is much greater – and that’s what’s so offensive about the use of this information by the LA Times. The “best possible,” still questionable, models estimated are not being used to draw simple, aggregate conclusions about the degree of variance across schools and classrooms, but rather they are being used to label individual cases from a large data set as “good” or “bad.” That is entirely inappropriate!

Note: On Kane and Staiger versus Rothstein and non-random assignment

Finally, a comment on references to two different studies on the influence of non-random assignment. Those wishing to write off the problems of non-random assignment typically refer to Kane and Staiger’s analysis using a relatively small, randomized sample. Those wishing to raise concerns over non-random assignment typically refer to Jesse Rothstein’s work. Eric Hanushek, in an exceptional overview article on Value Added assessment summarizes these two articles, and his own work as follows:

An alternative approach of Kane and Staiger (2008) of using estimates from a random assignment of teachers to classrooms finds little bias in traditional estimation, although the possible uniqueness of the sample and the limitations of the specification test suggest care in interpretation of the results.

A compelling part of the analysis in Rothstein (2010) is the development of falsification tests, where future teachers are shown to have significant effects on current achievement. Although this could be driven in part by subsequent year classroom placement on based on current achievement, the analysis suggests the presence of additional unobserved differences..

In related work, Hanushek and Rivkin (2010) use alternative, albeit imperfect, methods for judging which schools systematically sort students in a large Texas district. In the “sorted” samples, where random classroom assignment is rejected, this falsification test performs like that in North Carolina, but this is not the case in the remaining “unsorted” sample where random assignment is not rejected.

http://edpro.stanford.edu/hanushek/admin/pages/files/uploads/HanushekRivkin%20AEA2010.CALDER.pdf

Video Blog: Thoughts on School Funding and the Big Race

Filmed back in the Spring of 2010.

Produced by the National Education Policy Center

http://www.youtube.com/user/NEPCVideos

Rolling Dice: If I roll a “6” you’re fired!

Okay… Picture this…I’m rolling dice… and each time I roll a “6” some loud-mouthed, tweet happy pundit who just loves value-added assessment for teachers gets fired. Sound fair? It might happen to  someone who sucks at their job…or might just be someone who is rather average. Doesn’t matter. They lost on the roll of the dice.  A 1 in 6 chance. Not that bad. A 5 in 6 chance of keeping their job. Can’t you live with that?

This report was just released the other day from the National Center for Education Statistics:

http://ies.ed.gov/ncee/pubs/20104004/pdf/20104004.pdf

The report carries out a series of statistical tests to determine the identification “error” rates for “bad teachers” when using typical value added statistical methods. Here’s a synopsis of the findings from the report itself:

Type I and II error rates for comparing a teacher’s performance to the average are likely to be about 25 percent with three years of data and 35 percent with one year of data. Corresponding error rates for overall false positive and negative errors are 10 and 20 percent, respectively.

Where:

Type I error rate (α) is the probability that based on c years of data, the hypothesis test will find that a truly average teacher (such as Teacher 4) performed significantly worse than average. (p. 12)

So, that means that there is about a 25% chance, if using three years of data or 35% chance if using 1 year of data that a teacher who is “average” would be identified as “significantly worse than average” and potentially be fired. So, what I really need are some 4 sided dice. I gave the pundits odds that are too good! Admittedly, this is the likelihood of identifying an “average” teacher as well below average. The likelihood of identifying an above average teacher as below average would be lower. Here’s the relevant definition of a “false positive” error rate from the study”

the false positive error rate, ()FPRq, is the probability that a teacher (such as Teacher 5) whose true performance level is q SDs above average is falsely identified for special assistance. (p. 12)

From the first quote above, even this occurs 1 in 10 times (given three years of data and 2 in 10 given only one year). And here’s the definition of a “false negative error:”

false negative error rate is the probability that the hypothesis test will fail to identify teachers (such as Teachers 1 and 2 in Figure 2.1) whose true performance is at least T SDs below average.

…which also occurs 1 in 10 times (given three years of data and 2 in 10 given only one year).

These concerns are not new. In a previous post, I discuss various problems with using value added measures for identifying good and bad teachers, such as temporal instability: http://www.urban.org/UploadedPDF/1001266_stabilityofvalue.pdf.

The introduction of this new report notes:

Existing research has consistently found that teacher- and school-level averages of student test score gains can be unstable over time. Studies have found only moderate year-to-year correlations—ranging from 0.2 to 0.6—in the value-added estimates of individual teachers (McCaffrey et al. 2009; Goldhaber and Hansen 2008) or small to medium-sized school grade-level teams (Kane and Staiger 2002b). As a result, there are significant annual changes in teacher rankings based on value-added estimates.

In my first post on this topic (and subsequent ones), I point out that the National Academies have already cautioned that:

“A student’s scores may be affected by many factors other than a teacher — his or her motivation, for example, or the amount of parental support — and value-added techniques have not yet found a good way to account for these other elements.”

http://www8.nationalacademies.org/onpinews/newsitem.aspx?RecordID=1278

And again, this new report provides a laundry list of factors that affect value-added assessment beyond the scope of the analysis itself:

However, several other features of value-added estimators that have been analyzed in the literature also have important implications for the appropriate use of value-added modeling in performance measurement. These features include the extent of estimator bias (Kane and Staiger 2008; Rothstein 2010; Koedel and Betts 2009), the scaling of test scores used in the estimates (Ballou 2009; Briggs and Weeks 2009), the degree to which the estimates reflect students’ future benefits from their current teachers’ instruction (Jacob et al. 2008), the appropriate reference point from which to compare the magnitude of estimation errors (Rogosa 2005), the association between value-added estimates and other measures of teacher quality (Rockoff et al. 2008; Jacob and Lefgren 2008), and the presence of spillover effects between teachers (Jackson and Bruegmann 2009).

In my opinion, the most significant problem here is the non-random assignment problem. The noise problem is significant and important, but much less significant than the non-random assignment problem. It just happens to be the topic of the day.

But alas, we continue to move forward… full steam ahead.

As I see it there are two groups of characters pitching fast-track adoption of value-added teacher evaluation policies.

Statistically Inept Pundits (who really don’t care anyway): The statistically inept pundits are those we see on Twitter every day, applauding the mass firing of DC teachers, praising the Colorado teacher evaluation bill and thinking that RttT is just AWESOME, regardless of the mixed (at best) evidence behind the reforms promoted by RttT (like value-added teacher assessment). My take is that they have no idea what any of this means… have little capacity to understand it anyway… and probably don’t much care. To them, I’m just a curmudgeonly academic throwing a wet blanket on their teacher bashing party. After all, who… but a wet blanket could really be against making sure all kids have good teachers… making sure that we fire and/or lay off the bad teachers, not just the inexperienced ones. These teachers are dangerous after all. They are hurting kids. We must stop them! Can’t argue that.  Or can we? The problem is, we just don’t have ideal, or even reasonably good methods for distinguishing between those good and bad teachers. And school districts that are all-of-the-sudden facing huge budget deficits and laying off hundreds of teachers, don’t retroactively have in place an evaluation system with sufficient precision to weed out the bad – nor could they.  Implementing “quality-based layoffs” here and now is among the most problematic suggestions currently out there.  The value-added assessment systems yet-to-be-implemented aren’t even up to the task. I’m really confused why these pundits who have so little knowledge about this stuff are so convinced that it is just so AWESOME.

Reform Engineers: Reform engineers view this issue in purely statistical and probabilistic terms – setting legal, moral and ethical concerns aside. I can empathize with that somewhat, until I try to make it actually work in schools and until I let those moral, ethical and legal concerns creep into my head. Perhaps I’ve gone soft. I’d have been all for this no more than 5 years ago. The reform engineer assumes first that it is the test scores that we want to improve  as our central objective – and only the test scores. Test scores are the be-all and end-all measure.  The reform engineer is okay with the odds above because more than 50% of the time they will fire the right person. That may be good enough – statistically. And, as long as they have decent odds of replacing the low performing teacher with at least an average teacher – each time – then the system should move gradually in a positive direction.  All that matters is that we have the potential for a net positive quality effect on replacing the 3/4 of fired teachers who were correctly identified and at least breaking even on the 1/4 who were falsely fired. That’s a pretty loaded set of assumptions though. Are we really going to get the best applicants to a school district where they know they might be fired for no reason on a 25% chance (if using 3 years of data) or 35% chance (on one year?). Of course, I didn’t even factor into this the number of bad teachers identified as good.

I guess that one could try to dismiss those moral, ethical and legal concerns regarding wrongly dismissing teachers by arguing that if it’s better for the kids in the end, then wrongly firing 1 in 4 average teachers along the way is the price we have to pay. I suspect that’s what the pundits would argue – since it’s about fairness to the kids, not fairness to the teachers, right? Still, this seems like a heavy toll to pay, an unnecessary toll, and quite honestly, one that’s not even that likely to work even in the best of engineered circumstances.

========

Follow up notes: A few comments I have received have argued from a reform engineering perspective that if we a) use the maximum number of years of data possible, and b) focus on identifying the bottom 10% or fewer of teachers, based on the analysis in the NCES/Mathematica report, we might significantly reduce our error rate – down to say 10% of teachers being incorrectly fired. Further, it is more likely that those incorrectly identified as failing are closer to failing anyway. That is not, however, true in all cases. This raises the interesting ethical question of – what is the tolerable threshold for randomly firing the wrong  teacher? or keeping the wrong teacher?

Further, I’d like to emphasize again that there are many problems that seriously undermine the application of value-added assessment for teacher hiring/firing decisions. This issue probably ranks about 3rd among the major problem categories. And this issue has many dimensions. First there is the statistical and measurement issue of having statistical noise result in wrongful teacher dismissal. There are also the litigation consequences that follow. There are also the questions over how the use of such methods will influence individuals thinking about pursuing teaching as a career, if pay is not substantially increased to counterbalance these new job risks. It’s not just about tweaking the statistical model and cut-points to bring the false positives into a tolerable zone. This type of shortsightedness is all too common in the types of technocratic solutions I, myself, used to favor.

Here’s a quick synopsis of the two other  major issues undermining the usefulness of value-added assessment for teacher evaluation & dismissal (on the assumption that majority weight is placed on value-added assessment):

1) That students are not randomly assigned across teachers and that this non-random assignment may severely bias estimates of teacher quality. The fact that non-random assignment of students may bias estimates of teacher quality will also likely have adverse labor market effects, making it harder to get the teachers we need in the classrooms where we need them most – at least without a substantial increase to their salaries to offset the risk.

2) That only a fraction of teachers can even be evaluated this way in the best of possible cases (generally less than 20%), and even their “teacher effects” are tainted – or enhanced – by one another. As I discussed previously, this means establishing different contracts for those who will versus those who will not be evaluated by test scores, creating at least two classes of teachers in schools and likely leading to even greater tensions between them. Further, there will likely be labor market effects with certain types of teachers either jockeying for position as a VAM evaluated teacher, or avoiding those positions.

More can be found on my entire blog thread on this topic: https://schoolfinance101.wordpress.com/category/race-to-the-top/value-added-teacher-evaluation/

Negotiating Points for Teachers on Value-Added Evaluations

A short time back I posted an explanation of how using value-added student testing data could lead to a series of legal problems for school districts and states.  That post can be found here:

https://schoolfinance101.wordpress.com/2010/06/02/pondering-legal-implications-of-value-added-teacher-evaluation/

We had some interesting follow-up discussion over on www.edjurist.com.

My concerns regarding legal issues arose from statistical problems and some practical problems associated with using value-added assessment to reliably and validly measure teacher effectiveness. The main issue is to protect against wrongly firing teachers on the basis of statistical noise, or on the basis of factors that influenced the value-added scores that were not related to teacher effectiveness.

Among other things, I pointed out problems associated with the non-random assignment of students, and how non-random assignment of students across classrooms of teachers can influence significantly – bias that is – value-added estimates of teacher effectiveness. Non-random assignment could, under certain state policies or district contracts, lead to the “de-tenuring” and/or dismissal of a teacher simply on the basis of students assigned to that teacher. Links to research and more detailed explanation of the non-random assignment problem are provided on the previous post above.

Of course, this also means that school principals or superintendents – anyone with sufficient authority to influence teacher and student assignment – could intentionally stack classes against the interest  of specific teachers. A principal could assign students to a teacher with the intent of harming that teacher’s value-added estimates.

To protect against this possibility, I suggest that teachers unions or individual teachers argue for language in their contracts which requires that students be randomly assigned and that class sizes be precisely the same – along with the time of day when courses are taught, lighting, room temperature , nutrition and any other possible factors that could compromise a teacher’s value added score and could be manipulated against a teacher.

The language in the class size/random assignment clause will have to be pretty precise to guarantee that each teacher is treated fairly – in a purely statistical sense. Teachers should negotiate for a system that guarantees “comparable class size across teachers – not to deviate more than X” and that year to year student assignment to classes should be managed through a “stratified randomized lottery system with independent auditors to oversee that system.” Stratified by disability classification, poverty status, language proficiency, neighborhood context, number of books in each child’s home setting, etc. That is, each class must be equally balanced with a randomly (lottery) selected set of children by each relevant classification.  This gets out of hand really fast.

KEEP IN MIND THAT THIS SPECIAL CONTRACT STILL APPLIES TO ONLY SOMEWHAT FEWER THAN 20% OF TEACHERS – THOSE WHO COULD EVEN REASONABLY BE LINKED TO SPECIFIC STUDENTS’ READING AND MATH ACHIEVEMENT.

I welcome suggestions for other clauses that should be included.

Just pondering the possibilities.
A recent summary of state statutes regarding teacher evaluation can be found here: http://www.ecs.org/clearinghouse/86/21/8621.pdf

See also: http://www.caldercenter.org/upload/CALDER-Research-and-Policy-Brief-9.pdf

This is a thoughtful read from a general supporter of using VA assessments to create better incentives to improve teacher quality. Read the “Policy Uses” section on pages 3-4.

Pondering Legal Implications of Value-Added Teacher Evaluation

I’m going out on a limb here. I’m a finance guy. Not a lawyer. But, I do have a reasonable background on school law thanks to colleagues in the field like Mickey Imber at U. of Kansas and my frequent coauthor Preston Green at Penn State. That said, any screw ups in my legal analysis below are my own and not attributable to either Preston or Mickey. In any case, I’ve been wondering about the validity of the claim that some pundits seem to be making that these new teacher evaluation policies are going to make it easier and less expensive to dismiss teachers.

=====

A handful of states have now adopted legislation which mandates that teacher evaluation be linked to student test data. Specifically, legislation adopted in states like Colorado, Louisiana and Kentucky and legislation vetoed in Florida follow a template of requiring that teacher evaluation for pay increase, for retaining tenure and ultimately for dismissal must be based 50% or 51% on student “value-added” or “growth” test scores alone. That is, student test score data could make or break a salary increase decision, but could also make or break a teacher’s ability to retain tenure. Pundits backing these policies often highlight provisions for multi-year data tracking on teachers so that a teacher would not lose tenure status until he/she shows poor student growth for 2 or 3 years running. These provisions are supposed to eliminate the possibility that random error or a “bad crop of students” alone could determine a teacher’s future.

Pundits are taking the position that these new evaluation criteria will make it easier to dismiss teachers and will reduce the costs of dismissing a teacher that result from litigation. Oh, how foolish!

The way I see it, this new crop of state statutes and regulations which include arbitrary use of questionable data, applied in a questionably appropriate way will most likely lead to a flood of litigation like none that has ever been witnessed.

Why would that be? How can a teacher possibly sue the school district for being fired because he/she was a bad teacher? Simply writing into state statute or department regulations that one’s “property interest” to tenure and continued employment must be primarily tied to student test scores does not by any stretch of the legal imagination guarantee that dismissal based on student test scores will stand up to legal challenges – good and legitimate legal challenges.

There are (at least) two very likely legal challenges that will occur once we start to experience our first rounds of teacher dismissal based on student assessment data.

Due Process Challenges

Removing a teacher’s tenure status is denial of a teacher’s property interest and doing so requires “due process.” That’s not an insurmountable barrier, even under typical teacher contracts that don’t require dismissal based on student test scores. Simply declaring that “a teacher will be fired if he/she shows 2 straight years of bad student test scores (growth or value-added)” and then firing a teacher for as much does not mean that the teacher necessarily was provided due process. Under a policy requiring that 51% of the employment decision be based on student value added test scores, a teacher could be wrongly terminated due to:

a) Temporal instability of the value-added measures

http://www.urban.org/UploadedPDF/1001266_stabilityofvalue.pdf

Ooooh…Temporal instability… what’s that supposed to mean? What it means is that teacher value-added ratings, which are averages of individual student gains, tend not to be that stable over time. The same teacher is highly likely to get a totally different value added rating from one year to the next. The above link points to a policy brief which explains that the year to year correlation for a teacher’s value added rating is only about .2 or .3. Further, most of the change or difference in the teacher’s value added rating from one year to the next is unexplainable – not by differences in observed student characteristics, peer characteristics or school characteristics. 87.5% (elementary math) to 70% (8th grade math) noise! While some statistical corrections and multi-year measures might help, it’s hard to guarantee or even be reasonably sure that a teacher wouldn’t be dismissed simply as a function of unexplainable low performance for 2 or 3 years in a row. That is, simply due to noise, and not the more troublesome issue of how students are clustered across schools, districts and classrooms.

b) Non-random assignment of students

The only fair way to compare teachers’ ability to produce student value-added is to randomly assign all students, statewide to all teachers… and then of course, to have all students live in exactly comparable settings with exactly comparable support structures outside of school, etc., etc. etc. That’s right. We’d have to send all of our teachers and all of our students to a single boarding school location somewhere in the state and make sure, absolutely sure that we randomly assigned students, the same number of students to each and every teacher in the system.

Obviously, that’s not going to happen. Students are not randomly sorted and the fact that they are not has serious consequences for comparing teachers’ ability to produce student value-added. See: http://gsppi.berkeley.edu/faculty/jrothstein/published/rothstein_vam2.pdf

c) Student manipulation of test results

As she travels the nation on her book tour, Diane Ravitch raises another possibility for how a teacher might find him/herself out of a job by no real fault of actual bad teaching. As she puts it, this approach to teacher evaluation puts the teacher’s job directly in the students’ hands. And the students can, if they wish, choose to consciously abuse that responsibility.  That is, the students could actually choose to bomb the state assessments to get a teacher fired, whether it’s a good teacher or a bad one. This would most certainly raise due process concerns.

d) A whole bunch of other uncontrollable stuff

A recent National Academies report noted:

“A student’s scores may be affected by many factors other than a teacher — his or her motivation, for example, or the amount of parental support — and value-added techniques have not yet found a good way to account for these other elements.”

http://www8.nationalacademies.org/onpinews/newsitem.aspx?RecordID=1278

This report generally urged caution regarding overemphasis of student value-added test scores in teacher evaluation – especially in high stakes decisions. Surely, if I was an expert witness testifying on behalf of a teacher who had been wrongly dismissed, I’d be pointing out that the National Academies said that using the student assessment data in this way is not a good idea.

Title VII of the Civil Rights Act Challenges

The non-random assignment of students leads to the second likely legal claim that will flood the courts as student testing based teacher dismissals begin – Claims of racially disparate teacher dismissal under Title VII of the Civil Rights Act of 1964.  Given that students are not randomly assigned and that poor and minority – specifically black – students are densely clustered in certain schools and districts and that black teachers are much more likely to be working in schools with classrooms of low-income black students, it is highly likely that teacher dismissals will occur in a racially disparate pattern. Black teachers of low-income black students will be several times more likely to be dismissed on the basis of poor value-added test scores. This is especially true where a statewide fixed, rigid requirement is adopted and where a teacher must be de-tenured and/or dismissed if he/she shows value-added below some fixed value-added threshold on state assessments.

So, here’s how this one plays out. For every 1 white teacher dismissed on value-added basis, 10 or more black teachers are dismissed –  relative to the overall proportions of black and white teachers. This gives the black teachers the argument that the policy has racially disparate effect. No, it doesn’t end there. A policy doesn’t violate Title VII merely because it has racially disparate effect. That just starts the ball rolling – gets the argument into court.

The state gets to defend itself – by claiming that producing value-added test scores is a legitimate part of a teacher’s job and then explaining how the use of those scores is, in fact neutral with respect to race. It just happens to have the disparate effect. Right? But, as the state would argue, that’s a good thing because it ensures that we can put better teachers in front of these poor minority kids, and get rid of the bad ones.

But, the problem is that the significant body of research on non-random assignment of students and its effect of value added scores indicates that it’s not necessarily differences in the actual effectiveness of black versus white teachers, but that the black teachers are concentrated in the poor black schools and that student clustering and not teacher effectiveness is leading to the disparate rates of teacher dismissal.  So they weren’t fired because they were precisely measurably ineffective, they were fired because they had classrooms of poor minority students year after year? At the very least, it is statistically problematic to distill one effect from the other! As a result, it’s statistically problematic to argue that the teacher should be dismissed! There is at least equal likelihood that the teacher is wrongly dismissed as there is that the teacher is rightly dismissed. I suspect a court might be concerned by this.

Reduction in Force

Note that many of these same concerns apply to all of the recent rhetoric over teacher layoffs and the need to base those layoffs on effectiveness rather than seniority. It all sounds good, until you actually try to go into a school district of any size and identify the 100 “least effective” teachers given the current state of data for teacher evaluation. Simply writing into a reduction in force (RIF) policy a requirement of dismissal based on “effectiveness” does not instantly validate the “effectiveness” measures. And even the best “effectiveness” measures, as discussed above, remain really problematic, providing tenured teachers reduced on grounds of ineffectiveness multiple options for legal action.

Additional Concerns

These two legal arguments ignore the fact that school districts and states will have to establish two separate types of contracts for teachers to begin with, since even in the best of statistical cases, only about 1/5 of teachers (those directly responsible for teaching math or reading in grades three through eight) might possibly be evaluated via student test scores (see: https://schoolfinance101.wordpress.com/2009/12/04/pondering-the-usefulness-of-value-added-assessment-of-teachers/)

I’ve written previously about the technical concerns over value-added assessment of teachers and my concern that pundits are seemingly completely ignorant of the statistical issues. I’m also baffled that few others in the current policy discussion seem even remotely aware of just how few teachers might – in the best possible case – be evaluated via student test scores, and the need for separate contracts. But, I am perhaps most perplexed that no-one seems to be acknowledging the massive legal mess likely to ensue when (or if) these poorly conceived policies are put into action.

I’ll save for another day the discussion of just who will be waiting in line to fill those teaching vacancies created by rigid use of test scores for disproportionately dismissing teachers in poor urban schools. Will they, on average, be better or perhaps worse than those displaced before them? Just who will wait in this line to be unfairly judged?

For a related article on the use of certification exams for credentialing teachers, see:

Green, P.C., Sireci, S.G. (2005) Legal and Psychometric Criteria for Evaluating Teacher Certification Tests.  Educational Measurement: Issues and Practice. Volume 19 Issue 1, Pages 22 – 31

And the (RttT) winners are…

In a previous post, I bemoaned the list of Race to the Top Nominees:

https://schoolfinance101.wordpress.com/2010/03/04/and-the-rttt-nominees-are/

Today, we have our winners – Delaware and Tennessee. Here’s my own summary of where these states stand on a number of key indicators. See previous post for discussion.

A helpful colleague offered the following summary bullet points for the above table (which I just didn’t have time to do myself when I first posted this). It’s a little hard to quote a table, so here’s the bottom line:

  • Delaware is dead last in the nation in terms of its effort to fund public education, despite that state having the nation’s greatest fiscal capacity (largest per capita GDP).
  • Delaware is also dead last in the nation in terms of its public schools serving school-aged kids:  21% of its school-aged kids do not attend the public schools.
  • Tennessee is ranked 4th from last in states’ efforts to fund public education.  Tennessee is also among the lowest scoring states on the NAEP assessments.
    • (“Effort” is here defined as state and local spending relative to state fiscal capacity, with “fiscal capacity” measured as per capita GDP.)

So then, who cares? or why should we? Many have criticized me for raising these issues, arguing “that’s not the point of RttT.  It’s (RttT)not about equity or adequacy of funding, or how many kids get that funding. That’s old school – stuff of the past – get over it! This…  This is about INNOVATION! And RttT is based on the ‘best’ measures of states’ effort to innovate… to make change… to reach the top!”

My response is that the above indicators measure Essential Pre-Conditions! One cannot expect successful innovation without first meeting these essential preconditions.  If you want to buy the “business-minded” rhetoric of innovation, which I wrote about here , you also need to buy into the reality that the way in which businesses achieve innovation also involves investment in both R&D and production (coupled with monitoring production quality). You can have all of the R&D and quality monitoring systems in the world, but if you go cheap on production and make a crappy product – you haven’t gotten very far.  On average, it does cost more to produce higher quality products.

This also relates to my post on common standards and the capacity to achieve them. It’s great to set high standards, but if don’t allocate the resources to achieve those standards, you haven’t gotten very far! It costs more to achieve high standards than low ones. Tennessee provides a striking example in the maps from this post! (their low spending seems generally sufficient to achieve their even lower outcome standards!)

That in mind, should states automatically be disqualified from RttT for doing so poorly on these Essential Preconditions? Perhaps not. After all, these are states which may need to race to the top more than others (assuming the proposed RttT strategies actually have anything to do with improving schools). But, for states doing so poorly on key indicators like effort and overall resources, or even the share of kids using the public school system, those states should at least have to explain themselves – and show how they will do their part to rectify these concerns.

And the (RTTT) Nominees are…

Not much time today to analyze, but I can’t pass up the opportunity for some quick comments on the Race to the Top Finalists announced today. The list is indeed a mixed bag (DC, CO, DE, DC, FL, GA, IL, KY, LA, MA, NY, NC, OH, PA, RI, SC, TN).

And yes, the list does include three of the most talked about early heavy favorites – and my favorites, of course – Louisiana, Tennessee and Illinois. (and there are many more comments on these states and their RTTT prospects throughout my earlier blog posts).

Here’s my rap sheet on these states in particular, and why I find it so completely absurd that simply a) removing caps on numbers of charter schools coupled with b) removing firewalls between teacher and student data are the primary criteria (or at least seem to be) for the big race.

It’s not just that some of these states have mildly problematic policies from a critical academic perspective. Rather, these three states in particular have compiled a record of education policies – both on the fiscal input end and on the outcome, standards and accountability end which are outright disgraceful.

The only thing going for Tennessee’s education system – beyond its data quality – is the fact that funding is relatively equitable within the state (compared to many states). But, that’s only because everyone has next to nothing! Tennessee currently maintains the least well-funded, overall, education system in the nation after correcting for costs associated with a) poverty, b) economies of scale and sparsity and c) regional competitive wage variation.

And not only is Tennessee dead last in overall funding, but it is also dead last in the rigor of its testing standards, when compared against NAEP proficiency standards. So, can the data really be that good if the standards are so low? if the proficiency rates on state assessments are so high even though the state ranks near the bottom on NAEP proficiency?

So, Tennessee spends little and expects little, but measures it well! In addition, Tennessee’s low spending appears to be largely a function of lack of effort, not lack of wealth. Tennessee is 4th lowest in the nation on the percent of gross state product spent on schools. Further Tennessee has the largest income gap between children not in the public schools and children in the public schools.

I’ve written more about Louisiana’s prospects in the past. Louisiana, like Tennessee, has mainly itself to blame for its low spending. Louisiana is 3rd lowest in the nation on the percent of GSP allocated to public schools. Coupled with that, Louisiana has the 3rd smallest share of 6 to 16 year olds in the public school system and the 3rd largest income gap between those in and not in the system. Louisiana’s own state testing standards are relatively average, but its NAEP outcomes are right there at the bottom (okay… 3rd from bottom across math and reading, grades 4 and 8 in 2007).

So, these two standout RTTT finalists are states that have pretty much chosen to throw their public education systems under the bus. Yet, they are somehow racing to the top!??

So, how does Illinois fit into this mess? Instead of throwing its entire system under the bus, Illinois has merely chosen to sacrifice the education of poor and minority children. Illinois maintains among the least equitable state school funding systems in the nation with among the largest funding gaps between wealthy and poor, minority and non-minority districts.  And, as it turns out, Illinois also has very low testing standards when mapped to NAEP standards.

Slides from recent presentation to National Urban League.

National Urban League Presentation

Common Standards and the Capacity to Achieve Them

It would appear that the Common Standards movement has picked up some momentum this week, with the administration’s pitch that Title I aid should be tied to states adopting common college readiness standards. This is all good talk, but standards alone, on paper and/or in state policies or proclamations don’t achieve themselves. It is inappropriate for state policymakers, federal policymakers, pundits or the general public to simply assume that local public school districts all have sufficient resources to achieve any reasonable common standards.

Perhaps if those standards are set obscenely low they will be broadly attainable at current state and local spending levels. Even then, there will be significant inequities in the ease with which those standards are attained.

Noticeably absent in the current policy conversations is any discussion of the relative capacity of state education systems and local school systems  to achieve any reasonable common standards. It would be far more logical for the federal government to tie Title I funding not to some vacuous statement of endorsement of toothless common standards, but rather to a guarantee that the state will ensure that all local public school districts (and charter schools) have sufficient financial resources to achieve common standards – whatever they are.  In this paper, I, along with Lori Taylor, explain how we approach the measurement of cost and its implications for common standards.

To see just how far our nation has to go in order to move toward common capacity to achieve common standards, let’s take a look at some national maps. Let’s start with a map of the projected relative state and local revenue per pupil levels across states, corrected for a variety of “cost” factors (regional wage variation, economies of scale, population density, poverty):

After correcting for a variety of factors, some stats like Tennessee, Mississippi, Utah and Oklahoma simply spend far less than most others on schools and only slightly above half as much as some states.

Here’s a different view, down to the district level based on an alternative set of cost adjustments. This second map shows that not only are some states much lower spending overall, but within those states, after adjusting for various cost differences, there also exist significant differences in spending (in this case, the map uses current operating expenditures per pupil with Title I funding). Again, Tennessee and Mississippi have overall very low spending. So do many areas of eastern central Washington, much of California and Texas major urban centers. Estimates are not provided for non-unified districts (large expanses of white bkgd).

So, by this point, you’re probably saying – yeah… but money doesn’t really matter that much. It’s how you use it. Maybe Tennessee, for example, is just really, really efficient at producing great outcomes on little expenditure.

Let’s now take a look at state assessment outcomes by districts, nationally. In this map, I’ve taken the proficiency levels for each district, based on the 3 year data set compiled by the New America Foundation (Thanks NAF) and I’ve expressed them as standard deviations from the national mean proficiency rate. Blue areas are those with relatively high proficiency rates and brown areas have relatively low proficiency rates. Check it out:

Wow, Tennessee does do great, despite its low spending! So does Oklahoma. These are model states, right? Low spending, yet really high performance on their own state tests! Check out Missouri. What’s going on there? Well, as it turns out, Tennessee is doing great on its own self-validation exercise – state tests – because it has really easy state tests – or, in other words, really low proficiency cut-points for its state tests. This is the game that states have been playing since the adoption of NCLB. We don’t have to spend much, or actually fix our education system as long as we set low enough standards to make it look like we’re awesome. This is well documented in a series of NCES reports which map state cut-points to NAEP cut-points. By the way – Missouri has a very hard test (in contrast with Kansas, right next door, which has relatively easy tests.)

Finally, here is a scatterplot of the relationship between an overall index of the relative equity and adequacy of state and local revenues per pupil, and state mean 4th and 8th grade, math and reading NAEP assessments for 2007. The funding equity/adequacy ratings are based on national, district level data from 2005 to 2007, and account for a) relative effort by states to fund schools (% of gross state product), b) shares of children in the public school system (and ratio of family income of those not in the system to those in the system), c) predicted state and local revenue level at average poverty, and d) extent to which funding is targeted based on poverty differences across districts.

If we really plan to get serious about Common Standards, then states like Louisiana, Tennessee, Alabama and Oklahoma are going to need to step things up a bit. Notably, low fiscal capacity states like Mississippi and Alabama will need significant federal assistance to pull this off. But, Tennessee and Louisiana are two states which spend less by choice – having among the lowest “effort” among states (% of Gross State Product allocated to schools).

Alternatively, we could just set standards as low as Tennessee standards, spend as little as Tennessee and pat ourselves on the back for a job well done. I don’t believe that this is the intent of the common standards movement, but I may be wrong. Nonetheless, if we continue to throw around the rhetoric of common standards without ever discussing the capacity to achieve them, then we should not expect much to ever come of this movement. Without sufficient capacity, there can be no substantive reform.

For more on whether school finance reforms actually can help, see: https://schoolfinance101.wordpress.com/2009/12/14/finance_reforms/

Disg-RACE to the TOP?

Here’s how Dems for Ed Reform characterizes Louisiana’s education reform efforts in relation to the federal Race to the Top competition:

Louisiana. The state passed legislation by Rep. Walt Leger III (D-New Orleans) lifting its charter school cap in June at the end of its legislative session. Louisiana is also pioneering an accountability system that tracks graduates of teacher training programs so that they can be held accountable for the performance of the teachers they train and so that their programs can be improved and/or revamped. A “unified group” of education and community-based organizations launched a statewide RttT effort in August.

http://www.dfer.org/2009/12/who_would_have.php

Note that I’m merely using this description as an example. DFER is far from the biggest offender when it comes to heaping praise on Louisiana.

Most pundits seem to agree that Louisiana is a front-runner to receive race to the top funding primarily because of its efforts to increase data and link student data to teachers (for practical issues on this point, see: https://schoolfinance101.wordpress.com/2009/12/04/pondering-the-usefulness-of-value-added-assessment-of-teachers/) and for the state’s lack of caps on numbers of new charters which can be granted per year.

I continue to argue, however, that even if these to factors are signs of “innovation” or an environment to support “innovation,” that innovation without real investment or true commitment is doomed to fail. Louisiana is the perfect example of the insanity that is race to the top. I pick on Louisiana here because it is such an absurd case, and because it is illustrative of the myopic and misguided criteria being used to evaluate innovation, and even more so, illustrative of the utter lack of critical thinking and analysis by pundits and ill-informed media-junkies, ed-writers and twitterers (who seem to lack any ability to critically evaluate … anything… but will re-tweet anything that praises Louisiana’s RttT application).

Let’s take a look at Louisiana’s education system. Yes, their system needs help, but the reality is that Louisiana politicians have never attempted to help their own system. In fact they’ve thrown it under the bus and now they want an award? Here’s the rundown:

  • 3rd lowest (behind Delaware & South Dakota) % of gross state product spent on elementary and secondary schools (American Community Survey of 2005, 2006, 2007)
  • 2nd lowest percent of 6 to 16 year old children attending the public system at about 80% (tied with Hawaii, behind Delaware) (American Community Survey of 2005, 2006, 2007). The national average is about 87%.
  • 2nd largest (behind Mississippi) racial gap between % white in private schools (82%) and % white in public schools (52%) (American Community Survey of 2005, 2006, 2007).  The national average is a 13% difference in whiteness, compared to 30% in Louisiana.
  • 3rd largest income gap between publicly and privately schooled children at about a 2 to 1 ratio. (American Community Survey of 2005, 2006, 2007)
  • 4th highest percent of teachers who attended non-competitive or less competitive (bottom 2 categories) undergraduate colleges based on Barrons’ ratings (NCES Schools and Staffing Survey of 2003-04). Almost half of Louisiana teachers attended less or non-competitive colleges, compared to 24% nationally.
  • Negative relationship between per pupil state and local revenues and district poverty rates, after controlling for regional wage variation, economies of scale, population density (poor get less).
  • 46th (of 52) on NAEP 8th Grade Math in 2009. 38th of 41 in 2000. http://nces.ed.gov/nationsreportcard/statecomparisons/
  • 49th (of 52) on NAEP 4th Grade Math in 2009. 35th of 42 in 2000.

So, this is a state where 20% abandon the public system and 82% of those who leave are white and have income twice that of those left in the public system, half of whom are non-white. While the racial gap is large in Mississippi, a much smaller share of Mississippi children abandon the public system and Mississippi is average on the percent of GSP allocated to public education. Mississippi simply lacks the capacity to do better. Louisiana doesn’t even try. And they deserve and award?

I read an article the other day that was uncritically tweeted (http://www.washingtonpost.com/wp-dyn/content/article/2009/12/12/AR2009121202631.html), explaining how Louisiana has adopted this great new teacher evaluation system. But, hey, look above. Louisiana ranks right near the top of the pack on the percent of all public school teachers who attended the least competitive colleges (which matters). Why worry about a dysfunctional supply pipeline for teachers? You wouldn’t want to consider the possibility that improved teacher wages and working conditions and investment in higher education could possibly improve that pipeline? A good teacher evaluation system will wash that  supply problem away!

Quite simply, if you’ve got the academically weakest teachers to begin with and you’ve got a system where 20% of students, almost entirely white from households with twice the average income leave the system, and where you’re putting about the lowest share of your state productivity into schools, and where your kids continue to score near the bottom on national assessments, all the data and supposed accountability in the world is not going to make much difference. Throwing RttT money into this mess isn’t likely to help much either. Applying a business investment mindset, Louisiana schools are certainly not a product line in which I’d invest my own hard earned money (but wait, RttT is ours, isn’t it?). That is, if I bother to think critically for a minute or two.

While I sympathize with the 80% of children left in Louisiana public schools, it is not the federal gov’t via RttT that is going to begin to dig them out of the hole in which they’ve been buried for decades by their own political leadership. The state of Louisiana must step up first, and big-time. The state must invest sufficiently in public schools to improve quality to the point where some of the wealthier and whiter families might actually opt back into the public system. At the very least, the state should be required  to put up “average” fiscal effort (% of GSP to schools) if it wants an award and should be required to show that it has targeted money to the highest need schools and children. Louisiana needs a stick, not a carrot!

Heaping mindless tweeted and re-tweeted praise on Louisiana is incredibly unhelpful and quite honestly, a bit embarrassing!  State data systems and charter caps cannot alone solve the world’s problems and certainly can’t solve Louisiana’s self-inflicted ailments.

Let’s hope the federal government can see through the smokescreen that it is at least partially responsible for creating, and make good use of RttT funding. Dumping that funding into states such as Louisiana or Delaware, Colorado, or Illinois is probably not best use. See: https://schoolfinance101.wordpress.com/2009/12/14/racingwhere/)

I have written previously about Louisiana among other states, here: https://schoolfinance101.wordpress.com/2009/12/15/why-do-states-with-best-data-systems/

And here: https://schoolfinance101.wordpress.com/2009/02/25/public-schooling-in-louisiana-and-mississippi/

Why do states with the “best” data systems have the worst schools?

Okay, so the title of this blog is a bit over the top and potentially inflammatory, but let’s take a look at those states, which, according to the Data Quality Campaign, have achieved the best possible state data systems by having all 10 elements recognized by the campaign. I should note that I appreciate the 10 data elements, especially as a data geek myself. It’s good stuff and this post is not intended to criticize the Data Quality Campaign. Rather, this post is intended to question whether this focus – or obsession – we have had of late, to rate the quality of state education systems by two criteria alone – a) whether they have certain data linked to certain other data and b) whether they have caps on charter schools – has created an unfortunate diversion. This obsession has caused us to take our eye off the ball – to applaud states who have, in reality, put little or no effort into improving their education systems – states who have, over time, dreadfully under-supplied public schooling, and states who have consistently produced the lowest educational outcomes (not merely as a function of the disadvantages of their student populations).

So, here’s a quick run-down. First, let’s begin with a look at the number of data quality elements compiled by states in relation to the percent of Gross State Product (Gross Domestic Product by State) allocated in the form of State and Local Revenue per Pupil to local public schools. There’s no real tight relationship here, but as we can see, Delaware, Louisiana and Tennessee are 3 states which now have all 10 data elements – HOORAY – but have very low educational effort. Utah and Washington also have low educational effort.

This might be inconsequential if it was… well… inconsequential. That is, if there was also no relationship to educational outcomes. Here’s a plot of the mean NAEP Math and Reading Grades 4 and 8 for 2007 (% Proficient) along with # of Data Quality Elements. In this case, there’s actually some relationship. Yep, states with better data have lower outcomes. Maybe having better data will increase the likelihood that they figure this out. A somewhat unfair argument given that many of these states are relatively poor states, but it’s not all about poverty (in fact, higher poverty would require greater effort to improve outcomes – but it doesn’t play out that way for these states. See this post for a discussion of poverty variation across states). Low effort, low performing, but high data quality states include Lousiana and Tennessee.  Yet, somehow, when viewed through a data quality lens alone – these states become superstars!

This next figure looks at the predicted per pupil state and local revenue in each state for a district having 10% poverty (relatively average for U.S. Census Poverty rates). The point here is to compare a truly comparable state and local revenue figure corrected for poverty variation, regional wage variation, economies of scale and population density. Here, we see that Utah and Tennessee (again) are standouts – having the lowest state and local revenue per pupil. Recall that both are also low to very low effort. Their revenue to districts is not low because they poor, but rather because they don’t put up the effort. But hey, they’ve got great data!!!!!

Another relevant “effort” related point to consider is just how many children of school age in the state are actually even served by the public system. If we were discussing child health care across states or even pre-school, we would most certainly consider the extent of “coverage.” We tend to ignore “coverage” in k-12 education because we too often assume near universal coverage. But that’s not the case. And coverage varies widely across states. Here, I measure coverage by the % of 6 to 16 year olds (American Community Survey of 2007) enrolled in public schools.

Not only are Lousiana and Delaware very low in their effort for schools, and Lousiana low on outcomes, both are also very low on Coverage. They don’t even serve 80% of 6 to 16 year olds in their public school system (remember, charter schools are part of the public system)!!!! Yet somehow, having good data on those who remain in the public system is a substitute making the state worthy of praise!!!!!

One might speculate that these differences are mainly about the wealth of states – especially when it comes to the ability of states to spend on their schools and the outcomes achieved in those schools. This is indeed true to a significant extent. But, as it turns out, the effort a state puts up toward public school spending is actually more strongly related than wealth (per capita gross state product) to predicted state and local revenues per pupil. That is, states which put up more effort, do raise more per pupil for their schools. Yes, states like Mississippi are at a disadvantage because they lack wealth. Tennessee and Utah have much less excuse! Delaware’s unique economic position allows it to raise significant revenue with little effort.

Finally, the effort –> revenue relationship would be of little consequence if it was not also the case that the predicted state and local revenue differences across states are associated with those pesky NAEP outcomes. Yes, there does exist a modest relationship (with many entangled underlying factors) between state and local revenues and NAEP outcomes.

There is indeed a lot tangled up in the various relationships presented above. But one thing is clear – DATA QUALITY ALONE PROVIDES LITTLE USEFUL INFORMATION ABOUT THE QUALITY OF A STATE’S EDUCATION SYSTEM! Our obsession with comparing states on this basis has caused us and policymakers to take their eye off the ball (former tennis coach speaking here!). Applauding states and financially rewarding them (RttT) merely for collecting better data with little attention to the actual school systems and children served (or not served) by those systems is, at best, disingenuous. 

To quote John McEnroe – You cannot be serious!