Friday Thoughts: In my own words (recent media commentary)

Interview for In These Times:

[I]t’s much easier to point blame at those working within the system–like teachers–than to actually raise the revenues to provide the resources necessary to really improve the system–to pay sufficient wages to attract and retain top college graduates and to provide the working conditions that would make teaching more appealing–including smaller total student loads… and higher quality infrastructure, materials, supplies, equipment and other supports.

http://www.inthesetimes.com/working/entry/12618/teachers_and_communities_overshadowed_by_corporate_fixes_for_schools/

In my interview with Geoff Mulvihill of AP:

In response to what reforms are needed most in New Jersey?

From a research angle, if you looked at the high-performing and the low-performing schools and you asked yourself what’s different about them, well, our highest-performing schools also have step-structured pay scales, collective bargained agreements, tenure, union contracts as do our low-performing schools. That’s not a differentiating factor.

These things that we’re talking about like merit pay, disrupting union contracts and collective bargaining don’t tend to be the things that the high-performing schools are doing.

http://www.courierpostonline.com/article/20120103/NEWS02/301030016/Educating-New-Jersey-s-urban-kids-costs-more-scholar-says?odyssey=nav|head

Follow up in a similar question

If you look at the biggest differences between the schools that are doing well and the schools that are doing poorly, there may be differences in teaching quality. There may be differences in skill-set of the teachers who are sorting themselves among the more and less desirable schools.

It may be that we’ve got some inequities in teaching quality. But to suggest that those inequities are a function of not having merit pay or they’re a function of having collective bargaining and a union presence doesn’t seem to fit when those structures also exist in the highly successful and affluent districts.

http://www.courierpostonline.com/article/20120103/NEWS02/301030016/Educating-New-Jersey-s-urban-kids-costs-more-scholar-says?odyssey=nav|head

On where to go from here:

I think we’ve got to keep up the effort of targeting resources toward the high-need districts, and the key is that equitable and adequate funding — and this is my big punchline — is the necessary condition for everything. If you want to run a good charter school, if you want to run a good public school, you’ve got to have enough money to do a good job.

Beneath the Veil of Inadequate Cost Analyses: What do Roland Fryer’s School Reform Studies Really Tell Us? (if anything)

Here’s a short section from one of my papers currently in progress (part of the summary of existing literature on alternative models/strategies, and marginal expenditures).

A series of studies from Roland Fryer and colleagues have explored the effectiveness of specific charter school models and strategies, including Harlem Childrens’ Zone (Dobbie & Fryer, 2009), “no excuses” charter schools in New York City (Dobbie & Fryer, 2011), schools within the Houston public school district (Apollo 20) mimicking no excuses charter strategies (Fryer, 2011, Fryer, 2012), and an intensive urban residential schooling model in Baltimore, MD (Curto & Fryer, 2011).  In each case, the models in question involve resource intensive strategies, including substantially lengthening school days and years, providing small group (2 or 3 on 1) intensive tutoring, providing extensive community based wrap around services (Harlem Childrens’ Zone) or providing student housing and residential support services (Baltimore).

The broad conclusion across these studies is that charter schools or traditional public schools can produce dramatic improvements to student outcomes by implementing no excuses strategies and perhaps wrap around services, and that these strategies come at relatively modest marginal cost. Regarding the benefits of the most expensive alternative explored – residential schooling in Baltimore (at a reported $39,000 per pupil) – the authors conclude that no excuses strategies of extended day and year, and intensive tutoring are likely more cost effective.

But, each of these studies suffers from poorly documented and often ill-conceived comparisons of costs and/or marginal expenditures.

In their study on the effectiveness of no excuses New York City charter schools, Dobbie and Fryer (2011) use data on 35 [those responding to their survey] charter schools to generate an aggregate index based on five policies including teacher feedback, use of data to guide instruction, high-dosage tutoring, increased instructional time and high expectations. [i] They then correlate this index with their measures of school effectiveness across the 35 schools, finding a significant relationship. Separately, the authors report weak or no correlations between “traditional” measures of school resources including per pupil spending and class size and their effectiveness measures, concluding that these measures are not correlated with effectiveness. In short, Dobbie and Fryer argue that potentially costly strategies matter, but money doesn’t. [or so the headlines went]

First, if potentially costly strategies matter (even if those costs are never measured), then so too does money itself. Second, the authors’ analysis and documentation of the financial data is woefully inadequate.[ii] The authors fail entirely to consider that the majority (55 to 60%) of per pupil spending differences across New York City charter schools are explained by grade ranges served and total enrollments (and/or enrollment per grade level, economies of scale), where enrollment is to some extent a function of institutional maturation (scaling up) (Baker and Ferris, 2011, p. 33).[iii]  Given the extent that expenditure variation is largely a function of uncontrollable structural differences across these schools, it is unlikely that one will find a simple correlation between spending variation and student outcomes (without finding some way to control for the structural differences). The authors also fail to report the source or descriptive statistics on their expenditure measure.

In earlier work on Harlem Childrens’ Zone, Dobbie and Fryer[iv] similarly argued that the substantial benefits they found for children participating in HCZ charter schools could be obtained at what they [feebly attempt to] characterize as negligible marginal expense. They arrive at this conclusion via the following [hap-hazard] cost calculation and [bogus] comparison:

The total per-pupil costs of the HCZ public charter schools can be calculated with relative ease. The New York Department of Education provided every charter school, including the Promise Academy, $12,443 per pupil in 2008-2009. HCZ estimates that they added an additional $4,657 per-pupil for in school costs and approximately $2,172 per pupil for after-school and “wrap-around” programs. This implies that HCZ spends $19,272 per pupil. To put this in perspective, the median school district in New York State spent $16,171 per pupil in 2006, and the district at the 95th percentile cutpoint spent $33,521 per pupil (Zhou and Johnson, 2008).[v]

Accepting the additional costs of Harlem Childrens’ Zone as adding up to $19,000 per pupil and accepting as a relevant comparison basis that this figure lies somewhere between the New York statewide median and statewide 95%ile of district spending, then the marginal expense for Harlem Childrens’ Zone might just be trivial. But the marginal expense calculation for HCZ is not clearly documented and highly suspect and the comparison basis misleading.

Baker and Ferris (2011) discuss the difficulties of deriving comparable spending per pupil figures for Harlem Childrens’ Zone schools, pointing out that reported total revenues based on IRS filings vary from $6,000 to $60,000 per pupil (p. 13) depending on the year of data and which children are counted in the denominator (charter students or all school aged residents in the zone).

Further it makes little sense to contextualize the HCZ total figure by placing it between the statewide median and 95%ile district, where affluent suburban Westchester County and Long Island districts far outpace per pupil spending in New York City (Baker and Welner, 2010, p.  10).[vi] Rather, more meaningful comparisons might use relevant budget components for all schools in New York City, or schools serving similar student populations in the same area of the city. Using the city Independent Budget Office (2010b) figure for 2008-09 of $15,672, and accepting the authors total cost figure of $19,000 per pupil, the marginal expense for HCZ would be 21%. Comparing against nearby school site budgets for select schools (see Baker and Ferris, p. 24), the marginal expense is 36 to 60%.

Similar imprecision plagues Fryer’s analysis of transfer of “no excuses” strategies from the charter school context to traditional public schools in Houston, Texas. Fryer explains in his study of Apollo 20 schools in Texas:

The marginal costs are $1,837 per student, which is similar to the marginal costs of other high-performing charter schools. While this may seem to be an important barrier, a back of the envelope cost-benefit exercise reveals that the rate of return on this investment is roughly 20 percent 30 – if one takes the point estimates at face value. Moreover, there are likely lower cost ways to conduct our experiment. For instance, tutoring cost over $2,500 per student. Future experiments can inform whether three-on-one (reducing costs by a third) or even online tutoring may yield similar effects.

Among other things, it is important to understand that this $1,837 figure is derived in a Houston, TX context (as opposed to an NYC context) where the average middle school operating expenditure per pupil is $7,911, for an average marginal expense of 1837/7911 = 23.2%.  While no documentation is provided for the $1,837 figure in Fryer’s paper, that figure is quite close to the average difference in current operating expenditure for the 5 Apollo 20 middle schools in Houston compared to all schools in Houston. But, when comparing only to other Houston Middle Schools that figure rises to $2,392, or 30%. In our view, a 23% to 30% increase in cost is substantial, but further exploration of the true costs of scaling the various reform strategies presented is warranted. [data available here: http://ritter.tea.state.tx.us/perfreport/aeis/2010/DownloadData.html]

In short, across Fryer’s various studies, we find a range of marginal expenses for preferred models and strategies from 21% to 60% above average expenditures of other schools not using the preferred models and strategies. So, what are these studies really saying?

Setting aside the exceptionally poor documentation behind any of the marginal expenditure or cost estimates provided in each and every one of these studies, throughout his various attempts to downplay the importance of financial resources for improving student outcomes, Roland Fryer and colleagues have made a compelling case for spending between 20 and 60% more on public schooling in poor urban contexts including New York City and Houston, TX.

I suspect there are more than a few urban superintendents and principals out there who would appreciate seeing and infusion of resources of this magnitude. And many might even be happy to allocate the bulk of those resources to adopt such strategies as increasing teacher compensation in order to extend school days and years and implement intensive tutoring supports (surprisingly non-reformy strategies).

I should also point out that 20% to 60% more funding, while marginally improving student outcomes in these districts, likely still falls well short of providing children attending poor urban districts equal opportunity to achieve outcomes commonly achieved by their more affluent suburban counterparts, and may fall well short of providing adequate resources for these children to gain access to and succeed in higher education and the labor market beyond. Estimating the true costs of these more lofty outcome objectives is a topic for another day.

NOTE: I would caution however, that we have little basis for asserting that a 20 to 60% increase in per pupil spending would be more efficiently spent on these strategies than on such alternatives as class size reduction and/or expansion of early childhood programs. These comparisons simply haven’t been made, and Fryer’s attempt at such a comparison (NYC “no excuses” study) is woefully inadequate.  Pundits who argue that class size reduction is an especially expensive and inefficient alternative seem willing to ignore outright the substantial additional costs of the strategies promoted in Fryer’s work, arriving at the erroneous conclusion (with Fryer’s full support) that class size reduction is ineffective and costly, and extended school time and intensive tutoring are costless and highly effective.


[ii] For a discussion of methods used for evaluating the relationship between fiscal inputs and student outcomes, see Baker, B.D. (2012) Revisiting the Age-Old Question: Does Money Matter in Education. Shanker Institute. http://www.shankerinstitute.org/images/doesmoneymatter_final.pdf

[iii] Baker, B.D. & Ferris, R. (2011). Adding Up the Spending: Fiscal Disparities and Philanthropy among New York City Charter Schools. Boulder, CO: National Education Policy Center. Retrieved [date] from http://nepc.colorado.edu/publication/NYC-charter-disparities.

[iv] Dobbie, W. & Fryer, R. G. (2009). Are High-Quality Schools Enough to Close the Achievement Gap? Evidence from a Bold Social Experiment in Harlem. Unpublished manuscript, Harvard University, 5.

[v] Dobbie, W. & Fryer, R. G. (2009). Are High-Quality Schools Enough to Close the Achievement Gap? Evidence from a Bold Social Experiment in Harlem. Unpublished manuscript, Harvard University, 5. http://www.economics.harvard.edu/files/faculty/21_HCZ_Nov2009_NBERwkgpaper.pdf

[vi] Baker, B. D., & Welner, K. G. (2010). “Premature celebrations: The persistence of interdistrict funding disparities” Educational Policy Analysis Archives, 18(9). Retrieved [date] from http://epaa.asu.edu/ojs/article/view/741

Jay Greene (Inadvertently?) Argues for a 23% Funding Increase for Texas Schools

I was intrigued by this post from Jay Greene today, in which he points out that public schools can learn from charter schools and perhaps can implement some of their successes. Specifically, Greene is referring to KIPP-like “no excuses” charter schools as a model, and their strategies for improving outcomes including much extended school time (longer day/year).  As the basis for his argument, Greene refers specifically to Roland Fryer’s updated analysis of Houston’s Apollo 20 schools – which are – in effect, models of no excuses charters applied in the traditional public district.  Greene opines:

Traditional public schools can get results like a KIPP school without having to actually become KIPP schools.  They just have to imitate a few of the key features employed by KIPP and other successful charter schools.  This is incredibly encouraging news.

Greene does acknowledge that pesky little issue of potentially higher costs, but seems to go along with Fryer’s downplaying of the additional costs, given the amazing benefits.

Cost is another barrier to bringing this reform strategy to scale, but he notes that the marginal cost is only $1,837 per student and the rate of return on that investment would be roughly 20%. (emphasis added)

Those of you who read Jay’s work regularly probably realize that he’s not generally one to argue that more money matters, at all, for improving public schools.  After all, here’s the intro to a synopsis of his book on Education Myths:

How can we fix our floundering public schools? The conventional wisdom says that schools need a lot more money, that poor and immigrant children can’t do as well as most other American kids, that high-stakes tests just produce “teaching to the test,” and that vouchers do little to help students while undermining our democracy. But what if the conventional wisdom is wrong?

Alternatively, what if Jay Greene is wrong and he just realized it – without even realizing it?  Perhaps he’s turning over a new leaf here. Perhaps he’s accepting that a little extra funding, if used on simple things like small group tutoring and additional time can help. Heck, if it’s such a small amount of money – ONLY $1,837 per pupil – we can likely find that somewhere already squandered in school budgets.

Really, what’s an additional $1,837 per Houston middle school student anyway? Let’s wrap some context around that number. Well, it’s about 23% higher than the average 2010 current operating expenditure per middle school pupil in Houston Independent School District (based on school site current operating expenditure data for Houston ISD, which can be downloaded here: http://ritter.tea.state.tx.us/perfreport/aeis/2010/DownloadData.html)

Now, in Houston ISD alone, there are about 36,000 middle schoolers, with somewhat under 4,000 (3,657) in 5 Apollo 20 Middle Schools (applying this list of middle schools – Attucks, Dowling, Fondren, Key, and Ryan – to the TEA school site data on enrollments). So let’s say we want to add about $2,000 per pupil to the budgets of the other middle schools serving about 32,000 pupils. Oh, that’s about $64 million.

Of course, it’s quite likely that the an additional 23% funding could also do some good toward expanding school time, providing intensive tutoring and other no excuses strategies in elementary and secondary schools as well. Houston Elementary schools serve over 100,000 kids and high schools nearly 50,000 kids. Rounding it off at an additional $2k per 150,000 kids, and well, we’re talking about a substantial increase in expenditure for Houston ISD.

Even if one can hypothetically re-allocate about 3 to 5% of existing funding toward these strategies, we’re still looking at approximately 18 to 20% increase in funding required to round out the programs/services.

Personally, I’m glad to see Jay Greene come around to this realization that a substantial infusion of additional funding, used wisely might lead to substantial improvement in traditional public schools.

Jay also points out that he has some concern that when scaling up these strategies, that sufficient supply of high quality teachers will be readily available. Fryer’s analysis doesn’t provide much insight into the competitive wages for the “no excuses” charter school teacher. Actually, Fryer’s analysis doesn’t even provide any real documentation of the $1,837 figure[1], but I’ll set that aside for now, since I’ve complained about Fryer’s hap-hazard, back of the napkin cost analyses in nearly every one of his other papers on a previous blog post.

Here’s a brief preview from ongoing research of the competitive wage structure of KIPP and other charter school teachers in Houston, and teachers in Houston ISD. These comparisons are based on a wage model using teacher level data in which I estimate the base salary of full time teachers as a function of degree levels and experience levels for teachers in each type charter school listed and in Houston ISD. I then project teacher salaries holding other factors constant.

Not surprisingly, KIPP in particular pays a significant premium for their teachers (with Harmony schools as a stark contrast, but see this story for additional context). Perhaps wages matter here, and that certainly needs to figure into the future scalability of these strategies, if we truly expect to hold teacher quality at least constant (if not improve it over time).

Here’s how Houston KIPP middle school operating expenditures per pupil stack up against Houston ISD middle schools (by special ed population share – which happens to be the most consistent predictor of school site spending differences, along with grade level served).

Paying teachers more to recruit and retain high quality candidates, and to find candidates willing to work more hours and days? Offering more time by extending school days and school years? Providing small group tutoring? This kind of stuff appears to make sense. And, it costs money. And if this stuff matters, then money matters. Sometimes it really is that simple.

Welcome aboard Jay. Perhaps money really does (or at least can) matter after all!

[1] The average difference in current operating expenditure per pupil between the five Apollo middle schools and all other Houston ISD schools (all grades) in 2010 appears to be about $1,839, surprisingly close to Fryer’s undocumented estimate.  But, the average difference between Apollo middle schools and Houston ISD middle schools was $2,392.

Follow up on Fire First, Ask Questions Later

Many of us have had extensive ongoing conversation about the Big Study (CFR) that caught media attention last week. That conversation has included much thoughtful feedback from the authors of the study.  That’s how it should be. A good, ongoing discussion delving into technical details and considering alternative policy implications.  I received the following kind note from one of the study authors, John Friedman, in which he addresses three major points in my critique:

Dear Bruce,

Thank you very much for your thorough and well-reasoned comment on our paper.  You raise three major concerns with the study in your post which we’d like to address.  First, you write that “just because teacher VA scores in a massive data set show variance does not mean that we can identify with any level of precision or accuracy which individual teachers … are “good” and which are “bad.”  You are certainly correct that there is lots of noise in the measurement of quality for any individual teacher.  But I don’t think it is right that we cannot identify individual teachers’ quality with any precision.  In fact, our value-added estimates for individual teachers come with confidence intervals that exactly quantify the degree of uncertainty, as we discuss in Section 6.2 of the paper.  For instance, if after 3 average-sized classrooms a teacher had VA of -0.2, which is 2 standard deviations below the mean, would have a confidence interval of approximately [-0.41,0.01].  This range implies that there is a 80% chance that the teacher is among the worst 15% in the system, and less than a 5% chance that the teacher is better than average.  Importantly, we take account of this teacher-level uncertainty in our calculations in Figure 10.  Even taking account of this uncertainly, replacing this teacher with an average one would generate $190K in NPV future earnings for the students per classroom.  Thus, even taking into account imprecision, value-added still provides useful information about individual teachers.  The imprecision does imply that we should use other measures (such as principal ratings or student feedback) in combination with VA (more on this below).

Your second concern is about the policy implications of the study, in particular the quotations given by my co-author and I for the NYT article, which give the impression that we view dismissing low-VA teachers as the best solution.  These quotes were taken out of context and we’d like to clarify our actual position.  As we emphasize in our executive summary and paper, the policy implications of the study are not completely clear.  What we know is that great teachers have great value and that test-score based VA measures can be useful in identifying such teachers.  In the long run, the best way to improve teaching will likely require making teaching a highly prestigious and well rewarded profession that attracts top talent.  Our interpretation of the policy implications of the paper is better reflected in this article we wrote for the New York Times.

Finally, you suggest to your readers that the earnings gains from replacing a bottom-5% teacher with an average one are small — only $250 per year.  This is an arithmetic error due to not adjusting for discounting. We discount all gains back to age 12 at a 5% interest rate in order to put everything in today’s dollars, which is standard practice in economics. Your calculation requires the undiscounted gain (i.e. summing the cumulative earnings impact), which is $50,000 per student for a 1 SD better teacher (84th pctile vs 50th pctile) in one grade. Discounted back to age 12 at a 5% interest rate, $50K is equivalent to about $9K.  $50,000 over a lifetime – around $1,000 per year – is still only a moderate amount, but we think it would be implausible that a single teacher could do more than that on average. So the magnitudes strike us as reasonable yet important.  It sounds like many readers make this discounting mistake, so it might be helpful to correct your calculation so that your readers have the facts right (the paper itself also provides these calculations in Appendix Table 14).

Thank you again for your thoughtful post, we hope look forward to reading your comments on our work and others in the future.

Best,

John Friedman

I do have comments in response to each of these points, as well as a few additional thoughts. And I certainly welcome any additional response from John or the other authors.

On precision & accuracy

The first point above addresses only the confidence interval around a teacher’s VA estimate for the teacher in the bottom 15%. Even then, if we were to use the VA estimate as a blunt instrument (acknowledging that the paper does not make such a recommendation – but does simulate it as an option) for deselection, this would result in a 20% chance of dismissing teachers who are not legitimately in the bottom 15% (5% who are actually above average), given three years of data.  Yes, that’s far better than break even (after waiting three full years), and permits one to simulate a positive effect of replacing the bottom 15% (in purely hypothetical terms, holding lots of stuff constant). But acting on this information, accepting a 1/5 misfire rate to generate a small marginal benefit, might still have a chilling effect on future teacher supply (given the extent that the error is entirely out of their control).

But the confidence interval is only one piece of the puzzle. It is the collective pieces of that puzzle that have led me to believe that the VA estimates are of limited if any value as a human resource management tool, as similarly concluded by Jesse Rothstein in his review of the first round of Gates MET findings.

We also know that if we were to use a different test of the supposed same content, we are quite likely to get different effectiveness ratings for teachers (either following the Gates MET findings, or the Corcoran & Jennings findings). That is, the present analysis tells us only whether there exists a certain level of confidence in the teacher ratings on a single instrument, which may or may not be the best assessment of teaching quality for that content area. Further test-test differences in teacher ratings may be caused by any number of factors. I would expect that test scaling differences as much as subtle content and question format differences, along with differences in the stakes attached lead to the difference in ratings across the same teachers when different tests are used. Given that the tests changed at different points in the CFR study, and there are likely at least some teachers who maintained constant assignments across those changes, CFR could explore shifts in VA estimates across different tests for the same teachers.  Next paper? (the current paper is already 5 or 6 rolled into one).

Also, as the CFR paper appropriately acknowledges, the VA estimates – and any resulting assumptions that they are valid – are contingent upon the fact that they were estimated to data retrospectively, using assessments for which there were no stakes attached – most importantly, where high stakes personnel decisions were not based on the tests.

And one final technical point, just because the model across all cases does not reveal any systematic patterns of bias does not mean that significant numbers of teacher cases within the mix would not have their ratings compromised by various types of biases (associated with either observables or unobservables). Yes, the bias, on average, is either a wash or drowned out by the noise. But there may be clusters of teachers serving clusters of students and/or in certain types of settings where the bias cuts one way or the other. This may be a huge issue if school officials are required to place heavy emphasis on these measures, and where some schools are affected by biased estimates (in any direction) and others not.

On the limited usefulness of VAM estimates

I do not deny, though I’m increasingly skeptical, that these models produce any useful information at the individual level. They do indeed, as CFR explain produce a prediction – with error – of the likelihood that a teacher produces higher or lower gains across students on a specific test or set of tests (for what that test is worth). That may be useful information. But that’s a very small piece of a much larger human resource puzzle. First of all, it’s very limited piece of information on a very small subset of teachers in schools.

While pundits often opine about the potential cost effectiveness of these statistical estimates for use in teacher evaluation versus more labor intensive observation protocols, we must consider in that cost effectiveness analysis that the VA estimates are capturing only effectiveness with respect to a) the specific tests in question (since other tests may yield very different results) and b) for a small share of our staff districtwide.

I do appreciate, and did recognize that the CFR paper doesn’t make a case for deselection with heavy emphasis on VA estimates. Rather, the paper ponders the policy implications in the typical way in which we academically speculate. That doesn’t always play well in the media – and certainly didn’t this time.

The problem – and a very big one – is that states (and districts) are actually mandating rigid use of these metrics including proposing that these metrics be used in layoff protocols (quality based RIF) – essentially deselection. Yes, most states are saying “use test-score based measures for 50%” and use other stuff for the other half.  And political supporters are arguing – “no-one is saying to use test scores as the only measure.” The reality is that when you put a rigid metric (and policymakers will ignore those error bands) into an evaluation protocol and combine it with less rigid, less quantified other measures the rigid metric will invariably become the tipping factor. It may be 50% of the protocol, but will drive 100% of the decision.

Also, state policymakers and local decision makers for the most part do not know the difference between a well estimated VAM, with appropriate checks for bias, and a Student Growth Percentile score – as being pitched to many state policymakers as a viable alternative and now adopted in many states – with no covariatesno published statistical evaluation on the properties, biases, etc.

Further, I would argue that there are actually perverse incentives for state policymakers and local district officials to adopt bad and/or severely biased VAMs because those VAMS are likely to appear more stable (less noisy) over time (because they will, year after year, inappropriately disadvantage the same teachers).

State policymakers are more than willing to make that completely unjustified leap that the CFR results necessarily indicate that Student Growth Percentiles – just like a well estimated (though still insufficient) VAM – can and should be used as blunt deselection tools (or tools for denying and/or removing tenure).

In short, even the best VAMs provide us with little more than noisy estimates of teaching effectiveness, measured by a single set of assessments, for a small share of teachers.

Given the body of research, now expanded with the CFR study, while I acknowledge that these models can pick up seemingly interesting variance across teachers, I stand by my perspective that that information is of extremely limited use for characterizing individual teacher effectiveness.

On the $250 calculation (and my real point)

My main point regarding the break down to $250 from $266k was that the $266k was generated for WOW effect, from an otherwise non-startling number (be it $1,000 or $250). It’s the intentional exaggeration by extrapolation that concerns me, like stretching the Y axes in the NY Times story (theirs, not yours). True, I simplified and didn’t discount (via an arbitrary 5%) and instead did a simple back of the napkin that would then reconcile, for the readers, with the related graph – which shows about a $250 shift in earnings at age 28 (but stretches the Y axis to also exaggerate the effect).  It is perhaps more reasonable to point out that this is about a $250 shift over $20,500, or slightly greater than 1.2%?

I agree that when we see shifts even this seemingly subtle, in large data sets and in this type of analysis, they may be meaningful shifts. And I recognize that researchers try to find alternative ways to illustrate the magnitude of those shifts. But, in the context of the NY Times story, this one came off as stretching the meaningfulness of the estimate – multiplying it just enough times  (by the whole class then by lifetime) to make it seem much bigger, and therefore much more meaningful.  That was easy blog fodder. But again, I put it down in that section of my critique focused on the presentation, not the substance.

If I was a district personnel director would I want these data? Would I use them? How?

This is one that I’ve thought about quite a bit.

Yes, probably. I would want to be able to generate a report of the VA estimates for teachers in the district. Ideally, I’d like to be able to generate a report based on alternative model specifications (option to leave in and take out potential biases) and on alternative assessments (or mixes of them). I’d like the sensitivity analysis option in order to evaluate the robustness of the ratings, and to see how changes to model specification affect certain teachers (to gain insights, for example, regarding things like peer effect vs. teacher effect).

If I felt, when pouring through the data, that they were telling me something about some of my teachers (good or bad), I might then use these data to suggest to principals how to distribute their observation efforts through the year. Which classes should they focus on? Which teachers? It would be a noisy pre-screening tool, and would not dictate any final decision.  It might start the evaluation process, but would certainly not end it.

Further, even if I did decide that I have a systematically underperforming middle school math teacher (for example), I would only be likely to try to remove that teacher if I was pretty sure that I could replace him or her with someone better. It is utterly foolish from a human resource perspective to automatically assume that I will necessarily be able to replace this “bad” teacher with an “average” one.  Fire now, and then wait to see what the applicant pool looks like and hope for the best?

Since the most vocal VAM advocates love to make the baseball analogies… pointing out the supposed connection between VAM teacher deselection arguments and Moneyball, consider that statistical advantage in Baseball is achieved by trading for players with better statistics – trading up (based on which statistics a team prefers/needs).  You don’t just unload your bottom 5%  or 15% players in on-base-percentage and hope that players with on-base-percentage equal to your team average will show up on your doorstep. (acknowledging that the baseball statistics analogies to using VAM for teacher evaluation to begin with are completely stupid)

Unfortunately, state policymakers are not viewing it this way – not seeking reasonable introduction of new information into a complex human resource evaluation process. Rather, they are rapidly adopting excessively rigid mandates regarding the use of VA estimates or Student Growth Percentiles as the major component of teacher evaluation, determination of teacher tenure and dismissal. And unfortunately, they are misreading and misrepresenting (in my view) the CFR study to drive home their case.

NJ Charter Data Round-up

Note: I will be making updates to this post in the coming days/weeks.

As we once again begin discussing & debating the appropriate role for Charter schools in New Jersey’s education reform “mix,” here’s a round-up on the New Jersey charter school numbers, in terms of demographic comparisons to all other public and charter schools in the same ‘city’ and proficiency rates (across all grades) compared to all others in the same ‘city.’

Key Findings:

Many NJ charter schools, especially those most often touted in the media as great success stories, continue to serve student populations that differ dramatically from populations of surrounding schools in the same city (see note *). These charters differ in terms of percentages of children who qualify for free lunch, percent classified as having disabilities, or percent with limited English language proficiency.

On average, given their demographics, NJ charter schools continue to have proficiency rates around where one would expect. Demographically advantaged charter schools have higher average proficiency than other schools around them. Demographically disadvantaged charter schools have lower average proficiency rates than others around them. Not tricky/heavy statistics here. Just a comparison of relative proficiency and relative demography.

When one estimates what I would call a “descriptive regression” model characterizing the differences in proficiency rates across district and charter schools in the same cities, one finds that compared against schools of similar demography, and on the same grade level and subject area tests, the charter proficiency rates, on average are no different than their traditional public school counterparts. In this particular regression model, charters did have higher proficiency in Science (charter x science interaction). More descriptive stuff to come when I get a chance. Not sure when that will be.

Note: The model includes a fixed effect for CITY location for each traditional public and charter school, such that each charter is compared against other schools in the same CITY.

But to be absolutely clear, this particular analysis misses the point entirely in two ways. First, it is merely descriptive of the average proficiency rates of charter and non-charter schools across tests, subjects & grades. It is not a test, by any means of comparative effectiveness of schools. Second, as I explain below,  comparisons of charterness vs. non-charterness are not particularly helpful for informing policy.

Policy Perspectives:

Issue 1: The relevant policy question is not whether charters on average perform better than traditional public schools on average and therefore whether we should simply replace more traditional public schools with more charter schools. The relevant questions are “what works? For whom? And under what circumstances?” Charter schools, traditional public schools and private schools all vary widely in quality and in their ability to serve different populations well. Some schools of each organizational type do well (at least for some kids) while others, quite bluntly, suck, no matter who they try to serve. Further, I’ve written previously about these arguments that charters or private schools “do more (than traditional public schools) with less money.” However, rarely are those money comparisons rigorously or accurately conducted. Often times the assertion of “more with less” isn’t backed by any analysis at all of the “with less” part of that equation (and sometimes not the “more” part either). But these are the types of issues we need to be exploring, including specifically what are the resource implications of the models being offered by those “successful” schools, be they charters, traditional public schools, or other alternatives.

Issue 2: It may not be that the only appropriate role for charters in the mix is for them to all try to serve the most representative population – a population mirroring that of the city as a whole or their zip codes. But, for those that don’t – for those that serve a niche – we need to recognize them as such, and need to monitor the extent that their demographic selection may have adverse effects on the system as a whole. We also need to recognize that their demographic difference may play a significant role in explaining either their apparent success, or apparent failure. We should recognize, for example, that schools like Robert Treat or North Star Academy may be showing high outcomes but are doing so largely as a function of serving very different populations than others around them. Further, there may be nothing wrong with that if they are truly doing well by the kids they serve. That may just be their appropriate niche. We just can’t pretend that this model of success can be spread city wide or statewide. And, it may be inappropriate to encourage these schools to serve more representative populations. Perhaps they should stick with what they are good at. As a result, it may be more reasonable for charters like North Star or Robert Treat to establish similar niche schools in other New Jersey cities rather than pretending they can expand dramatically in the same cities and still maintain their current level of achievements.

Issue 3: We also need to remember that NJ’s large urban districts themselves operate a wide variety of schools and segment their own student populations at the secondary level through such options as magnet schools. Charters aren’t the only segmenting force. Charters including those that are demographically representative and those that aren’t have simply become a part of that mix. And we need to recognize where each fits into that mix and consider very seriously the implications for the system as a whole.

Issue 4: Finally, as I so often point out policy perspectives and parental interests may differ sharply when it comes to “elite” charter schools. From a policy perspective, elite charter schools provide limited implications for scalability (and for charters as a broad-based policy “solution”) because their benefits are derived from concentrating motivated, often less poor (non-disabled & fluent English speaking), self-selected students with the staying power to endure “no excuses” charter models.  From a parental perspective, this public policy limitation often provides the strongest personal incentive to pursue a specific school for one’s own children. Again, it comes down to that “other” strongest in-school factor driving student success – peer effect. Peer effect is a limitation (confounding factor) in public policy (unless we can find clever strategies to optimize peer distribution). But peer effect may be a legitimate quality indicator for parental choices.

Data notes:

As I’ve noted numerous times on this blog, my goal here is to access and report on publicly available data from widely recognized and/or official government sources. These are the most recent data of that type available. And here are the sources:

District and Charter School Location Information: http://nces.ed.gov/ccd/bat (2009-2010)

District special education classification rates: http://www.nj.gov/education/specialed/data/ADR/2010/classification/distclassification.xls

School level % LEP/ELL & % Free Lunch: http://www.nj.gov/education/data/enr/enr11/enr.zip

Combined Demographic Data: Charter Demographics 2011

*Note: City and Zip Code averages constructed by summing all students, all free lunch students, all LEP/ELL students for all schools in each “city” and in each “zip  code” as identified by school location based on the NCES Common Core of Data and then dividing city-wide (or zip code wide) % LEP/ELL and % Free Lunch by city (or zip) wide total enrollment for both traditional public schools and charters (that is, charters are part of the city-wide, or zip-wide average).  For special education, to estimate the citywide (and zip) average for schools, the district overall rate was applied to district schools.  This would not be an appropriate way to compare individual city schools to charter schools, since special education populations are not evenly distributed across city schools (or throughout a zip code), but is a more reasonable approach for generating the citywide aggregates. Again, charters are included in citywide and in zip-code level averages.

Differentiating “cost savings” from “expenditure reduction”

Today, it’s time for a little School Finance 101, clarifying the difference between what is a “cost savings” versus what is an “expenditure reduction.”

Cost savings means finding ways to reduce expenditure while still addressing the same range of objectives (goals, intended outcomes) and while still achieving the same level or quality of outcomes with respect to each objective.

Expenditure reduction typically means choosing not to address some objectives, goals or intended outcomes. That is, to cut back the scope of production (drop a product line, eliminate product features, cut curricular offerings, address fewer objectives/goals). Further, expenditure reduction might also mean simply choosing not to shoot for as high outcomes on specific objectives.

Note that an expenditure reduction can also be achieved by realizing actual cost savings. But, any old expenditure reduction cannot necessarily be classified as cost savings.

Here’s an example of potential “cost savings.” If you are running a small, remote rural district and questioning whether you can continue to maintain advanced placement calculus as an in-house, district staffed course for 3 to 5 students per year, you might consider the alternative of having those students take the course online. It may just be that student outcomes, for your particular students and that particular course are not measurably adversely affected by the change and that the change results in a substantial reduction in expenditure per child on AP calculus. That is, expenditure was reduced and the outcome held constant. Cost savings were achieved.

The cost savings example above is considerably different from comparing the total per pupil expense on brick and mortar schooling and all of the programs and services included in that, with the per pupil expense needed to offer an online curriculum of core required academic courses (or any subset that differs in scope of goals/objectives).  This is the critical flaw in the interpretations and presentations of (some though not all of) the findings in the recent Fordham Institute study on the supposed “cost” of online versus blended, versus brick and mortar schooling.

One might clean up this aggregate brick and mortar to online schooling comparison by attempting to isolate the per pupil costs of offering those same courses/programs/services within the brick and mortar structure and measuring and comparing the outcomes of those same courses offered each way (in house versus online, as above with AP Calculus).[1]

In these times of tight local school district budgets and rhetoric about “stretching the school dollar” and the “new normal,” paying close attention to the distinctions above is critically important.

The recent Fordham Report on online and blended learning provides some interesting new data, but provides no insights (yet) regarding “cost savings.” Again, there’s some potentially useful stuff in there, but comparisons like those made in Figure 1, p. 4  (comparing total brick and mortar per pupil spending to the other two options) are very deceptive and do much to undermine the rest of the report.

Similarly, the “stretching the dollar” brief released last year by the Fordham Institute provides little or no valuable information regarding “cost savings” but does provide a laundry list of ideas for cutting services (with no evidence or measure of the results of such cuts), such as cutting off services to limited English speaking children after two years or cutting total funding to special education (by capping and redistributing those funds uniformly across districts). Kevin Welner and I address in greater detail the various expenditure reduction strategies cast as “cost savings” by Petrilli and Roza in the “stretching the dollar” brief in a recent NEPC report.

Further, it’s important to understand that it’s not necessarily even an expenditure reduction when a school district cuts from its budget something that it then expects someone else, such as parent, to pay for (like cutting district funding for athletic travel and either replacing it with fees, or expecting local sports booster clubs to raise the money). It may be a school budget reduction, and a reduction to the school’s expenditure, but the expenditure is still there.

I’m not saying that schools or districts should never simply cut expenditures by reducing the scope of their services, or shooting lower on some goals. Some goals/objectives may nolonger be (as) important, or may need to be traded off to use scarce resources toward other more “important” (importance being measured in any number of ways) goals/objectives.

Rather, I’m saying that if it’s an expenditure cut, it’s an expenditure cut.

If it’s really just a transfer of responsibility for the expenditure, acknowledge that.

And, if it really is an attempt at “cost savings” then it’s legit to call it that.

So, when presented with these quick and easy, off the shelf school finance solutions for supposed cost savings, please ask yourself whether the authors/presenters really have evaluated cost savings or merely expenditure reductions.

And to those authors/presenters who I’m not always sure understand the difference themselves please make at least some effort to differentiate between real “cost savings” and simple “expenditure reduction.”

 

 


[1] Alternatively, one might argue that the singular goal of any of the three options is a high school diploma, and that the different estimates are of the “costs” under each model of achieving that singular goal. However, in this case, it becomes important to evaluate the “quality” of the outcome – high school diploma – when obtained these very different ways (perhaps by evaluating preparedness for higher education, access, persistence, 6-year graduation, for otherwise similar students).

Misunderstanding & Misrepresenting the “Costs” & “Economics” of Online Learning

The Fordham Institute has just released its report titled “The Costs of Online Learning” in which they argue that it is incrementally cheaper to move from a) brick and mortar schooling to b) blended learning and then c) fully online learning.

http://www.edexcellencemedia.net/publications/2012/20120110-the-costs-of-online-learning/20120110-the-costs-of-online-learning.pdf

Accompanying this report is a blog post titled “Understanding the Economics of Online Learning” from the Quick and the Ed. http://www.quickanded.com/2012/01/understanding-the-economics-of-online-learning.html/comment-page-1#comment-78690

On first glance, both the report itself and especially the blog post from Quick & Ed display basic misunderstandings of the concept of “cost,” a very basic economic concept. I find this to be particularly disturbing in a blog post titled “understanding the economics of online learning.”

“Cost” refers to the cost of providing a service level of specific quality, which in education might be measured in terms of student outcomes. That includes all costs of achieving those outcomes, whether covered by public subsidy, or whether passed along to other participants in the system. A really good guide for understanding “costs” this type of analysis is Hank Levin’s book on Cost Effectiveness Analysis.

By contrast an “expense” is that which is expended toward providing some given level or portion of service. You can spend less and get less. You can spend more and get more. But getting Y quality service will cost you X, and no less than X (where X represents the minimum amount you would need to spend, given the most efficient production technology for achieving Y quality of service).  You can conceivably spend more than X for Y quality of service, but that would be, shall we say, inefficient.

Often to cover the full cost of any particular service, like public schooling for example, several parties incur expenses. It is assumed that the majority of the cost of brick and mortar schooling is covered at government expense. But, we all know that there are also fees for many things in some states (and districts), such as participation fees for sports, personal expense on school lunch, or transportation fees. Assuming attendance is compulsory, transportation fees are necessarily part of the cost of the education system whether covered by parents through fees (a tax by another name) or covered by the local public school district.

The “cost” of brick and mortar schools doesn’t change if we simply decide to cut transportation services while maintaining compulsory attendance laws. Rather, we pass along that expense to someone else – the parents. That expense is still there, and it may have even increased if we add in the cumulative parental expense on transportation (in effect, a tax for school participation).

What’s being compared in the online learning report is not “cost” but expenses on varied levels of service provision.

We might be generous here and set aside the thorniest issue and assume that the measured academic outcomes addressed by each option are the same regardless of model type or student served (likely a huge, unsupported assumption). But the outcomes of brick and mortar schooling include not only the measured academic outcomes, but any and all outcomes derived from the total expenses on brick and mortar schooling (those used in the study), including outcomes of athletic and arts participation, physical education, etc.  If the range of outcomes covered by brick and mortar schooling are broader, that should be taken into account in this type of analysis. That is, if brick and mortar schooling is providing more than just the core academic programs – including sports, clubs, arts, phys ed  – and online services are not – the analysis should either add these costs to the online service costs (what these things would cost if privately supplemented) or should subtract them from the brick and mortar cost. Otherwise this is a rather pointless apples to five course meal comparison (unless we also throw in a utility analysis and assume all of that other stuff to have zero utility… a suspect assumption).

One might argue… so what’s the big deal, the kid goes to school in the kitchen in their house, and the parent is simply in the next room working from home, as opposed to the child being in a brick and mortar school for the day. Well, even that’s not a $0 expense endeavor. To nitpick, it’s likely that the increased monitoring role of the parent in this case would reduce the parent’s work productivity to some extent – an opportunity cost. The opportunity costs become potentially much larger if the parent’s productivity depends more on not being at home, but they can no-longer be away from home. Then there’s the marginal increase to utilities associated with having the child at home and online, and potential increased food expense (a little hard to judge). Additional computer hardware, etc. This kind of “little” stuff adds up across large numbers of kids.

I do not see anywhere in this study (on quick glance) or in the post above, any discussion of the varied amount of expense (portion of cost) that would be passed along to someone else (parents) under each model in order to achieve the same outcomes.  This has to be accounted for in order to have a thoughtful conversation on public policy implications. In other words, the present study does little to advance thoughtful conversation on public policy implications of online and blended learning models. But with some additional work, perhaps it might.

It may not be feasible to construct a full tally of all of the “costs” passed along to someone else under each model, but it’s at least worth listing out what some/many of those things might be and the likely range of costs being passed along.

It may still be reasonable to make the argument that government expense can be reduced, but it’s not necessarily a reduction in the cost of the service, but rather a transfer of responsibility for covering that cost. It may be… though I’m not entirely sure… that the total cost is also reduced. But taking that next step in the analysis also involves evaluating the full costs of inputs and full range of outcomes achieved.

Spending less to get less doesn’t reduce costs. It reduces only expenditures and that distinction is important.

Fire first, ask questions later? Comments on Recent Teacher Effectiveness Studies

Please also see follow-up discussion here: https://schoolfinance101.wordpress.com/2012/01/19/follow-up-on-fire-first-ask-questions-later/

Yesterday was a big day for big new studies on teacher evaluation. First, there was the New York Times report on the new study by Chetty, Friedman and Rockoff. Second, there was the release of the second part of the Gates Foundation’s Measures of Effective Teaching project.

There’s still much to digest. But here’s my first shot, based on first impressions of these two studies (with very little attention to the Gates study)

The second – Gates MET study – didn’t have a whole lot of punchline to it, but rather spent a great deal of time exploring alternative approaches to teacher evaluation and the correlates of those approaches to a) each other and b) measured student outcome gains. The headline that emerged from that study, in the Washington Post and in brief radio blurbs was that teachers ought to be evaluated by multiple methods and should certainly be evaluated more than once a year or every few years with a single observation. That’s certainly a reasonable headline and reasonable set of assertions. Though, in reality, after reading the fully study, I’m not convinced that the study validates the usefulness of the alternative evaluation methods other than that they are marginally correlated with one another and to some extent with student achievement gains, or that the study tells us much if anything about what schools should do with the evaluation information to improve instruction and teaching effectiveness. I have a few (really just one for now) nitpicky concerns regarding the presentation of this study which I will address at the end of this post.

The BIG STUDY of the day… with BIG findings … at least in terms of news headline fodder, was the Chetty, Friedman & Rockoff (CFR) study.  For this study, the authors compile a massive freakin’ data set for tech-data-statistics geeks to salivate over.  The authors used data back to the early 1990s on children in a large urban school district, including a subset of children for whom the authors could gather annual testing data on math and language arts assessments. Yes, the tests changed at different points between 1991 and 2009, and the authors attempt to deal with this by standardizing yearly scores (likely a partial fix at best). The authors use these data to retrospectively estimate value-added scores for those (limited) cases where teachers could be matched to intact classrooms of kids (this would seem to be a relatively small share of teachers in the early years of the data, increasing over time… but still limited to grades 3 to 8 math & language arts). Some available measures of student characteristics also varied over time. The authors take care to include in their value-added model, the full extent of available student characteristics (but remove some later) and also include classroom level factors to try to tease out teacher effects. Those who’ve read my previous posts understand that this is important though quite likely insufficient!

The next big step the authors take is to use IRS tax record data of various types and link it to the student data. IRS data are used to identify earnings, to identify numbers and timing of dependent children (e.g. did an individual 20 years of age claim a 4 year old dependent?) and to identify college enrollment. Let’s be clear what these measures are though. The authors use reported earnings data for individuals in years following when they would have likely completed college (excluding incomes over $100k). The authors determine college attendance from tax records (actually from records filed by colleges/universities) on whether individuals paid tuition or received scholarships. This is a proxy measure – not a direct one. The authors use data on reported dependents & the birth date of the female reporting those dependents to create a proxy for whether the female gave birth as a teenager.[1] Again, a proxy, not direct measure. More later on this one.

Tax data are also used to identify parent characteristics. All of these tax data are matched to student data by applying a thoroughly-documented algorithm based on names, birth dates, etc. to match the IRS filing records to school records (see their Appendix A).

And in the end, after 1) constructing this massive data set[2], 2) retrospectively estimating value-added scores for teachers and 3) determining the extent to which these value added scores are related to other stuff, the authors find…. well… that they are.

The authors find that teacher value added scores in their historical data set vary. No surprise. And they find that those variations are correlated to some extent with “other stuff” including income later in life and having reported dependents for females at a young age. There’s plenty more.

These are interesting findings. It’s a really cool academic study. It’s a freakin’ amazing data set! But these findings cannot be immediately translated into what the headlines have suggested – that immediate use of value-added metrics to reshape the teacher workforce can lift the economy, and increase wages across the board! The headlines and media spin have been dreadfully overstated and deceptive. Other headlines and editorial commentary has been simply ignorant and irresponsible. (No Mr. Moran, this one study did not, does not, cannot negate  the vast array of concerns that have been raised about using value-added estimates as blunt, heavily weighted instruments in personnel policy in school systems.)

My 2 Big Points

First and perhaps most importantly, just because teacher VA scores in a massive data set show variance does not mean that we can identify with any level of precision or accuracy, which individual teachers (plucking single points from a massive scatterplot) are “good” and which are “bad.” Therein exists one of the major fallacies of moving from large scale econometric analysis to micro level human resource management.

Second, much of the spin has been on the implications of this study for immediate personnel actions. Here, two of the authors of the study bear some responsibility for feeding the media misguided interpretations. As one of the study’s authors noted:

“The message is to fire people sooner rather than later,” Professor Friedman said. (NY Times)

This statement is not justified from what this study actually tested/evaluated and ultimately found. Why? Because this study did not test whether adopting a sweeping policy of statistically based “teacher deselection” would actually lead to increased likelihood of students going to college (a half of one percent increase) or increased lifelong earnings. Rather, this study showed retrospectively that students who happened to be in classrooms that gained more, seemed to have a slightly higher likelihood of going to college and slightly higher annual earnings. From that finding, the authors extrapolate that if we were to simply replace bad teachers with average ones, the lifetime earnings of a classroom full of students would increase by $266k in 2010 dollars. This extrapolation may inform policy or future research, but should not be viewed as an absolute determinant of best immediate policy action.

This statement is equally unjustified:

Professor Chetty acknowledged, “Of course there are going to be mistakes — teachers who get fired who do not deserve to get fired.” But he said that using value-added scores would lead to fewer mistakes, not more. (NY Times)

It is unjustified because the measurement of “fewer mistakes” is not compared against a legitimate, established counterfactual – an actual alternative policy. Fewer mistakes than by what method? Is Chetty arguing that if you measure teacher performance by value-added and then dismiss on the basis of low value-added that you will have selected on the basis of value-added. Really? No kidding! That is, you will have dumped more low value-added teachers than you would have (since you selected on that basis) if you had randomly dumped teachers? That’s not a particularly useful insight if the value-added measures weren’t a good indicator of true teacher effectiveness to begin with. And we don’t know, from this study, if other measures of teacher effectiveness might have been equally correlated with reduced pregnancy, college attendance or earnings.

These two quotes by authors of the study were unnecessary and inappropriate. Perhaps it’s just how NYT spun it… or simply what the reporter latched on to. I’ve been there. But these quotes in my view undermine a study that has a lot of interesting stuff and cool data embedded within.

These quotes are unfortunately illustrative of the most egregiously simpleminded, technocratic, dehumanizing and disturbing thinking about how to “fix” teacher quality.

Laundry list of other stuff…

Now on to my laundry list of what this new study adds and what it doesn’t add to what we presently know about the usefulness of value-added measures for guiding personnel policies in education systems. In other words, which, if any of my previous concerns are resolved by these new findings.

Issue #1: Isolating Teacher Effect from “other” classroom effects (removing “bias”)

The authors do provide some additional useful tests for determining the extent to which bias resulting from the non-random sorting of kids across classrooms might affect teacher ratings. In my view the most compelling additional test involves evaluating the value-added changes that result from teacher moves across classrooms and schools. The authors also take advantage of their linked economic data on parents from tax returns to check for bias. And in their data set, comparing the results of these tests with other tests which involve using lagged scores (Rothstein’s falsification test) the authors appear to find some evidence of bias but in their view, not enough to compromise the teacher ratings. I’m not yet fully convinced, but I’ve got a lot more digging to do. (I find Figure 3, p. 63 quite interesting)

But more importantly, this finding is limited to the data and underlying assessments used by these authors in this analysis in whatever school system was used for the analysis. To their credit, the authors provide not only guidance, but great detail (and share their Stata code) for others to replicate their bias checks on other value added models/results in other contexts.

All of this stuff about bias is really about isolating the teacher effect from the classroom effect and doing so by linking teachers (a classroom level variable) to student assessment data with all of the underlying issues of those data (the test scaling, equating moves from x to x+10 on one test to another, an on one region of the scale on one test to another region of the scale on the same test).

Howard Wainer explains the heroic assumptions necessary to assert a causal effect of teachers on student assessment gains here: http://www.njspotlight.com/ets_video2/

When it comes to linking the teacher value-added estimates to lifelong outcomes like student earnings, or teen pregnancy, the inability to fully isolate teacher effect from classroom effect could mean that this study shows little more than the fact that students clustered in classrooms which do well over time eventually end up less likely to have dependents while in their teens, more likely to go to college (.5%) and earn a few more dollars per week.[3]

These are (or may be) shockingly unsurprising findings.

Issue #2. Small Share of Teachers that Can Be Rated

This study does nothing to address the fact that relatively small shares of teachers can be assigned value-added scores. This study, like others merely uses what it can – those teachers in grades 3 to 8 that can be attached to student test scores in math and language arts. More here.

Issue #3: Policy implications/spin from media assume an endless supply of better teachers?

This study like others makes assertions about how great it would all turn out – how many fewer teen girls would get pregnant, how much more money everyone would earn, if we could simply replace all of those bad teachers with average ones, or average ones with really good ones. But, as I noted above, these assertions are all contingent on an endless supply of “better” teachers standing in line to take those jobs. And this assertion is contingent upon there being no adverse effect on teacher supply quality if we were to all of the sudden implement mass deselection policies. The authors did not, nor can they in this analysis, address these complexities. I discuss deselection arguments in more detail in this previous post.

A few final comments on Exaggerations/Manipulations/Clarifications

I’ll close with a few things I found particularly annoying:

  • Use of super-multiplicative-aggregation to achieve a number that seems really, really freakin’ important (like it could save the economy!).

One of the big quotes in the New York Times article is that “Replacing a poor teacher with an average one would raise a single classroom’s lifetime earnings by about $266,000, the economists estimate.” This comes straight from the research paper. BUT… let’s break that down. It’s a whole classroom of kids. Let’s say… for rounding purposes, 26.6 kids if this is a large urban district like NYC. Let’s say we’re talking about earnings careers from age 25 to 65 or about 40 years. So, 266,000/26.6 = 10,000 lifetime additional earnings per individual. Hmmm… nolonger catchy headline stuff. Now, per year? 10,000/40 = 250. Yep, about $250 per year (In constant, 2010 [I believe] dollars which does mean it’s a higher total over time, as the value of the dollar declines when adjusted for inflation). And that is about what the NYT Graph shows: http://www.nytimes.com/interactive/2012/01/06/us/benefits-of-good-teachers.html?ref=education

  • The super-elastic, super-extra-stretchy Y axis

Yeah… the NYT graph shows an increase of annual income from about $20,750 to $21,000. But, they do the usual news reporting strategy of having the Y axis go only from $20,250 to $21,250… so the $250 increase looks like a big jump upward. That said, the author’s own Figure 6 in the working paper does much the same!

  • Discussion/presentation of “proxy” measure as true measure (by way of convenient language use)

Many have pounced on the finding that having higher value added teachers reduces teen pregnancy and many have asked – okay… how did they get the data to show that? I explained above that they used a proxy measure based on the age of the female filer and the existence of dependents. It’s a proxy and likely an imperfect one. But pretty clever. That said, in my view I’d rather that the authors say throughout “reported dependents at a young age” (or specific age) rather than “teen pregnancy.” While clever, and likely useful, it seems a bit of a stretch, and more accurate language would avoid the confusion. But again, that doesn’t generate headlines.

  • Gates study gaming of stability correlations

I’ve spent my time here on the GFR paper and pretty much ignored the Gates study. It didn’t have those really catchy findings or big headlines. And that’s actually a good thing. I did find one thing in the Gates study that irked me (I may find more on further reading). In a section starting on Page 39 the report acknowledges that a common concern about using value-added models to rate teachers is the year volatility of the effectiveness ratings. That volatility is often displayed with correlations between teachers’ scores in one year and the same teachers’ scores the next year, or across different sections of classes in the same year. Typically these correlations have fallen between .15 and .5 (.2 and .48 in previous MET study). These low correlations mean that it’s hard to pin down from year to year, who really is a high or low value added teacher. The previous MET report made a big deal of identifying the “persistent effect” of teachers, an attempt to ignore the noise (something which in practical terms can’t be ignored), and they were called out by Jesse Rothstein in this critique: http://nepc.colorado.edu/thinktank/review-learning-about-teaching

The current report doesn’t focus as much on the value-added metrics, but this one section goes to yet another length to boost the correlation and argue that value-added metrics are more stable and useful than they likely are. In this case, the authors propose that instead of looking at the year to year correlations between these annually noisy measures, we should correlate any given year with the teacher’s career long average where that average is a supposed better representation of “true” effectiveness. But this is not an apples to apples comparison to the previous correlations, and is not a measure of “stability.” This is merely a statistical attempt to make one measure in the correlation more stable (not actually more “true” just less noisy by aggregating and averaging over time), and inflate the correlation to make it seem more meaningful/useful. Don’t bother! For teachers with a relatively short track record in a given school, grade level and specific assignment, and schools with many such teachers, this statistical twist has little practical application, especially in the context of annual teacher evaluation and personnel decisions.


[1] “We first identify all women who claim a dependent when filing their taxes at any point before the end of the sample in tax year 2010. We observe dates of birth and death for all dependents and tax filers until the end of 2010 as recorded by the Social Security Administration. We use this information to identify women who ever claim a dependent who was born while the mother was a teenager (between the ages of 13 and 19 as of 12/31 the year the child was born).”

[2] There are 974,686 unique students in our analysis dataset; on average, each student has 6.14 subject-school year observations.

[3] Note that the authors actually remove their student level demographic characteristics in the value-added model in which they associate teacher effect with student earnings The authors note: When estimating the impacts of teacher VA on adult outcomes using (9), we omit the student-level controls Xigt. (p. 22) Tables in appendices do suggest that these student level covariates may not have made much difference. But, this may be evidence that the student level covariates themselves were too blunt to capture real variation across students.

6 Things I’m Still Waiting for in 2012 (and likely will be for some time!)

I start this new year with reflections on some unfinished business from 2011 – Here are a few bits of information I anxiously await for 2012. Some are likely within reach. Others, well, not so much.

  1. A thoroughly documented (rigorously vetted) study by Harvard economist Roland Fryer, which actually identifies and breaks out in sufficient detail (& with appropriate rigor & thorough documentation) the costs of delivering in whole and in part (and costs of bringing to scale), no excuses curriculum/models/strategies and comprehensive wrap-around services.
  2. The long since promised rigorous New Jersey charter school evaluation – or even better – improved student level data in New Jersey such that researchers can actually conduct reasonable analyses of charter schooling and reforms/strategies more generally across New Jersey public & charter schools.
  3. That long list of all of those other average to below average paying professions – professions other than teaching – where compensation is entirely merit based and based substantially on (noisy) multiple regression estimates of employee effectiveness determined by the behavior of children as young as 8 years old [generously assuming 3rd grade test scores to represent the lower end of the value-added grade range],  AND where the top college graduates just can’t wait to sign up!
  4. That long list of highly successful market-based charter and/or independent private schools – schools not bound by the shackles of union negotiated agreements – where teacher compensation is not strongly predicted by (or directly a function of) experience and/or academic credentials,  AND where the top college graduates just can’t wait to sign up (or stick around)! (see also: https://schoolfinance101.wordpress.com/2010/10/09/the-research-question-that-wasn%E2%80%99t-asked/)
  5. Evidence that there really is enough money tied up in (wasted on) cheerleading and ceramics to be reallocated to provide sufficient class size reduction in core content areas and increased classroom teacher wages (toward improving teacher quality) to make substantive improvements to the quality of high poverty schools!
  6. Evidence that  the differences in student outcomes between high performing affluent suburban public school districts and lower performing poor urban and inner urban fringe school districts are somehow explained by substantial differences in personnel policies, merit-based teacher compensation, teacher benefits and negotiated agreements as opposed to substantive differences in family backgrounds and available resources.

For elaboration on a few of these issues, see my recent AP interview with Geoff Mulvihill: http://www.mycentraljersey.com/article/20120101/NJNEWS10/301010003

And so the new year of education policy research and blogging begins. A year in which I, myself, will be engaged in addition, more extensive analyses of the finances of charter schools, revenue raising and expenditure patterns by locations and by network affiliation. A year in which I also expect to be digging deeper into the distribution and effects of cuts in state aid and funding constraints on school and district resource allocation and exploring across multiple states (and districts and schools within states) the causes and consequences of inequities and inadequacies in public education funding.