Third Way’s “Revisionist Analysis” [Bold-faced lie!]

I know I said I’d stop addressing the Third Way report on Middle Class Schools, but I do have one more thing to point out. Third Way issued a memo in which it aggressively attacked my assertion that they had used district level data to characterize middle class schools. Again, this assertion was relevant to showing the absurdity of their classification scheme, but there were numerous other problems with the report.

My NEPC Review

My NEPC Response to Third Way Memo regarding Methods

Third way claims my analyses to be “fatally flawed” because, as they claim in their follow-up memo, their analyses were actually at the school level and did not, as I show in tables in my review, contain all schools in poor cities including Detroit, Philadelphia or Chicago. Allow me to point out that what I actually said in my review was:

That is, these large urban districts are counted in any Third Way district-level analyses as middle-class districts.

I was very clear in my review that the table of large cities pertained specifically to “district-level” analyses in the Third Way report. I further explained extensively the problems with their continued mixing of school, individual family and district units.

But here’s the kicker based on one last check of their original report and the follow-up memo. In the follow up memo, the authors include this footnote to explain their methods – focusing on how they collected school level data from the NCES Common Core (school level data that never actually show up in any form, any table, in their original report). Note the part in this footnote where they explain selecting “school” as the unit of analysis:

Footnote in Memo

http://content.thirdway.org/publications/446/Third_Way_Memo_-_A_Response_to_the_National_Education_Policy_Center_.pdf

Footnote #8 Third Way calculations based on data from the following source: United States, Department of Education, Institute of Education Statistics, National Center for Education Statistics, Common Core of Data. Accessed September 22, 2011. Available at: http://nces.ed.gov/ccd/bat/. The Common Core of Data includes data from the “2008-09 Public Elementary/Secondary School Universe Survey,” “2008-09 Local Education Agency Universe Survey,” and “2000 School District Demographics” from the U.S. Census Bureau. To generate data from the Common Core of Data, in the “select rows” drop down box, select “School.” Then select next. On the following page, in the “select columns” drop down box, choose the “Students in Special Programs” option. Select the box next to “Total Free and Reduced Lunch Students.” Then in the drop down box, select “Contact Information” option. Then select the box next to “Location City.” Then go back to the “select columns” drop down box and select the “Enrollment by Grade” option.  Then select the box next to “11th Grade enrollment.”  Then go more time to the “select columns” drop down box, choose “Total enrollment.” Then select the box next to “Total students.” Then select next. On the next page, choose “Illinois.” Then click the “view table” option. Once the table is compiled, download the table into Excel.csv by clicking that option at the top of the page. To calculate the number of high schools in Chicago with a student population of between 26-75% eligible for NSLP, we performed the following steps: 1) We first sorted by schools based on % NSLP (number of students eligible for free or reduced lunch divided by total number of students enrolled). 2) We then pulled out the schools that had enrollment in 11th grade. 3) We then sorted the schools based on location city, and pulled out the schools located in the City of Chicago.

Now, check out the two related (copied and pasted) footnotes from their original report. Each indicates using DISTRICT level data.

In short, the follow up memo was simply a lie – a flat out lie – and included revisionist analysis completely unrelated to any information actually presented in the original report.

I have retained copies of the originals, if the authors should choose to now go back and edit/change these footnotes.

Doing crappy analysis is one thing. Trying to cover it up by lying and revising while leaving the trail behind really doesn’t help.

Original Report

http://content.thirdway.org/publications/435/Third_Way_Report_-_Incomplete_How_Middle_Class_Schools_Aren_t_Making_the_Grade_-_PRINT.pdf

Footnote #40 Third Way calculations based on data from the following source: United States, Department of Education, Institute of Education Statistics, National Center for Education Statistics, Common Core of Data. Accessed July 25, 2011. Available at: http://nces.ed.gov/ccd/ bat/. The Common Core of Data includes data from the “2008-09 Public Elementary/Secondary School Universe Survey,” “2008-09 Local Education Agency Universe Survey,” and “2000 School District Demographics” from the U.S. Census Bureau. To generate data from the Common Core of Data, in the “select rows” drop down box, select “District.” Then select next. On the following page, in the “select columns” drop down box, choose the “Census 2000 – Household Income, Occupancy and Size” option. Then check the box next to “Median Family Income.” Then go back to the “select columns” drop down box, choose the “Students in Special Programs” option. Select the box next to “Total Free and Reduced Lunch Students.” Then go back one more time to the “select columns” drop down box, choose “total enrollment.” Then select the box next to “total students.” Then select next. On the next page, choose the “Select 50 States + DC” filter from the drop down box. Then click the “view table” option. Once the table is compiled, download the table into Excel.csv by clicking that option at the top of the page. To calculate average household income by school district, we performed the following steps: 1) We first sorted school districts based on % NSLP (number of students eligible for free or reduced lunch divided by total number of students enrolled). 2) Using CPI for 2009, we adjusted the incomes for inflation. 3) We then found the median household income, based on the following groupings: 0-25.44%, 25.45-75.44%, 75.45-100% NSLP.

Footnote #88 Third Way calculations based on data from the following source: United States, Department of Education, Institute of Education Statistics, National Center for Education Statistics, Common Core of Data. Accessed July 25, 2011. Available at: http://nces.ed.gov/ccd/ bat/. The Common Core of Data includes data from the “2008-09 Public Elementary/Secondary School Universe Survey”, “2008-09 Local Education Agency Universe Survey,” and “2000 School District Demographics” from the Census Bureau. To generate data from the Common Core of Data, in the “select rows” drop down box, select “District.” Then select next. On the following page, in the “select columns” drop down box, choose the “Census 2000 – Household Income, Occupancy and Size” option. Then check the box next to “Median Family Income.” Then go back to the “select columns” drop down box, choose the “Students in Special Programs” option. Select the box next to “Total Free and Reduced Lunch Students.” Then go back one more time to the “select columns” drop down box, choose “total enrollment.” Then select the box next to “total students.” Then select next. On the next page, choose the “Select 50 States + DC” filter from the drop down box. Then click the “view table” option. Once the table is compiled, download the table into Excel.csv by clicking that option at the top of the page. To calculate average household income by school district, we performed the following steps: 1) We first sorted school districts based on % NSLP (number of students eligible for free or reduced lunch divided by total number of students enrolled). 2) Using CPI for 2009, we adjusted the incomes for inflation. 3) We then found the median household income, based on the following groupings: 0-25.44%, 25.45-50.44%, 50.45-75.44%, 75.45-100% NSLP.

Newsflash! “Middle Class Schools” score… uh…in the middle. Oops! No news here!

I’ve already beaten the issue of the various flaws, misrepresentations and outright data abuse in the Third Way middle class report into the ground on this blog. And it’s really about time for that to end. Time to move on. But here is one simple illustration which draws on the same NAEP data compiled and aggregated in the Middle Class report. For anyone reading this post who has not already read my others on the problems with the definition of “Middle Class,” and related data abuse & misuse please start there:

My NEPC Review

My NEPC Response to Third Way Memo regarding Methods

My blog response to the argument that I’m simply a Status-quo-er

Again, the entire basis of the Third Way report is that our nation’s middle class schools are under-performing… not meeting expectations… dismal…dreadful… failures!  Now, setting aside the absurd methods used for classifying “middle class” and setting aside that the report mixes units of analysis illogically throughout (districts vs. schools vs. individual families, regardless of district or school attended) and mixes data across generations of high school graduates, how did they really expect middle class schools to perform? Did they expect them NOT to be IN THE MIDDLE? That seems rather foolish. No, wait, it is entirely foolish!

Here’s one very simple example showing the NAEP 8th grade math mean scale scores of children in 2009 by the percent of children in their school who qualify for the National School Lunch Program:

Rather amazingly, what we see here is that as school level % low income increases, NAEP mean scale scores decrease. Interestingly, the NAEP reporting tool chooses to include anomalous categories of 0% and 100%, which, not surprising, don’t fall right in line. Across the low income brackets, but for the anomalous endpoints, the relationship is nearly linear – with mean scale scores declining incrementally from the 1 to 5% low income group to the 76 to 99% category. Note also, that consistent with my previous explanations, the supposed “middle class” is actually to the right hand side – poorer side – of the distribution.

Most importantly… and really no freakin’ surprise… in fact something I shouldn’t ever even have to graph in order to validate it – THE SUPPOSED “MIDDLE CLASS” SCHOOLS FALL WHERE? RIGHT IN LINE! RIGHT IN THE DAMN MIDDLE OF THE CATEGORIES ON EITHER SIDE OF THEM? HOW THE HECK IS THAT PERFORMING UNDER EXPECTATIONS? THAT, MY FRIENDS, IS LUDICROUS! IT’S RIGHT ON EXPECTATIONS – STATISTICALLY!

Whether we as a country are, or whether I specifically am happy with the level or distribution of outcomes in the above figure is an entirely different issue. I might want to see higher outcomes across the board. Personally, I’d love to see the resources leveraged to begin to raise the outcomes on the right hand side of the graph – to reduce the clear linear relationship between low income concentrations and student outcomes.  But I also understand that the national aggregate relationship shown in the figure above has underlying it, the embedded disparities of 50 unique state education systemssome where states are making legitimate efforts to provide resources to improve equity in educational outcomes, and others quite honestly, that have done little or nothing for decades and in some cases have systematically eroded the equity and adequacy of resources over time (well before the current fiscal crisis)!

Fixing these disparities is a large and complex task and one that is not aided by small minded rhetoric and flimsy oversimplified analyses.

Insult of insults from Third Way – Baker, You… You… Status Quo…er!

I gotta admit that my favorite part of the Third Way memo responding to my critique of their “Middle Class” report is the end of the memo.

Here are the two concluding paragraphs from the Third Way memo in reply to my rather harsh critique of their report:

 There are 52,860 public and charter schools that fall within our definition of middle-class schools, and they educate 25.7 million16 students. The message from Dr. Baker and the NEPC seems to be—let’s ignore them. In fact, let’s not even define them. Our view is that there is immense potential out there. These schools are failing in their basic mission—to become college factories.

From our perspective, college graduation rates of 31% and 23% in the second and third NSLP groupings, respectively—as our report presents—are unacceptable for America’s economic future. Clearly, the NEPC and Dr. Baker disagree and are satisfied with the status quo. We are not.

Yes, there it is. The insult of insults in reformyland! I am, as a result of critiquing their near criminal abuse of data, a… a… Status Quo-er!

Obviously, anyone (like me) who might take offense at such egregious representation of data must be a defender of the status quo. That is the worst offense in today’s reform debate. Especially if the egregious abuse of data was done with good intentions? Right? Done with the good intentions of letting the American public understand just how awful their schools are!  They need to know. America needs to know! And now! This can’t wait! Even if we have to classify information illogically or draw conclusions that don’t even match our data?

Look, bad data analyses and bombastic conclusions about our supposed education apocalypse do little or nothing to start a genuine conversation about either the true current conditions of our schools or whether we should be considering systemic changes.

Often, such crisis mode reporting has as its central objective, encouraging the public and policymakers to act in haste and adopt ill-conceived (often self-serving) policy before they know what’s really going on. That is, let’s get in a panic and adopt something really stupid and fast.  Any reader should be wary of and evaluate critically crisis-mode reports like the Third Way middle class report. Some such reports may ultimately reveal important issues and some even with a degree of immediacy. Third Way’s report reveals neither.

Third Way Responds but Still Doesn’t Get It!

Third Way has posted a response to my critique in which they argue that their analyses do not suffer the egregious flaws my review indicates. Specifically, they bring up my reference to the fact that whenever they are using a “district” level of analysis, they include the Detroit City Schools in their entirety in their sample of “middle class.” They argue that they did not do this, but rather only included the middle class schools in Detroit.

The problems with this explanation are many. First, several of their methodological explanations specifically refer to doing computations based on selecting “district” not school level data. For example, Footnote #8 in their report explains:

Third Way calculation based on the following source: New America Foundation, “Federal Education Budget Project,” Accessed on April 22, 2011. Available at: http://febp.newamerica.net/k12

The New America data set provides data at either the state, or DISTRICT level (see lower right hand section of page from link in footnote), not school level. And financial data of this type are not available nationally at the school level. You couldn’t select some and not all schools for financial data. My tabulations of who is in, or out of the sample are based on the district level data from the link in that web site.

Further, the authors later explain to their readers, in Footnote #40, in great detail, how to construct a data set to identify the middle class schools, using the NCES Common Core of Data Build a Table Function. Specifically, the instructions refer to selecting “district” to construct the data set. That selection creates a file of district level, not school level data. As such, a district is in or out in its entirety.

Third Way calculations based on data from the following source: United States, Department of Education, Institute of Education Statistics, National Center for Education Statistics, Common Core of Data. Accessed July 25, 2011. Available at: http://nces.ed.gov/ccd/bat/. The Common Core of Data includes data from the “2008-09 Public Elementary/Secondary School Universe Survey,” “2008-09 Local Education Agency Universe Survey,” and “2000 School District Demographics” from the U.S. Census Bureau. To generate data from the Common Core of Data, in the “select rows” drop down box, select “District.”

In my review, I explain thoroughly that Third Way mixes units of analysis throughout their report, sometimes referring to district level data from the New America Foundation data set, sometimes referring to NCES tabulations of data based on the Schools and Staffing Survey (not even their own original analyses of SASS data), and in some cases referring to data on individual children from the high school graduating class of 1992. In fact, the title of a section of the review is “mixing and matching data sources.” I explained in my review:

The authors seem to have overlooked the fact that NCES tables based on Schools and Staffing Survey data typically report characteristics based on school-level subsidized lunch rates. As such, within a large, relatively diverse district like New York City, several schools would fall into the authors’ middle-class grouping, while others would be considered high-poverty, or low-income, schools. But, many other of the authors’ calculations are based on district-level data, such as the financial data from New America Foundation. When using district-level data, a whole district would be included or excluded from the group based on the district-wide percentage of children qualifying for free or reduced-price lunch. What this means is that the Third Way report is actually comparing different groups of schools and districts from one analysis to another, and within individual analyses.

When referring to district level data, the district of Detroit would be included in its entirety. When referring to aggregations from tables based on the Schools and Staffing Survey, as I explain, some would be in and some would be out.

Further, the authors refer throughout to the groupings by subsidized lunch rates as quartiles. They are not. Quartiles would include even distributions – quarters – of either children, schools or districts. The selected cutoffs of 25% and 75% qualified for free or reduced lunch do not yield quartiles, as shown by their own data.

The bottom line, however, is that the arbitrary, broad and imbalanced subsidized lunch cutoffs chosen by the authors neither work well for district or school level analysis, no less an inconsistent mix of the two. And, the authors fail to understand that applying the same income thresholds across states and regions of the U.S. yields vastly different populations. Having income below 185% of the income level for poverty provides for a very different quality of life in New York versus New Mexico (for some discussion, see: https://schoolfinance101.wordpress.com/2011/09/13/revisiting-why-comparing-naep-gaps-by-low-income-status-doesnt-work/).

But, in their response, the Third Way authors also downplay the importance of any analyses that might have been done with district level data, stating that their most significant conclusions were not drawn from these data.

As I explain in my review, it would appear that their boldest conclusions were actually drawn from data on a completely different measure at a completely different unit of analysis, and for a completely different generation. Most of their conclusions about college graduation rates appear to be based on individuals who graduated from high school in 1992 (by my tracking of their Footnote #90). Further, when evaluating individual family income based data, the measure of middle class is entirely different, and we don’t know whether those children attend “middle class” schools or districts at all. That is, students are identified by a family income measure and placed into quartiles, regardless of the income levels of their schools. We don’t know which of them attended “middle class” schools and which did not. But, we do know that they graduated about 20 years ago, reducing their relevance for the analysis quite substantially.

For these reasons, the reply by the authors does little to help explain or redeem the report. Readers should also note that these (the issues discussed above) were only a subset of the problems with the report, which included, among other things, claims about middle class under-performance refuted by their own tables on the same page.

These are severe methodological flaws of a type one does not see regularly in “high profile” reports making bold claims about the state of American public education. In my view, the Third Way’s bold proclamation about the dreadful failures of our middle class schools, supported only by severely flawed analyses, was worthy of a bold response.

A few additional comments & data clarifications:

In their reply memo, the authors list the total numbers of schools in Detroit and other cities that fall above and below their subsidized lunch cut off points, arguing that these are the actual numbers of schools in each city which they included in their “middle class” group and arguing that this clarification negates entirely my concern as to which districts are and are not included. Again, whether the illogical and unfounded cut points were applied to school or district level data doesn’t actually matter that much. It’s bad analysis either way.

But, the tabulation they provide in the memo, which is likely drawn from school level data from the NCES Common Core, Public School Universe Survey, does not actually relate to the vast majority of tables and analyses reported in their original document. Either the authors simply don’t understand this, or the memo is a knowingly false representation of their analyses. Here’s a quick run down:

  1. Financial data used in the report, for per pupil expenditure calculations are not available at the school level.
  2. Teacher salary and all teacher characteristics comparisons were based on pre-made tables based on Schools and Staffing Survey data, which is a SAMPLE of about 8,000 or so schools out of 100,000 or so nationally. I point out in my review that these pre-made NCES tables reporting on SASS data would have schools within districts falling either side of the cut off lines. The authors do not appear to have actually used SASS data themselves, which would provide much more flexibility in the analysis. Rather, the  authors performed calculations based on tables in NCES reports using SASS data.
  3. NAEP (National Assessment of Educational Progress) data simply can’t be parsed by school within district in any way that would represent all schools within each district that fall above and/or below the cut points used (as implied in their memo). NAEP data could be reported (or drawn from reports) based on average school characteristics, or based on child characteristics. Third Way appears to have used this easy table creator tool from NAEP (see their FN#52). So, yes, the NAEP tabulations would split schools within large districts. But, to be clear, these would not match the numbers of schools counts the report in their memo because NAEP is based on sample data. Further, the problem here is that their report infers a relationship between the students scores on NAEP and the financial data when there is only partial overlap between the two because different units are used for each. Nonetheless, the BIG takeaway regarding the tables of NAEP data are that NAEP scores of students who attend the middle brackets of schools score… in the middle! Suggesting that these data reveal dreadful failures of middle class schools is delusional (in a purely statistical sense, that is)!
  4. The data on college matriculation and on graduation by age 26 (their most bold conclusions) are cited to reports done by others, most significantly to the Bowen book Crossing the Finish Line, which in its early sections (Chapter 2), includes family income quartile data based on the National Longitudinal Studies of the 8th grade class of 1988, and other data in the Bowen book (as I explain in the review) are on select states only. It is entirely inappropriate to extrapolate either the NELS 88 findings, or select state findings to the national population in “middle class” schools. We may know individual family income quartile, but we do not know their schools’ characteristics. Arguably, it is entirely inappropriate for the Third Way, on page 5 of their reply memo to claim regarding the completion rates of 26 year olds that “This is the major finding of our paper,” when it is, in fact, not their finding at all, but rather a citation to a finding in a book by someone else!

While the authors seem to wish to argue that my criticism over the poverty classification applied to district level data does not undermine their major conclusions that is clearly not the case. Given these concerns that exist across a) financial input data, b) teacher characteristics data, c) achievement outcome measures and d) college completion data, and the misalignment of units across all measures, not a single conclusion of the Third Way report remains intact.

One difference between Playin’ Jazz and Policy Research: Comments on the Third Way “Middle Class” Reply

Occasionally on this blog, I slip in some jazz references. I often see commonalities between jazz improvisation and policy analysis. But I think I’ve finally found one thing that is very different.

A lot of jazz teachers will joke around with students about what to do when you’re improvising a solo over chord changes, perhaps to a standard tune, and you happen to land unintentionally on a dissonant note.  Somethin’ with a really sour sound!  The usual advice is if you hit such a note, play it even louder a few more times! Make it sound intentional. Of course, you eventually want to resolve the dissonance, not end on it. But work it until then.

Well, I’m not sure that this principle applies well to policy research. Here’s why. I just completed a review of a report by Third Way, a think tank I’d never heard of previously. Third Way released a report on what it called “Middle Class” schools, and argued that these schools aren’t making the grade. Methodologically, this report was about the most god-awful thing I’ve ever had to read.  Here is the abstract of my review:

Incomplete: How Middle Class Schools Aren’t Making the Grade is a new report from Third Way, a Washington, D.C.-based policy think tank. The report aims to convince parents, taxpayers and policymakers that they should be as concerned about middle-class schools not making the grade as they are about the failures of the nation’s large, poor, urban school districts. But, the report suffers from egregious methodological flaws invalidating nearly every bold conclusion drawn by its authors. First, the report classifies as middle class any school or district where the share of children qualifying for free or reduced-priced lunch falls between 25% and 75%. Seemingly unknown to the authors, this classification includes as middle class some of the poorest urban centers in the country, such as Detroit and Philadelphia. But, even setting aside the crude classification of middle class, none of the report’s major conclusions are actually supported by the data tables provided. The report concludes, for instance, that middle-class schools perform much less well than the general public, parents and taxpayers believe they do. But, the tables throughout the report invariably show that the schools they classify as “middle class” fall precisely where one would expect them to—in the middle—between higher- and lower-income schools.

http://nepc.colorado.edu/thinktank/review-middle-class

In short, the layers of problems with the report were baffling. Among those layers of problems was a truly absurd definition of “middle class” schools, which, when I went to some of the data sources cited in order to evaluate the membership of “middle class” schools, I found school districts including Detroit, Philadelphia and numerous other large poor urban centers. Yet, throughout, the authors suggested that they were characterizing stereotypical “middle class” schools.

So, here’s the fun part. In response to my critique, did the Third Way authors consider at all the possibility that they had not done a very methodologically strong report? That their definition of “middle class” districts might have a few problems? Hell no. What did they do with that dissonant note! They took the advice of jazz instructors, and decided to defend that note, and play it loudly a few more times!

In their own words:

Let us be clear: Our decision to use this criteria was a deliberate choice, grounded in established procedures and data. http://perspectives.thirdway.org/?p=1173

But really. Let’s be more clear. While you might claim to have played this sour note deliberately, or might be trying to convince us as much, it just doesn’t cut it in policy research. Maybe sometimes it doesn’t really work in Jazz that well either. I don’t really like to see people in the front row cringe while I’m playin’ or encourage them to cringe a few more times before I provide them relief.

Please, don’t make me cringe anymore by defending indefensible criteria and shoddy analyses.  It’s time to go back to the woodshed. Go home. Do some practicing. Learn the tunes. Learn the changes. It takes time and discipline and we all play those dissonant notes some time.  I’ve certainly played my share over time. Sometimes we make em’ work. A lot of the time it can’t be done. Perhaps in this way, the discipline of good policy analysis and the discipline of solid jazz improv are quite similar.

A related parable from Jazz history: http://www.guardian.co.uk/music/2011/jun/17/charlie-parker-cymbal-thrown

Oh, and a few more comments. The “middle class” definition issue is but one of many egregious flaws in the report. Among other things, the authors repeatedly refer to quartiles which are not in fact quartiles. The authors make repeated claims inferring that today’s middle class schools are only getting ¼ graduates through college by age 26, but a little detective work shows that this assumption is actually cited back to a source using data on the high school class of 1992 (20 freakin’ years ago). The report confuses individuals from middle class families with students who attended schools that, on average, are middle class (not the same). Finally, the report constantly notes that middle class schools do not meet expectations, while providing tables showing that the middle class students, on average, perform where? In the middle. Right where expected!

Piloting the Plane on Musical Instruments & using SGPs to Evaluate Teachers

I’ve posted a few blogs recently on the topic of Student Growth Percentile Scores, or SGPs and how many state policymakers have moved to adopt these measures and integrate them into new evaluation systems for teachers. In my first post, I argued that SGPs are simply not designed to make inferences about teacher effectiveness.

The designers of SGP replied to my first post, suggesting that I was conflating the measures with their use by suggesting that these measures can’t and shouldn’t be used to infer teacher effectiveness.  And in their response (more below), they explained in greater detail, what was essentially my main point – that SGPs are not designed or intended to infer teacher effectiveness from student achievement growth. They also argued that the policy makers they have advised on adopting SGPs understood that.

Well, let’s review what’s going on in New Jersey. In New Jersey, a handful of districts have signed on to the department of education’s Pilot teacher evaluation program, explained here: http://www.state.nj.us/education/EE4NJ/faq/

Specifically, here’s how NJDOE responds to the question over how standardized testing data, and SGPs based on those data would be used within the pilot evaluations:

From NJDOE

Q:  How much weight do standardized test scores get in the evaluations?

A:  Standardized test scores are not available for every subject or grade. For those that exist (Math and English Language Arts teachers of grades 4-8), Student Growth Percentages (SGPs), which require pre- and post-assessments, will be used. The SGPs should account for 35%-45% of evaluations.  The NJDOE will work with pilot districts to determine how student achievement will be measured in non-tested subjects and grades.

Now, here is a quote from Betebenner and colleagues’ response to my criticism of policymakers proposed uses of SGPs in teacher evaluation.

From Damian Betebenner & colleagues

A primary purpose in the development of the Colorado Growth Model (Student Growth Percentiles/SGPs) was to distinguish the measure from the use: To separate the description of student progress (the SGP) from the attribution of responsibility for that progress.

(emphasis added)

But, you see, using these data to “evaluate teachers” necessarily infers “attribution of responsibility for that progress.” Attribution of responsibility to the teacher!  If one cannot use these measures to attribute responsibility to the teacher, then how can one possibly use these measures to “evaluate” the teacher? One can’t. You can’t. No-one can. No-one should!

Perhaps in an effort to preserve proprietary interests, Betebenner and colleagues in their reply to my original criticism also note:

To be clear about our own opinions on the subject: The results of large-scale assessments should never be used as the sole determinant of education/educator quality.

No state or district that we work with intends them to be used in such a fashion. That, however, does not mean that these data cannot be part of a larger body of evidence collected to examine education/educator quality.

But this statement stands in direct conflict with the first above. If the tool is insufficient for – simply not even designed to – ATTRIBUTE RESPONSIBILITY FOR PROGRESS to either teachers or schools, then it simply can’t and SHOULDN’T BE USED THAT WAY! Be it for 10% or 90%.

The reality is that even though Betebenner and colleagues explain that they believe that the policymakers with whom they have consulted “get it” and would never consider misusing the measures in the ways I explained on my original post, that is precisely what is going on.

Also, I noted previously that this paragraph from their response is a complete cop out. I explained:

What the authors accomplish with this point, is permitting policymakers to still assume (pointing to this quote as their basis) that they can actually use this kind of information, for example, for a fixed 90% share of high stakes decision making, regarding school or teacher performance, and  certainly that a fixed 40% or 50% weight would be reasonable. Just not 100%. Sure, they didn’t mean that. But it’s an easy stretch for a policymaker.

If the measures aren’t meant to isolate system, school or teacher effectiveness, or if they were meant to but simply can’t, they should NOT be used for any fixed, defined, inflexible share of any high stakes decision making.  In fact, even better, more useful measures shouldn’t be used so rigidly.

[Also, as I’ve pointed out in the past, when a rigid indicator is included as a large share (even 40% or more) in a system of otherwise subjective judgments, the rigid indicator might constitute 40% of the weight but drive 100% of the decision.]

Look. It’s pretty simple. If you want to pilot an airplane effectively, the plane needs to have the right instruments – flight instruments. If you’re coming in for a landing in dense fog in mountainous terrain, you look down to where your flight instruments should be, http://www.b737.org.uk/images/fltinsts_panel_nonefis.jpg, and there sits an alto saxophone instead (albeit a fine, Selmer Mark VI w/serial # in the 180s), you’re screwed. You might have a few minutes left to blow through the changes to Foggy Day, but your chances of successfully piloting the plane to a safe landing are severely diminished.

Okay, this analogy is a bit of a stretch. But it is not a stretch to acknowledge that SGPs were simply not designed to attribute responsibility for student progress to teachers. Meanwhile, VAM models try, but are unable to effectively, accurately or precisely attribute student progress to teachers. So, we have a choice of piloting the plane with either a) the wrong instruments (SGP) or b) instruments that don’t work very well (have high error rates & comparable problems of inference).  When faced with choices this bad, it may be wise to take another course entirely. Don’t pilot the damn plane! It would be a shame to crash it with such a beautiful saxophone on board!

On ignorance & impartiality: A comment on the Monmouth U. Poll on Ed. Policy

Some Twitter followers may have noticed the ongoing back and forth regarding the validity of the recent Monmouth University Poll on education reform.I’d certainly rather spend my time on more substantive discussion.

As I’ve noted on many occasions, polls are what they are. They ask what they ask. And the responses to the questions must always be evaluated only with respect to what was asked. Questions about specific policies in particular require that the policies in question be described correctly. This is a point raised the other day by Matt Di   Carlo about the Monmouth Poll here.

Yesterday, Patrick Murray, director of the polling institute posted a response to some of the criticisms levied against the recent Monmouth poll. Unfortunately, I found his response to be much less fulfilling and in many ways far more disturbing than the poll itself. Quite honestly, I’d have left this issue alone if not for some particularly troublesome assertions made by the polling institute director Patrick Murray.

First, here is my response regarding the substantive issue raised by Matt Di   Carlo:

Mr. Murray points out that he, as many pollsters do, chose to use colloquial language to describe “tenure.” The problem, as explained by Matt Di Carlo here http://shankerblog.org/?p=3695, is that the colloquial characterization was factually incorrect, and that it would be possible to achieve a colloquial characterization that is not factually incorrect. The factual error in the characterization of tenure leads to a clear bias in the question. This is the most obvious example, but there are numerous more subtle cases where questions do not accurately represent existing or proposed legislation or regulations.

Here are a few additional points regarding content in Mr. Murray’s response:

Specifically, Mr. Murray contends that critics were simply unhappy with the results, and offered no substantive criticism of the methods.

On Twitter, I have criticized the title of the press release for the poll, which claims that the poll results indicate broad support for New Jersey reforms, implying that responses to the specific questions regarding policies can be taken as supporting the specific policies being proposed.  That is, it infers a close relationship between the policies framed in the questions and actual policy proposals on the table.  Usually,  it is the media who makes such misguided leaps. In this case, the polling institute provided them with the misleading headline.

Mr. Murray’s response not only defends the headline, but he actually makes even less justified statements (slightly more specific) to the same effect. Mr. Murray claims that the poll results provide “broad, general support” for the “Governor’s proposals”, which happen to be rather specific proposals (many of which are not actually the governor’s proposals, but proposals for which he has offered support).  But, very few (if any) of the questions in the poll accurately represent the specific proposals (like mischaracterizing what tenure is).  The questions are broad, and imprecise (if intended to discern support for existing proposals). They are general. Some are outright incorrect. As a researcher, I can assure you that a response to one question, referring to one type of policy (a hypothetical policy that is substantively different from the actual proposals) should not be interpreted as relating to another (without careful statistical validation, which would involve asking the other question).  That is a methodological concern. Not a concern with the findings. It is a concern largely over the representation of findings (press release titles matter), as opposed to the usual quibbling over sampling issues.

After defending the wording of the tenure question, Mr. Murray goes on to discuss the follow up questions to the tenure question – specifically those about how the general public would like to see tenure changed. The problem is that each of these questions about how to “change” tenure is invalid because “change” in the mind of the respondent (at least the uninformed respondent) is measured against an incorrectly defined baseline of what tenure is. That is, Mr. Murray has provided a prompt in the first tenure question that incorrectly describes tenure, asserting that tenure means that a teacher can only be fired for “serious misconduct.” Then he asks in a series of questions whether that should be changed and how. If the baseline condition – existing policy – is described incorrectly, arguably biased – then responses to subsequent questions are influenced by this. That is either biased, or simply sloppy.

Which brings up a related issue. Mr. Murray notes that many if not most poll respondents were unaware of policies, or details of reforms. Because of that, the phrasing of the questions, the colloquial explanations of the policies are of even greater importance, having even greater potential to shape the response. That phrasing can be the basis of grossly misinforming the otherwise uninformed respondent. And it just may have been.

The most significant and most disturbing point:

Setting aside this methodological quibbling, I take issue with Mr. Murray’s point that academic researchers might come at these issues with normative values – as I admittedly do – and that having normative values (based on years of extensive research on these topics) somehow invalidates someone’s ability to critique the poll. Mr. Murray explains:

 To start, most of the criticism has come from people without expertise in the field of survey research.  Some has, which I will treat more seriously.  But it’s important to note that all of these critics, including some who are academic researchers, have taken very public normative positions on education policy.  Normative is one of those great social science words.  It simply means they already have a clear opinion about how things ought to be.  When normative values get applied in a research setting, they lead to bias.

So, in other words. If you don’t have expertise in opinion research, your criticisms should not be taken seriously. And, if you have far too much knowledge and expertise in the substance of the poll (education law, policy and reform), you are too biased for your opinion to carry any weight. This argument is patently absurd.

As Mr. Murray frames it, only through blissful ignorance  on issues of substance can anyone be sufficiently impartial to be involved in, or make claims or arguments regarding either substance or method.  Those with knowledge and opinions derived from that knowledge are necessarily too biased to have valid concerns. I’ll admit that I have biases for rigorous research methodologies.

Like Dr. Di Carlo (who holds a Ph.D. in Sociology from Cornell), I’m not a pollster. I’m a researcher and perhaps that alters my view on how research is conducted and what kinds of conclusions can be reasonably drawn from survey responses to questions with specific wording.  I generally don’t care much for polls or polling results, but I am a stickler for methods.

This poll was about policies, not politicians. And as someone who studies policies I am particularly sensitive to the details of policy design & implementation. This poll was clearly not sensitive to those details and was exceptionally sloppy in its characterization of policies and policy design. And that’s a methodological problem, and one that is so glaringly apparent because of my academic expertise in this area – not because of some normative bias – but, because of actual details, including statutes and regulations.

Perhaps I’m being too picky, and that’s just how the polling industry works. Perhaps the normative values of pollsters allow for imprecise colloquial descriptions and drawing broad unsubstantiated conclusions. That seems to be the gist of Patrick Murray’s argument, and one I find distasteful enough to require a response.

Inkblots and Opportunity Costs: Pondering the Usefulness of VAM and SGP Ratings

I spent some time the other day, while out running, pondering the usefulness of student growth percentile estimates and value added estimates of teacher effectiveness for the average school or district level practitioner. How would they use them? What would they see in them? How might these performance snapshots inform practice?

Let’s just say I am skeptical that either VAMs (Value Added Models) or SGPs (Student Growth Percentiles) can provide useful insights to anyone who doesn’t have a pretty good understanding of the nuances of these kinds of data/estimates & the underlying properties of the tests. If I was a principal, would I rather have the information than not? Perhaps. But I’m someone who’s primary collecting hobby is, well, collecting data. That doesn’t mean it all has meaning, or more specifically, that it has sufficient meaning to influence my thinking or actions. Some does. Some doesn’t. Keeping some of the data that doesn’t have much meaning actually helps me to delineate. But I digress.

It seems like we are spending a great deal of time and money on these things for questionable return. We are investing substantial resources in simply maximizing the links in our data systems between individual student’s records and their classroom teachers of record, hopefully increasing our coverage to, oh, somewhere between 10% and 20% of teachers (those with intact, single teacher classrooms, serving children who already have a track record of prior tests – e.g. upper elementary classroom teachers).

At the outset of this whole “statistical rating of teachers” endeavor, it was perhaps assumed by some economists that we would just ram these things through as large scale evaluation tools (statewide and in large urban districts) and use them to prune the teacher workforce and that would make the system better. We’d shoot first… ask questions later (if at all). We’d make some wrong decisions, hopefully statistically more “right’ than wrong, and we’d develop a massive model and data set for large enough numbers of teachers that the cost per unit (cost per bad teacher correctly fired, counterbalanced by the cost per good teacher wrongly fired) would be relatively low. We’d bring it all to scale, and scale would mean efficiency.

Now, I find this whole version of the story to be too offensive to really dig into here and now. I’ve written previously about “smart selection” versus “dumb selection” regarding personnel decisions in schools. And this would be what I called “dumb selection.

But, it also hasn’t necessarily played out this way… thankfully… except perhaps for some large city systems like Washington, DC, and a few more rigidly mandated state systems (though we’re mostly in wait-and-see mode there as well). Instead, we are now attempting to be more “thoughtful” about how we use this stuff and asking teachers to ponder their statistical ratings for insights into how they interact with children? How they teach? And we are asking administrators to ponder teachers’ statistical estimates for any meaning they might find.

In my current role, as a researcher of education policy, I love equations like this: http://graphics8.nytimes.com/images/2011/03/07/education/07winerip_graphic/07winerip_graphic-articleLarge-v2.jpg

I like to see the long lists of coefficients (estimates of how some measure in the model relates to the dependent variable) spit out in my Stata logs and ponder what they might mean, with full consideration of what I’ve chosen to include or exclude in the model, and whether I’m comfortable that the measures on both sides of the equation are of sufficient quality to really tell me anything… or at least something.

The other evening,  I thought back to my teaching days (considered a liability as an education policy researcher), at whether I thought it would have been useful to me to simply have some rating of my aggregate effectiveness – simply relative to other teachers. Nothing specific about the performance of my students on specific content/concepts. Just some abstract number… like the relative rarity that my students scored X at the end of my class given that they scored X-Y at the end of last years class? Or, some generalized “effectiveness” rating category based on whether my coefficient in the model surpassed a specific cut score to call me “exceptional” or merely “adequate?” Something like this.

Would that be useful to me? to the principal? if I was the principal?

Given that I typically taught 2 sections of 7th grade life science and 2 of 8th grade physical science (yeah… cushy private school job), with class sizes of about 18 students each, which rotated through different times of day, I might also find it fun to compare growth of my various classes. Did the disruptive distraction kid really cause my ratings in one life science section to crash (you know who you are!)? Was the same kid able to bring her 8th grade teacher down the next year (hopefully not me again!)?

I asked myself… would those ratings actually tell me anything about what I should do next year (accepting that the data would come on a yearly cycle)? Should I go watch teachers who got better ratings? Could I? Would they protect their turf? Would that even tell me a damn thing? Besides, knowing what I do now, I also know that large shares of the teachers who got a better rating likely got that rating either because of a) random error/noise in the data or b) some unmeasured attribute of the students they serve (bias). Of course, I didn’t know that then, so what would I think?

My gut instinct is that any of these aggregate indicators of a teacher’s relative effectiveness, generated from complex statistical models, with, or without corrections for other factors, are little more than ink blots to most teachers and administrators. And I”m not convinced they’ll ever be anything more than that. They possess many of the same attributes of randomness or fuzziness of an ink blot. And while the most staunch advocate might wish them to appear as an impressionist painting, I expect they are still most often seen as ink blots – not even a Jackson Pollock. More random than pattern. And even if/when there is a pattern, the average viewer may never pick it up.

I anxiously (though skeptically) await well crafted qualitative studies exploring stakeholders’ interpretations of these inkblots.

But these aren’t just any ink blots. They are rather expensive ink blots if and when we start trying to use them in more comprehensive and human resource intensive ways through local public schools and districts and if we weigh on them the burden that we MUST use them not merely to inform, but rather to DRIVE our decisions – and must find significant meaning in them to justify doing so.  That is, if we really expect teachers and principals to log significant hours trying to derive meaning from them, after consultants, researchers, central office administrators and state department officials have labored over data system design, linking teachers to students, and deciding on the most aesthetically pleasing representation of teacher performance classifications for the individual reporting system. Using these tools as quick screening, blunt instruments is certainly a bad idea. But is this – staring at them for endless hours in search of meaning that may not be there – much better?

It strikes me that there are a lot more useful things we could/should/might be spending our time looking at in order to inform and improve educational practice or evaluate teachers. And that the cumulative expenditure on these ink blots, including the cost of time spent musing over them, might be better applied elsewhere.

More on the SGP debate: A reply

This new post from Ed News Colorado is in response to my critique of Student Growth Percentiles here: https://schoolfinance101.wordpress.com/2011/09/02/take-your-sgp-and-vamit-damn-it/

I must say that I agree with almost everything in this response to my post, except for a few points. First, they argue:

Unfortunately Professor Baker conflates the data (i.e. the measure) with the use. A primary purpose in the development of the Colorado Growth Model (Student Growth Percentiles/SGPs) was to distinguish the measure from the use: To separate the description of student progress (the SGP) from the attribution of responsibility for that progress.

No, I do not conflate the data and measures with their proposed use. Policy makers are doing that and doing that based on ill advisement from other policymakers who don’t see the important point – the primary purpose – as Betebenner, Briggs and colleagues explain.  This is precisely why I use their work in my previous post – because it explains their intent and provides their caveats.

Policymakers, by contrast are pitching the direct use of SGPs in teacher evaluation. Whether they intended this or not, that’s what’s happening. Perhaps this is because they are not explaining as bluntly they do here, what the actual intent/design was.

Further, I should point out that while I have marginally more faith that a VAM could, in theory be used to parse out teacher effect than an SGP, which isn’t even intended to, I do not have any more faith than they do that a VAM actually can accomplish this objective. They interpret my post as follows:

Despite Professor Baker’s criticism of VAM/SGP models for teacher evaluation, he appears to hold out more hope than we do that statistical models can precisely parse the contribution of an individual teacher or school from the myriad of other factors that contribute to students’ achievement.

I’m not, as they would characterize, a VAM supporter over SGP, and any reader of this blog certainly realizes that. However, it is critically important that state policymakers be informed that SGP is not even intended to be used in this way. I’m very pleased they have chosen to make this the central point of their response!

And while SGP information might reasonably be used in another way, if used as a tool for ranking and sorting teacher or school effectiveness, SGP results would likely be more biased even than VAM results… and we may not even know or be able to figure out to what extent.

I agree entirely with their statement (but for the removal of “freakin”):

We would add that it is a similar “massive … leap” to assume a causal relationship between any VAM quantity and a causal effect for a teacher or school, not just SGPs. We concur with Rubin et al (2004) who assert that quantities derived from these models are descriptive, not causal, measures. However, just because measures are descriptive does NOT imply that the quantities cannot and should not be used as part of a larger investigation of root causes.

The authors of the response make one more point, that I find objectionable (because it’s a cop out!):

To be clear about our own opinions on the subject: The results of large-scale assessments should never be used as the sole determinant of education/educator quality.

What the authors accomplish with this point, is permitting policymakers to still assume (pointing to this quote as their basis) that they can actually use this kind of information, for example, for a fixed 90% share of high stakes decision making, regarding school or teacher performance, and  certainly that a fixed 40% or 50% weight would be reasonable. Just not 100%. Sure, they didn’t mean that. But it’s an easy stretch for a policymaker.

If the measures aren’t meant to isolate system, school or teacher effectiveness, or if they were meant to but simply can’t, they should NOT be used for any fixed, defined, inflexible share of any high stakes decision making.  In fact, even better, more useful measures shouldn’t be used so rigidly.

[Also, as I’ve pointed out in the past, when a rigid indicator is included as a large share (even 40% or more) in a system of otherwise subjective judgments, the rigid indicator might constitute 40% of the weight but drive 100% of the decision.]

So, to summarize, I’m glad we are, for the most part, on the same page. I’m frustrated that I’m the one who had to raise this issue in part because it was pretty clear to me from reading the existing work on SGP’s that many were conflating the measure with its use. I’m still concerned about the use, and especially concerned in the current policy context. I hope in the future that the designers and promoters of SGP will proclaim more loudly and clearly their own caveats – their own cautions – and their own guidelines for appropriate use.

Simply handing off the tool to the end user and then walking away in the face of misuse and abuse would be irresponsible.

Addendum: By the way, I do hope the authors will happily testify on behalf of the first teacher who is wrongfully dismissed or “de-tenured” on the basis of 3 bad SGPs in a row. That they will testify that SGPs were never intended to assume a causal relationship to teacher effectiveness, nor can they be reasonably interpreted as such.

Revisiting why comparing NAEP gaps by low income status doesn’t work

This is a compilation of previous posts, in response to the egregious abuse of data presented on Page 3, here: http://www.scribd.com/fullscreen/64717249:

Pundits love to make cross-state comparisons and rank states on a variety of indicators, something I’m guilty of it as well.[1] A favorite activity is comparing NAEP test scores across subjects, including comparing which states have the biggest test score gaps between children who qualify for subsidized lunch and children who don’t. The simple conclusion – States with big gaps are bad – inequitable – and states with smaller gaps must being doing something right!

It is generally assumed by those who report these gaps and rank states on achievement gaps that these gaps are appropriately measured – comparably measured – across states. That a low-income child in one state is similar to a low-income child in another. That the average low-income child or the average of low-income children in one state is comparable to the average of low-income children in another, and that the average of non-low income children in one state is comparable to the average of non-low income children in another.  Unfortunately, however, this is a deeply flawed assumption.

Let’s review the assumption. Here’s the basic framing adopted by most who report on this stuff:

Non-Poor Child Test Score – Poor Child Test Score = Poverty Achievement Gap

Non-Poor Child in State A = Non-Poor Child in State B

Poor Child in State A = Poor Child in State B

These conditions have to be met for there to be any validity to rankings of achievement gaps.

Now, here’s the problem.

Poor = child from family falling below 185% income level relative to income cut point for poverty

Therefore, the measurement of an achievement gap between “poor” and “non-poor” is:

Average NAEP of children above 185% poverty threshold – Average NAEP of children below 185% poverty threshold = “Poverty” achievement Gap

But, the income level for poverty is not varied by state or region.[2]

As a result, the distribution of children and their families above and below the specified threshold varies widely from state to state, and comparing the average performance of the groups of children above that threshold and below it is not particularly meaningful.  Comparing those gaps across states is really problematic.

Here are graphs of the poverty distributions (using a poverty index where 100 = 100%, or income at the poverty level) for families of 5 to 17 year olds in New Jersey and in Texas. These graphs are based on data from the 2008 American Community Survey (from http://www.ipums.org). They include children attending either/both public and private school.

Figure 1

Poverty Distribution (Poverty Index) and Reduced Price Lunch Cut-Point

 

  Figure 2

Poverty Distribution (Poverty Index) and Reduced Price Lunch Cut-Point

 

To put it really simply, comparing the above the line and below the line groups in New Jersey means something quite different from comparing the above the line and blow the line groups in Texas, where the majority are actually below the line… but where being below the line may not by any stretch of the imagination be associated with comparable economic deprivation. Further, in New Jersey, much larger shares of the population are distributed toward the right hand end of the distribution – the distribution is overall “flatter.” These distributional differences undoubtedly have significant influence on the estimation of achievement gaps. As I often point out, the size of an achievement gap is as much a function of the height of the highs as it is a function of the depth of the lows.[3]

How does this matter when comparing poverty achievement gaps?

In the above charts, while I show how different the poverty and income distributions were in Texas and New Jersey as an example, those charts don’t explain how/why these distribution differences thwart comparisons of low-income vs. non-low income achievement gaps. Yes, it should be clear enough that the above the line and below the line groups just aren’t similar across these two states and/or nearly every other.

A logical extension of the analysis in that previous post would be to look at the relationship between:

Gap in average family total income between those above and below the free or reduced price lunch cut-off

AND

Gap in average NAEP scores between children from families above and below the free or reduced price lunch cut-off

If there is much (or any) of a relationship between the income gaps and the NAEP gaps – that is, states with larger income gaps between the poor and non-poor groups also have larger achievement gaps – such a finding would call into question the usefulness of state comparisons of these gaps.

So, let’s walk through this step by step.

First, Figure 3 shows the relationship across states between the NAEP Math Grade 8 scores and family total income levels for children in families ABOVE the free or reduced cutoff:

Figure 3

There is a modest relationship between income levels of non-low income children and NAEP scores. Higher income states generally have higher NAEP scores. No adjustments are applied in this analysis to the value of income from one location to another, mainly because no adjustments are applied in the setting of the poverty thresholds. Therein lies at least some of the problem. The rest lies in using a simple ABOVE vs. BELOW a single cut point approach.

Second, Figure 4 shows the relationship between the average income of families below the free or reduced lunch cut point and the average NAEP scores on 8th Grade Math (2009).

Figure 4

 

This relationship is somewhat looser than the previous relationship and for logical reasons – mainly that we have applied a single low-income threshold to every state and the average income of individuals below that single income threshold does not vary as widely across states as the average income of individuals above that threshold. Further, the income threshold is arbitrary and not sensitive do the differences in the value of any given income level across states.  But still, there is some variation, with some stats have much larger clusters of very low-income families below the free or reduced price lunch threshold (Mississippi).

But, here’s the most important part. Figure 5 shows the relationship between income gaps estimated using the American Community Survey data (www.ipums.org) from 2005 to 2009 and NAEP Gaps. This graph addresses directly the question posed above – whether states with larger gaps in income between families above and below the arbitrary low-income threshold also have larger gaps in NAEP scores between children from families above and below the arbitrary threshold.

Figure 5

In fact, they do. And this relationship is stronger than either of the two previous relationships. As a result, it is somewhat foolish to try to make any comparisons between achievement gaps in states like Connecticut, New Jersey and Massachusetts versus states like South Dakota, Idaho or Wyoming. It is, for example, more reasonable to compare New Jersey and Massachusetts to Connecticut, but even then, other factors may complicate the analysis.

How does this affect state ranking gaps? Re-ranking New Jersey

New Jersey’s current commissioner of education seems to stake much of his arguments for the urgency of implementing reform strategies on the argument that while New Jersey ranks high on average performance, New Jersey ranks 47th in achievement gap between low-income and non-low income children (video here: http://livestre.am/M3YZ).

And just yesterday, a New Jersey Governor’s Task Force report used New Jersey’s egregious poverty achievement gap as the primary impetus for the immediate need for reform: http://www.scribd.com/fullscreen/64717249 (In my view, all that follows in this report is severely undermined by the fact that those who drafted the report clearly do not have even the most basic understanding of data on poverty and achievement!)

To be fair, this is classic political rhetoric with few or no partisan boundaries.

To review, comparisons of achievement gaps across states between children in families above the arbitrary 185% income level and below that income level are very problematic.  In my last post on this topic, I showed that states where there is a larger gap in income between these two groups (the above and below the line groups), there is also a larger gap in achievement.  That is, the size of the achievement gap is largely a function of the income distribution in each state.

Let’s take this all one more, last step and ask – If we correct for the differences in income between low and higher income families – how do the achievement gap rankings change? And, let’s do this with an average achievement gap for 2009 across NAEP Reading and Math for Grades 4 and 8.

Figure 6 shows the differences in income for lower and higher income children, with states ranked by the income gap between these groups:

Figure 6

 

Massachusetts, Connecticut and New Jersey have the largest income gaps between families above and below the arbitrary Free or Reduced Price Lunch income cut off.

Now, let’s take a look (Figure 7) at the raw achievement gaps averaged across the four tests:

Figure 7

 

New Jersey has a pretty large raw gap, coming in 5th among the lower 48 states (note there are other difficulties in comparing the income distributions in Alaska and Hawaii, in relation to free/reduced lunch cut points). Connecticut and Massachusetts also have very large achievement gaps.

One can see here, anecdotally that states with larger income gaps in the first figure are generally those with larger achievement gaps.

Here’s the relationship between the two (Figure8):

Figure 8

In this graph, a state that falls on the diagonal line, is a state where the achievement gap is right on target for the expected achievement gap, given the difference in income for those above and below the arbitrary free or reduced price lunch cut-off. New Jersey falls right on that line. States falling on the line have relatively “average” (or expected) achievement gaps.

One can take this the next step to rank the “adjusted” achievement gaps based on how far above or below the line a state falls. States below the line have achievement gaps smaller than expected and above the line have achievement gaps larger than expected. At this point, I’m not totally convinced that this adjustment is capturing enough about the differences in income distributions and their effects on achievement gaps. But it makes for some fun adjustments/comparisons nonetheless. In any case, the raw achievement gap comparisons typically used in political debate are pretty meaningless.

Here are adjusted achievement gap rankings (Figure 9):

Figure 9

Here, NJ comes in 27th in achievement gap. That is 27th from largest. That is, New Jersey’s adjusted achievement gap between higher and lower-income students, when correcting for the size of the income gap between those students, is smaller than the gap in the average state.


[3] For further explanation of the problems with poverty measurement across states, using constant thresholds, and proposed solutions see: Renwick, Trudi. Alternative Geographic Adjustments of U.S. Poverty Thresholds: Impact on State Poverty Rates. U.S. Census Bureau, August 2009. https://xteam.brookings.edu/ipm/Documents/Trudi_Renwick_Alternative_Geographic_Adjustments.pdf