It’s good to be King: More Misguided Rhetoric on the NY State Eval System

Very little time to write today, but I must comment on this NY Post article on the bias I’ve been discussing in the NY State teacher/principal growth percentile ratings. Sociologist Aaron Pallas of TC and economist Sean Corcoran of NYU express appropriate concerns about the degrees of bias found and reported in the technical report provided by the state’s own consultant developing the models. And this article overall raises concern that these problems were simply blown off. I would/and have put it more bluntly. Here’s my replay of events – quoting the parties involved:

First, the state’s consultants designing their teacher and principal effectiveness measures find that those measures are substantively biased:

Despite the model conditioning on prior year test scores, schools and teachers with students who had higher prior year test scores, on average, had higher MGPs. Teachers of classes with higher percentages of economically disadvantaged students had lower MGPs. (p. 1) https://schoolfinance101.com/wp-content/uploads/2012/11/growth-model-11-12-air-technical-report.pdf

But instead of questioning their own measures, they decide to give them their blessing and pass them along to the state as being “fair and accurate.”

The model selected to estimate growth scores for New York State provides a fair and accurate method for estimating individual teacher and principal effectiveness based on specific regulatory requirements for a “growth model” in the 2011-2012 school year. p. 40 https://schoolfinance101.com/wp-content/uploads/2012/11/growth-model-11-12-air-technical-report.pdf

The next step was for the Chancellor to take this misinformation and polish it up as pure spin as part of the power play against the teachers in New York City (who’ve already had the opportunity to scrutinize what is arguably a better but still substantially flawed set of metrics). The Chancellor proclaimed:

The student-growth scores provided by the state for teacher evaluations are adjusted for factors such as students who are English Language Learners, students with disabilities and students living in poverty. When used right, growth data from student assessments provide an objective measurement of student achievement and, by extension, teacher performance. http://www.nypost.com/p/news/opinion/opedcolumnists/for_nyc_students_move_on_evaluations_EZVY4h9ddpxQSGz3oBWf0M

Then send in the enforcers…. This statement came from a letter sent to a district that did decide to play ball with the state on the teacher evaluation regulations. The state responded that… sure… you can adopt the system of multiple measures you propose – BUT ONLY AS LONG AS ALL OF THOSE OTHER MEASURES ARE SUFFICIENTLY CORRELATED WITH OUR BIASED MEASURES… AND ONLY AS LONG AS AT LEAST SOMEONE GETS A BAD RATING.

The department will be analyzing data supplied by districts, BOCES and/or schools and may order a corrective action plan if there are unacceptably low correlation results between the student growth subcomponent and any other measure of teacher and principal effectiveness… https://schoolfinance101.wordpress.com/2012/12/05/its-time-to-just-say-no-more-thoughts-on-the-ny-state-tchr-eval-system/

So… what’s my gripe today? Well, in this particular NY Post article we have some rather astounding quotes from NY State Commissioner John King, given the information above. Now, last I talked about John King, he was strutting about NY with this handy new graph of completely fabricated information on how to improve educational productivity. So, what’s King up to now? Here’s how John King explained the potential bias in the measures and how that bias a) is possibly not bias at all, and b) even if it is, it’s not that big a problem:

“It’s a question of, is this telling you something descriptive about where talent is placed? Or is it telling you something about the classroom effect [or] school effect of concentrations of students?” said King.

“This data alone can’t really answer that question, which is one of the reasons to have multiple measures — so that you have other information to inform your decision-making,” he added. “No one would say we should evaluate educators on growth scores alone. It’s a part of the picture, but it’s not the whole picture.”

So, in King’s view, the bias identified in the AIR technical report might just be a signal as to where the good teachers really are. Kids in schools with lower poverty – kids in schools with higher average starting scores and kids in schools with fewer children with disabilities simply have the better teachers. While there certainly may be some patterned sorting of teachers by their actual effect on test scores a) this proposition is less likely than the expectation of classroom effect and b) making this assumption when not really being able to tease out cause is a highly suspect approach to teacher evaluation (reformy thinking at its finest!).

The kicker is in how King explains why the potential bias isn’t a problem. King argues that the multiple measures approach buffers against over-reliance on the growth percentiles. As he states so boldly – “it’s part of the picture, but it’s not the whole picture.”

The absurdity here is that KING HAS DECLARED TO LOCAL OFFICIALS THAT ALL OTHER MEASURES THEY CHOOSE TO INCLUDE MUST BE SUFFICIENTLY CORRELATED WITH THESE GROWTH PERCENTILE MEASURES! That’s precisely what the letter quoted above and sent to one local official says! Even this wasn’t the case, the growth percentiles which may wrongly classify teachers for factors outside their control, might carry disproportionate weight in determining teacher ratings (merely as a function of the extent of variation – most of which is noise & much of the remainder is biased). But, when you require that all other measures be correlated with this suspect measure – you’ve stacked the deck to be substantially if not entirely built on a flawed foundation.

THIS HAS TO STOP. STATE OFFICIALS MUST BE CALLED OUT ON THIS RIDICULOUS CONTORTED/DECEPTIVE & OUTRIGHT DISHONEST RHETORIC!

Note: King also tries to play up the fact that at any level of poverty, there are some teachers getting higher or lower ratings. This explanation ignores the fact that much of the remaining variation in teacher estimates is noise. Some will get higher or lower ratings in a given year simply because of the noise/instability in the measures. These variations may be entirely meaningless.

Forget the $300m Deal! Let’s talk $3.4 billion (or more)!

Sometime last week or so, Sockpuppets for Ed Reform marched on City Hall in NY demanding that the city and teachers union come to a deal on a teacher evaluation system compliant with the state’s new regulations for such systems, so that the district could receive an approximately $300 million grant payment associated with the implementation of that system. Well, actually, it was more about trying to enrage the public that the evil teachers union in particular was at fault for holding hostage and potentially losing this supposedly massive sum of funding.

As one can see by the signs the SFER protesters were displaying, the protest was much less clearly articulated than I’ve described above. On would think, from looking at stuff like this: http://nyulocal.com/wp-content/uploads/2012/11/DSC_0841.jpg that this protest was actually about obtaining funding for the district – funding that would provide for substantive and sustained improvement to district programs/services.

But hey, far be it for SFER to actually carry placards that are in any way accurate or precise (or to have any clue what they are talking about). At this particular event in NYC, they even convinced a 15 year old that the fight was really about funding.

So, we’ve got a protest that is presented as being about funding, but is really about a teacher evaluation system driven by student test scores, being carried out by a group that clearly has little or no understanding of either.

You know, I would typically give a group of undergrads a break on stuff like this. Hey, they’re undergrads and have time to learn/develop the discipline/understanding of these complex topics. Heck, I was anything but a disciplined undergrad myself. But unfortunately, this group has thus far displayed to me the worst attributes of the most intellectually lazy of today’s college students – a persistent pattern of copying and pasting low quality content from web sites and presenting it as novel content of their own. It’s as if their placards, and their entire website was generated by lifting content from “reformy-pedia.”

So then, what is the real story on what’s goin’ on with Teacher Evaluation and School Funding in New York State?

The State Evaluation System/Guidelines

I’ve written several posts recently about the state metrics for teacher evaluation and the state department of education push to get districts on board. I also wrote about the letter from the Chancellor of the Board of Regents which appeared in the NY Post, encouraging NYC in particular to get on board with that $300m RAW deal!

In my humble opinion, no-one should sign on to a deal to implement a teacher evaluation system under the current NYSED guidelines, given the evidence I’ve laid out over the past few weeks. No-one. Just say NO.

First, the state’s consultants designing their teacher and principal effectiveness measures find that those measures are substantively biased:

Despite the model conditioning on prior year test scores, schools and teachers with students who had higher prior year test scores, on average, had higher MGPs. Teachers of classes with higher percentages of economically disadvantaged students had lower MGPs. (p. 1) https://schoolfinance101.com/wp-content/uploads/2012/11/growth-model-11-12-air-technical-report.pdf

But instead of questioning their own measures, they decide to give them their blessing and pass them along to the state as being “fair and accurate.”

The model selected to estimate growth scores for New York State provides a fair and accurate method for estimating individual teacher and principal effectiveness based on specific regulatory requirements for a “growth model” in the 2011-2012 school year. p. 40 https://schoolfinance101.com/wp-content/uploads/2012/11/growth-model-11-12-air-technical-report.pdf

The student-growth scores provided by the state for teacher evaluations are adjusted for factors such as students who are English Language Learners, students with disabilities and students living in poverty. When used right, growth data from student assessments provide an objective measurement of student achievement and, by extension, teacher performance. http://www.nypost.com/p/news/opinion/opedcolumnists/for_nyc_students_move_on_evaluations_EZVY4h9ddpxQSGz3oBWf0M

The department will be analyzing data supplied by districts, BOCES and/or schools and may order a corrective action plan if there are unacceptably low correlation results between the student growth subcomponent and any other measure of teacher and principal effectiveness… https://schoolfinance101.wordpress.com/2012/12/05/its-time-to-just-say-no-more-thoughts-on-the-ny-state-tchr-eval-system/

This is a raw deal, whether attached to what appears to be a pretty big bribe or not. And quite honestly, while $300 million is nothing to sneeze at, it pales in comparison to what the city schools are actually owed under the state’s own proposal for how it would fund its schools to comply with a court order of nearly a decade ago.

THE REAL ISSUE in NY State

Meanwhile, at the other end of the state – well sort of – a different protest was going on. This protest in Albany actually was about funding and the fact that the state of New York has repeatedly cut state aid from local public school districts each of the past few years, has systematically cut more per pupil funding from districts serving needier student populations and has never once come close to providing the funding levels that the state’s own funding formula suggest are needed (actually, were needed back in 2007!).

Here’s a quick run-down on the state of school funding in New York:

New York continues to maintain one of the least equitable school finance systems in the country, where districts serving higher concentrations of children in poverty have systematically less state and local revenue per pupil.
New York State accomplishes these patterns of egregious disparity not merely by lack of effort, but by actually allocating substantial state resources – disproportionate state resources – toward buying down the tax rates of the state’s wealthiest districts and making other politically convenient state aid allocations to economically advantaged districts, at the expense of children in poverty.
Even though the state was ordered by the NY court of appeals nearly a decade ago to provide adequate resources to children attending high need districts, and even though the court accepted the state’s own proposed funding formula to meet that goal (which was much lower than more rigorously determined spending targets), the state has chosen to not even come close to funding those targets and in recent years has systematically cut more funding from children with greater needs.

So, how does this all affect districts across New York State and NYC in particular? I’m going to set a really low bar here for my comparisons. In response to court order in the Campaign for Fiscal Equity case the state of New York proposed a new school finance formula – a foundation aid formula – to begin implementation in 2007. It was actually a pretty lame, relatively low-balled funding formula to begin with, as explained here!

But even that low-balled estimate of what districts were supposed to get has never been close to fully funded. Several large districts, including Albany, for example, receive in 2012-13, less than half of the state aid they are supposed to receive if the formula was implemented.

The formula provides a target level of funding for each district based on student needs and regional costs. Then, the formula determines the share of that target funding that should come from the state. Then, the formula as actually implemented, ignores all of that and provides a marginal increase or decrease (over what districts have historically received) maintaining the persistent inequities of the system.

The first figure below shows the difference between actual state foundation aid per pupil (after applying this trick they refer to as gap elimination adjustment) and the aid calculated to be needed according to THE STATE’S OWN FORMULA for addressing regional costs and student needs. Districts are organized from low need (left) to high need (right) using the state’s own pupil need index. Bubble size indicates district enrollment size. NYC is the BIG ONE! And, we can see, by eyeballing the middle of that bubble, that NYC is being shorted between $3,000 and $4,000 per pupil. At 1 million kids, that’s about $3.4 billion … each year… every year… over time. No, not a $300m implementation grant, but $3.4 billion in annual operating funds. Yeah… the stuff that actually provides for smaller class sizes, decent teacher pay, up to date materials, supplies and equipment, and arts, music and all that other stuff!

The table below provides a closer look at districts with the largest funding gap between what the formula calculates is needed and what districts actually receive in state aid.

So, instead of talking about a one shot $300m bribe to implement a bad system based on bad data, at a cost that may exceed the amount of grant to begin with, perhaps it would make more sense to focus on that $3.4 billion deal! You know, the one state officials themselves promised in response to that court order all those years ago.

And when we do start taking more seriously this much bigger funding issue, don’t forget to send me a cool lookin’ knit protest hat!

Readings

Policy Brief on State Aid in New York (Summer 2011) NY Aid Policy Brief_Fall2011_DRAFT6

Baker, B.D., Welner, K.G. (2012) Evidence and Rigor: Scrutinizing the Rhetorical Embrace of
Evidence-based Decision-making. Educational Researcher 41 (3) 98-101

Baker, B.D., Welner, K. (2011) School Finance and Courts: Does Reform Matter, and How Can We
Tell? Teachers College Record 113 (11) p. –

Baker, B.D., Corcoran, S.P.(2012) The Stealth Inequalities of School Funding: How Local Tax
Systems and State Aid Formulas Undermine Equality. Washington, DC. Center for American
Progress. http://www.americanprogress.org/wp-content/uploads/2012/09/StealthInequities.pdf

Baker, B.D., Sciarra, D., Farrie, D. (2012) Is School Funding Fair? Second Edition, June 2012.
http://schoolfundingfairness.org/National_Report_Card_2012.pdf

Baker, B.D. (2012) Revisiting the Age Old Question: Does Money Matter in Education. Shanker
Institute. http://www.shankerinstitute.org/images/doesmoneymatter_final.pdf

Baker, B.D., Welner, K.G. (2011) Productivity Research, the U.S. Department of Education, and
High-Quality Evidence. Boulder, CO: National Education Policy Center. Retrieved [date] from
http://nepc.colorado.edu/publication/productivity-research.

Friday Thoughts on Data, Assessment & Informed Decision Making in Schools

Some who read this blog might assume that I am totally opposed, in any/all circumstances to using data in schools to guide decision-making. Despite my frequent public cynicism I assure you that I believe that much of the statistical information we collect on and in schools and school systems can provide useful signals regarding what’s working and what’s not, and may provide more ambiguous signals warranting further exploration – through both qualitative information gathering (observation, etc.) and additional quantitative information gathering.

My personal gripe is that thus far – especially in public policy – we’ve gone about it all wrong. Pundits and politicians seem to have this intense desire to impose certainty where there is little or none and impose rigid frameworks with precise goals which are destined to fail (or make someone other than the politician look as if they’ve failed).

Pundits and politicians also feel the intense desire to over-sample the crap out of our schooling system – taking annual measurements on every child over multiple weeks of the school year when strategic sampling of selected testing items across samples of students and settings might provide more useful information at lower cost and be substantially less invasive (NAEP provides one useful example). To protect the health of our schoolchildren, we don’t make them all walk around all day with rectal thermometers hanging out of…well… you know? Nor do political pollsters attempt to poll 100% of likely voters. Nor should we feel the necessity to have all students take all of the assessments, all of the time, if our goal is to ensure that the system is getting the job done/making progress.

In my view, a central reason for testing and measurement in schools is what I would refer to as system monitoring, where system monitoring is best conducted in the least intrusive and most cost-effective way – such that the monitoring itself does not become a major activity of the system! We just need enough sampling density in our assessments to generate sufficient estimates at each relevant level of the system.

I know there are those who would respond that testing everyone every year ensures that no kids fall through the cracks. If we did it my less intrusive way… kids who weren’t given all test questions in math in a given year might fall through some hypothetical math crack somewhere. But it is foolish to assume that NCLB-every-student-every-year testing regimes actually solve that problem. Further, high stakes testing with specific cut scores either for graduation or grade promotion violates one of the most basic tenets of statistical measurement of student achievement – that these measures are not perfectly precise. They can’t identify exactly where that crack is, or which kid actually fell through it! One can’t select a cut score and declare that the child one point above that score (who got one more question correct on that given day) is ready (with certainty) for the next grade (or to graduate) and the child 1 point below is not. In all likelihood these two children are not different at all in their actual “proficiency” in the subject in question. We might be able to say – by thoughtful and rigorous analysis – that on average, students who got around this score in one year, were likely to get a certain score in a later year, and perhaps even more likely to make it beyond remedial course work in college. And we might be able to determine if students attending a particular school or participating in a particular program are more or less likely (yeah… probability again) to succeed in college.

Thoughtful analysis and more importantly thoughtful USE of testing data in schools requires a healthy respect for what those numbers can and cannot tell us… and nuanced understanding that the numbers typically include a mix of non-information (noise/unexplainable, non-patterned information), good information (true signal) and perhaps misinformation (false signal, or bias, variation caused by something other than what we think it’s caused by).

These issues apply generally to our use of student assessment data in schools and also apply specifically to an area I discuss often on this blog – statistical evaluation of teacher influence on tested student outcomes.

I was pleased to see the Shankerblog column by Doug Harris a short while back in which Doug presented a more thoughtful approach to integrating value-added estimates into human resource management in the schooling context. Note that Doug’s argument is not new at all, nor is it really his own unique view. I first heard this argument in a presentation by Steve Glazerman (of Mathematica) at Princeton a few years ago. Steve also used the noisy medical screening comparison to explain the use of known-to-be-noisy information to assist in making more efficient decisions/taking more efficient steps in diagnosis. That is, with appropriate respect for the non-information in the data, we might actually find ways to use that information productively.

Last spring, I submitted an article (still under review) in which I, along with my coauthors Preston Green and Joseph Oluwole explained:

As we have explained herein, value-added measures have severe limitations when attempting even to answer the narrow question of the extent to which a given teacher influences tested student outcomes. Those limitations are sufficiently severe such that it would be foolish to impose on these measures, rigid, overly precise high stakes decision frameworks. One simply cannot parse point estimates to place teachers into one category versus another and one cannot necessarily assume that any one individual teacher’s estimate is necessarily valid (non-biased). Further, we have explained how student growth percentile measures being adopted by states for use in teacher evaluation are, on their face, invalid for this particular purpose. Overly prescriptive, overly rigid teacher evaluation mandates, in our view, are likely to open the floodgates to new litigation over teacher due process rights, despite much of the policy impetus behind these new systems supposedly being reduction of legal hassles involved in terminating ineffective teachers.

This is not to suggest that any and all forms of student assessment data should be considered moot in thoughtful management decision making by school leaders and leadership teams. Rather, that incorrect, inappropriate use of this information is simply wrong – ethically and legally (a lower standard) wrong. We accept the proposition that assessments of student knowledge and skills can provide useful insights both regarding what students know and potentially regarding what they have learned while attending a particular school or class. We are increasingly skeptical regarding the ability of value-added statistical models to parse any specific teacher’s effect on those outcomes. Further, the relative weight in management decision-making placed on any one measure depends on the quality of that measure and likely fluctuates over time and across settings. That is, in some cases, with some teachers and in some years, assessment data may provide leaders and/or peers with more useful insights. In other cases, it may be quite obvious to informed professionals that the signal provided by the data is simply wrong – not a valid representation of the teacher’s effectiveness.

Arguably, a more reasonable and efficient use of these quantifiable metrics in human resource management might be to use them as a knowingly noisy pre-screening tool to identify where problems might exist across hundreds of classrooms in a large district. Value-added estimates might serve as a first step toward planning which classrooms to observe more frequently. Under such a model, when observations are completed, one might decide that the initial signal provided by the value-added estimate was simply wrong. One might also find that it produced useful insights regarding a teacher’s (or group of teachers’) effectiveness at helping students develop certain tested algebra skills.

School leaders or leadership teams should clearly have the authority to make the case that a teacher is ineffective and that the teacher even if tenured should be dismissed on that basis. It may also be the case that the evidence would actually include data on student outcomes – growth, etc. The key, in our view, is that the leaders making the decision – indicated by their presentation of the evidence – would show that they have used information reasonably to make an informed management decision. Their reasonable interpretation of relevant information would constitute due process, as would their attempts to guide the teacher’s improvement on measures over which the teacher actually had control.

By contrast, due process is violated where administrators/decision makers place blind faith in the quantitative measures, assuming them to be causal and valid (attributable to the teacher) and applying arbitrary and capricious cutoff-points to those measures (performance categories leading to dismissal). The problem, as we see it, is that some of these new state statutes require these due process violations, even where the informed, thoughtful professional understands full well that she is being forced to make a wrong decision. They require the use of arbitrary and capricious cutoff-scores. They require that decision makers take action based on these measures even against their own informed professional judgment.

My point is that we can have thoughtful, data informed (NOT DATA DRIVEN) management in schools. We can and should! Further, we can likely have thoughtful data informed management (system monitoring) through far less intrusive methods than currently employed – taking advantage of advancements in testing and measurement, sampling design etc. But we can only take these steps if we recognize the limits of data and measurement in our education systems.

Unfortunately, as I see it, current policy efforts enforcing the misuse of assessment data (as illustrated here, here and here) and misuse of estimates of teacher effectiveness based on those data (as illustrated here) will likely do far more harm than good. Unfortunately, I don’t see things turning corner any time soon.

Until then, I may just have to stick to my current message of Just say NO!

It’s time to just say NO! More thoughts on the NY State Tchr Eval System

This post is a follow up on two recent previous posts in which I first criticized consultants to the State of New York for finding substantial patterns of bias in their estimates of principal (correction: School Aggregate) and teacher (correction: Classroom aggregate) median growth percentile scores but still declaring those scores to be fair and accurate, and next criticized the Chancellor of the Board of Regents for her editorial attempting to strong-arm NYC to move forward on an evaluation system adopting those flawed metrics – and declaring the metrics to be “objective” (implying both fair and accurate).

Let’s review. First, the AIR report on the median growth percentiles found, among other biases:

Despite the model conditioning on prior year test scores, schools and teachers with students who had higher prior year test scores, on average, had higher MGPs. Teachers of classes with higher percentages of economically disadvantaged students had lower MGPs. (p. 1)

In other words… if you are a teacher who so happens to have a group of students with higher initial scores, you are likely to get a higher rating, whether that difference is legitimately associated with your teaching effectiveness or not. And, if you are a teacher with more economically disadvantaged kids, you’re likely to get a lower rating. That is, the measures are biased – modestly – on these bases.

Despite these findings, the authors of the technical report chose to conclude:

The model selected to estimate growth scores for New York State provides a fair and accurate method for estimating individual teacher and principal effectiveness based on specific regulatory requirements for a “growth model” in the 2011-2012 school year. p. 40

I provide far more extensive discussion here! But even a modest bias across the system as a whole can indicate the potential for substantial bias for underlying clusters of teachers serving very high poverty populations or very high or very low prior scoring students. In other words, THE MEASURE IS NOT ACCURATE – AND BY EXTENSION – IS NOT FAIR!!!!! Is this not obvious enough?

The authors of the technical report were wrong – technically wrong – and I would argue morally and ethically wrong in providing NYSED their endorsement of these measures! You just don’t declare outright, when your own analyses show otherwise, that a measure [to be used for labeling people] is fair and accurate! [setting aside the general mischaracterization that these are measures of “teacher and principal effectiveness”]

Within a few days after writing this post, I noticed that Chancellor Merryl Tisch of the NY State Board of Regents had posted an op-ed in the NY POST attempting to strong-arm an agreement on a new teacher evaluation system between NYC teachers and the city. In the op-ed, the Chancellor opined:

The student-growth scores provided by the state for teacher evaluations are adjusted for factors such as students who are English Language Learners, students with disabilities and students living in poverty. When used right, growth data from student assessments provide an objective measurement of student achievement and, by extension, teacher performance.

As I noted in my post the other day, one might quibble that Chancellor Tisch has merely stated that the measures are “adjusted for” certain factors and she has not claimed that those adjustments actually work to eliminate bias – which the technical report indicates THEY DO NOT. Further, she has merely declared that the measures are “objective” and not that they are accurate or precise. Personally, I don’t find this deceitful propaganda at all comforting! Objective or not – if the measures are biased, they are not accurate and if they are not accurate they, by extension are not fair.

Sadly, the story of misinformation and disinformation doesn’t stop here. It only gets worse! I received a copy of a letter yesterday from a NY school district that had its teacher evaluation plan approved by NYSED. Here is a portion of the approval letter:

Now, I assume this language to be boilerplate. Perhaps not. I’ve underling the good stuff. What we have here is NYSED threatening that they may enforce a corrective action plan on the district if the district uses any other measures of teacher or principal effectiveness that are not sufficiently correlated WITH THE STATE’S OWN BIASED MEASURES OF PRINCIPAL AND TEACHER EFFECTIVENESS!

This is the icing on the cake! This is sick- warped- wrong! Consultants to the state find that the measures are biased, and then declare they are “fair and accurate.” The Chancellor spews propaganda that reliance on these measures must proceed with all deliberate speed! (or ELSE!!!!!!!). Then the Chancellor’s enforcers warn individual district officials that they will be subjected to mind control – excuse me – departmental oversight – if they dare to present their own observational or other ratings of teachers or principals that don’t correlate sufficiently with the state imposed, biased measures.

I really don’t even know what to say anymore??????????

But I think it’s time to just say no!

When Dummy Variables aren’t Smart Enough: More Comments on the NJ CREDO Study

This is a brief follow up on the NJ CREDO study, which I wrote about last week when it was released. The major issues with that study were addressed in my previous post, but here, I raise an additional non-trivial issue that plagues much of our education policy research. The problems I raise today not only plague the CREDO study (largely through no real fault of their own…but they need to recognize the problem), but also plague many/most state and/or city level models of teacher and school effectiveness.

We’re all likely guilty at some point in time or another – guilty of using dummy variables that just aren’t precise enough to capture what is that we are really trying to measure. We use these variables because, well, they are available, and often, greater precision is not. But the stakes can be high if using these variables leads to misclassification/misidentification of schools for closure, teachers to be dismissed, or misidentification of supposed policy solutions deserving greater investment/expansion.

So… what is a dummy variable? Well, a dummy variable is when we classify students as Poor or Non-poor by using a simple, single income cut-off and assigning, for example, the non-poor a value 0f “0” and poor a value of “1.” Clearly, we’re losing much information when we take the entire range of income variation and lump it into two categories. And this can be consequential as I’ve discussed on numerous previous occasions. For example, we might be estimating a teacher effectiveness model and comparing teachers who each have a class loaded with 1s and few 0s. But, there’s likely a whole lot of variation across those classes full of 1s – variation between classrooms with large numbers of very low income, single parent & homeless families versus the classroom where those 1s are marginally below the income threshold.

For those who’ve not really pondered this, consider that for 2011 NAEP 8th grade math performance in New Jersey, the gap between non-low income and reduced lunch kids (185% income threshold for poverty) is about the same as the gap between free (130% income level) & reduced!

The NJ CREDO charter school comparison study is just one example. CREDO’s method involves identifying matched students who attend charter schools and districts schools based on a set of dummy variables. In their NJ study, the indicators included an indicator for special education status and an indicator for children qualified for free or reduced priced lunch (as far as one can tell from the rather sketchy explanation provided). If their dummy variable matches, they are considered to be matched – empirically THE SAME. Or, as stated in the CREDO study:

…all candidates are identical to the individual charter school student on all observable characteristics, including prior academic achievement.

Technically correct – Identical on the measures used – but identical? Not likley!

The study also matched on prior test score, which does help substantially in providing additional differentiation within these ill-defined categories. But, it is important to understand that annual learning gains – as well as initial scores/starting point – are affected by a child’s family income status. Lower income, among low income, is associated with increased mobility (induced by housing instability). Quality of life during all those hours kids spend outside of school (including nutrition/health/sleep, etc.) affect childrens’ ability to fully engage in their homework and also likely affect summer learning/learning loss (access to summer opportunities varies by income/parental involvement, etc.). So – NO – it’s not enough to only control for prior scores. Continued deprivation influences continued performance and performance growth. As such, this statement in the CREDO report is quite a stretch (but is typical, boilerplate language for such a study):

The use of prior academic achievement as a match factor encompasses all the unobservable characteristics of the student, such as true socioeconomic status, family background, motivation, and prior schooling.

Prior scores DO NOT capture persistent differences in unobservables that affect the ongoing conditions under which children live, which clearly affect their learning growth!

Now, one problem with the CREDO study is that we really don’t know which schools are involved in the study, so I’m unable here to compare the demographics of the schools actually included among charters with district schools. But, for illustrative purposes, here are a few figures that raise significant questions about the usefulness of matching charter students and district students on the basis of “special education” as a single indicator, and “free AND reduced” lunch qualification as a single indicator.

First, here are the characteristics of special education populations in Newark district and charter schools.

As I noted in my previous post, nearly all special education students in Newark Charter schools have mild specific learning disabilities and the bulk of the rest have speech impairment. Yet, students in districts schools who may have received the same dummy variable coding are far more likely to have multiple disabilities, mental retardation, emotional disturbance, etc. It seems rather insufficient to code these groups with a single dummy variable… even if the classifications of the test-taker population were more similar than those of the total enrolled population (assuming many of the most severely disabled children were not in that test-taker sample?).

Now, here are the variations by income status – first for district and charter schools in the aggregate:

Here, charters in Newark as I’ve noted previously, generally have fewer low income students, but they have far fewer students below the 130% income threshold than they do between the 130% and 185% thresholds. It would be particularly interesting to be able to parse the blue regions even further as I suspect that charters serve an even smaller share of those below the 100% threshold. Using a single dummy variable, any child in either the red or blue region was assigned a 1 and assumed to be the same (excuse me… “IDENTICAL?”). But, as it turns out, there is about twice the likelihood that the child with a 1 in a charter school was in a family between the 130% and 185% income thresholds. And that may matter quite a bit, as would additional differences within the blue region.

Here’s the distribution of free vs. reduced price lunch across NJ charter schools – among their free/reduced populations.

While less than 10% of the free/reduced population in NPS is in the upper income bracket, a handful of Newark Charter schools – including high flyers like Greater Newark, Robert Treat and North Star, have 20% to 30% of their (relatively small) low income populations in the upper bracket of low income. That is, for the “matched child” who attended Treat, North Star or Greater Newark there was a 2 to 3 times greater chance than for the their “peer” in NPS that they were from the higher (low) income group.

Again… CREDO likely worked with the data they have. However, I do find inexcusable the repeated sloppy use of the term “poverty” to refer to children qualified for free or reduced price lunch, and the failure of the CREDO report to a) address any caveats regarding their use of these measures or b) provide any useful comparisons of the differences in overall demographic context between charter schools and district schools.

Ed Schools – The Sequel: Rise of the Intellectually Dead

Warning: The following post contains the elitist musings of an ivory tower professor who has only professed at major research universities, who attended a selective liberal arts college & received his doctorate from an Ivy league institution (well… a branch of one… Teachers College at Columbia).

A while back, I wrote a post on “ed schools” the point of which was to show the shift in production of degrees that had occurred between the early 1990s and late 2000s. When I wrote that first post, ed schools were coming under fire from DC think tanks like the National Center on Teaching Quality (NCTQ), which seemed largely unable to understand the most basic issues of degree production in education (I’m unsure they’ve learned much since then!). And now, it would appear that our esteemed U.S. Secretary of Education has decided that ed schools and teacher preparation will be of primary interest in the second term of this administration.

The problem as I previously indicated, was that most of this rhetoric about ed schools and their supposed failure of society and production of generations of ill-equipped American youth, is that the rhetoric of “ed school” assumes a static definition of ed school – rooted in a 1950s to 1970s characterization of the regional public teachers college, and built on an assumption that teachers obtain their training and a teaching credential – for the one thing they teach – through a single institution as the core of their undergraduate education. Being “teachers colleges,” these schools are obviously lax on admission standards, have curriculum that is neither academically rigorous nor practical, etc. etc. etc. (the conflicting rhetoric in this regard is fun to follow – too much theory… no practical application… but not academically rigorous, etc.), and well… simply must be replaced by a vast set of alternative routes/pathways/programs!

In short, the vast majority of the critique of teacher education assumes this monolithic AND STATIC entity of teacher preparation housed in state colleges and universities. Emporia state in Kansas – that’s you! Monclair in NJ – that’s you! West Georgia – you too! And those state flagships with teacher prep programs? Damn you Rutgers, Michigan, Illinois for producing increasing numbers of underqualified teachers! The wrath of NCTQ and now Arne Duncan will be upon you!

But degree & credential production in education has not entirely been static over time. In fact, anything but! There are clearly emerging trends. And if we believe that there really has been a decline in the academic quality of those receiving credentials in education, it would behoove us to take a close look at those trends. But since no-one else seems to be doing that – especially not NCTQ – I figured I should take another shot at it.

A couple of key points are in order. FIRST – it is important to understand that these days, many initial teaching credentials are already granted through alternate routes outside of undergraduate programs and to individuals with degrees in fields other than education. In addition to non-degree alternate routes which I cannot even capture with the data in this post, many initial teaching credentials are granted through graduate programs – at the masters degree level and an even larger share of additional – second/third credentials received by practicing teachers are obtained through graduate programs. Individual teachers may have collected a handful of different credentials, all from different institutions.

So, let’s take a look at undergraduate and masters degree production trends.

Undergraduate Training

Undergraduate degree production in “education” fields generally (most of which involves teacher preparation) has been most stable over time. Using 1994 Carnegie Classifications (the most stratified system of Carnegie classifications of the past few decades: see end of post for definitions), we see that the percent of degrees being produced by what were the public “teachers colleges” (Comprehensive 1… as opposed to those labeled as “Teachers Colleges”) still hold the lions share, but have declined over time. Research Universities which produced around 14% in 1990 now produce closer to 10% (those are your state flagships & major private universities). So… the major traditional public college and university role is declining slightly in market share.

That loss is being picked up by what is actually a very small subset of colleges – that also tend to be relatively small, and not so prestigious colleges. These are the “LA – Liberal Arts 2” colleges. It’s quite striking that growth in this subset is sufficient to shift the market shares of major state universities and comprehensive regional colleges. Incidentally, LA 2s were among the first to expand rapidly their production of online and distance MBAs… around the same time they started tapping the ed market. (this period overlaps with a trend among financially strapped, less selective colleges making the move to change their name to “university.“)

Patterns are also relatively stable by the Barrons’ competitiveness ratings. Notably, colleges right in the middle of the competitiveness ratings have the largest market share. I know this conflicts with reformy ideas that all ed degrees are produced by the worst colleges – but at the undergrad level, it’s a pretty normal distribution. Competitive colleges have a consistent 50% market share. Indeed, they are not the top third. They are also not the bottom! They are… the middle… as one would expect for a profession with modest (at best) earnings expectations.

The next two categories out from there – one up (very) and one down (less), have just under 20%. But, the “less competitive” group seems to be showing an uptick (they are also heavy on those LA2s!). Highly Competitive and Non-Competitive are also relatively comparable, but with non-competitive slightly outpacing highly-competitive.

Masters Degrees

It’s in the production of masters degrees where the real fun stuff is happening. First, let’s take a look at what’s been happening across institutions by type. Note that Comprehensive colleges were, in large part, designed to deliver bachelors and masters degree programs and many from early on had large education programs and teacher preparation programs in particular. But we see in the figure below that the market share of masters degree production for Comp1s has declined over time. So too has the market share for masters degrees for Research Universities (including state flagship universities).

Amazingly, it’s those LA2s again that have risen dramatically in degree production. These lower tier liberal arts colleges (we’re not talkin’ Williams, Haverford, etc… which are LA1s. Those schools aren’t crankin’ up masters in Ed… and they’re also not changing their name to Williams University, etc.), have become the second largest producers of masters degrees in education. Bear in mind that liberal arts colleges, as classified in the 1990s, were never really intended to be handing out graduate degrees – no less massive numbers of them. LA2s have gone from only about 1% of ed masters production in 1990 to over 10% by 2011.

The next figure reclassifies these schools by the competitiveness of their undergraduate programs (since we lack competitiveness measures for graduate programs). What we see here is that masters programs housed in “LESS COMPETITIVE” undergraduate colleges are the ones that are creeping up in market share. To a significant extent, these are online, credential granting programs run through LA2s.

So, what we have here, is a rather dramatic expansion of graduate credentials in education being handed out by what some (including myself) might characterize as relatively low quality, non-selective undergraduate institutions that were never meant to be handing out graduate degrees to begin with. But perhaps that’s just my ivory tower, Research I perspective.

Now lets take a look at the top 20 Masters degree producers in the early 1990s and then in the most recent three years. In the early 1990s, the largest producers were crankin’ out a few thousand over a three year period. These included some early entrants – pre-online era – to the degree mass-production game like Lesley College and National Louis U. But, there were also many programs housed in brick and mortar public universities in the mix, including both state flagships (UT Austin, Ohio State) and other pretty solid academic schools (Harvard, Columbia/TC). Arguably, these [the public colleges in particular] are the schools now taking the brunt of the blame for the state of teacher preparation – Northern Arizona, Northern Colorado, Eastern Michigan, etc.

But who has actually been crankin’ out the masters degrees and credentials in recent years? And, if there is a decline and pending crisis in education training/preparation, who might instead be to blame? Below is the more recent production of graduate degrees/credentials. First and foremost, we’ve now got schools crankin’ out over 3,000 per year – or 9k per 3 years. Phoenix, Waldon and Grand Canyon together produce more masters degrees than many of the next several combined. There is a substantial gap in production before one reaches the first traditional teacher preparation program on the list.

Is it possible that the emphasis on traditional “ed schools” within state boundaries as the obvious source of our problems is misplaced?

Graduate Degree Production in Educational Leadership/Administration

I’ve got one last bit to address here and that’s training in educational leadership/administration, a topic I’ve written about in my academic publications (see below). Degree production in educational leadership has followed many of the same trends we see in education more generally. And there has been comparable push to provide more “alternatives” for gaining access to principal, supervisor and district leadership credentials. NOTE- if you think some of what I’m displaying here makes education grad degree production look like a cesspool, I assure you that when it comes to the production of MBAs, the picture is equally if not even more ugly! (One can buy an MBA almost anywhere… perhaps even more easily than a degree in ed admin… and in many cases which I have observed directly, the level of academic rigor, even within major universities, is hardly different!)

The figure below shows that major research universities have played a declining role in the production of graduate degrees (all levels) in educational administration. Again, it’s those entrepreneurial LA2s that are crankin’ up the production – moving into 2nd place among institution types.

Now lets take a look specifically at doctoral degrees. One can almost kind of understand the mass production masters degrees which in education are often tied to obtaining specific certifications perhaps in additional fields of specialization (special education, etc.). Yes, in many states, administration degrees are structured such that the masters is coupled with building level certification and doctorate with district level certification. Even then, how many doctorates does any one institution need to be cranking out? And who should be granting that level of degree?

By 1990s Carnegie classifications, doctorates should be (have been) largely granted out by Research and Doctoral Universities. Comprehensive colleges were generally masters producing schools, not doctoral granting institutions. These strata were, in fact, intended to reflect the capacity of institutions to grant certain types/levels of degrees.

Already by the early 1990s, Nova Southeastern had pioneered mass production of the education doctorate. But outside of the Nova model, most major producers of doctorates were actual universities (okay… a bit harsh… since NOVA actually is a university, and has a pretty well defined, conventional curriculum for their graduate programs).

In the most recent years, Nova Southeastern has remained strong… but now right up there are such stellar academic powerhouses as Walden, Capella and Phoenix! (and Argosy)… many of which probably occasionally show up as side-bar advertisements on my blog! (as they do when I log into facebook).

A notable change in the past few years is the entrance of USC and Penn to this mix, with their new practitioner preparation programs which apparently crank out a sizable number of doctorates per year. This raises the interesting question of whether leading universities should try to get into the mass production game? Is the system overall better for it, even if those institutions have to sacrifice some quality in order to mass produce? We’ll have to see if they can keep up with the Waldens and Capellas over the next several years.

Closing Thoughts

To me, these trends are pretty astounding, and serious consideration of these trends must play into any discussion that alarmists might have about the supposed decline in the quality of teacher and administrator preparation (to the extent these alarmists give serious consideration to anything). Those ringing these alarm bells seem more than happy to suggest that the obvious problem lies with traditional “ed schools” (read, regional and state flagship public colleges and universities) and that the obvious solution is to provide more alternative routes, online options – teacher preparation by MOOC… (and likely not a MOOC delivered by Stanford U. faculty… but rather through Walden, Capella and the like) & expansion of schools relying on imported, short term labor supply.

I also find it strange to say the least that those who argue that the problem is that our teachers don’t come from the upper third of college graduates seem to believe that the solution is to expand the types programs that tend to grow most rapidly among colleges that cater to the bottom third (less & non-competitive). To those reformy alarmists who feel they’ve identified the obvious problems and logical solutions, the above data should make sufficiently clear that we’ve already gone down that road.

Further, I’m thoroughly unconvinced that new models purporting to be more selective in the teachers they prepare, but relying largely on a self-credentialing model (we use our teachers to credential our teachers… and only accept as graduate students those who work in our schools?) focused primarily in ideological & cultural indoctrination are a step in the right direction. I have little doubt they’ll find a captive audience to self-credential and maintain a viable “business model,” (by requiring their own teachers to take courses delivered by their peers & bosses to achieve the credentials needed to keep their jobs) but this endogenous, back-patting self-validating model is no way to train the future teacher workforce.*

All of this begs the question of what next? Where do we go from here? How to we achieve integrity and quality in the production of degrees and credentials, and more broadly training and preparation of future teachers and administrators? I really don’t have any answers for these questions right now. But I’m pretty sure that the last two decades have taken us the wrong direction!

Related Research

Baker, B.D, Orr, M.T., Young, M.D. (2007) Academic Drift, Institutional Production and Professional Distribution of Graduate Degrees in Educational Administration. Educational Administration Quarterly 43 (3) 279-318

Baker, B.D., Fuller, E. The Declining Academic Quality of School Principals and Why it May Matter. Baker.Fuller.PrincipalQuality.Mo.Wi_Jan7

Baker, B.D., Wolf-Wendel, L.E., Twombly, S.B. (2007) Exploring the Faculty Pipeline in Educational
Administration: Evidence from the Survey of Earned Doctorates 1990 to 2000. Educational
Administration Quarterly 43 (2) 189-220

Wolf-Wendel, L, Baker, B.D., Twombly, S., Tollefson, N., & Mahlios, M. (2006) Who’s Teaching the Teachers? Evidence from the National Survey of Postsecondary Faculty and Survey of Earned Doctorates. American Journal of Education 112 (2) 273-300

1994 Carnegie Classifications

Research Universities I: These institutions offer a full range of baccalaureate programs, are committed to graduate education through the doctorate, and give high priority to research. They award 50 or more doctoral degrees1 each year. In addition, they receive annually $40 million or more in federal support.
Research Universities II: These institutions offer a full range of baccalaureate programs, are committed to graduate education through the doctorate, and give high priority to research. They award 50 or more doctoral degrees1 each year. In addition, they receive annually between $15.5 million and $40 million in federal support.
Doctoral Universities I: These institutions offer a full range of baccalaureate programs and are committed to graduate education through the doctorate. They award at least 40 doctoral degrees1 annually in five or more disciplines.
Doctoral Universities II: These institutions offer a full range of baccalaureate programs and are committed to graduate education through the doctorate. They award annually at least ten doctoral degrees—in three or more disciplines—or 20 or more doctoral degrees in one or more disciplines.
Master’s (Comprehensive) Universities and Colleges I: These institutions offer a full range of baccalaureate programs and are committed to graduate education through the master’s degree. They award 40 or more master’s degrees annually in three or more disciplines. [Includes typical regional, within-state public normal schools/teachers colleges]
Master’s (Comprehensive) Universities and Colleges II: These institutions offer a full range of baccalaureate programs and are committed to graduate education through the master’s degree. They award 20 or more master’s degrees annually in one or more disciplines.
Baccalaureate (Liberal Arts) Colleges I: These institutions are primarily undergraduate colleges with major emphasis on baccalaureate degree programs. They award 40 percent or more of their baccalaureate degrees in liberal arts fields4 and are restrictive in admissions.
Baccalaureate Colleges II: These institutions are primarily undergraduate colleges with major emphasis on baccalaureate degree programs. They award less than 40 percent of their baccalaureate degrees in liberal arts fields4 or are less restrictive in admissions. [Includes many cash-strapped, relatively non-selective, smaller private liberal arts colleges]

*I still like to believe that the most important background attribute of a “good teacher” or school leader is someone who is enthusiastic about their own learning, constantly seeking intellectual growth and challenge and that this attribute is often revealed in the types of advanced studies an individual chooses to pursue. To me, even if the Relay model does tap into a set of graduates of more selective colleges, if the Relay program itself is little more than a workshop on “no excuses” classroom disciplinary practices and typical inspiring edu-guru staff development fodder, then the Relay model is antithetical to developing truly good teachers. A workshop or two and perhaps some practical guidance from peers or teacher leaders – okay. But a graduate degree based on this stuff? Are you kidding? (just watch the RELAY GSE Videos here: http://www.relayschool.org/videos?vidid=5)

When Disinformation is Fueled by Misinformation! CHANCELLOR TISCH, YOU ARE WRONG!

Very recently, I posted a critique of the recent technical report on New York State median growth percentiles to be used in that state’s teacher and principal evaluation system.

Today, I read this piece in the NY Post – an editorial by NY State Board of Regents Chancellor Merryl Tisch, and well, MY HEAD ALMOST EXPLODED!

The point of the editorial is to encourage NY City’s teachers and DOE to agree to a teacher evaluation system based on supposedly objective measures – where “objective measures” seems largely to be code language for estimates of teacher effectiveness derived from student assessment data.

First, I have written several previous posts on the usefulness of NYC’s value-added model for determining teacher effectiveness.

Setting aside this long list of concerns about the NYC VAM results, I now turn to the NYSED – state median growth percentile data (which actually seem inferior to the NYC VAM model/estimates). In her editorial, Chancellor Tisch proclaims:

The student-growth scores provided by the state for teacher evaluations are adjusted for factors such as students who are English Language Learners, students with disabilities and students living in poverty. When used right, growth data from student assessments provide an objective measurement of student achievement and, by extension, teacher performance.

Let me be blunt here. CHANCELLOR TISCH – YOU ARE WRONG! FLAT OUT WRONG! IRRESPONSIBLY & PERHAPS NEGLIGENTLY WRONG!

[now, one might quibble that Chancellor Tisch has merely stated that the measures are “adjusted for” certain factors and she has not claimed that those adjustments actually work to eliminate bias. Further, she has merely declared that the measures are “objective” and not that they are accurate or precise. Personally, I don’t find this deceptive language at all comforting!]

Indeed, the measures attempt – but fail to sufficiently adjust for key factors. They retain substantial biases as identified in the state’s own technical report. And they are subject to many of the same error concerns as the NYC VAM model. Given the findings of the state’s own technical report, it is irresponsible to suggest that these measures can and should be immediately considered for making personnel and compensation decisions.

Finally, as I laid out in my previous blog post to suggest that “growth data from student assessments provide an objective measure of student achievement, and, by extension, teacher performance” IS A HUGE UNWARRANTED STRETCH!

While I might concur with the follow up statement from Chancellor Tisch that “We should never judge an educator solely by test scores, but we shouldn’t completely disregard student performance and growth either.” I would argue that school leaders/peer teachers/personnel managers should absolutely have the option to completely disregard data that have high potential to be sending false signals, either as a function of persistent bias or error. Requiring action based on biased and error prone data (rather than permitting those data to be reasonably mined to the extent they may, OR MAY NOT, be useful) is a toxic formula for public schooling quality.

The one thing I can’t quite figure out here is which is the misinformation and which is the disinformation. In any case, both are wrong!

The rest of what I have to say, I’ve already said. But, so readers don’t have to click the link below to access the previous post, I’ve pasted the entire thing below. Enjoy!

COMPLETE PREVIOUS POST!

I was immediately intrigued the other day when a friend passed along a link to the recent technical report on the New York State growth model, the results of which are expected/required to be integrated into district level teacher and principal evaluation systems under that state’s new teacher evaluation regulations. I did as I often do and went straight for the pictures – in this case- the scatterplots of the relationships between various “other” measures and the teacher and principal “effect” measures. There was plenty of interesting stuff there, some of which I’ll discuss below.

But then I went to the written language of the report – specifically the report’s (albeit in DRAFT form) conclusions. The conclusions were only two short paragraphs long, despite much to ponder being provided in the body of the report. The authors’ main conclusion was as follows:

The model selected to estimate growth scores for New York State provides a fair and accurate method for estimating individual teacher and principal effectiveness based on specific regulatory requirements for a “growth model” in the 2011-2012 school year. p. 40

http://engageny.org/wp-content/uploads/2012/06/growth-model-11-12-air-technical-report.pdf

13-Nov-2012 20:54

Updated Final Report: http://engageny.org/sites/default/files/resource/attachments/growth-model-11-12-air-technical-report_0.pdf

Local copy of original DRAFT report: growth-model-11-12-air-technical-report

Local copy of FINAL report: growth-model-11-12-air-technical-report_FINAL

Unfortunately, the multitude of graphs that immediately preceded this conclusion undermine it entirely. but first, allow me to address the egregious conceptual problems with the framing of this conclusion.

First Conceptually

Let’s start with the low hanging fruit here. First and foremost, nowhere in the technical report, nowhere in their data analyses, do the authors actually measure “individual teacher and principal effectiveness.” And quite honestly, I don’t give a crap if the “specific regulatory requirements” refer to such measures in these terms. If that’s what the author is referring to in this language, that’s a pathetic copout. Indeed it may have been their charge to “measure individual teacher and principal effectiveness based on requirements stated in XYZ.” That’s how contracts for such work are often stated. But that does not obligate the author to conclude that this is actually what has been statistically accomplished. And I’m just getting started.

So, what is being measured and reported? At best, what we have are:

An estimate of student relative test score change on one assessment each for ELA and Math (scaled to growth percentile) for students who happen to be clustered in certain classrooms.

THIS IS NOT TO BE CONFLATED WITH “TEACHER EFFECTIVENESS”

Rather, it is merely a classroom aggregate statistical association based on data points pertaining to two subjects being addressed by teachers in those classrooms, for a group of children who happen to spend a minority share of their day and year in those classrooms.

An estimate of student relative test score change on one assessment each for ELA and Math (scaled to growth percentile) for students who happen to be clustered in certain schools.

THIS IS NOT TO BE CONFLATED WITH “PRINCIPAL EFFECTIVENESS”

Rather, it is merely a school aggregate statistical association based on data points pertaining to two subjects being addressed by teachers in classrooms that are housed in a given school under the leadership of perhaps one or more principals, vps, etc., for a group of children who happen to spend a minority share of their day and year in those classrooms.

Now Statistically

Following are a series of charts presented in the technical report, immediately preceding the above conclusion.

Classroom Level Rating Bias

School Level Rating Bias

And there are many more figures displaying more subtle biases, but biases that for clusters of teachers may be quite significant and consequential.

Based on the figures above, there certainly appears to be, both at the teacher, excuse me – classroom, and principal – I mean school level, substantial bias in the Mean Growth Percentile ratings with respect to initial performance levels on both math and reading. Teachers with students who had higher starting scores and principals in schools with higher starting scores tended to have higher Mean Growth Percentiles.

This might occur for several reasons. First, it might just be that the tests used to generate the MGPs are scaled such that it’s just easier to achieve growth in the upper ranges of scores. I came to a similar finding of bias in the NYC value added model, where schools having higher starting math scores showed higher value added. So perhaps something is going on here. It might also be that students clustered among higher performing peers tend to do better. And, it’s at least conceivable that students who previously had strong teachers and remain clustered together from year to year, continue to show strong growth. What is less likely is that many of the actual “better” teachers just so happen to be teaching the kids who had better scores to begin with.

That the systemic bias appears greater in the school level estimates than in the teacher level estimates is suggestive that the teacher level estimates may actually be even more bias than they appear. The aggregation of otherwise less biased estimates should not reveal more bias.

Further, as I’ve mentioned on several times on this blog previously, even if there weren’t such glaringly apparent overall patterns of bias their still might be underlying biased clusters. That is, groups of teachers serving certain types of students might have ratings that are substantially WRONG, either in relation to observed characteristics of the students they serve or their settings, or of unobserved characteristics.

Closing Thoughts

To be blunt – the measures are neither conceptually nor statistically accurate. They suffer significant bias, as shown and then completely ignored by the authors. And inaccurate measures can’t be fair. Characterizing them as such is irresponsible.

I’ve now written 2 articles and numerous blog posts in which I have raised concerns about the likely overly rigid use of these very types of metrics when making high stakes personnel decisions. I have pointed out that misuse of this information may raise significant legal concerns. That is, when district administrators do start making teacher or principal dismissal decisions based on these data, there will likely follow, some very interesting litigation over whether this information really is sufficient for upholding due process (depending largely on how it is applied in the process).

I have pointed out that the originators of the SGP approach have stated in numerous technical documents and academic papers that SGPs are intended to be a descriptive tool and are not for making causal assertions (they are not for “attribution of responsibility”) regarding teacher effects on student outcomes. Yet, the authors persist in encouraging states and local districts to do just that. I certainly expect to see them called to the witness stand the first time SGP information is misused to attribute student failure to a teacher.

But the case of the NY-AIR technical report is somewhat more disconcerting. Here, we have a technically proficient author working for a highly respected organization – American Institutes for Research – ignoring all of the statistical red flags (after waiving them), and seemingly oblivious to gaping conceptual holes (commonly understood limitations) between the actual statistical analyses presented and the concluding statements made (and language used throughout).

The conclusion are WRONG – statistically and conceptually. And the author needs to recognize that being so damn bluntly wrong may be consequential for the livelihoods of thousands of individual teachers and principals! Yes, it is indeed another leap for a local school administrator to use their state approved evaluation framework, coupled with these measures, to actually decide to adversely affect the livelihood and potential career of some wrongly classified teacher or principal – but the author of this report has given them the tool and provided his blessing. And that’s inexcusable.

The Secrets to Charter School Success in Newark: Comments on the NJ CREDO Report

Today, with much fanfare, we finally got our New Jersey Charter School Report. The unsurprising findings of that report are that charter schools in Newark in particular seem to be providing students with greater average annual achievement gains than those of similar (matched) students attending district schools. Elsewhere around the state charter schools are pretty much average.

Link to report: http://credo.stanford.edu/pdfs/nj_state_report_2012_FINAL11272012.pdf

So then, the big question is, what exactly is behind the apparent success of Newark Charter schools – or at least some of them enough to influence the analysis as a whole – that makes them successful? Further, and perhaps more importantly, is there something about these schools that makes them successful that can be replicated?

The General Model

Allow me to start by pointing out that the CREDO study uses its usual approach – a reasonable one given data and system constraints, of identifying matched sets of students from feeder schools (or areas) who end up in district schools and in charter schools. CREDO then compares (estimates) the year to year test score gains of students in the charter and district schools.

The CREDO approach, while reasonable, simply can’t sort out which component of student achievement gain is created by “school factors” (such as teacher quality, length of day/year, etc.) and which factors are largely a function of concentrating non-low income, non-ell, non-disabled females in charter schools while concentrating the “others” in district schools.

School Effect = Controllable School Factors + Peer Group & Other Factors

In other words, we simply don’t know what component of the effect has to do with school quality issues that might be replicated and what component has to do with clustering kids together in a more advantaged peer group. Yes, the study controls for the students’ individual characteristics, but no, it cannot sort out whether the clustering of students with more or less advantaged peers affects their outcomes (which it certainly does). Lottery-based studies suffer the same problem, when lotteried in and lotteried out students end up in very different peer contexts. Yes, the sorting mechanism is random, but the placement is not. The peer selection effect may be exacerbated by selective attrition (shedding weaker and/or disruptive students over time). And Newark’s highest flying charter schools certainly have some issues with attrition.

Given my numerous previous posts, I would suggest Figure 1 as the general model of the secrets of Newark Charter School success.

Figure 1. The General Model

Put simply, while resource use – additional time, compensation, etc. – may be part of the puzzle – the scalable part – the strong sorting patterns of students into charter and district schools clearly play some role – a substantial role – and one that constrains our ability to use “chartering” as a broad-based public policy solution.

One Part Segregation

Let’s start by taking a look at the most recent available data on the segregation of students by disability status, free lunch status, gender and language proficiency. Now, the CREDO report is careful to point out that charter school enrollments match the demographics of their feeder schools – and uses this finding as an indication that therefore charter schools aren’t cream-skimming. That’s all well and good…. EXCEPT … that for some (actually many) reason, charter schools themselves end up having far fewer of the lowest income students. See Figure 2.

Figure 2. % Free Lunch

Now, one technical quibble I have with the CREDO report is that it relies on the free/reduced priced lunch indicator to identify economic disadvantage (and then sloppily throughout refers to this as “poverty”). I have shown on numerous previous occasions that Newark charters tend to serve larger shares of the less poor children and smaller shares of the poorer children. So, it is quite likely that the CREDO matched groups of students actually include disproportionate shares of “reduced lunch” children for charters and “free lunch” children sorted into district schools. This is a non-trivial difference! [gaps between free lunch and reduced lunch students tend to be comparable to gaps between reduced lunch and non-qualified students.]

Here are the other sorting issues:

Figure 3. % ELL/LEP

Figure 4. % Female

Figure 5 shows that not only do charter schools in Newark tend to serve far fewer children with disabilities, they especially serve few or no students with more severe disabilities. In fact, they serve mainly students with Specific Learning Disabilities and Speech Language Impairment. Given the data in Table 5, it is actually quite humorous – if not strangely disturbing – that the CREDO study attempted to parse the relative effectiveness of district and charter schools at producing outcomes for children with disabilities using only a single broad classification [Student matching was based on a single classification, creating the possibility that children with speech language impairment in charters were being compared with children with mental retardation and autistic children in district schools. It is likely that most students who took the assessments were those with less severe disabilities in both cases.].

Figure 5. Special Education Distributions

Here are some related findings from (and links to) previous posts

Newark Charter Effects on NPS School Enrollments

New Jersey Charter School Special Education

Newark Charter School Attrition Rates

Here are just a few visuals of how the free lunch shares and female student test-taker shares relate to general education proficiency rates on 8^th grade math. Both are relatively strong determinants of cross-school proficiency. And both with respect to gender balance and free lunch balance, Newark Charter schools are substantively different from their district school counterparts.

Figure 6: 8^th Grade Math & % Free Lunch

Figure 7: 8^th Grade Math & % Female

Now, these are performance level differences, which are not the same as the gain measures estimated in the CREDO study. But, I’ve chosen the 8^th grade scores because that is when the charter scores tend to pull away from the district school scores (that is, these are the score levels at the tail end of achieving greater gains). But, the contexts of the gains for charter students are so substantially different from the contexts of achievement gains for district school students that scalability is highly questionable.

As I’ve said before – There just aren’t enough non-disabled, non-poor, fluent English speaking females in Newark to fully replicate district-wide the successes of the city’s highest flying charters.

One Part Compensation

Now, I’ve also written many posts which address the resource advantages and some resource allocation issues for high flying New York City charter schools, which a) also promote substantial student population segregation and b) have been shown in numerous studies to yield positive achievement gains.

I do not intend to imply by my above critique that peer group effect is necessarily the ONLY effect driving Newark Charter’s supposed success. The problem is that because high flying Newark Charters in particular serve such uncommon student populations we can never really sort out the peer group versus school quality effects.

It is certainly reasonable to assume that the additional time and effort spent with these students in some schools – even though they are a more advantaged (less disadvantaged) group – makes a difference. No excuses charters in Newark like those in New York City tend to provide longer school days and longer school years, and importantly, they compensate their teachers for the additional time & effort. Here’s a simple chart of the average teacher compensation for early career teachers in NPS and Newark Charters. NPS teachers catch back up in later years, but as I’ve pointed out in numerous previous posts, a handful of Newark charters have adopted the reasonable (smart) competitive strategy of leveraging higher salaries and salary growth at the front end to improve teacher retention and recruitment.

Figure 9: Newark Teacher Compensation

Below is a more precise comparison that teases out the differences that aren’t so apparent in Figure 9. For Figure 10, I have used 3 years of data on teachers to estimate a regression model of teacher salaries as a function of experience, degree level and data year.

Some of Newark’s “high flying charters” [North Star, Gray, TEAM] tend to substantially outpace salaries of NPS teachers over the first ten years of a teacher career. Few of these schools have any teachers with more years of experience than 10. Other Newark charter schools maintain at least relatively competitive salaries with NPS.

Now, a critical point here is that as I’ve shown above, teaching in many of these schools comes with the perk of working with a much more advantaged student population. As such, it is conceivable that even a comparable wage provides recruitment advantage – given the student population difference. Clearly, a higher wage provides a significant recruitment advantage – though in the case of the highest paying school(s), the elevated salary comes with substantial additional obligations.

Figure 10. Modeled Teacher Salary Variation by Experience

Closing Thoughts

So, when all is said and done, this new “charter school” report like many that have come before it leaves us sadly unfulfilled, at least with respect to its potential to provide important policy insights. Most cynically, one might argue the main finding of the report is simply that cream-skimming works – generates a solid peer effect that provides important academic advantages to a few – and serving a few is better than serving none at all (assuming the latter is really the alternative?). Keep it up! Don’t worry ’bout the rest of those kids who get shuffled off into district schools. Quite honestly, given the huge, persistent differences in student populations between high flying Newark charters and districts schools, and given the relatively consistency of research on peer group effects, it would be shocking if the CREDO report had not found that Newark charters outperform district schools.

While it is likely that there exists some strategies employed by some charters (as well as some strategies employed by some district schools) that are working quite well – THE CREDO REPORT PROVIDES ABSOLUTELY NO INSIGHTS IN THIS REGARD. It’s a classic “charter v. district” comparison – where it is assumed that “chartering” represents one set of educational/programmatic strategies and “districting” represents another – when in fact, neither is true (see the scatter of dots in my plots above to see the variations in each group!).

AIR Pollution in NY State? Comments on the NY State Teacher/Principal Rating Models/Report

The model selected to estimate growth scores for New York State provides a fair and accurate method for estimating individual teacher and principal effectiveness based on specific regulatory requirements for a “growth model” in the 2011-2012 school year. p. 40

http://engageny.org/wp-content/uploads/2012/06/growth-model-11-12-air-technical-report.pdf

13-Nov-2012 20:54

Updated Final Report: http://engageny.org/sites/default/files/resource/attachments/growth-model-11-12-air-technical-report_0.pdf

Local copy of original DRAFT report: growth-model-11-12-air-technical-report

Local copy of FINAL report: growth-model-11-12-air-technical-report_FINAL

First Conceptually

So, what is being measured and reported? At best, what we have are:

An estimate of student relative test score change on one assessment each for ELA and Math (scaled to growth percentile) for students who happen to be clustered in certain classrooms.

THIS IS NOT TO BE CONFLATED WITH “TEACHER EFFECTIVENESS”

An estimate of student relative test score change on one assessment each for ELA and Math (scaled to growth percentile) for students who happen to be clustered in certain schools.

THIS IS NOT TO BE CONFLATED WITH “PRINCIPAL EFFECTIVENESS”

Now Statistically

Following are a series of charts presented in the technical report, immediately preceding the above conclusion.

Classroom Level Rating Bias

School Level Rating Bias

And there are many more figures displaying more subtle biases, but biases that for clusters of teachers may be quite significant and consequential.

Closing Thoughts

And a video with song!

==================

Note: In the executive summary, the report acknowledges these biases:

Despite the model conditioning on prior year test scores, schools and teachers with students who had higher prior year test scores, on average, had higher MGPs. Teachers of classes with higher percentages of economically disadvantaged students had lower MGPs.

But then blows them off throughout the remainder of the report, and never mentions that this might be important.

Local copy of report: growth-model-11-12-air-technical-report

On the Stability (or not) of Being Irreplaceable

This is just a quick note with a few pictures in response to the TNTP “Irreplaceables” report that came out a few weeks back – a report that is utterly ridiculous at many levels (especially this graph!)… but due to the storm I just didn’t get a chance address it. But let’s just entertain for the moment the premise that teachers who achieve a value-added rating in the top 20% in a given year are… just plain freakin’ awesome…. and that districts should take whatever steps they can to focus on retaining this specific momentary slice of teachers. At the same time, districts might not want concern themselves with all of those other teachers that range only from okay… all the way down to those that simply stink!

The TNTP report focuses on teachers who were in the top 14% in Washington DC based on aggregate IMPACT ratings, which do include more than value-added alone, but are certainly driven by the Value-added metric. TNTP compares DC to other districts, and explains that the top 20% by value-added are assumed to be higher performers.

For the other four districts we studied, we used teacher value-added scores or student academic growth measures to identify high- and low-performing teachers—those whose students made much more or much less academic progress than expected. These data provided us with a common yardstick for teacher performance. Teachers scoring in approximately the top 20 percent were identified as Irreplaceables. While teachers of this caliber earn high ratings in student surveys and have been shown to have a positive impact that extends far beyond test scores, we acknowledge that such measures are limited to certain grades and subjects and should not be the only ones used in real-world teacher evaluations. http://tntp.org/assets/documents/TNTP_DCIrreplaceables_2012.pdf

Let’s take a stab at this with the NYC Teacher Value-added Percentiles which I played around with in some previous posts.

The following graphs play out the premise of “irreplaceables” with NYC value-added percentile data. I start by identifying those teachers that are in the top 20% in 2005-06 and then see where they land in each subsequent year through 2009-10.

NOTE: IT’S REALLY NOT A GREAT IDEA TO MAKE SCATTERPLOTS OF THE RELATIONSHIP BETWEEN PERCENTILE RANKS – BETTER TO USE THE ACTUAL VAM SCORES. BUT THIS IS ILLUSTRATIVE… THE POINT BEING TO SEE WHERE ALL OF THOSE DOTS THAT ARE “IRREPLACEABLE” IN YEAR 1 (2005-06) STAY THAT WAY YEAR AFTER YEAR!

I’ve chosen to focus on the MATHEMATICS ratings here… which were actually the more stable ratings from year to year (but were stable potentially because the were biased!)

See: https://schoolfinance101.wordpress.com/2012/02/28/youve-been-vam-ified-thoughts-graphs-on-the-nyc-teacher-data/

Figure 1 – Who is irreplaceable in 2006-07 after being irreplaceable in 2005-06?

Figure 1 shows that there are certainly more “irreplaceables” (awesome teachers) that remain above the median the following year than fall below it… but there sure are one heck of a lot of those irreplaceables that are below the median the next year… and a few that are near the 0%ile! This is not, by any stretch to condemn those individuals for being falsely rated as irreplaceable but actually sucking. Rather, this is to point out that there is comparable likelihood that these teachers were wrongly classified each year (potentially like nearly every other teacher in the mix).

Figure 2 – Among those 2005-06 Irreplaceables, how do they reshuffle between 2006-07 & 2007-08?

Hmm… now they’re moving all over the place. A small cluster do appear to stay in the upper right. But, we are dealing with a dramatically diminishing pool of the persistently awesome here. And I’m not even pointing out the number of cases in the data set that are simply disappearing from year to year. Another post – another day.

I provide an analysis along these lines here: https://schoolfinance101.wordpress.com/2012/03/01/about-those-dice-ready-set-roll-on-the-vam-ification-of-tenure/

Figure 3 – How many of those teachers who were totally awesome in 2005-06 were still totally awesome in 2009-10?

The relationship between ratings from year to year is even weaker when one looks at the endpoints of the data set, comparing 2005-06 ratings to 2009-10 ones. Again, we’ve got teachers who were supposedly “irreplaceable” in 2005-06 who are at the bottom of the heap in 2009-10.

Yes, there is still a cluster of teachers who had a top 20% rating in 2005-06 and have one again in 2009-10. BUT… many… uh… most of these had a much lower rating for at least one of the in between years!

Of the thousands of teachers for whom ratings exist for each year, there are 14 in math and 5 in ELA that stay in the top 20% for each year! Sure hope they don’t leave!

====

Note: Because the NYC teacher data release did not provide unique identifiers for matching teachers from year to year, for my previous analyses I had constructed a matching identifier based on teacher name, subject and grade level within school. So, my year to year comparisons include only those teachers who are teaching the same subject and grade level in the same school from one year to the next. Arguably, this matching approach might lead to greater stability than might be expected if I included teachers who moved to different schools serving different students and/or changed subject areas or levels.

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

The General Model

One Part Segregation

One Part Compensation

Closing Thoughts

Share this:

Share this:

Share this: