Blog

Class Size & Funding Inequity in NY State & NY City

New York State repeatedly makes the top of our list of most INEQUITABLE state school finance systems in our National Report Card.

http://www.schoolfundingfairness.org/National_Report_Card_2012.pdf

Further, NY State was acknowledged in a report I prepared last year, with colleague Sean Corcoran of NYU, identifying states that are both generally inequitable and that actually exacerbate those inequities through their state school finance systems.

http://www.americanprogress.org/wp-content/uploads/2012/09/StealthInequities.pdf

As I’ve explained in several previous blog posts, NY State’s primary policy response has been to brush aside the issue – blame school districts for being inefficient – pretend that the state already spends way too much – and ram ill-conceived policies down the throats of local district officials.

Here’s one choice quote from the Governor:

“The problem with education in New York is not money,” Cuomo said. “We have one of the highest spending rates in the nation. Our performance isn’t where our money is.”[1]

Of course, it’s not about money. Couldn’t possibly be.  If Scarsdale has enough, they must all have enough. If New York districts spend, on average, more than New Mexico districts, it has to be more than enough.

But these are meaningless comparisons. Cost pressures in education are primarily local/regional. Education is a labor intensive industry. Salaries must be competitive on the local/regional labor market to recruit and retain quality teachers. And for children to have access to higher education, they must be able to compete with peers in their region. And within any region, children with greater needs and schools serving higher concentrations of children with greater needs require more resources – more resources to recruit and retain even comparable numbers of comparable teachers – and more resources to provide smaller class sizes and more individual attention.

More information on how and why money matters can be found here:

http://www.shankerinstitute.org/images/doesmoneymatter_final.pdf

Here’s one snapshot of spending disparities in downstate/NYC/Long Island (identified by Regional Cost Index Region).  Remember – on average – higher need districts generally require more financial support per pupil to move toward comparable outcomes.  Here, along the horizontal axis, we have district rates of children qualified for free or reduced lunch compared to the average district in their “core based statistical area” (basically, metro area). Along the vertical axis, we have the Approved Operating Expense per Pupil, relative to the average district in the CBSA.  Now, in a “fair” system, the pattern would tilt upward from left to right. Not in NY.

Figure 1

Slide2

What we see here is that many districts having several times the average low income concentration as their surrounding districts also spend much less than the average of their surrounding districts.

Now, state policymakers have been patting themselves on the back of late for their supposed generous increases to state aid for the coming year. The following graph puts those increases into context. As I’ve explained in previous posts, NY State has failed for years to even come close to funding it’s state general aid formula. The graph below organizes districts into quintiles by student need – from low need districts to high need districts based on the state’s own pupil need index (driven largely by low income population shares). The blue bars show how much less state aid than promised, each group is receiving on average. The red and green bars show the aid increases for the coming year. They barely make a dent in the current underfunding of the state’s own formula (which sets a very low bar to begin with).

Figure 2

Slide1

So then, what are the consequences of this? Michael Rebell and colleagues at Teachers College put out a series of papers last Fall/Winter discussing essential resources, where the essential resource benchmarks are derived from an earlier court order in Campaign for Fiscal Equity v. State.

Although there is no specific maximum class size number beyond which children cannot learn, the Court of Appeals has indicated that classes of about the sizes listed below are appropriate and that larger class sizes may lead to unsatisfactory results.101 For schools and classes with large concentrations of students below grade level, and for AIS and RTI services, smaller class sizes may be necessary.

Kindergarten-grade 3: 20 students

Grades 4-6: 21-23 students

Middle and High School: 21-23 students (p 13-14)

The next two figures use the recently released NYSED 2012 school report cards to characterize the percent of children attending schools with average class sizes above certain thresholds – specifically, those identified in Rebell’s essential resources analysis.  I have put schools into quintiles, statewide, using percent free or reduced price lunch.
Figure 3. Elementary Class Sizes
Slide3
Figure 4. Middle School Class Sizes
Slide4
Two findings here are notable.
  • First, that high poverty schools which most need smaller class sizes tend to have higher proportions of children in larger classes.
  • Second, it’s getting worse, not better as a function of the state’s failure to fund high poverty schools appropriately.
What’s most offensive here is that – while this has occurred during a time of recession – it is presently occurring in a context where the Governor continues to proclaim that schools – all schools – any schools – certainly these schools – have more than enough money to get the job done.  Further, these patterns are occurring in a context where the state continues to squander billions in state funding on lower need districts, as explained in our Stealth Inequities report.
Conditions are particularly dire in New York City. The next several graphs present the frequency distributions of the percent of children in schools by average class size – for elementary and middle schools. Here, I also present the class size distributions of schools in districts in the top wealth/income quintile in the NYC metro area. Red vertical lines indicate the essential resources thresholds.
Figure 5. Elementary Class Sizes in NYC
Slide5
Figure 6. Elementary Class Sizes in Wealthy Districts in Region
Slide7
Figure 7. Middle School Class Sizes in NYC
 Slide8

Figure 8. Middle School Class Sizes in Wealthy Districts in Region

Slide10
What is notable here is that:
  • class sizes in NYC schools continue to increase over time.
  • class sizes in NYC schools are much larger than those in the top wealth/income quintile
  • Further, compared to essential resources thresholds, class sizes in NYC are freakin’ huge! Yeah… that’s a technical term for you… freakin’ huge!!!!!

For those hack pundits who’ve latched on to the “uncertainty” or “narrowness” of research on the effectiveness of class size reduction (& bogus characterizations of “cost effectiveness”), there is little if any justification for permitting class sizes in high poverty settings at 30 or higher. Further, class size, and total student load are a relevant working condition influencing teacher recruitment/retention.

In simpler terms, there is certainly little basis for the inequity here. From a simple fairness standpoint, it makes little sense that children in the top 20% districts by wealth and income should have access to such smaller classes than children in New York City and that these disparities should year after year be a byproduct of the state’s dysfunctional, inequitable school finance system and overblown false claims that serve to maintain that status quo!

Civics 101? Center for Ed Reform’s Bizarre Understanding of Civics & the Law

PDF of original CER response to LA court ruling: Louisiana High Court Violates Parent Rights (in case they revise/retract)

Now I know that the last thing reformy types really want to think about – to bother themselves with – is a basic understanding of law, civics and the structure of American government. All  that stuff is just an annoyance – an impediment to reformy awesomeness.

As such, it comes as no surprise to me that Jeanne Allen of the Center for Education Reform in the wake of today’s  Louisiana Supreme Court decision overturning that state’s private school voucher program, has issued perhaps the most over the top ignorant response I’ve seen in quite some time. Here are a few choice quotes from the CER press release:

“If indeed the Louisiana constitution, as suggested by the majority court opinion, prohibits parents from directing the course of the funds allocated to educate their child, then the Louisiana constitution needs to be reviewed by the nation’s highest court,” said Center for Education Reform President Jeanne Allen.

Allen added: “I urge Governor Jindal to file an appeal to the US Supreme Court, and ask for the justices’ immediate review of the decision. The Louisiana justices actions today violate the civil rights of parents and children who above all are entitled to an education that our Founders repeated time and time again is the key to a free, productive democracy.”

Allen’s response, in her view, is based on her understanding (or lack thereof) of the 2002 U.S. Supreme Court decision in Zelman v. Simmons Harris, which involved an establishment clause challenge to the Cleveland Ohio Voucher program.

So, what does that mean? Well, Ohio had adopted a voucher program that permitted children in Cleveland to apply their taxpayer funds to attend either a public or private school, including religious schools.  Because religious schools dominated the marketplace of alternatives in Cleveland, most voucher users/recipients applied their vouchers to religious schools. Taxpayers sued over this use of their funds – claiming that their tax dollars were being used – against their conscience – to support religious objectives/promoting or advancing religion. So, here’s a quick summary of what the court found/decided in that case:

In a 5-4 opinion delivered by Chief Justice William H. Rehnquist, the Court held that the program does not violate the Establishment Clause. The Court reasoned that, because Ohio’s program is part of Ohio’s general undertaking to provide educational opportunities to children, government aid reaches religious institutions only by way of the deliberate choices of numerous individual recipients and the incidental advancement of a religious mission, or any perceived endorsement, is reasonably attributable to the individual aid recipients not the government. Chief Justice Rehnquist wrote that the “Ohio program is entirely neutral with respect to religion. It provides benefits directly to a wide spectrum of individuals, defined only by financial need and residence in a particular school district. It permits such individuals to exercise genuine choice among options public and private, secular and religious. The program is therefore a program of true private choice.” http://www.oyez.org/cases/2000-2009/2001/2001_00_1751

So then, what are the policy implications? Well, this finding really just means that the establishment clause of the first amendment of the U.S. Constitution DOES NOT PROHIBIT a voucher scheme like that adopted in Ohio.

That, by no stretch of reformy logic (otherwise known as the imagination), means that the U.S. Constitution MANDATES that states must permit such voucher schemes.

It may come as a surprise to some reformers that states have their own constitutions. Indeed, states cannot adopt a constitution which attempts to deprive rights guaranteed under the U.S. Constitution. Civics 101!

Notably, the U.S. Constitution does not include a right to a voucher for private schooling. Rather, the right in question in Zelman was whether the taxpayers’ rights (under the first amendment) were being violated by the voucher scheme.

State Constitutions may establish additional rights which then apply within their boundaries and some of those rights might lead to a declaration of a voucher program being unconstitutional (in state court). Among other things, many state constitutions prohibit the use of public funds for religious entities, including schools (expanding the protection of the establishment clause). These provisions may be interpreted as prohibiting a voucher program which includes religious schooling.

Many states also have provisions that require that the state provide for a “uniform system of PUBLIC schools,” a phrase that has been interpreted in some states as meaning that the state legislature must, in fact, provide education funding – toward a goal of uniformity – through a “system of public schools,” which would mean that doing so through private schooling might not comply.

Some reformers might say – pshaw… that’s just liberal activist courts reading into their constitutions stuff that’s really not there. How can one possibly interpret the phrase “the state shall provide for a uniform system of public schools” as meaning that the state shall actually provide for a system of schools that is both uniform and public? (note that there exist important delineations in legal/governance terms between “public” schools, voucher receiving private schools, and charter schools).

It is indeed permissible for state constitutions to include such requirements. And quite honestly, it is ridiculous to assume that Zelman, which merely declares that vouchers don’t violate the establishment clause of the U.S. Constitution, somehow negates these provisions of state constitutions.

How ridiculous is this logic? Let’s apply it to another well known U.S. Supreme Court case on education. In 1973, the U.S. Supreme Court found (as a step toward their final ruling related to funding inequities) in San Antonio ISD v. Rodriguez that education was not a fundamental right under the U.S. Constitution. It’s just not there. There’s no education article/amendment in the U.S. Constitution.

But state constitutions make reference to the state’s obligation toward providing schools/education, etc. These are education articles/clauses. And in many states, high courts have determined that these education articles in fact provide a fundamental right – under the state constitution – to some form of education (uniform system, thorough and efficient system, sound basic education, etc.). But ooohhhh no… not by Jeanne Allen’s logic. How dare state’s grant a right to education? or obligate their legislatures to provide a system of public schools?

Applying Jeanne Allen’s logic, if the U.S. Constitution does not grant this right, then states should not be permitted to do so either (which doesn’t mean states can’t provide schools, they just can’t guarantee a right to them? which I guess might suit the CER agenda…). In other words, states should absolutely not be permitted to guarantee a fundamental right to an education, because the Rodriquez court said that the U.S. Constitution included no such right.

No rights, no problem. After all, it’s the reformy way.

Follow up Question Guide for Ed Writers (on Teacher Evaluation)

I was reviewing the past few days of news coverage on NJ teacher evaluations and came across the following quote, which was not-so-amazingly left unchallenged:

Cerf said research shows test scores are “far and away” the best gauge of teacher effectiveness, and to not use test score data would be “very anti-child.”

http://www.nj.com/news/index.ssf/2013/05/state_board_of_education_adjus.html

Here’s a reporters’ guide to follow up questions….

Mr. Cerf… can you show me exactly what research comes to that conclusion? (this should always be the immediate follow up to the ambiguous “research shows” comment)

Exactly how is “far and away” measured in that research?

And what is meant by “best gauge of effectiveness?”

That is, what is the valid measure of effectiveness against which test scores are gauged? (answer… uh… test scores themselves)

So, Mr. Cerf, are you trying to tell me that the Gates MET study proved that test scores are “far and away” the best gauge of teacher effectiveness? (seems to be most common reference point of late)

Can you show me where they said that?

And how did they measure what was the best predictor of effectiveness? In other words… what did they use as the true measure of effectiveness?

So… you’re telling me that the Gates study found – not far and away, mind you – that test score based measures are, well… the best predictor of themselves a year later? Is that right?

That’s what you mean by best gauge? Right? That they are the best predictor of themselves… if we also use other measures to try to predict test scores? Right? Seems a bit circular doesn’t it?

And how well did test-score based measures predict themselves a year later? if we accept as logical that validity test?

Well, that seems like a rather modest relationship from year to year, doesn’t it?

How does that make test scores the best predictor of actual effectiveness if actual effectiveness is broader than test scores themselves?

Okay… moving on… Since we’re leaning on those Gates foundation findings as providing the basis for placing heavy weight on test scores in NJ teacher evaluation… I note here (pointing to NJDOE documents on SGPs) that New Jersey has chosen an approach called Growth Percentiles to measure teacher effectiveness…. Can you show me in the Gates studies where the authors find this approach to be appropriate – or even more specifically “far and away” the best approach for measuring teacher effectiveness?

I don’t see any reference to SGPs or MGPs in the Gates studies… why is that?

Are SGPs and VAMs the same thing? I’ve been told there are substantive differences.

Isn’t one of these, SGPs, not even designed to isolate the effect the teacher has on test score gains?

In which case, how can they possibly be “the best gauge of teacher effectiveness?”

Let’s save the “very anti-child” stuff for another day!

Deconstructing Disinformation on Student Growth Percentiles & Teacher Evaluation in New Jersey

CROSS-POSTED FROM: http://njedpolicy.wordpress.com/

Deconstructing Disinformation on Student Growth Percentiles & Teacher Evaluation in New Jersey (Printable Policy Brief): SGP_Disinformation_BakerOluwole

Bruce D. Baker, Rutgers University, Graduate School of Education

Joseph Oluwole, Montclair State University


Introduction

This brief addresses problems with, and disinformation about New Jersey’s Student Growth Percentile (SGP) measures which are proposed by New Jersey Department of Education officials, to be used for evaluating teachers and principals and rating local public schools. Specifically, the New Jersey Department of Education has proposed that the student growth percentile measures be used as a major component for determining teacher effectiveness:

“If, according to N.J.A.C. 6A:10-4.2(b), a teacher receives a median student growth percentile, the student achievement component shall be at least 35 percent and no more than 50 percent of a teacher’s evaluation rubric rating.” [1]

Yet those ratings of teacher effectiveness may have consequences for employment. Specifically, under proposed regulations, school principals are obligated to notify teachers:

“…in danger of receiving two consecutive years of ineffective or partially effective ratings, which may trigger tenure charges to be brought pursuant to TEACHNJ and N.J.A.C. 6A:3.”[2]

In addition, proposed regulations require that school principals and assistant principals be evaluated based on school aggregate growth percentile data:

“If, according to N.J.A.C. 6A:10-5.2(b), the principal, vice-principal, or assistant principal receives a median student growth percentile measure as described in N.J.A.C. 6A:10-5.2(c) below, the measure shall be at least 20 percent and no greater than 40 percent of evaluation rubric rating as determined by the Department.” [3]

Thus inferring that the median student’s achievement growth in any school may be causally attributed to the principal and/or vice principal.

But, we explain in this brief that student growth percentile data are not up to this task.  In the following brief, we explain that:

  • Student Growth Percentiles are not designed for inferring teacher influence on student outcomes.
  • Student Growth Percentiles do not control for various factors outside of the teacher’s control.
  • Student Growth Percentiles are not backed by research on estimating teacher effectiveness. By contrast, research on SGPs has shown them to be poor at isolating teacher influence.
  • New Jersey’s Student Growth Percentile measures, at the school level, are significantly statistically biased with respect to student population characteristics and average performance level.

Understanding Student Growth Measures

Two broad categories of methods and models have emerged in state policy regarding development and application of measures of student achievement growth to be used in newly adopted teacher evaluation systems. The first general category of methods is known as value-added models (VAMs) and the second as student growth percentiles (SGPs or MGPs, for “median growth percentile”). Several large urban school districts including New York City and Washington, DC have adopted value-added models and numerous states have adopted student growth percentiles for use in accountability systems. Among researchers it is well understood that these are substantively different measures by design, one being a possible component of the other. But these measures and their potential uses have been conflated by policymakers wishing to expedite implementation of new teacher evaluation policies and pilot programs.[4]

Arguably, one reason for the increasing popularity of the SGP approach across states is the extent of highly publicized scrutiny and large and growing body of empirical research over problems with using VAMs for determining teacher effectiveness.[5] Yet, there has been far less research on using student growth percentiles for determining teacher effectiveness. The reason for this vacuum is not that student growth percentiles are simply immune to problems of value-added models, but that researchers have until recently chosen not to evaluate their validity for this purpose – estimating teacher effectiveness – because they are not designed to infer teacher effectiveness.

Two recent working papers compare SGP and VAM estimates for teacher and school evaluation and both raise concerns about the face validity and statistical properties of SGPs. Goldhaber and Walch (2012) conclude: “For the purpose of starting conversations about student achievement, SGPs might be a useful tool, but one might wish to use a different methodology for rewarding teacher performance or making high-stakes teacher selection decisions” (p. 30).[6] Ehlert and colleagues (2012) note: “Although SGPs are currently employed for this purpose by several states, we argue that they (a) cannot be used for causal inference (nor were they designed to be used as such) and (b) are the least successful of the three models [Student Growth Percentiles, One-Step VAM & Two-Step VAM] in leveling the playing field across schools” (p. 23).[7]

A value-added estimate uses assessment data in the context of a statistical model (regression analysis), where the objective is to estimate the extent to which a student having a specific teacher or attending a specific school influences that student’s difference in score from the beginning of the year to the end of the year – or period of treatment (in school or with teacher). The most thorough of VAMs, more often used in research than practice, attempt to account for: (a) the student’s prior multi-year gain trajectory, by using several prior year test scores (to isolate the extent that having a certain teacher alters that trajectory), (b) the classroom level mix of student peers, (c) individual student background characteristics, and (d) possibly school level characteristics. The goal is to identify most accurately the share of the student’s or group of students’ value-added that should be attributed to the teacher as opposed to other factors outside of the teachers’ control. Corrections such as using multiple years of prior student scores dramatically reduces the number of teachers who may be assigned ratings. For example, when Briggs and Domingue (2011) estimate alternative models to the LA Times (Los Angeles Unified School District) data using additional prior scores, the number of teachers rated drops from about 8,000 to only 3,300, because estimates can only be determined for teachers in grade 5 and above. [8] As such, these important corrections are rarely included in models used for actual teacher evaluation.

By contrast, a student growth percentile is a descriptive measure of the relative change of a student’s performance compared to that of all students.  That is, the individual scores obtained on these underlying tests are used to construct an index of student growth, where the median student, for example, may serve as a baseline for comparison. Some students have achievement growth on the underlying tests that is greater than the median student, while others have growth from one test to the next that is less. That is, the approach estimates not how much the underlying scores changed, but how much the student moved within the mix of other students taking the same assessments.  It uses a method called quantile regression to estimate the rarity that a child falls in her current position in the distribution, given her past position in the distribution (Briggs & Betebenner, 2009).[9]   Student growth percentile measures may be used to characterize each individual student’s growth, or may be aggregated to the classroom level or school level, and/or across children who started at similar points in the distribution to attempt to characterize the collective growth of groups of students.

Many, if not most value-added models also involve normative rescaling of student achievement data, measuring in relative terms how much individual students or groups of students have moved within the large mix of students. The key difference is that the value-added models include other factors in an attempt to identify the extent to which having a specific teacher contributed to that growth, whereas student growth percentiles are simply a descriptive measure of the growth itself.

SGPs can be hybridized with VAMs, by conditioning the descriptive student growth measure on student demographic characteristics. New York State has adopted such a model. However, the state’s own technical report found “Despite the model conditioning on prior year test scores, schools and teachers with students who had higher prior year test scores, on average, had higher MGPs. Teachers of classes with higher percentages of economically disadvantaged students had lower MGPs(p. 1).[10]

Value-added models while intended to estimate teacher effects on student achievement growth, largely fail to do so in any accurate or precise way, whereas student growth percentiles make no such attempt.[11] Specifically, value-added measures tend to be highly unstable from year to year, and have very wide error ranges when applied to individual teachers, making confident distinctions between “good” and “bad” teachers difficult if not impossible.[12]  Furthermore, while value-added models attempt to isolate that portion of student achievement growth that is caused by having a specific teacher they often fail to do so and it is difficult if not impossible to discern a) how much the estimates have failed and b) in which direction for which teachers. That is, the individual teacher estimates may be biased by factors not fully addressed in the models and researchers have no clear way of knowing how much. We also know that when different tests are used for the same content, teachers receive widely varying ratings, raising additional questions about the validity of the measures.[13]

While we have substantially less information from existing research on student growth percentiles, it stands to reason that since they are based on the same types of testing data, they will be similarly susceptible to error and noise. But more troubling, since student growth percentiles make no attempt (by design) to consider other factors that contribute to student achievement growth, the measures have significant potential for omitted variables bias.  SGPs leave the interpreter of the data to naively infer (by omission) that all growth among students in the classroom of a given teacher must be associated with that teacher. Research on VAMs indicates that even subtle changes to explanatory variables in value-added models change substantively the ratings of individual.[14] Omitting key variables can lead to bias and including them can reduce that bias.  Excluding all potential explanatory variables, as do SGPs, takes this problem to the extreme by simply ignoring the possibility of omitted variables bias while omitting a plethora of widely used explanatory variables. As a result, it may turn out that SGP measures at the teacher level appear more stable from year to year than value-added estimates, but that stability may be entirely a function of teachers serving similar populations of students from year to year. The measures may contain stable omitted variables bias, and thus may be stable in their invalidity. Put bluntly, SGPs may be more consistent by being more consistently wrong.

In defense of Student Growth Percentiles as accountability measures, Betebenner, Wenning and Briggs (2011) explain that one school of thought is that value-added estimates are also most reasonably interpreted as descriptive measures, and should not be used to infer teacher or school effectiveness: “The development of the Student Growth Percentile methodology was guided by Rubin et al’s (2004) admonition that VAM quantities are, at best, descriptive measures”.[15] Rubin, Stuart, and Zanutto (2004) explain:

Value-added assessment is a complex issue, and we appreciate the efforts of Ballou et al. (2004), McCaffrey et al. (2004) and Tekwe et al. (2004). However, we do not think that their analyses are estimating causal quantities, except under extreme and unrealistic assumptions. We argue that models such as these should not be seen as estimating causal effects of teachers or schools, but rather as providing descriptive measures (Rubin et al., 2004, p. 18).[16]

Arguably, these explanations do less to validate the usefulness of Student Growth Percentiles as accountability measures (inferring attribution and/or responsibility to schools and teachers) and far more to invalidate the usefulness of both Student Growth Percentiles and Value-Added Models for these purposes.

Do Growth Percentiles Fully Account for Student Background?

New Jersey has recently released its new regulations for implementing teacher evaluation policies, with heavy reliance on student growth percentile scores, aggregated to the teacher level as median growth percentiles (using the growth percentile of the median student in any class as representing the teacher effect). When recently challenged about whether those growth percentile scores will accurately represent teacher effectiveness, specifically for teachers serving kids from different backgrounds, NJ Commissioner Christopher Cerf explained:

“You are looking at the progress students make and that fully takes into account socio-economic status,” Cerf said. “By focusing on the starting point, it equalizes for things like special education and poverty and so on.”[17] (emphasis added)

There are two issues with this statement. First, comparisons of individual students don’t actually explain what happens when a group of students is aggregated to their teacher and the teacher is assigned the median student’s growth score to represent his/her effectiveness, where teachers don’t all have an evenly distributed mix of kids who started at similar points (to other teachers). So, in one sense, this statement doesn’t even address the issue.

Second, this statement is simply factually incorrect, even regarding the individual student. The statement is not supported by research on estimating teacher effects which largely finds that sufficiently precise student, classroom and school level factors do relate to variations not only in initial performance level but also in performance gains. Those cases where covariates have been found to have only small effects are likely those in which effects are either drowned out by particularly noisy outcome measures, problems resulting from underlying test scaling (or re-scaling) or poorly measured student characteristics. Re-analysis of teacher ratings from the Los Angeles Times analysis, using richer data and more complex value-added models yielded substantive changes to teacher ratings.[18] The Los Angeles Times model already included far more attempts to capture student characteristics than New Jersey’s Growth Percentile Model – which includes none.

At a practical level, it is relatively easy to understand how and why student background characteristics affect not only their initial performance level but also their achievement growth. Consider that one year’s assessment is given in April. The school year ends in late June. The next year’s test is given the next April. First, there are approximately two months of instruction given by the prior year’s teacher that are assigned to the current year’s teacher. Beyond that, there are a multitude of things that go on outside of the few hours a day where the teacher has contact with a child, that influence any given child’s “gains” over the year, and those things that go on outside of school vary widely by children’s economic status. Further, children with certain life experiences on a continued daily, weekly and monthly basis are more likely to be clustered with each other in schools and classrooms.

With annual test scores, differences in summer experiences which vary by student economic background matter. Lower income students experience much lower achievement gains than their higher income peers over the summer.[19] Even the recent Gates Foundation Measures of Effective Teaching Project, which used fall and spring assessments, found that “students improve their reading comprehension scores as much (or more) between April and October as between October and April in the following grade.”(p. 8)[20] That is, gains and/or losses may be as great during the time period when children have no direct contact with their teachers or schools. Thus, it is rather absurd to assume that teachers can and should be evaluated based on these data.

Even during the school year, differences in home settings and access to home resources matter, and differences in access to outside of school tutoring and other family subsidized supports may matter and depend on family resources. [21] Variations in kids’ daily lives more generally matter (neighborhood violence, etc.) and many of those variations exist as a function of socio-economic status. Variations in peer group with whom children attend school matters,[22] and also varies by socio-economic status, neighborhood structure, conditions, and varies by socioeconomic status of not just the individual child, but the group of children.

In short, it is inaccurate to suggest that using the same starting point “fully takes into account socio-economic status.” It’s certainly false to make such a statement about aggregated group comparisons – especially while never actually conducting or producing publicly any analysis to back such a claim.

Did the Gates Foundation Measures of Effective Teaching Study Validate Use of Growth Percentiles?

Another claim used in defense of New Jersey’s growth percentile measures is that a series of studies conducted with funding from the Bill and Melinda Gates Foundation provide validation that these measures are indeed useful for evaluating teachers. In a recent article by New Jersey journalist John Mooney in his online publication NJ Spotlight, state officials were asked to respond to some of the above challenges regarding growth percentile measures. Along with perpetuating the claim that the growth percentile model takes fully into account student background, state officials also issued the following response:

The Christie administration cites its own research to back up its plans, the most favored being the recent Measures of Effective Teaching (MET) project funded by the Gates Foundation, which tracked 3,000 teachers over three years and found that student achievement measures in general are a critical component in determining a teacher’s effectiveness.”[23]

The Gates Foundation MET project did not study the use of Student Growth Percentile Models. Rather, the Gates Foundation MET project studied the use of value-added models, applying those models under the direction of leading researchers in the field, testing their effects on fall to spring gains, and on alternative forms of assessments. Even with these more thoroughly vetted value-added models, the Gates MET project uncovered, though largely ignored, numerous serious concerns regarding the use of value-added metrics. External reviewers of the Gates MET project reports pointed out that while the MET researchers maintained their support for the method, the actual findings of their report cast serious doubt on its usefulness.[24]

The Gates Foundation MET project results provide no basis for arguing that student growth percentile measures should have a substantial place in teacher evaluation.  The Gates MET project never addressed student growth percentiles. Rather, it attempted a more thorough, more appropriate method, but provided results which cast serious doubt on the usefulness of even that method.  Those who have compared the relative usefulness of growth percentiles and value-added metrics have found growth percentiles sorely lacking as a method for sorting out teacher influence on student gains.[25]

What do We Know about New Jersey’s Growth Percentile Measures?

Unfortunately, the New Jersey Department of Education has a) not released any detailed teacher-level growth percentile data for external evaluation or review, b) unlike other states pursuing value-added and/or growth metrics has chosen not to convene a technical review panel and c) unlike other states pursuing these methods, has chosen not to produce any detailed technical documentation or analysis of their growth percentile data. Yet, they have chosen to issue regulations regarding how these data must be used directly in consequential employment decisions. This is unacceptable.

The state has released as part of its school report cards, school aggregate median growth percentile data, which shed some light on the possible extent of the problems with their current measures.  A relatively straightforward statistical check on the distributional characteristics of these measures is to evaluate the extent to which they relate to measures of student population characteristics. That is, to what extent do we see that higher poverty schools have lower growth percentiles, or to what extent do we see that schools with higher average performing peer groups have higher average growth percentiles. Evidence of correlation with either might be indicative of statistical bias- specifically omitted variables bias.

Table 1 below uses school level data from the recently released New Jersey School Report Cards databases, combined with student demographic data from the New Jersey Fall Enrollment Reports.  For both ELA and Math growth percentile measures, there exist modest, negative statistically significant correlations with school level % free lunch, and with school level % black or Hispanic.

Higher shares of low income children and higher shares of minority children are each associated with lower average growth percentiles.  This finding validates that the growth percentile measures – which fail on their face to take into account student background characteristics – as a result fail statistically to remove the bias associated with these measures.

To simplify, there exist three types of variation in the growth percentile measures: 1) variation that may in fact be associated with a given teacher or school, 2) variation that may be associated with factors other than the school or teacher (omitted variables bias) and 3) statistical/measurement noise.

The difficulty here is our inability to determine which type of variation is “true effect”, which is the “effect of some other factor” and which is “random noise.” Here, we can see that a sizeable share of the variance in growth is associated with school demographics (“Some other factor”). One might assert that this pattern occurs simply because good teachers sort into schools with fewer low income and minority children, who are then left with bad teachers unable to produce gains. Such an assertion cannot be supported with these data, given the equal (if not greater) likelihood that these patterns occur as a function of omitted variables bias – where all possible variables have actually been omitted (proxied only with a single prior score).

Pursuing a policy of dismissing or detenuring at a higher rate, teachers in high poverty schools because of their lower growth percentiles, would be misguided. Doing so would create more instability and disruption in settings already disadvantaged, and may significantly reduce the likelihood that these schools could then recruit “better” teachers as replacements.

Table 1

% Free Lunch

% Black or Hispanic

% Black or Hispanic

0.9098*

Math MGP

-0.3703*

-0.3702*

ELA MGP

-0.4828*

-0.4573*

*p<.05

Figure 1 shows the clarity of the pattern of relationship between school average proficiency rates (for 7th graders) and growth percentiles. Here, we see that schools with higher average performance also have significantly higher growth percentiles. This may occur for a variety of reasons. First, it may just be that along higher regions of the underlying test scale, higher gains are more easily attainable. Second, it may be that peer group average initial performance plays a significant role in influencing gains. Third, it may be that to some extent, higher performing schools do have some higher value-added teachers.

The available data do not permit us to fully distinguish which of these three factors most drives this pattern, and the first two of these factors have little or nothing to do with teacher or teaching quality. This uncertainty raises issues of fairness and reliability; particularly, in an evaluation system that has implications for teacher tenure and other employment.

Figure 1

Slide1

Figure 2 elaborates on the negative relationship between student low income status and school level growth percentiles, showing that among very high poverty schools, growth percentiles tend to be particularly low.

Figure 2

Slide2

Figure 3 shows that the share of special education children scoring non-(partially)-proficient also seems to be a drag on school level growth percentiles. Schools with larger shares of partially proficient special education students tend to have lower median growth percentiles.

Figure 3

Slide3

An important note here is that these are school level aggregations and much of the intent of state policy is to apply these growth percentiles for evaluating teachers. School growth percentiles are merely aggregations of the handful of teachers for whom rating exist in any school. Bias that appears at the school level is not created by the aggregation. It may be clarified by the aggregation. But if the school level data are biased, then so too are the underlying teacher level data.

What Incentives and Consequences Result from these Measures?

The consequences of adopting these measures for high stakes use in policy and practice are significant.

Rating Schools for Intervention

Growth measures are generally assumed to be better indicators of school performance and less influenced by student background than status measures. Status measures include proficiency rates commonly adopted for compliance with the Federal No Child Left Behind Act.  Using status measures disparately penalizes high poverty, high minority concentration schools, increasing the likelihood that these schools face sanctions including disruptive interventions such as closure or reconstitution. While less biased than status measures, New Jersey’s Student Growth Percentile measures appear to retain substantial bias with respect to student population characteristics and with respect to average performance levels, calling into question their usefulness for characterizing school (and by extension school leader) effectiveness. Further, the measures simply aren’t designed for making such assertions.

Further, if these measures are employed to impose disruptive interventions on high poverty, minority concentration schools, this use will exacerbate the existing disincentive for teachers or principals to seek employment in these schools. If the growth percentiles systematically disadvantage schools with more low income children and non-proficient special education children, relying on these measures will also reinforce current incentives for high performing charter schools to avoid low income children and children with disabilities.

Employment Decisions

First and foremost, SGPs are not designed for inferring the teacher’s effect on student test score change and as such they should not be used that way. It simply does not comport with fairness to continue to use SGPs for an end for which they were not designed. Second, New Jersey’s SGPs retain substantial bias at the school level, indicating that they are likely a very poor indicator of teacher influence. The SGP creates a risk that a teacher will be erroneously deprived of a property right in tenure, consequently creating due process problems. In essence, these two concerns about SGP raise serious issues of validity and reliability. Continued reliance on an invalid/unreliable model borders on arbitrariness.

These biases create substantial disincentives for teachers and/or principals to seek employment in settings with a) low average performing students, b) low income students and c) high shares of non-proficient special education students. Creating such a disincentive is more likely to exacerbate disparities in teacher quality across settings than to improve it.

Teacher Preparation Institutions

While to date, the New Jersey Department of Education has made no specific movement toward rating of teacher preparation institutions using the aggregate growth percentiles of recent graduates in the field, such a movement seems likely. The new Council for the Accreditation of Teacher Preparation standards requires that teacher preparation institutions employ their state metrics for evaluative purposes:

4.1.The provider documents, using value-added measures where available, other state-supported P-12 impact measures, and any other measures constructed by the provider, that program completers contribute to an expected level of P-12 student growth.[26]

The patterns of bias in SGPs being relatively clear, it would be disadvantageous for colleges of education to place their graduates in high poverty, low average performing schools, or schools with higher percentages of non-proficient special education students.

The Path Forward

Given what we know about the original purpose and design of student growth percentiles and what we have learned specifically about the characteristics of New Jersey’s Growth Percentile measures, we propose the following:

(i) An immediate moratorium on attaching any consequences – job action, tenure action or compensation to these measures. Given the available information, failing to do so would be reckless and irresponsible and further, is likely to lead to exorbitant legal expenses incurred by local public school districts obligated to defend the indefensible.

(ii) A general rethinking – back to square one – on how to estimate school and teacher effect, with particular emphasis on better models from the field. It may or may not, in the end, be a worthwhile endeavor to attempt to estimate teacher and principal effects using student assessment data. At the very least, the statistical strategy for doing so, along with the assessments underlying these estimates, require serious rethinking.

(iii) A general rethinking/overhaul of how data may be used to inform thoughtful administrative decision making, rather than dictate decisions.  Data including statistical estimates of school, program, intervention or teacher effects can be useful for guiding decision making in schools. But rigid decision frameworks, mandates and specific cut scores violate the most basic attributes of statistical measures. They apply certainty to that which is uncertain. At best, statistical estimates of effects on student outcomes may be used as preliminary information – a noisy pre-screening tool – for guiding subsequent, more in-depth exploration and evaluation.

Perhaps most importantly, NJDOE must reposition itself as an entity providing thoughtful, rigorous technical support for assisting local public school districts in making informed decisions regarding programs and services.  Mandating decision frameworks absent sound research support is unfair and sends the wrong message to educators who are in the daily trenches At best, the state’s failure to understand the disconnect between existing research and  current practices suggests  a need for critical  technical capacity. At worst, endorsing policy positions through a campaign of disinformation raises serious concerns.


[4] Goldhaber, D., & Walch, J. (2012). Does the model matter? Exploring the relationship between different student achievement-based teacher assessments. University of Washington at Bothell, Center for Education Data & Research. CEDR Working Paper 2012-6. Ehlert, M., Koedel, C., &Parsons, E., & Podgursky, M. (2012).  Selecting growth measures for school and teacher evaluations. National Center for Analysis of Longitudinal Data in Education Research (CALDAR). Working Paper #80.

[5] Baker, E.L., Barton, P.E., Darling-Hammond, L., Haertel, E., Ladd, H.F., Linn, R.L., Ravitch, D., Rothstein, R., Shavelson, R.J., & Shepard, L.A. (2010). Problems with the use of student test scores to evaluate teachers. Washington, DC: Economic Policy Institute.  Retrieved June 4, 2012, from http://epi.3cdn.net/724cd9a1eb91c40ff0_hwm6iij90.pdf. Corcoran, S.P. (2010). Can teachers be evaluated by their students’ test scores? Should they be? The use of value added measures of teacher effectiveness in policy and practice. Annenberg Institute for School Reform. Retrieved June 4, 2012, from http://annenberginstitute.org/pdf/valueaddedreport.pdf.

[6] Goldhaber, D., & Walch, J. (2012). Does the model matter? Exploring the relationship between different student achievement-based teacher assessments. University of Washington at Bothell, Center for Education Data & Research. CEDR Working Paper 2012-6.

[7] Ehlert, M., Koedel, C., &Parsons, E., & Podgursky, M. (2012).  Selecting growth measures for school and teacher evaluations. National Center for Analysis of Longitudinal Data in Education Research (CALDAR). Working Paper #80.

[8] See Briggs & Domingue’s (2011) re-analysis of LA Times estimates pages 10 to 12. Briggs, D. & Domingue, B. (2011). Due diligence and the evaluation of teachers: A review of the value-added analysis underlying the effectiveness rankings of Los Angeles Unified School District Teachers by the Los Angeles Times. Boulder, CO: National Education Policy Center. Retrieved June 4, 2012 from http://nepc.colorado.edu/publication/due-diligence.

[9] Briggs, D. & Betebenner, D., (2009, April). Is student achievement scale dependent? Paper presented at the invited symposium Measuring and Evaluating Changes in Student Achievement: A Conversation about Technical and Conceptual Issues at the Annual Meeting of the National Council for Measurement in Education, San Diego, CA. Retrieved June 4, 2012, from http://dirwww.colorado.edu/education/faculty/derekbriggs/Docs/Briggs_Weeks_Is%20Growth%20in%20Student%20Achievement%20Scale%20Dependent.pdf.

[10] American Institutes for Research. (2012). 2011-12 growth model for educator evaluation technical report: Final. November, 2012. New York State Education Department.

[11] Briggs and Betebenner (2009) explain: “However, there is an important philosophical difference between the two modeling approaches in that Betebenner (2008) has focused upon the use of SGPs as a descriptive tool to characterize growth at the student-level, while the LM (layered model) is typically the engine behind the teacher or school effects that get produced for inferential purposes in the EVAAS” (p. 30).

[12] McCaffrey, D.F., Sass, T.R., Lockwood, J.R., & Mihaly, K. (2009). The intertemporal variability of teacher effect estimates. Education Finance and Policy, 4,(4) 572-606. Sass, T.R. (2008). The stability of value-added measures of teacher quality and implications for teacher compensation policy. Retrieved June 4, 2012, from http://www.urban.org/UploadedPDF/1001266_stabilityofvalue.pdf. Schochet, P.Z. & Chiang, H.S. (2010). Error rates in measuring teacher and school performance based on student test score gains. Institute for Education Sciences, U.S. Department of Education. Retrieved May 14, 2012, from http://ies.ed.gov/ncee/pubs/20104004/pdf/20104004.pdf.

[13] Corcoran, S.P., Jennings, J.L., & Beveridge, A.A. (2010). Teacher effectiveness on high- and low-stakes tests. Paper presented at the Institute for Research on Poverty Summer Workshop, Madison, WI.

Gates Foundation (2010). Learning about teaching: Initial findings from the measures of effective teaching project. MET Project Research Paper. Seattle, Washington: Bill & Melinda Gates Foundation. Retrieved December 16, 2010, from http://www.metproject.org/downloads/Preliminary_Findings-Research_Paper.pdf.

[14] Ballou, D., Mokher, C.G., & Cavaluzzo, L. (2012, March). Using value-added assessment for personnel decisions: How omitted variables and model specification influence teachers’ outcomes. Paper presented at the Annual Meeting of the Association for Education Finance and Policy. Boston, MA.  Retrieved June 4, 2012, from http://aefpweb.org/sites/default/files/webform/AEFP-Using%20VAM%20for%20personnel%20decisions_02-29-12.docx.

Briggs, D. & Domingue, B. (2011). Due diligence and the evaluation of teachers: A review of the value-added analysis underlying the effectiveness rankings of Los Angeles Unified School District Teachers by the Los Angeles Times. Boulder, CO: National Education Policy Center. Retrieved June 4, 2012 from http://nepc.colorado.edu/publication/due-diligence.

[15] Betebenner, D., Wenning, R.J., & Briggs, D.C. (2011). Student growth percentiles and shoe leather. Retrieved June 5, 2012, from http://www.ednewscolorado.org/2011/09/13/24400-student-growth-percentiles-and-shoe-leather.

[16] Rubin, D. B., Stuart, E. A., & Zanutto, E. L. (2004). A potential outcomes view of value-added assessment in education. Journal of Educational and Behavioral Statistics, 29(1), 103–16.

[17]http://www.wnyc.org/articles/new-jersey-news/2013/mar/18/everything-you-need-know-about-students-baked-their-test-scores-new-jersy-education-officials-say/

[18] Briggs, D. & Domingue, B. (2011). Due Diligence and the Evaluation of Teachers: A review of the value-added analysis underlying the effectiveness rankings of Los Angeles Unified School District teachers by the Los Angeles Times. Boulder, CO: National Education Policy Center. Retrieved [date] from http://nepc.colorado.edu/publication/due-diligence.

[19] Alexander, K. L., Entwisle, D. R., & Olson, L. S. (2001). Schools, achievement, and inequality: A seasonal perspective. Educational Evaluation and Policy Analysis, 23(2), 171-191.

[20] Learning about Teaching: Initial Findings from the Measures of Effective Teaching Project. MET Project Research Paper. Seattle, Washington: Bill & Melinda Gates Foundation. Retrieved December 16, 2010, from http://www.metproject.org/downloads/Preliminary_Findings-Research_Paper.pdf.

[21] Lubienski, S. T., & Crane, C. C. (2010) Beyond free lunch: Which family background measures matter? Education Policy Analysis Archives, 18(11). Retrieved [date], from http://epaa.asu.edu/ojs/article/view/756

[22] For example, even value-added proponent Eric Hanushek finds in unrelated research that “students throughout the school test score distribution appear to benefit from higher achieving schoolmates.” See: Hanushek, E. A., Kain, J. F., Markman, J. M., & Rivkin, S. G. (2003). Does peer ability affect student achievement?. Journal of applied econometrics, 18(5), 527-544.

[24] Rothstein, J. (2011). Review of “Learning About Teaching: Initial Findings from the Measures of Effective Teaching Project.” Boulder, CO: National Education Policy Center. Retrieved [date] from http://nepc.colorado.edu/thinktank/review-learning-about-teaching. [accessed 2-may-13]

[25] Ehlert, M., Koedel, C., Parsons, E., & Podgursky, M. (2012). Selecting Growth Measures for School and Teacher Evaluations. http://ideas.repec.org/p/umc/wpaper/1210.html

The Principal’s Dilemma as Mock Trial: Ed Law Colleagues Please Provide Your Opinions!

The following is a hypothetical case I am using as the culminating activity in Public School Law this semester.

The Dismissal of Principal X

Principal X is principal in a local public middle school in a state that has recently adopted through legislation, articulated with greater precision in state department of education regulations, a new teacher evaluation scheme. The teacher evaluation laws and regulations now require that:

  • Any teacher who receives two sequential evaluations less than “satisfactory” shall have his/her tenure status revoked;
  • Teacher evaluations shall consist of 40 to 50% measures of student growth, where the majority shall be based on state provided metrics.
  • By regulatory decree of the State Commissioner of Education, any other measures selected by local district officials for inclusion in evaluations must be proven correlated with state approved and provided measures of student achievement growth.

Further, the state now conditions receipt of “any and all increases to state aid for local public school districts” on full compliance with statutes and regulations pertaining to teacher evaluation.

On September 20th of 2013, Principal X was provided with growth percentile data on her teachers from the prior year. Of the approximately 40 certified staff in her school, 8 received growth percentile data, two of whom achieved unsatisfactory growth percentile estimates for their students, one of whom received a second unsatisfactory rating in a row ‐Teacher Y.

In keeping with the requirement that any and all other measures used in the state approved teacher evaluations be correlated with the growth percentile measures, the principal was compelled to assign this teacher a second unsatisfactory rating, and thus compelled to revoke the tenure status of Teacher Y. She was a 10 year veteran teacher perceived by the principal and many others in the school to be one of the school’s most valuable human resources. In fact, over the past several years, the principal had relied on this teacher to take the difficult students including playing a more significant role than others in inclusion of children with disabilities in her classroom – and the teacher not only willingly, but eagerly complied.

Frustrated with the outcome of the new state teacher evaluation laws, Principal X took her case to the public and to state officials simultaneously. Without specific reference to the case in question – but via stylized example – the principal used the case of Teacher Y to illustrate how strict requirements of job action based largely on limited and problematic measures could lead to damaging decisions – decisions

1| Page

that she argued were neither in the best interest of the teachers nor the children they served, and decisions likely to negatively affect the quality of education statewide.

The principal made the case for returning discretion on issues of teacher evaluation and human resource management to local officials, including school principals. The principal’s letter led to a sympathetic uprising from community members and parents, who were quick to catch on as to which teacher was actually the basis of the principal’s hypothetical. Parents of that teacher’s students were outraged, and expressed their outrage at local board of education meetings. During this time, the local board of education maintained quiet support of the principal.

The principal had also begun to engage other principals statewide establishing a network of principals publicly proclaiming their opposition to newly adopted state teacher evaluation statutes and regulations. A web site was created, a non‐profit organization (political action organization) was formed, and the original letter of opposition to new state policies posted on the site, along with a petition for other school principals to show support for the group’s cause and/or become an official member.

State officials were less supportive and unamused by this principal’s apparent disrespect for their authority, and her “willful disobedience of existing statutes and regulations” expressed by Principal X’s stalling on submitting relevant evaluation information necessary for revocation of Teacher Y’s tenure status. Further, state officials were less than thrilled with the mounting insurrection initiated by the publicly posted letter to state officials outlining problems with the state teacher evaluation laws.

State officials released a letter to the local board of education indicating that their state aid would be frozen for the coming school year if, in fact, their rogue principal continued to stall and refuse compliance with the teacher evaluation laws. Under pressure from the board, Principal X agreed to initiate procedures that would lead to tenure revocation for Teacher Y. Instead of waiting out this process, Teacher Y chose to resign and pursue employment elsewhere.

But with mounting pressure on the local board of education from state department officials to control the growing movement among principals statewide against the teacher evaluation laws, a movement initiated by one of their most respected principals (who had received only glowing evaluations in prior years), the district board chose to dismiss Principal X, citing that the principal’s activities had distracted her from doing the job required, substantively compromised her effectiveness as a principal and significantly interfered with the ability of district officials to efficiently and effectively carry on district operations (including the uncertainty created over the district’s future state aid receipts).

Principal X is now suing the district for wrongful dismissal, arguing that the district’s dismissal is in violation of her first amendment right to express herself to the public on issues of public interest, for which she, as an informed public school employee has relevant information.

2| Page

Required Reading

Key Cases

Pickering v. Board of Education: http://www.oyez.org/cases/1960‐1969/1967/1967_510

Connick v. Myers: http://www.oyez.org/cases/1980‐1989/1982/1982_81_1251

Garcetti v. Ceballos: http://www.oyez.org/cases/2000‐2009/2005/2005_04_473

Blogs

EdJurist: Garcetti & Schools http://www.edjurist.com/garcetti‐and‐schools

EdJurist: Academic Blogging & Garcetti: http://www.edjurist.com/blog/2008/5/9/academic‐freedom­garcetti‐blogging.html

Law Reviews

Oluwole, J. O. (2007). On the Road to Garcetti: Unpick’erring Pickering and Its Progeny. Cap. UL Rev., 36, 967.

3| Page

The Perils of Economic Thinking about Human Behavior

Behavioral economics is an interesting and potentially useful field of academic inquiry. At its best, real behavioral economics attempts to address some of the concerns I raise here. But many if not most assumptions about human behavior and response to incentives are not representative of behavioral economics at its best.

Specifically,  I’m increasingly concerned with what I see as the simple-minded projection of economic thinking onto everyone and anyone else, leading to ridiculous policy recommendations – that amazingly – get taken seriously – at least by the media and punditocracy.

See, for example, Roland Fryer’s experiment on loss aversion as a strategy for incenting teachers to make sure that their students gain a few extra test score points in limited content areas. Indeed, if we pay you up front, and threaten to take your salary away if you don’t get those test score points out those kids, the data suggest  a greater likelihood of squeezing the kids for a few more points. Whether that tells us anything about the motivation and morals of teachers, or of the economists framing this argument is an entirely different question. This tells us little or nothing of the appropriate policy response. Thankfully, the policy implications of this paper were sufficiently absurd that they gained little traction.

Let’s assume classic economic assumptions about human behavior really hold steadfast and can be grossly simplified to an anything for an extra buck, or not to lose one, position. I would argue that it is perhaps economists themselves that are most stereotypical in this regard –  at least as represented in the thinking the project onto others.  In fact, I would argue that many, born out of a culture that self-selects into economic professions, are simply going out of their way to project their own thinking on others.

Further, many of these economists operate in a world where they can influence/control public policy and they too have an incentive in how they behave in this system. They are not impartial observers by any stretch of the imagination. Their goal is to use their economic research to shape public policy to their own advantage.

Put simply, just because the average morally bankrupt economist might do pretty much anything for an extra buck (or a billion), doesn’t mean the average teacher, doctor, nurse, fireman or police officer would!

This issue has been on my mind for some time, but recently came to a head when I read this completely ridiculous Washington Post article on health care policy – specifically – how to remove the incentive for hospitals and physicians resulting from surgical complications.

I should note, I come from a medical family, so some of my arguments herein are drawn from dinner table conversations (across generations), coupled with my tendency to read health policy research out of personal interest in exploring connections with education policy.

It was implied in the WaPo article… well… actually it was explicitly stated in the article that hospitals and physicians have a big financial incentive for their patients to have serious complications, leading to extended hospital stays and additional procedures.

Now, the average economist might be so morally bankrupt such that if he/she were in an operating room (OR) considering the implications of complications relative to potential earnings that they might intentionally introduce infection or other complication, but thankfully the average economist is not in the OR. Thankfully, they self-selected into economics and not medicine (likely foreseeing greater opportunity to earn more for much less work and upfront investment).

The WaPo article does make the following statement, to head off this argument:

The study does not imply that hospitals intentionally complicate surgeries to bring in more revenue.

But, I would argue that this is actually a rather half-hearted disclaimer (to a half-assed argument) for an article that very much implies just that.

Certainly the economists’ policy response – how to employ crude economic assumptions of human behavior to fix this dreadful perverse incentive – implies that cutting off this financial benefit for malpractice would improve hospital and physician behavior [meanwhile conflating the hospital and physician incentives & roles in the various related processes]. Here is the policy solution recommended by the economists cited in the WaPo article:

If hospitals receive a set amount for every heart surgery they perform, for example, they suddenly have an incentive to reduce complications — they know the extra medical spending will come out of their own budget.

Lost in the economists’ reasoning here are a) the potential longer term financial and career implications to the physician repeatedly entangled in litigation over post-surgical complications, b) and the stress/mental toll on the physician arising from managing complications in tense moments in the OR.

Indeed this is anecdotal, but I’ve not met a physician – surgeon or anesthesiologist – who prefers a day when things go bad in the OR – or would be likely to see dollar signs in those moments of stress. What kind of sick bastard even thinks that way? Well, perhaps the average economist does.

Economists rarely – uh… NEVER face comparable professional stress to managing a patient’s life on the edge – even when they make a massively stupid spreadsheet error stimulating economic turmoil across the globe. Nor do they pay hefty malpractice premiums to shield themselves from such egregious malpractice (despite measurable financial damages). I would assert that the economist never faces the stress of having to care for a classroom of 20 to 40, 5 to 15 year old kids, whose immediate safety and well-being, as well as their long term futures is on the line.  This is in part, why they get away with such ludicrous thinking.

It’s all freakin’ game (Freakin’ used here in a technical freakonomics sense)… a game of playing with big data – several layers removed from reality – from people – from real human consequences.

Perhaps that’s the central issue… even more so than economists’ financial self-interests?

Taken in perspective, it’s a fun game and a pretty cushy lifestyle to have opportunity to ponder policy implications of big data, as long as we don’t start thinking that what we do is so freakin’ important and indispensable and as long as we understand where we sit in this big messy puzzle of human behavior and incentives.

Of course, the other interesting piece here is the leap we often see these days between what the study behind the headlines actually said, and the resulting spin in the media headlines. We also often see the economists themselves engaging in the spin. This was equally true in the famed Chetty, Rockoff, Friedman Fire Teachers First, Ask Questions Later study.

For example, here’s what the original study – in the Journal of the American Medical Association – on reimbursements associated with complications actually said:

Depending on payer mix, many hospitals have the potential for adverse near-term financial consequences for decreasing postsurgical complications.

It takes one hell of a leap of logic to get from this measured finding to the policy recommendation above.  It takes projecting economists thinking –amoral greed – onto all actors involved. It also takes ignoring entirely a multitude of contextual factors and perverse consequences (economist thinking – first, we assume none of that stuff exists). Indeed, many complications relate to preexisting conditions and/or overall health of the incoming patient. Do we really want to incent risk aversion? (avoiding those far more likely to have complications?). Well, if it leads to lower premiums for and taxes paid by economists, then perhaps?

Tangentially (or not?), there is an equally ill-conceived movement afoot to apply to healthcare management the brilliance of what we have supposedly learned from measuring teacher effectiveness with value-added models, as explained in this policy brief from Mathematica. Notably, I tend to think Mathematica does pretty good work on education policy (better than most. See here, but for more critical perspective, see here) Put in its best light, this policy brief is merely Mathematica researchers engaging in another I’ve got a Hammer… where’s the freakin’ nail exercise.

Put in the light of economic thinking about human behavior – which many economists prefer to project on all others, the incentive here is for Mathematica to broaden its market, gaining contracts to develop value-added metrics for health care systems and for state and Federal government – to ultimately be used in reducing payments for healthcare, and reducing the tax burden and healthcare premiums paid by Mathematica researchers – their funders and their peers. It’s a win/win. More contracts and higher income, and lower taxes and health benefits expenses (not costs, but expenses*).

That is, as long as they are never in need of surgery.

====

*Cost  reduction implies that quality of service remains constant, whereas expenditure reduction may lead to service quality reduction.

Revisiting the Complexities of Charter Funding Comparisons

This Education Week Post today rather uncritically summarized a recently published article based on an earlier report on charter school spending “gaps.” I’ve not had a chance to dig into this updated study yet, but the Ed Week post also referred to an earlier study from Ball State University which I have critiqued on multiple occasions. Importantly, my previous critiques of this study point to the complexities of making these comparisons appropriately.  Here is one version of my critique of the Ball State study, which appears in Footnote 22, page 49 of this study: http://nepc.colorado.edu/files/rb-charterspending_0.pdf

A study frequently cited by charter advocates, authored by researchers from Ball State University and Public Impact, compared the charter versus traditional public school funding deficits across states, rating states by the extent that they under-subsidize charter schools. The authors identify no state or city where charter schools are fully, equitably funded.

But simple direct comparisons between subsidies for charter schools and public districts can be misleading because public districts may still retain some responsibility for expenditures associated with charters that fall within their district boundaries or that serve students from their district. For example, under many state charter laws, host districts or sending districts retain responsibility for providing transportation services, subsidizing food services, or providing funding for special education services. Revenues provided to host districts to provide these services may show up on host district financial reports, and if the service is financed directly by the host district, the expenditure will also be incurred by the host, not the charter, even though the services are received by charter students.

Drawing simple direct comparisons thus can result in a compounded error: Host districts are credited with an expense on children attending charter schools, but children attending charter schools are not credited to the district enrollment. In a per-pupil spending calculation for the host districts, this may lead to inflating the numerator (district expenditures) while deflating the denominator (pupils served), thus significantly inflating the district’s per pupil spending. Concurrently, the charter expenditure is deflated.
Correct budgeting would reverse those two entries, essentially subtracting the expense from the budget calculated for the district, while adding the in-kind funding to the charter school calculation. Further, in districts like New York City, the city Department of Education incurs the expense for providing facilities to several charters. That is, the City’s budget, not the charter budgets, incur another expense that serves only charter students. The Ball State/Public Impact study errs egregiously on all fronts, assuming in each and every case that the revenue reported by charter schools versus traditional public schools provides the same range of services and provides those services exclusively for the students in that sector (district or charter).

Charter advocates often argue that charters are most disadvantaged in financial comparisons because charters must often incur from their annual operating expenses, the expenses associated with leasing facilities space. Indeed it is true that charters are not afforded the ability to levy taxes to carry public debt to finance construction of facilities. But it is incorrect to assume when comparing expenditures that for traditional public schools, facilities are already paid for and have no associated costs, while charter schools must bear the burden of leasing at market rates – essentially and “all versus nothing” comparison. First, public districts do have ongoing maintenance and operations costs of facilities as well as payments on debt incurred for capital investment, including new construction and renovation. Second, charter schools finance their facilities by a variety of mechanisms, with many in New York City operating in space provided by the city, many charters nationwide operating in space fully financed with private philanthropy, and many holding lease agreements for privately or publicly owned facilities.

New York City is not alone it its choice to provide full facilities support for some charter school operators (http://www.thenotebook.org/blog/124517/district-cant-say-how-many-millions-its-spending-renaissance-charters). Thus, the common characterization that charter schools front 100% of facilities costs from operating budgets, with no public subsidy, and traditional public school facilities are “free” of any costs, is wrong in nearly every case, and in some cases there exists no facilities cost disadvantage whatsoever for charter operators.

Baker and Ferris (2011) point out that while the Ball State/Public Impact Study claims that charter schools in New York State are severely underfunded, the New York City Independent Budget Office (IBO), in more refined analysis focusing only on New York City charters (the majority of charters in the State), points out that charter schools housed within Board of Education facilities are comparably subsidized when compared with traditional public schools (2008-09). In revised analyses, the IBO found that co-located charters (in 2009-10) actually received more than city public schools, while charters housed in private space continued to receive less (after discounting occupancy costs). That is, the funding picture around facilities is more nuanced that is often suggested.

Batdorff, M., Maloney, L., May, J., Doyle, D., & Hassel, B. (2010). Charter School Funding: Inequity Persists. Muncie, IN: Ball State University.

NYC Independent Budget Office (2010, February). Comparing the Level of Public Support: Charter Schools versus Traditional Public Schools. New York: Author, 1.

NYC Independent Budget Office (2011). Charter Schools Housed in the City’s School Buildings get More Public Funding per Student than Traditional Public Schools. New York: Author. Retrieved April 24, 2012, from http://ibo.nyc.ny.us/cgi-park/?p=272.

NYC Independent Budget Office (2011). Comparison of Funding Traditional Schools vs. Charter Schools: Supplement. New York: Author .Retrieved April 24, 2012, from http://www.ibo.nyc.ny.us/iboreports/chartersupplement.pdf.

Note: The average “capital outlay” expenditure of public school districts in 2008-09 was over $2,000 per pupil in New York State, nearly $2,000 per pupil in Texas and about $1,400 per pupil in Ohio. Based on enrollment weighted averages generated from the U.S. Census Bureau’s Fiscal Survey of Local Governments, Elementary and Secondary School Finances 2008-09 (variable tcapout): http://www2.census.gov/govs/school/elsec09t.xls

Friday AM Graphs: Just how biased are NJ’s Growth Percentile Measures (school level)?

New Jersey finally released the data set of its school level growth percentile metrics. I’ve been harping on a few points on this blog this week.

SGP data here: http://education.state.nj.us/pr/database.html

Enrollment data here: http://www.nj.gov/education/data/enr/enr12/stat_doc.htm

First, that the commissioner’s characterization that the growth percentiles necessarily fully take into account student background is a completely bogus and unfounded assertion.

Second, that it is entirely irresponsible and outright reckless that they’ve chosen not even to produce technical reports evaluating this assertion.

Third, that growth percentiles are merely individual student level descriptive metrics that simply have no place in the evaluation of teachers, since they are not designed (by their creator’s acknowledgement) for attribution of responsibility for that student growth.

Fourth, that the Gates MET studies provide absolutely no validation of New Jersey’s choice to use SGP data in the way proposed regulations mandate.

So, this morning I put together four quick graphs of the relationship between school level percent free lunch and median SGPs in language arts and math and school level 7th grade proficiency rates and median SGPs in language arts and math. Just how bad is the bias in the New Jersey SGP/MGP data?  Well, here it is! (actually, it was bad enough to shock me)

First, if you are a middle school with higher percent free lunch, you are, on average likely to have a lower growth percentile rating in Math. Notably, the math ASK assessment has significant ceiling effect leading into middle grades, perhaps weakening this relationship. (more on this at a later point)Slide1

If your are a middle school with higher percent free lunch, you are, on average, likely to have a lower growth percentile rating in English Language Arts. This relationship is actually even more biased than the math relationship (uncommon for this type of analysis), likely because the ELA assessment suffers less ceiling effect problem.

Slide2As with many if not most SGP data, the relationship is actually even worse when we look at the correlation with average performance level of the school, or peer group. If your school has higher proficiency rates to begin with, your school will quite likely have a higher growth percentile ranking:

Slide3

The same applies for English Language Arts:

Slide4

Quite honestly these the worst – most biased – school level growth data I think I’ve ever seen.

They are worse than New York State.

They are much worse than New York City.

And they are worse than Ohio.

And this is just a first cut at them. I suspect that if I have actual initial scores or even school level scale scores, the relationship between those scores and growth percentile is even stronger. But will test that when opportunity presents itself.

Further, because the bias is so strong at the school level – it is likely also quite strong at the teacher level.

New Jersey’s school level MGPs are highly unlikely to be providing any meaningful indicator of the actual effectiveness of teachers, administrators and practices of New Jersey schools.  Rather, by conscious choice to ignore contextual factors of schooling (be it the vast variations in the daily lives of individual children, or the difficult to measure power of peer group context, and various  other social contextual factors), New Jersey’s growth percentile measures fail miserably.

No school can be credibly rated as effective or not based on these data, nor can any individual teacher be cast as necessarily effective or ineffective.

And this not at all unexpected.

Additional Graphs: Racial Bias

Slide5

Slide6

Just for fun, here’s a multiple regression model which yields additional factors that are statistically associated with school level MGPs. First and foremost, these factors explain over 1/3 of the variation in Language Arts MGPs. That is, Language Arts MGPs seem heavily contingent upon a) student demographics, b) location and c) grade range of school.  In other words, if we start using these data as a basis for de-tenuring teachers, we will likely be detenuring teachers quite unevenly with respect to a) student demographics, b) location and c) grade range… despite having little evidence that we are actually validly capturing teacher effectiveness – and substantial implication here that we are, in fact, NOT.

Patterns for math aren’t much different. Less variance is explained, again, I suspect because of the strong ceiling effect on math assessments in the upper elementary/middle grades. There appears to be a charter school positive effect in this regression, but I remain too suspicious of attaching any meaningful conclusions to these data. Besides, if we assert this charter effect to be true as a function of these MGPs being somehow valid, then we’d have to accept that charters like Robert Treat in Newark are doing a particularly poor job (very low MGP either compared to similar demographic schools, or similar average performance level schools).

School Level Regression of Predictors of Variation in MGPs

school mgp regression

*p<.05, **p<.10

At this point, I think it’s reasonable to request that the NJDOE turn over masked (removing student identifiers) versions of their data… the student level SGP data (with all relevant demographic indicators), matched to teachers, attached to school IDs, and also including certifying institutions of each teacher.  These data require thorough vetting at this point as it would certainly appear that they are suspect as a school evaluation tool. Further, any bias that becomes apparent to this degree at the school level – which is merely an aggregation of teacher/classroom level data – indicates that these same problems exist in the teacher level data. Given the employment consequences here, it is imperative that NJDOE make these data available for independent review.

Until these data are fully disclosed (not just their own analyses of them, which I expect to be cooked up any day now), NJDOE and the Board of Education should immediately cease moving forward on using these data either for any consequential decisions either for schools or individual teachers. And if they do not, school administrators, local boards of education and individual teachers and teacher preparation institutions (which are also to be rated by this shoddy information) should JUST SAY NO!

A few more supplemental analyses

Slide1

Slide2

Slide3

Slide4

 

Briefly Revisiting the Central Problem with SGPs (in the creator’s own words)

When I first criticized the use of SGPs for teacher evaluation in New Jersey, the creator of the Colorado Growth Model responded with the following statement:

Unfortunately Professor Baker conflates the data (i.e. the measure) with the use. A primary purpose in the development of the Colorado Growth Model (Student Growth Percentiles/SGPs) was to distinguish the measure from the use: To separate the description of student progress (the SGP) from the attribution of responsibility for that progress.

http://www.ednewscolorado.org/voices/student-growth-percentiles-and-shoe-leather

I responded here.

Let’s parse this statement one more time. The goal, of the SGP approach, as applied in the Colorado Growth Model and subsequently in other states is to:

…separate the description of student progress (the SGP) from the attribution of responsibility for that progress.

To evaluate the effectiveness of a teacher on influencing student progress, one must certainly be able to attribute responsibility for that progress to the teacher. If SGP’s aren’t designed to attribute that responsibility, then they aren’t designed for evaluating teacher effectiveness, and thus aren’t a valid factor for determining whether a teacher should have his/her tenure revoked on the basis of their ineffectiveness.

It’s just that simple!

Employment lawyers, save the quote and link above for cross examination of Dr. Betebenner when teachers start losing their tenure status and/or are dismissed primarily on the basis of his measures – which by his own recognition – are not designed to attribute responsibility for student growth to them (the teachers) or any other home, school or classroom factor that may be affecting that growth.

(Reiterating again that while value added models do attempt to isolate teacher effect, they just don’t do a very good job at it).

 

 

 

 

On Misrepresenting (Gates) MET to Advance State Policy Agendas

In my previous  post I chastised state officials for their blatant mischaracterization of metrics to be employed in teacher evaluation. This raised (in twitter conversation) the issue of the frequent misrepresentation of findings from the Gates Foundation Measures of Effective Teaching Project (or MET). Policymakers frequently invoke the Gates MET findings as providing broad based support for however they might choose to use, whatever measures they might choose to use (such as growth percentiles).

Here is one example in a recent article from NJ Spotlight (John Mooney) regarding proposed teacher evaluation regulations in New Jersey:

New academic paper: One of the most outspoken critics has been Bruce Baker, a professor and researcher at Rutgers’ Graduate School of Education. He and two other researchers recently published a paper questioning the practice, titled “The Legal Consequences of Mandating High Stakes Decisions Based on Low Quality Information: Teacher Evaluation in the Race-to-the-Top Era.” It outlines the teacher evaluation systems being adopted nationwide and questions the use of SGP, specifically, saying the percentile measures is not designed to gauge teacher effectiveness and “thus have no place” in determining especially a teacher’s job fate.

The state’s response: The Christie administration cites its own research to back up its plans, the most favored being the recent Measures of Effective Teaching (MET) project funded by the Gates Foundation, which tracked 3,000 teachers over three years and found that student achievement measures in general are a critical component in determining a teacher’s effectiveness.

I asked colleague Morgan Polikoff of the University of Southern California for his comments. Note that Morgan and I aren’t entirely on the same page on the usefulness of even the best possible versions of teacher effect (on test score gain) measures… but we’re not that far apart either.  It’s my impression that Morgan believes that better estimated measures can be more valuable – more valuable than I perhaps think they can be in policy decision making. My perspective is presented here (and Morgan is free to provide his).  My skepticism in part arises from my perception that there is neither interest among or incentive for state policymakers to actually develop better measures (as evidenced in my previous post). And that I’m not sure some of the major issues can ever be resolved.

That aside, here are Morgan Polikoff’s comments regarding misrepresentation of the Gates MET findings – in particular, as applied to states adopting student growth percentile measures:

As a member of the Measures of Effective Teaching (MET) project research team, I was asked by Bruce to pen a response to the state’s use of MET to support its choice of student growth percentiles (SGPs) for teacher evaluations. Speaking on my behalf only (and not on behalf of the larger research team), I can say that the MET project says nothing at all about the use of SGPs. The growth measures used in the MET project were, in fact, based on value-added models (VAMs) (http://www.metproject.org/downloads/MET_Gathering_Feedback_Research_Paper.pdf). The MET project’s VAMs, unlike student growth percentiles, included an extensive list of student covariates, such as demographics, free/reduced-price lunch, English language learner, and special education status.

Extrapolating from these results and inferring that the same applies to SGPs is not an appropriate use of the available evidence. The MET results cannot speak to the differences between SGP and VAM measures, but there is both conceptual and empirical evidence that VAM measures that control for student background characteristics are more conceptually and empirically appropriate (link to your paper and to Cory Koedel’s AEFP paper). For instance, SGP models are likely to result in teachers teaching the most disadvantaged students being rated the poorest (cite Cory’s paper). This may result in all kinds of negative unintended consequences, such as teachers avoiding teaching these kinds of students.

In short, state policymakers should consider all of the available evidence on SGPs vs. VAMs, and they should not rely on MET to make arguments about measures that were not studied in that work.

Morgan

Citations:

Baker, B.D., Oluwole, J., Green, P.C. III (2013) The legal consequences of mandating high stakes decisions based on low quality information: Teacher evaluation in the race-to-the-top era. Education Policy Analysis Archives, 21(5). This article is part of EPAA/AAPE’s Special Issue On Value-Added: What America’s Policymakers Need to Know and Understand, Guest Edited by Dr. Audrey Amrein-Beardsley and Assistant Editors Dr. Clarin Collins, Dr. Sarah Polasky, and Ed Sloat. Retrieved [date], from http://epaa.asu.edu/ojs/article/view/1298

Ehlert, M., Koedel, C., Parsons, E., & Podgursky, M. (2012). Selecting Growth Measures for School and Teacher Evaluations. http://ideas.repec.org/p/umc/wpaper/1210.html

(Updated alternate version:

http://economics.missouri.edu/working-papers/2012/WP1210_koedel.pdf)