Thursday, October 01, 2009

Problems with the New Barrymore Awards: Part I

Next Monday, the Theatre Alliance of Greater Philadelphia will host the 2009 Barrymore Awards for Excellence in Theatre. However, the unprecedented clustering of nominations for this year's awards points out the problems with the new method of nominating. See Part II for how these problems render the Awards unable to fulfill their stated goal of recognizing excellence.

In the past five seasons (2004-2008) of the Barrymore Awards, only five productions earned 10 or more nominations.

This year alone, four productions garnered more than 10 nominations, even though a greater number of participating companies made more shows eligible than ever before. Two of them—Cinderella and Something Intangible—equaled the total of 13 given to Sweeney Todd in 2005. The Producers and Scorched scored 12 apiece, bringing the total for the top-four vote getters to 50 out of 113 possible nominations. In the musical theatre categories, two productions captured 25 nods, and five took 44 of the 51 nominations possible in this genre.

Furthermore, this clustering of nominations extended to whole award categories: the Wilma’s production of Scorched and People’s Light’s staging of the musical Cinderella each saw four female performers nominated for Outstanding Supporting Actress (in a play and musical, respectively); likewise, the Arden’s production of Something Intangible raked in three best actor nods.

Something doesn’t add up. While some might contend that a handful of shows emerged as clearly superior candidates in a mediocre season (despite notable oversights like Blackbird and Hamlet, among others), I’d argue that the clustering effect around these (and a few other) productions resulted from changes implemented this year to the Barrymore Awards voting system.

Out with the Old: How the nominating used to work

To understand what happened requires some background on the Barrymore Awards’ history. Started by the Theatre Alliance of Greater Philadelphia during the 1994-95 season, the Alliance first used nominators selected from the theatre community to decide the awards. In 2000, the Alliance switched from this simple system to a two-tiered approach of 40 to 50 nominators and 10 to 17 judges, the latter handpicked theatre professionals who formed a unit possessing hundreds of years of theatre-producing and theatre-going experience amongst them.

This now-discarded two-tiered system randomly assigned six nominators to see each eligible production within the first three days of its opening night. Within 24 hours, each filled out a ballot, giving either a “thumbs-up” or “thumbs-down” for every applicable category (such as “outstanding music direction”).

If a minimum three out of the six nominators gave a thumbs-up in any one category, then that production became eligible for nomination in every category. To determine which aspects of a show (if any) should receive a nomination, all of the judges now went and viewed that particular production. At the end of the season, the judges—who had seen every eligible production—then voted on the awards. The top five ballot-getters received nominations, with the winner determined by which show/performer/designer garnered the most of the judges’ votes.

In with the New: From differential expertise to random voters

For the 2008-09 season, Margie Silvante, the Theatre Alliance’s new Executive Director, decided to eliminate the two-tiered system of nominators and judges, and replace it with a cadre of “voters”. Armed with a metrics-based standard of quantification, her new system randomly assigned 8 voters (out of a pool of 62) to see each show, with each voter weighing in upon 12 to 20 productions out of the 130 eligible for consideration.

Within 24 hours after seeing an eligible show, each voter logged onto a website to post their scores for each of the applicable awards (for instance, “outstanding actor in a play”). The website’s ballot ranged from 0-20 (poor) to 86-100 (outstanding), and each voter cast a specific number score for each possible award, using these categories like “poor” as rough-and-ready standards to guide their scoring. Under this new system, the top five scores in any award determined the nominations, with the top-point scorer ultimately winning the award (to be announced at the ceremony on October 5).

In early 2008, Silvante announced these changes at a mid-season meeting of nominators and judges and stressed her desire to reintroduce integrity into the process and eliminate the prejudice of some judges. I had witnessed this bias at an earlier meeting when then-judge Alan Blumenthal admitted to Walnut Street Theatre’s Artistic Director Bernard Havard the judges’ past prejudice against the Walnut’s productions.

Silvante hoped that her new metrics-based system would eliminate this unfairness and enable greater rigor by introducing a method of quantification that could (in theory) draw upon the commonalities of judgment from a larger and more diverse pool of voters.

Considered Judgment versus The Wow Effect

But rather than produce greater integrity and rigor, the new process instead yielded a clustering of nominations unseen in previous years. Two competing hypotheses can explain this phenomenon; neither have anything to do with artistic merit.

To understand what happened, consider the new system’s process of assigning voters. Out of 62 randomly assigned voters, the chance that any eight of them saw a single show comes to 1 in 136 trillion. The chance that any single group of eight voters reunited to see another production amounts to 1 in 1.8 x 1027 . (The actual number is slightly less because of the cap put on the possible number of shows assigned to each individual voter.)

Under the old system, the chance that any grouping of judges not only all saw the same productions but saw every eligible production: 100 percent. The judges could compare performances, and thereby ensured a level of measured reflection and quality control that this new system lacks.

The new system, by contrast, requires that each voter post a score within 24 hours, without recourse to reflection, and without the frame of reference that seeing every other eligible production affords. As such, the evaluative process each voter employs must contend with his or her first impression of a performance and whatever overwhelming emotions—both positive or negative—the production has elicited.

Because of this time constraint, I would assert that voters, taken as a whole, will tend to over-value an excellent production and fall victim to the “wow effect” just like anyone in the audience. (Other critics have cited this as the number one reason to postpone writing a review until one can fully collect his or her thoughts.)

Certain plays—those heavily indebted to spectacle, or capable of inducing powerful emotions in the audience—could take much greater advantage of this wow effect. The final unraveling of the mystery in Scorched packs an emotionally stunning revelation that few plays equal, and walking out of the theatre, and even for the next 24 hours, the show’s conclusion would still leave one reeling. But a magnificent moment doesn’t necessarily make a magnificent show. And a common error—the fallacy of division—would see voters acceding greater weight to each performance in a show that elicited that effect.

The old two-tiered system of judges and nominators could actually take advantage of this “wow factor’s” bias. The judges would see plays overvalued by the nominators, and by not having to decide the ultimate merit of each production element on the spot, could temper their observations through evaluations of other performances. For the judges, what may have appeared overwhelmingly “outstanding” after a single viewing, could, in a broader sense of what the community offered over an entire season, come into better perspective. (Don’t believe the “wow effect” exists? Judges have said to me on more than one occasion that they “can’t believe the nominators sent them to see such-and-such a show.”)

The Second Hypothesis: Mediocrity rears its non-descript head

Allowing the awards to be determined by the random distribution of voters who only see a handful of shows enables another likely–though far more invidious—possibility for the clustering of awards, which I’ll call the “mediocrity effect.”

While the new system hinges on a set of commonalities distributed evenly among 62 voters that could help quantify their choices, a rough-and-ready metric of five categories cannot eliminate personal judgments in assigning the scores.

Take any two critics seeing the same show. Presumably, Philadelphia Weekly’s J. Cooper Robb and I bring a commonality of background qualities to our roles as theatre critics. Yet, in his best of the season roundup, he called Geoff Sobelle’s performance in Hamlet the year’s best. I thought it decorated with frills that lacked a central unifying quality. In the Barrymore voting system, Robb might have scored Sobelle’s performance a 95, where I would’ve chalked up a 70.

However, under the new system of scoring, Sobelle’s unique interpretation would have lost to any performance that consistently earned a vote of 83, a score that falls below the cutoff for “outstanding.” To give another indication of how this could happened, when I was a nominator, actors (who I won’t name), told me that they had auditioned for the role they now had to vote upon, didn’t agree with the choices made by the performer who was cast, and for that reason, didn’t think it worthy of Barrymore consideration. And rather than eliminate the bias of the judges and restore integrity, this new system makes it possible for disgruntled voters to trash a performer’s rankings entirely.

Moreover, statistics predicts that most rankings will cluster around a norm. (And even if the Awards process eliminated the highest and lowest score—as the Olympics adjusts the points for diving—this would actually further encourage regression to an average score.) Unfortunately, this new system of voting actually makes it possible that this “norm” enshrines mediocrity at the expense of more superlative work.

What the new system ultimately makes possible

I don’t write these comments to discredit any of the voters, many of whose opinions I respect, but to point out what types of outcomes a particular set of boundaries will make more likely. And knowing that all systems of measurement possess flaws that mandate trade-offs, I will not pretend that all of the voters can completely avoid well-established observational biases. I would instead opt to select systems that minimize the impact of each bias in turn.

And this all goes back to the way the voters are assigned. The new system only produces a 1.8 in 1027 chance that the same 8 voters ever reunited to evaluate a show again. In all likelihood, the voters who cast their vote for Something Intangible never evaluated another show as a unit. Furthermore, the parameters of this new system encourage the “wow effect” and the “mediocrity effect” in such a way that not only makes each error possible, but exacerbates the likelihood of each of them occurring.

Because the new system lacks a method of self-correction or quality control (that the judges provided in years past), it further exacerbates the effects of each error. Hence, you get clustering: either around shows that wowed voters or that contained enough reasonably good elements as to ensure a high average, though not an outstanding one.

In either case, the less-than-24-hour reflections of 8 individuals who hadn’t seen all the contenders (and not the same 8 people for any single award) decided each and every award this year.

In a system with dual levels of quality control and far greater numbers of variables provided by the judges seeing every eligible production, this clustering effect would not be a statistical probability but would only happen for a show that was truly phenomenal. Hence, under the old system, only five shows in five years garnered 10 or more nominations, as opposed to 4 productions this year alone receiving that many. By contrast, the new system encourages the clustering of awards not out of any reason of artistic merit, but out of sheer probability alone.

Oh well, back to the drawing board.

See Parts II and III for more.

No comments: