Thursday, October 01, 2009

Problems with the New Barrymore Awards: Part II

In my first article on the problems with the new Barrymore Awards voting process, I pointed out how the new system’s assignment of voters enabled clustering of awards around certain productions to a degree unseen in seasons past.

Here, I will show how the new process itself cannot fulfill its stated goal of recognizing the best performance, design element, or production over an entire season. And I would say that this holds true even if everything I wrote in the first article proves false.

First things first: I realize that the Barrymore’s do not—in name—designate the “best” anything (e.g. director), but instead give awards signifying “outstanding” sound design, “outstanding” performance by a leading actress, etc. But this circumlocution merely equivocates on a term.

Under the old system of voting, only one performance or design element received the top number of votes from the judges, and similarly, the new system yields a “highest score” from the voters. In each process, someone is (or will be) collectively regarded as the “best” of the season. Of course, people can always pretend otherwise.

However, I would argue that only the old system could legitimately recognize the best performances and design elements of a season. By contrast, the new process cannot even convey a standard of excellence, let alone reward the most outstanding anything of the season.

Who Decides and How?

This year’s new system of voting sent eight randomly assigned voters out of 62 to see each show, with each voter seeing 12 to 20 shows over the season. Their instructions encouraged them to treat each show on its own merits and rank each performance or design element on a scale of 0 to 100, with rough-and-ready categories (like “poor: 0-20”) guiding their scores.

The judging of figure skating in the Olympics attempts something similar, assigning point values to each performer taken in individual consideration. But there, the judges possess pre-determined objective criteria (difficultly of routine, number of specific movements performed) that form part of their scoring.

However, because theatre lacks any such observer-independent objective criteria, the new Barrymore system more resembles trying to determine the fastest runner by taking each competitor in isolation, letting a handful of people watch him run, and then selecting another, different batch of observers to evaluate the next sprinter. Imagine this process without a stopwatch and you understand how they determined this year’s awards.

As such, this quantifiable system can only encourage thinking about excellence, but without a frame of reference or cross-comparison, it cannot possibly measure it adequately. Like obscenity, we must trust the voters to just “know it when they see it.”

How the old system of judges solve this problem

When it comes to art, this might be the best any of us can do, and the judges of the old system operated similarly. However, unlike the judges, the voters do not see every eligible show, which, in a qualitative analysis, is the only thing that could give them a frame of reference to properly vote for the “most outstanding X of an entire season.” Instead, they cast a once-and-done fixed vote that they cannot later rescind or alter.

The old system of judges who had seen every eligible production could—no matter how flawed otherwise—at least introduce a frame of reference for cross-comparison. Yes, they also lacked “objective criteria,” but unlike runners viewed by rotating sets of observers, the judges at least possessed the advantage of seeing and evaluating every show. At the end of the year, after marshalling a continually refined set of theatre-evaluating experiences, they could then confidently cast a vote for excellence.

But now, the new system has transferred the power of the judges to an even smaller group while losing the one advantage of cross-comparison that the judges conferred. Even assuming bias on the part of all judges, that they had seen every eligible show still gave the old system a level of quality control that the new process lacks.

A sports analogy clarifies the problem

So rather than 10 to 17 judges deciding all the awards after a period of reflection, this season the first (and isolated) impressions of eight individuals decided each and every award. But because of the random distribution of the voters, not even the same group of voters made any two decisions.

To borrow another analogy from sports, the new process resembles allowing a different set of judges to decide the gold, silver, and bronze medals. Whoever thought that spreading the responsibility of choosing each award—though not any award—onto new random groups actually increased the rigor and integrity of the Barrymore process needs to take a course in qualitative analysis.

In order to rank something as “the most outstanding X” of the year, one needs a large sample, not of voters seeing isolated shows, but of total number of shows seen.

By contrast, trying to pretend that the voters should only treat a show on its merits means asking them to ignore every single show or theatre-experience any of them ever had. But each voter can only know excellence by past exposure to such. And since no one can ever ignore the totality of their experience when making judgments about excellence, why wouldn’t Silvante want to buttress the system’s ability to truly reward it by ensuring that each and every person who votes on the awards all possess the same theatre-going experiences that season?

Qualitative analysis versus quantifiable metrics

Qualitative notions like “best” and “outstanding” must involve a comparison. But the elimination of a group of judges that could make these comparisons eliminated the possibility of the new system rendering such judgments. At best, the new awards can only stipulate which performance, production, or design element earned the highest score via random assignment of a group of voters who never again voted on another production as a unit. Perhaps they should change the name of each award from “Outstanding Actor,” to “Highest Voted Upon Performance,” a meaningless moniker to signify a process that could not otherwise ensure that it rewarded the quality of excellence.

Stay tuned for Part III in this series, where I discuss the potential for using quantitative analysis to judge art.

Problems with the New Barrymore Awards: Part I

Next Monday, the Theatre Alliance of Greater Philadelphia will host the 2009 Barrymore Awards for Excellence in Theatre. However, the unprecedented clustering of nominations for this year's awards points out the problems with the new method of nominating. See Part II for how these problems render the Awards unable to fulfill their stated goal of recognizing excellence.

In the past five seasons (2004-2008) of the Barrymore Awards, only five productions earned 10 or more nominations.

This year alone, four productions garnered more than 10 nominations, even though a greater number of participating companies made more shows eligible than ever before. Two of them—Cinderella and Something Intangible—equaled the total of 13 given to Sweeney Todd in 2005. The Producers and Scorched scored 12 apiece, bringing the total for the top-four vote getters to 50 out of 113 possible nominations. In the musical theatre categories, two productions captured 25 nods, and five took 44 of the 51 nominations possible in this genre.

Furthermore, this clustering of nominations extended to whole award categories: the Wilma’s production of Scorched and People’s Light’s staging of the musical Cinderella each saw four female performers nominated for Outstanding Supporting Actress (in a play and musical, respectively); likewise, the Arden’s production of Something Intangible raked in three best actor nods.

Something doesn’t add up. While some might contend that a handful of shows emerged as clearly superior candidates in a mediocre season (despite notable oversights like Blackbird and Hamlet, among others), I’d argue that the clustering effect around these (and a few other) productions resulted from changes implemented this year to the Barrymore Awards voting system.

Out with the Old: How the nominating used to work

To understand what happened requires some background on the Barrymore Awards’ history. Started by the Theatre Alliance of Greater Philadelphia during the 1994-95 season, the Alliance first used nominators selected from the theatre community to decide the awards. In 2000, the Alliance switched from this simple system to a two-tiered approach of 40 to 50 nominators and 10 to 17 judges, the latter handpicked theatre professionals who formed a unit possessing hundreds of years of theatre-producing and theatre-going experience amongst them.

This now-discarded two-tiered system randomly assigned six nominators to see each eligible production within the first three days of its opening night. Within 24 hours, each filled out a ballot, giving either a “thumbs-up” or “thumbs-down” for every applicable category (such as “outstanding music direction”).

If a minimum three out of the six nominators gave a thumbs-up in any one category, then that production became eligible for nomination in every category. To determine which aspects of a show (if any) should receive a nomination, all of the judges now went and viewed that particular production. At the end of the season, the judges—who had seen every eligible production—then voted on the awards. The top five ballot-getters received nominations, with the winner determined by which show/performer/designer garnered the most of the judges’ votes.

In with the New: From differential expertise to random voters

For the 2008-09 season, Margie Silvante, the Theatre Alliance’s new Executive Director, decided to eliminate the two-tiered system of nominators and judges, and replace it with a cadre of “voters”. Armed with a metrics-based standard of quantification, her new system randomly assigned 8 voters (out of a pool of 62) to see each show, with each voter weighing in upon 12 to 20 productions out of the 130 eligible for consideration.

Within 24 hours after seeing an eligible show, each voter logged onto a website to post their scores for each of the applicable awards (for instance, “outstanding actor in a play”). The website’s ballot ranged from 0-20 (poor) to 86-100 (outstanding), and each voter cast a specific number score for each possible award, using these categories like “poor” as rough-and-ready standards to guide their scoring. Under this new system, the top five scores in any award determined the nominations, with the top-point scorer ultimately winning the award (to be announced at the ceremony on October 5).

In early 2008, Silvante announced these changes at a mid-season meeting of nominators and judges and stressed her desire to reintroduce integrity into the process and eliminate the prejudice of some judges. I had witnessed this bias at an earlier meeting when then-judge Alan Blumenthal admitted to Walnut Street Theatre’s Artistic Director Bernard Havard the judges’ past prejudice against the Walnut’s productions.

Silvante hoped that her new metrics-based system would eliminate this unfairness and enable greater rigor by introducing a method of quantification that could (in theory) draw upon the commonalities of judgment from a larger and more diverse pool of voters.

Considered Judgment versus The Wow Effect

But rather than produce greater integrity and rigor, the new process instead yielded a clustering of nominations unseen in previous years. Two competing hypotheses can explain this phenomenon; neither have anything to do with artistic merit.

To understand what happened, consider the new system’s process of assigning voters. Out of 62 randomly assigned voters, the chance that any eight of them saw a single show comes to 1 in 136 trillion. The chance that any single group of eight voters reunited to see another production amounts to 1 in 1.8 x 1027 . (The actual number is slightly less because of the cap put on the possible number of shows assigned to each individual voter.)

Under the old system, the chance that any grouping of judges not only all saw the same productions but saw every eligible production: 100 percent. The judges could compare performances, and thereby ensured a level of measured reflection and quality control that this new system lacks.

The new system, by contrast, requires that each voter post a score within 24 hours, without recourse to reflection, and without the frame of reference that seeing every other eligible production affords. As such, the evaluative process each voter employs must contend with his or her first impression of a performance and whatever overwhelming emotions—both positive or negative—the production has elicited.

Because of this time constraint, I would assert that voters, taken as a whole, will tend to over-value an excellent production and fall victim to the “wow effect” just like anyone in the audience. (Other critics have cited this as the number one reason to postpone writing a review until one can fully collect his or her thoughts.)

Certain plays—those heavily indebted to spectacle, or capable of inducing powerful emotions in the audience—could take much greater advantage of this wow effect. The final unraveling of the mystery in Scorched packs an emotionally stunning revelation that few plays equal, and walking out of the theatre, and even for the next 24 hours, the show’s conclusion would still leave one reeling. But a magnificent moment doesn’t necessarily make a magnificent show. And a common error—the fallacy of division—would see voters acceding greater weight to each performance in a show that elicited that effect.

The old two-tiered system of judges and nominators could actually take advantage of this “wow factor’s” bias. The judges would see plays overvalued by the nominators, and by not having to decide the ultimate merit of each production element on the spot, could temper their observations through evaluations of other performances. For the judges, what may have appeared overwhelmingly “outstanding” after a single viewing, could, in a broader sense of what the community offered over an entire season, come into better perspective. (Don’t believe the “wow effect” exists? Judges have said to me on more than one occasion that they “can’t believe the nominators sent them to see such-and-such a show.”)

The Second Hypothesis: Mediocrity rears its non-descript head

Allowing the awards to be determined by the random distribution of voters who only see a handful of shows enables another likely–though far more invidious—possibility for the clustering of awards, which I’ll call the “mediocrity effect.”

While the new system hinges on a set of commonalities distributed evenly among 62 voters that could help quantify their choices, a rough-and-ready metric of five categories cannot eliminate personal judgments in assigning the scores.

Take any two critics seeing the same show. Presumably, Philadelphia Weekly’s J. Cooper Robb and I bring a commonality of background qualities to our roles as theatre critics. Yet, in his best of the season roundup, he called Geoff Sobelle’s performance in Hamlet the year’s best. I thought it decorated with frills that lacked a central unifying quality. In the Barrymore voting system, Robb might have scored Sobelle’s performance a 95, where I would’ve chalked up a 70.

However, under the new system of scoring, Sobelle’s unique interpretation would have lost to any performance that consistently earned a vote of 83, a score that falls below the cutoff for “outstanding.” To give another indication of how this could happened, when I was a nominator, actors (who I won’t name), told me that they had auditioned for the role they now had to vote upon, didn’t agree with the choices made by the performer who was cast, and for that reason, didn’t think it worthy of Barrymore consideration. And rather than eliminate the bias of the judges and restore integrity, this new system makes it possible for disgruntled voters to trash a performer’s rankings entirely.

Moreover, statistics predicts that most rankings will cluster around a norm. (And even if the Awards process eliminated the highest and lowest score—as the Olympics adjusts the points for diving—this would actually further encourage regression to an average score.) Unfortunately, this new system of voting actually makes it possible that this “norm” enshrines mediocrity at the expense of more superlative work.

What the new system ultimately makes possible

I don’t write these comments to discredit any of the voters, many of whose opinions I respect, but to point out what types of outcomes a particular set of boundaries will make more likely. And knowing that all systems of measurement possess flaws that mandate trade-offs, I will not pretend that all of the voters can completely avoid well-established observational biases. I would instead opt to select systems that minimize the impact of each bias in turn.

And this all goes back to the way the voters are assigned. The new system only produces a 1.8 in 1027 chance that the same 8 voters ever reunited to evaluate a show again. In all likelihood, the voters who cast their vote for Something Intangible never evaluated another show as a unit. Furthermore, the parameters of this new system encourage the “wow effect” and the “mediocrity effect” in such a way that not only makes each error possible, but exacerbates the likelihood of each of them occurring.

Because the new system lacks a method of self-correction or quality control (that the judges provided in years past), it further exacerbates the effects of each error. Hence, you get clustering: either around shows that wowed voters or that contained enough reasonably good elements as to ensure a high average, though not an outstanding one.

In either case, the less-than-24-hour reflections of 8 individuals who hadn’t seen all the contenders (and not the same 8 people for any single award) decided each and every award this year.

In a system with dual levels of quality control and far greater numbers of variables provided by the judges seeing every eligible production, this clustering effect would not be a statistical probability but would only happen for a show that was truly phenomenal. Hence, under the old system, only five shows in five years garnered 10 or more nominations, as opposed to 4 productions this year alone receiving that many. By contrast, the new system encourages the clustering of awards not out of any reason of artistic merit, but out of sheer probability alone.

Oh well, back to the drawing board.

See Parts II and III for more.