Just under a week ago my childhood friend Chris Wilson called me to ask how I thought one might go about probabilistically predicting which Oscar nominee would win best picture in 2019 (award show this Sunday) based on the 381 films that have been nominated for the award between 1950 and 2018. CW is the director of data journalism at TIME.com, and he had just assembled a data set that includes 47 candidate predictors in the form of other awards and nominations for all historical nominees.
Unfortunately, the deadline to explain a modeling approach and furnish a prediction was about 4 days from our first discussion.
Under the time constraint posed, I suggested a simple off-the-shelf logistic regression approach with ad-hoc two-stage model selection to choose predictors. First, we ran a series of univariate logistic regressions and grabbed the top several contenders (chosen based on odds ratios), followed by an exhaustive model search in the top set using BIC as the selection metric. The modeling exercise (and a few variants) were fairly unequivocal that there is no lock, but, relative to the historical body of data, a film with Roma’s accolades was the best bet. You can check out the article here.
The exercise got me thinking about the various time scales on which statistical activity is of interest. For a methodological development to be published in a statistics journal, we expect to take months, devise optimal procedures to address novel problems, and fear the timeline will extend to years due to the extensive peer review process. (An analysis like ours would be desk-rejected by this sort of venue.) A responsible data analysis using best existing practices for a study outside of statistics likely takes weeks in my experience, and the total time to acceptance for publication is probably only months (although this varies widely by field). I have consulted with businesses and other groups to answer data-analytic questions for internal use, which sometimes takes only a few weeks.
But data journalism has a window of days. I have some specific ideas about how to exploit the unique structure of the Oscar prediction problem, and these will take time to develop. But the Oscars are imminent. Every day we postpone publishing drastically reduces the value of the article. Nobody cares if we correctly predict a past winner. And next week, the world (which moves more quickly than the publication cycle of statistics journals) will surely provide a new data challenge with a short timeline. So, without time to invent the perfect power trimmer for the task, we instead tended our data hedge with the logistic regression machete. Maybe I can get some better ideas together for the next award show. But for now, Godspeed, Roma. I have never cared more about the fate of a film that I have not yet had the chance to see.