‘Small Data’ Enabled Prediction of Obama’s Win, Say Economists

Patrick Hummel and David Rothschild of accurately predicted the vote shares that President Obama would receive in all 50 states in the 2012 U.S. presidential election.

Nine months before the 2012 U.S. presidential election last November, economists knew the election’s outcome. They attribute their forecasting success not to social network-sourced "big data" as some in the media have suggested, but to "small data" they say could have just as accurately been figured with a pencil and paper as with a computer.

The electoral process in the U.S. is somewhat complex, which complicates the process of forecasting its outcome. The president and vice president are not elected directly by the voters; instead, the Electoral College, an institution made up of "electors" who are chosen by popular vote on a state-by-state basis, elects the nation’s top executives every four years. Each of the 50 U.S. states is entitled to as many electors as it has members of Congress (which includes two senators and a number of representatives proportional to each state’s population), and those electors are supposed to cast their votes for the winner of the popular vote in each state.

With all that in mind, economists say the 2012 presidential election was just too early to have relied on input from unproven data from sources like Facebook, Twitter, and search results. Instead, they used social media input to make more complex combination predictions—just as they anticipate doing for the next presidential election in 2016.

How data was used to predict the election’s results depended entirely on the forecasting methodology chosen, four basic types of which were employed by economists in 2012, says Justin Wolfers, a professor of economics and public policy at the University of Michigan.

He describes them this way:

The fundamentals method. This includes such factors as GDP growth, unemployment rate, and whether the candidate is an incumbent. "The model is a very simplistic, stripped-down one," says Wolfers, "and that is its virtue. Because we only get one presidential election every four years, we have a limited number of observations from which to draw. A more complicated model would require additional observations we just don’t have."
The polls method. Two broad types include looking at a popular poll like the Gallup Poll, while a more sophisticated approach aggregates results from various pollsters. Statistician Nate Silver employed this latter method to correctly predict the winner in all 50 states and the District of Columbia. "It involves looking at every single poll, thinking about each poll’s biases, and realizing that the average of many polls will do better than any individual poll," says Wolfers. "Many regard this method as state of the art."
The prediction market method. Economists watch speculative markets in which current market prices can be interpreted as predictions of the election outcome.
The hybrid of polls and prediction market method. "The latest twist in election forecasting in which, instead of asking people who they intend to vote for, they are asked who they think will win," says Wolfers. "This method of polling—which, like a prediction market, aggregates the latent wisdom that exists in the broader population—worked so spectacularly well in 2012 that I expect it will play a much more prominent role in 2016."

In Wolfers’ opinion, a year in advance of the election, the fundamentals approach works well while polls do not, because people have not started thinking about the election yet. Polls do a good job three months before the election, he says, but prediction markets do the best job regardless of when they are employed.

None of these methods involve what is commonly known as "big data," says Patrick Hummel, currently a research scientist at Google who developed a model for forecasting elections with David Rothschild, an economist at Microsoft Research and a Fellow of the Applied Statistics Center at Columbia University, during their time at Yahoo! Research.

Hummel describes the way they utilized data in his 2012 presidential prediction as simple linear regression, first gathering from earlier elections historical data like economic indicators, presidential approval ratings, which party was in power and for how long, and biographical information about the candidates. Then, he and Rothschild compared how various pieces of data that were available nine months before the 2012 election correlated with the results of the earlier elections.

In February 2012, they predicted President Obama would win the Electoral College with 303 electoral votes to Romney’s 235, forecasting every state correctly except for Florida, where they predicted Obama would lose (in fact, Obama won Florida with just 50.01% of the vote).

Hummel and Rothschild also accurately predicted the vote shares that President Obama would receive in all 50 states and, after the election, determined their median error in that prediction was 1.6 points.

"We are aware of 23 different polling organizations that made predictions of statewide vote shares in various states the day before the election," Hummel says, "and of those 23, there was only one that ended up with an average error that was less than 1.6 points."

Hummel and Rothschild’s dataset included hundreds of historical elections—the outcomes in 50 states for every year for the last several decades—that totaled approximately 100,000 unique pieces of data.

"I wouldn’t classify that as big data," Hummel says, "which can involve as many as tens of billions of data points in one analysis. Our particular analysis, which could be done with pencil and paper, doesn’t come anywhere close to that."

Hummel describes the way they utilized data in his 2012 presidential prediction as simple linear regression.

While "big data" might not have been appropriate for predicting the presidency in 2012, it was what was needed for making complex, combination predictions, says David Pennock, principal researcher and assistant managing director at Microsoft Research New York City.

"If you’re just trying to predict the national winner, the small data model with just a few features—like economic variables, approval ratings, and so on—is the right way to go," Pennock explains. "But you really do need computer science when you’re dealing with computations between things—like the chance that the same party wins both Ohio and Pennsylvania, or the chance that whoever wins Ohio will win the whole country. That’s when you’ve got not just 50 things to predict, but two to the 50^th, which is something like one quadrillion, if you count up all the combinations."

The 2012 election was the first in which economists were able to do complex combination predictions, due not only to the increase in available computational power, but also to algorithms that were developed just recently.

"For instance, we can ask what’s the likelihood of gas prices going up if Obama wins—or if Romney wins," Pennock says. "And what is the likelihood of taxes rising. Or, if Candidate A wins, is it likely we’ll be involved in more wars … or that the stock market will drop. These are the kinds of interesting predictions that help voters determine how the election’s outcome will affect them personally."

Pennock admits being skeptical about the ability of social media to predict elections, especially when there are better ways to forecast a winner. Instead, he—and other economists—used social media during the 2012 election to understand more subjective information, like people’s sentiments or reactions—the sort of information, he says, that cannot easily be obtained from other sources.

That is why Rothschild is gathering "tons of data from social media like Twitter and Facebook and search results" that he is not applying immediately, but intends to use in forthcoming research in 2013 and 2014.

"I just didn’t have the confidence yet to use social media to answer in any meaningful way the questions people are asking," he says. "It’s just too new and too complicated and there isn’t much of a historical track record for it."

Rothschild says he is excited about the promise of the new social media data that, come the next presidential election, he foresees economists will be able to use to answer very complex questions—and to provide the results in real time.

"The current data doesn’t have the granularity to allow us to do that yet," he says. "But the direction we’re moving in is to determine, say, what are the five things people want to know most—and then to be able to provide answers. Or, perhaps build a model where each person can input their own specific questions and then output the answers. That’s the promise of social media."

Currently, Rothschild is working with teams at both Bing (Microsoft’s Web search engine) and Xbox 360 (Microsoft’s video-game console) to prepare for what he calls "the next generation of actively collecting data to answer questions."

"People have being doing telephone polling for years," he says, "but that is becoming more and more expensive, especially since many people don’t have standard phones anymore and tend not to answer their phones when pollsters call."

Indeed, poll response rates are down from over 40% 20 years ago to less than 10% today.

Instead, he says, Xbox has a huge audience of engaged users who are eager to supply information—and so he has been working on new methods of polling them, especially by making the process more enjoyable for them to participate.

In late September, he and his team launched a test run, averaging 20,000 interviews of five questions daily on Xbox, especially during the presidential debates.

"What we concluded was that we can make the product engaging so that people want to be in it and supply us information," he says, "and that we can make the results meaningful, even today in 2012."

Yet the real promise for forecasting in both the 2016 and 2020 Presidential elections, says Rothschild, is being able to answer new questions in a way that is both quick and, perhaps, more personalized.

Indeed, agrees Pennock, "there’ll be more complex predictions, since that seems to have captured the public’s imagination. We won’t just be talking about the horse race and who will win, but what will happen to each voter regardless of who wins. Which is actually more relevant, I think."