How can you tell where in the United States people are most satisfied with their lives?
You can ask residents across a range of counties, as the Centers for Disease Control (CDC) does as part of a national phone survey. You can also make a crude guess based on easy-to-find proxies: an area’s income and education levels, for example, tend to go up and down with life satisfaction. Also, these days, you can try picking up on what people in different places are saying day in and day out through social media like Twitter. This latter approach poses a natural language processing challenge, yet using techniques from machine learning, scientists recently managed to get surprisingly good readings from nearly a billion tweets—and, what’s more, to derive new insights that no phone survey can provide.
The researchers, led by computer scientist Andrew Schwartz and psychology doctoral student Johannes Eichstaedt of the University of Pennsylvania’s World Well-Being Project, used what they call an "open-vocabulary" approach to extracting life satisfaction information from Twitter. That approach contrasts with prevailing methods, which deploy canned lists of words that clearly relate to happiness. Sentiment analysis, for example, has traditionally taken this dictionary-based approach. "What a lot of people do is they simply count the relative frequency of positive and negative emotion words over time or space," Eichstaedt explains, "and they say ‘look—this is the happiest state.’" However, what if there’s more to be learned from other, non-obvious words that correlate with happiness?
To find candidates for such words, the researchers began by using the machine-learning algorithm called latent dirichet allocation (LDA) on a large set of Facebook status updates to generate a list of 2,000 topics that show up in social media. An LDA topic is a list of semantically similar words, Schwartz says—whether or not these have anything to do with life satisfaction. For example, among the 2,000 topics, LDA identified a tooth topic (consisting of words like teeth, dentist, and mouth) and a Super Mario topic which includes the words playing, bros, Wii, and Nintendo, among others. All these topics can then be used on any data set, Schwartz says, so the researchers could go on to extract the topic words from Twitter; since many of the nearly billion tweets they sampled were tagged with location data, the scientists could see how often each of these topics appears in a given county.
Because LDA generates topics automatically, it makes no preconceptions about what topics might be related to life satisfaction—making it a good starting point for the discovery of latent variables. For example, if Super Mario words somehow consistently show up more often in tweets from counties with high life satisfaction—sadly for gamers, they didn’t—the researchers could conclude that this topic held a clue to subjective well-being. (Several outdoor and nature-related topics, on the other hand, did ultimately correlate with life satisfaction.)
To discover such hidden associations, the researchers used a training set consisting of tweets from 75% of the counties (setting aside the rest for later testing) to produce the optimal function relating a county’s use of language topics to the county’s life satisfaction score. First, their program extracted topic words from all the counties. Then it ran repeated correlation tests between these words and each county’s life satisfaction. Finally, to create a model that most accurately predicts life satisfaction at the county level, the researchers used a machine-learning algorithm that performs a Lasso regression to find the best set of weighting coefficients for all the topics and all the counties at once. The resulting function, Schwartz says, "works for one given county, but to produce it we tried to optimize it so it works best across all counties."
To see how well their function actually predicts life satisfaction, the researchers tested it on the tweets from the remaining 25% of counties.
The function performed remarkably well. Using LDA topics alone, it outperformed the dictionary-based approach. And though the topics alone didn’t predict county-level life satisfaction as well as did demographic and socioeconomic variables, something extraordinary happened when the researchers combined their topics data with demographic and socioeconomic statistics: the predictive accuracy was higher than for either measure alone, suggesting that Twitter contains meaningful information that statistical measures don’t fully capture. Indeed, the color-coded U.S. life satisfaction map that emerged from this combination looks astoundingly similar to the map derived from the CDC data. The correlation (of .535) is surprisingly strong, according to Cornell University psychologist Jeffrey Hancock, who works with computer scientists to study large data sets in his research on detecting deception. "I wouldn’t even hazard to guess how much it cost to collect the CDC data, and here people are talking for free about their daily lives and you’re getting these incredibly strong correlations with survey data," he says.
This level of predictive accuracy is especially remarkable given that Twitter users, unlike CDC survey respondents, tend to skew young. As Schwartz puts it, "Even though we’re not capturing a representative sample, the signal from this sample does seem to be capturing what the signal would be from a representative sample."
Even more interesting are the insights coming from those LDA topics that correlated highly with life satisfaction. For example, one might expect money to be part of the picture, particularly given the known relationship between county-level income and life satisfaction. Yet it turns out it is not language related to money per se that’s correlated with happiness; as LDA showed, it was donating, support, charity, and similar philanthropic words.
"There’s a popular perception that Twitter tweets are superficial and mundane—that it’s people talking about what they had for breakfast," Hancock says. "This study demonstrates that that’s not the case—or if it is, the mundane is deeply connected to our happiness."
Based in San Francisco, Marina Krakovsky is the co-author of Secrets of the Moneylab: How Behavioral Economics Can Improve Your Business.
No entries found