In announcing David Blei as the latest recipient of the ACM-Infosys Foundation Award in the Computing Sciences, ACM president Vint Cerf said Blei's contributions provided a basic framework "for an entire generation of researchers." Blei's seminal 2003 paper on latent Dirichlet allocation (LDA, co-authored with Andrew Ng and Michael Jordan while still a graduate student at the University of California, Berkeley) presented a way to uncover document topics within large bodies of data; it has become one of the most influential in all of computer science, with more Google Scholar citations than, for example, the earlier PageRank paper that launched Google. Blei's approach and its extensions have found wide-ranging uses in fields as varied as e-commerce, legal pretrial discovery, literary studies, history, and more. At Princeton since 2006, Blei begins this fall as a professor at Columbia University, with joint appointments in the two departments his work bridges, computer science and statistics.
The idea behind topic modeling is that we can take a big collection of documents and learn that there are topics inside that collectionlike sports or health or businessand that some documents exhibit a pattern of words around that topic and others don't. That idea really began in the late '80s and early '90s with latent semantic analysis (LSA), from people like Susan Dumais at Microsoft Research. Then Thomas Hoffman developed probabilistic latent semantic analysis (PLSA), taking the ideas of LSA and embedding them in a probability model.
This story is going to sound anticlimactic: I went to an internship to COMPAQ Research in Boston, and there I was working with Pedro Moreno; we were running speech recognizers on talk-radio programs, and then using PLSA on that noisy text output. I came back to Berkeley, where my officemate was Andrew Ng, and I told him I'd worked with the PLSA-based model. He said, "Something has always bothered me about PLSA." It was a technical thing, but as a consequence we wrote down three models, and LDA was one of them. In going from PLSA to LDA, the main difference is LDA could be used in a larger, more complicated model.
Did you have a sense that this was going to be big?
No. I did have an immediate sense that I enjoyed working on it: I remember fitting a model and seeing the topics pop out, and noticing qualitatively that they were very good, and being amazed by it. But we could never have foreseen the success.
What about in hindsightwhat contributed to its success?
I can see a couple things. In our first attempt at developing LDA, we had a complicated algorithm, and after maybe two years of work, we figured out an easier way, which made everything much faster, so we could handle much larger datasets. We were also working on it at a time when probabilistic models from a machine learning perspective were starting to become more mainstream, and LDA became an example that people could latch onto of how someone can use probability models to analyze data. All this coincided with the rise of a lot of unstructured dataand the sudden need to organize all that data.
What are some of your favorite applications of LDA?
I find it amazing where people are using it. It's used in some recommendation systems. I heard from a friend that it's used in detecting credit-card fraud. One of my favorite uses is in the digital humanities, which have picked up on LDA and topic modeling in general. These are historians and English professors and folklore scholars, and they want to find patternsrecurring themesthat they wouldn't have otherwise seen.
So topic modeling isn't just for analyzing big data
That's right. Imagine you're a historian, and you're studying 500 documents about a period of timethat's a modest number, but that's still a large number of documents to hold in your mind at a time. You can't ask the historian to give you a list of co-occurring words, because to a human it's not obvious, but it is to LDA. That's not to say the algorithm is doing a better job than a person, it's just an alternative view.
You also work on Bayesian nonparametric methods and approximate posterior inference
They're all related. Let's go back to topic modeling: LDA assumes that there are a fixed number of topics and that documents exhibit multiple topics. We look at those assumptions, get a big document collection, and we ask: under those assumptions, what are the topics that best describe how this collection came about?
"Bayesian nonparametrics expands the language of modeling to make it much more flexible."
What a Bayesian nonparametric model does, it expands what we can do with models. The number of topics is no longer held fixed, but grows in a natural way with the data: if there are a million documents, the Bayesian nonparametric model might find a thousand topics, and if there are more documents, it can add more topics if it needs them.
Bayesian nonparametrics can also be expanded to say that I think my documents have a tree of topics, which starts with general things like sports, and expands to specific things like hockey, baseball, etc. Now the problem of choosing the number of topics is much more difficult. Which tree structure do you choose? You have to try each possible tree structure, and it's an impossible task. But with Bayesian nonparametric methods we can say, "There's a tree, and I don't know what it looks like, but it's somehow responsible for explaining the data, so find me the tree structure as well." Bayesian nonparametrics expands the language of modeling to make it much more flexible.
Matt Connelly, a historian at Columbia, studies the history of diplomacy, and he has a dataset of all the cables sent between different diplomatic stations in the '70s. He could use LDA to analyze it, or he could say, "I know something about this," and I can sit down with him and build a topic model based on what he knows about the data and what he wants to uncover in the data. Finding the hidden structure that this data exhibits is a computational problem called inferencethe problem of computing the conditional distribution of the hidden variables given the observations. If you use LDA, the way the hidden variables look is different for each datasetif you use LDA on The New York Times, the hidden topics look like "war," and "sports," and "weather," and so onand if you use LDA on scientific articles, then you get "genetics" and "neuroscience" and "biology."
In the last few years in my group, we've been working on generic and scalable ways of doing inference. If Matt Connelly comes up with a model and it takes me two years to come up with the algorithm, he's going to lose interest. But if there's a procedure that works on lots of different models and quickly gives answers, then we'd be in business. That's approximate posterior inference.
We're also working on an interesting set of applied problems, often involving text but also other kinds of data. I'm collaborating with social scientists to study newspaper articles and build models of how perspectives on funding for culture change over time; working with biologists to study genetic data to understand the hidden ancestral populations exhibited in our genome, working with neuroscientists to study fMRI data to understand how text is somehow reflected in measurements of the brain while people are looking at words and texts.
Another big push in my group is user behavior data; we want to take data about which scientific articles people read and use that to understand how the articles are organized.
It's an extremely exciting time for machine learning because of the deluge of data. Statistics and machine learning haven't dealt with data of this size and this complexity, so I'm excited by the potential for doing exploratory data analysis.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from email@example.com or fax (212) 869-0481.