Probabilistic sequence models have become so pervasive in science and engineering that we often act as if the sequence modeling problem is basically solved. On the Internet, such models power the large-scale systems for statistical machine translation, such as Google Translate, that receive millions of queries each day. These systems use a probabilistic sequence modela language modelto select among candidate translations in the target language. For example, consider translating the English phrases "to show interest" and "to charge interest" into Portuguese. In the first phrase, the word "interest" should be translated as "interesse" (curiosity) while in the second it should be translated as "juro" (return on lending). In each case, a Portuguese language model would assign higher probability to the correct translation based on the context established by the preceding verb. Today, such language models are derived from the statistics of large corpora with as many as 1010 sentences of text. Moreover, in practical systems for machine translation or speech recognition, they may assign probabilities to sequences that contain as many as 106 distinct words.
The history of probabilistic sequence models, too long and complex to review here, dates back to Markov at the turn of the last century. To a first approximation, today's practice is to memorize a subset of cw patterns occurring in the training data, where w is a token (for example, Unicode point, DNA base, word) and c is a context of preceding tokens. The statistics of these patterns are then used to estimate the probability of a token appearing after a given sequence of tokens. For modelers, the critical modeling questions are how to select which patterns to store, and from these patterns, how to estimate the probability that a token w follows a context c when w has never or rarely been seen in that context. Though informed by decades of research, current practices are still something of a black art. They often reflect the particular data structures and algorithms used to create sequence models and, for large-scale systems, the need for distribution implementations. We practitioners seem for the most part resigned to these complications, although naturally we wonder if there are more elegant ways to manage the balance between memorizing old patterns and generalizing to new ones.
The Sequence Memoizer (SM) detailed in the following paper addresses this balance in a novel way by extending several previous ideas: suffix trees for storing prediction contexts of unbounded length;2 factorable priors for integrating contexts of different lengths into a final probability estimate;5 and nonparametric Bayesian methods for incorporating the uncertainty over which contexts best predict the next token.4 The authors had the critical insight to combine a particular form of hierarchical Pitman-Yor process with standard techniques for linear-time, context-tree creation. This combination yields the first linear-time, unbounded-context sequence model based on principled Bayesian techniques.
How well does it work? On standard test collections using standard metrics, the SM matches or outperforms all previous methods that store the same context statistics. Why does it work so well? One likely reason is that the hierarchical Pitman-Yor process is better at modeling the power law statistics of natural sequence data: in many types of sequences, patterns that are individually rare still account for a substantial fraction of all observed patterns. By modeling these statistics, the SM is better able to integrate the predictions from contexts of different lengths.
The SM is an important advance in probabilistic sequence modeling that finally makes nonparametric Bayesian methods practical for large data sets. Is sequence modeling therefore solved? Hardly. First, there remains a bit of black art in the SM, whose probability estimates cannot be computed in closed form but must instead be approximated with stochastic (Monte Carlo) methods for inference. Not only do these methods require some empirical tuning, but they also create a trade-off between predictive power and computational complexity. We lack any theoretical understanding of this trade-off in SMs when stochastic inference is taken into account. Finally, for many applications, long contiguous patterns of tokens seem a very inefficient representation of context. For instance, natural language text often has a hierarchical structure, where relatively distant words (say, the main verb and the head noun of the direct object) may be strongly correlated while the words in between are less correlated to either;1 similarly, DNA sequences can exhibit strong correlations between bases in two consecutive exons but much weaker correlations with the bases in the intervening intron. Furthermore, in certain contexts it seems a reasonable approximation to interchange sets of tokens or token sequences that belong to particular syntactic or semantic classes.3 An interesting question is whether more refined nonparametric Bayesian models can capture these statistical properties.
3. Lin, D. An information-theoretic definition of similarity. In Proceedings of the 15th International Conference on Machine Learning (1998), 296304; http:dblp.unitrier.de/.
©2011 ACM 0001-0782/11/0200 $10.00
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from email@example.com or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2011 ACM, Inc.
No entries found