Information discovery has been transformed by the Web, and the impact on how information is gathered, published, and delivered has been profound. Web search is one form of discovery, and is preferred when a user has a specific objective, for example, wants directions to an address. It is less effective when users simply want to be informed about news that may be relevant to them, or to learn more about topics of interest. In these latter scenarios, the user experience depends crucially upon the quality of content recommendations. In contrast to search, the explicit signal about what the user wishes to see is much weaker, and the importance of a broad range of complementary indicators increases greatly. Novel techniques are required to best leverage a broad array of weak signals.
In this article, we present an overview of the content recommendation problem, namely how to recommend a small set of items to a user from an underlying pool of content items, in settings where the user does not explicitly express what is desired. In contrast to traditional media such as newspapers, every item shown on a page can be dynamically selected to reflect which items are currently the most engaging. In fact, the choice can be made taking into account a given user's interests, and even what they are looking at in the current session. Also, data-driven dashboards allow editors to program new stories to reflect trending topics and changes in the demographics of users visiting the site, and to monitor performance metrics in near real time. These changes are fundamentally altering how websites are designed, and indeed, how journalists and editors work.
The following issues must be considered in addressing the content recommendation problem:
Input signals. In building machine-learned models of what items a user is likely to engage with in a given context, we can draw upon many signals, including the content and source of each article, a user's interest profile (reflecting both long-term interests based on prior visits and short-term interests as reflected in the current session), and "popularity" indicators such as observed click-through rates or CTRs (the fraction of time the item is clicked on when a link to it is presented to users) and extent of social-sharing (for example, the number of times the item is tweeted or shared or "liked").
Objective to optimize. There are many objectives a website could choose to optimize for, including near-term objectives such as CTR and revenue per click, as well as long-term metrics such as increased time spent on the site, higher return and user retention rates, increase in social actions, and many others.
Algorithmic techniques. Algorithms must be developed to address a number of tasks.
Editorial tools: A complete recommendation framework must additionally provide tools for editors and journalists to: (1) observe in real time what items are most interesting (to which user segments on which parts of the website, at what times, from what locations, and on what devices), (2) quickly identify emerging trends in order to create additional content to meet these information needs, (3) constrain the algorithmic recommendations to follow guidelines associated with the site, and (4) tune the objective function and balance trade-offs (for example, maximize revenues while still keeping CTRs high and delivering personalized user experiences).
While creating user and item profiles and providing editorial tools are important aspects of content recommendation, we focus on scoring and ranking in this article. We argue that since user intent in content recommendation is significantly weaker compared to applications like Web search (where a user specified query serves as a strong signal of intent), expected click-through rate or CTR (appropriately weighted) is a more fundamental measure in scoring items for a user in a given context than semantic relevance. Also, ranking is not merely sorting items by scoreswe must balance considerations like diversity (ensuring that users see a range of topics over time), serendipity (ensuring that we do not overfit our recommendations to the user, thereby limiting the discovery of new interests) and editorial voice (the general tone of the content associated with a given portal). Further, the objective function for ranking might be based on several criteria, for example, we may want to maximize the number of article shares while not sacrificing more than 10% of the achievable CTR.
In Table 1, we identify four types of portals: general, personal, domain-specific, and social network. General portals publish a broad range of content; the home page of a content network typically falls in this category. A personal portal allows a user to customize the page with desired content; a domain-specific portal publishes content related to a specific topic or domain, for example, sports; and a social network portal allows users to connect to other users and disseminate information through their network. Content modules published on these portals broadly fall into one of the following three categories:
Content recommendation techniques are important and in use by almost all major Web portals. Instead of providing a review of specific techniques used by various portals, we review the broad framework that consists of two main technical componentsscoring and ranking in various application settings. We believe this provides a crisp mathematical formulation of various use cases faced by Web portals in practice. The actual methods used can be adequately described by a combination of techniques within this framework.
Technical solutions to scoring and ranking in content recommendation modules depend on the goals of the application and the nature of available signals; Table 2 lists a number of typical scenarios. We start our review with a simple application setting, where the goal is to maximize clicks in a recommendation module that has a relatively small content pool, and makes no use of user or context information. Although simple, this setting poses the challenge of finding the right balance between exploration and exploitation when estimating click-through rates (CTRs) of items using near real-time user click feedback. It also serves as a strong baseline method for non-personalized FMs, especially for portals that employ editors to select or create a small set of high-quality content items. We then extend the application setting to personalized recommendation with a large content pool, which poses a data sparsity challenge because the amount of click feedback at detailed user interest levels is too little to support exploration of even a modest number of items. The key is to reduce dimensionality by leveraging user features and users' past activities on the portal. The techniques reviewed here are also useful for implementing personalized RMs and NMs. After tackling the data sparsity challenge, we discuss multiobjective ranking, which is important for RMs (where semantic relevance to the context page and CTR estimates need to be blended), and more generally, to optimize multiple objectives (for example, clicks, social actions, revenue, time spent on the portal).
The fundamental technical challenge in scoring is to estimate the "value" of an item for a given user in a given context. Although one can use semantic relevance between query-document pairs as our score (in content recommendation, a user-context pair is a query), a more appropriate measure is the expected click-through rate or CTR since the main signal from users in content recommendation is whether the user clicks the recommended items. Further, knowing the expected CTR for an item opens the door to a principled approach to ranking items, by weighting the CTR with the utility of a click on that item, which gives the expected utility. Hence, CTR (appropriately weighted) is the primary scoring function we consider in this article. While scores based on other signals like explicit feedback (for example, like/dislike) are useful in some applications and can be estimated by some of the reviewed techniques, they are not our main focus. With known-item CTRs, we could maximize clicks by always recommending the item with highest CTR. But since CTRs are not known, a key task is to accurately estimate the click-through rate (CTR) of each candidate item in the item pool, perhaps through an exploration process (displaying each item to some number of user visits). However, exploration has an opportunity cost of not showing items that are empirically better; balancing these two aspects constitutes the explore/exploit trade-off. Dynamic item pools and non-stationarity of CTR over time adds more complexity.
Technical solutions to scoring and ranking in content recommendation modules depend on the goals of the application and the nature of available signals.
The Explore/Exploit Trade-off. To obtain further intuition on the explore/exploit problem, consider a simplified setting with a content pool consisting of two items, where the goal is to obtain an optimal algorithm to maximize overall expected CTR for the next 100 user visits. Note that the solution space here is astronomicalthere are 2100 different possible recommendation sequences (over two trillion!). This is similar in spirit to the classical multi-armed bandit problem, which considers how to dynamically allocate a single resource to alternate projects.32 Remarkably, an optimal solution exists and involves adaptively changing future decisions based on past feedback as discussed by Gittins.11 It illustrates the fundamental "explore/exploit" trade-off between "exploit" (display the item that appears to be doing well so far) and "explore" (display the item that may appear inferior, but perhaps due to an unlucky streak, to determine its true popularity). For instance, suppose the estimated CTR after 20 visits for items 1 and 2 are 1/3 and 1/5 respectively. It is tempting to abandon item 2 and persist with item 1 for the remaining 80 visits, but that may not be optimal since the true CTR of item 2 could be potentially higher than the estimated one, which is noisy due to the small sample size. The following references offer more details on multi-armed bandit problems.3,7,11
For content recommendation, several assumptions required to obtain Gittins' optimal solution are violated. The item pool is not static and changes over time, the CTR themselves could be non-stationary, and the click feedback is delayed (due to delay by users in clicking an item after display and delay in data transmission from Web servers to the backend machines). But perhaps the most difficult issue is the curse of dimensionality introduced due to the need to provide personalized recommendation with large/dynamic content pool, and hence the dearth of experimental budget to estimate popularity at fine resolutions. Hence, content recommendation problems require different solutions and cannot be solved through classical multi-armed bandit schemes.
Here, we discuss technical approaches to solving the explore/exploit problem for content recommendation assuming CTR to be our score function.
The most popular content recommendation. We begin with the problem of recommending the most popular item on a single slot to all users to maximize the total number of clicks. Although simple, this problem of most popular recommendation includes the basic ingredients of content recommendation and also provides a strong baseline for more sophisticated techniques.
Estimating item popularity (CTR) involves several nuances. Ideally, popularity for each item should be estimated by displaying the item to a representative sample of the current user population. For instance, serving most popular at night with popularity estimated using data in the morning may not perform well due to differences in user populations. Several other sources of bias can also affect popularity estimates when using retrospective data collected from existing systems. To protect against such bias, it is useful to update popularity estimates rapidly in an adaptive fashion through procedures like a Kalman filter,30 using data obtained through randomization, that is, randomly assigning some fraction of visits in a given time epoch to each item in the content pool. The optimal amount of randomization for each item can be computed by using extensions of bandit schemes.3
Personalization and large content pools. A natural extension to most popular recommendation is to classify users into coarse segments based on attributes like demographics and geographic location, and then apply most popular recommendation techniques to each segment. Such an approach works well if segments are coarse and item affinities for users within a segment are relatively homogeneous. Techniques like clustering and decision trees13 can be used to identify such segments.
However, this segmented most popular recommendation approach only works when the number of user visits per segment available to explore each candidate item is "sufficiently" large to ensure reliable identification of the highest CTR items. It is therefore challenging to provide personalized recommendations from a large pool of items, since it may not even be possible to show each item to each user even once!
Explore/Exploit with Data Sparsity. Several applications require personalization with a large and dynamic content pool, and this contributes to data sparsity. Approaches to tackling sparsity include:
In the following, we first review dimensionality reduction techniques and then discuss how to apply bandit schemes to the resulting low dimensional user and item representations. Subsequently, we review methods for constructing such representations using online updates, and how to effectively initialize these online methods.
Dimensionality reduction techniques. A proper survey of dimensionality reduction goes beyond the scope of this article; we only cover a few representative approaches.
Grouping through hierarchies: In some scenarios, users and/or items are hierarchically organized. For instance, the city in which a user lives is nested within a state, which in turn is nested within a country (Figure 5). If such hierarchies do not exist, hierarchical clustering or decision tree learning13 may be applied to automatically create hierarchies from data.
Instead of exploring/exploiting individual items for each individual user, one can start with exploring/exploiting coarse content categories for coarse user segments and gradually transition to fine-grained user segments and items as we obtain more data.17,29
Dimensionality reduction through linear projections: Another popular approach is to work in a generalized linear model framework.28 In many applications, we have a rich set of user features such as demographics, geolocation, and behavioral activities such as searches. Let us denote by xi = (x1i, ..., xMi) the feature vector for user i. The number M of such features is typically large (thousands or even millions). A generalized linear model assumes the CTR of item j for user i is a monotone function of a linear combination of features x'ij, and the unknown item coefficient vector j is estimated by using item interaction data across users. Although such a model reduces the CTR estimation problem from estimating a CTR for each (user, item) pair to estimating M coefficients for each item, it is still daunting when M is large. One approach is to project j into a low-dimensional space through a linear projection matrix B; that is, j = j where j is low dimensional. The projection B can be estimated in an unsupervised fashion by using principal component analysis (PCA)13 on user features; a supervised approach that uses additional click feedback information provides better performance.4
Grouping through collaborative filtering. In typical content recommendation applications, a certain fraction of users tends to interact frequently with the recommendation module. For such users, it is possible to derive features that capture item affinity based on past user interactions alone. It is also possible to derive such features for users who are less frequent by using techniques like collaborative filtering. For instance, based on data from the entire user population, one can estimate associations of the form: "users who like item A also like item B." The stronger these associations, the smaller the effective dimensionality, because they induce soft constraints on the joint distributions of CTR of all user-item pairs. Such approaches work well in practice even in the absence of other user and item features.1,8
Collaborative filtering approaches based on factor models currently provide state-of-the-art performance.19 The idea is to map user i and item j into the same Euclidean space as factor vectors ui and vj respectivelyuser-item affinity is then given by the inner product u'ivj of ui and vj, which is a measure of similarity between the two. Unlike explicit interest categories, these groups are latent and are estimated from retrospective data; a small number (a few tens to hundreds) of factors usually provide good performance in applications.
Explore/exploit and dimension reduction. As we discussed earlier, obtaining an optimal explore/exploit solution for personalized recommendation using dimension reduction is non-trivial and hence heuristics are often used in practice. The simplest solution is called -greedy. It serves an item selected at random with probability to each user visit and, with probability 1 , it serves the item having the highest estimated CTR for that user (see Kakade16 and Langford20 for more details). Upper-Confidence-Bound (UCB) scheme7,21 that ranks items based on an overestimate of mean (for example, mean + k-std, where k is some positive number and std is the estimated standard deviation) and Thompson sampling37 that sample parameters from the posterior distribution are other commonly used schemes.
Online CTR estimation. CTR estimation models for time-sensitive items needs to be updated frequently using the most recent user feedback. Such online models start with an initial estimate of the model parameters and continuously update parameters as new data becomes available. For instance, consider the factor model that predicts the CTR (on the log-odds scale) of item j for user i by u'ivj. Assume user factor estimates are more stable than items; thus, only vj needs to be updated in an online manner. From a Bayesian perspective, without any click feedback on item j, we postulate a prior distribution for vj with mean j and some variance. After receiving feedback (click or no-click), we update the prior to obtain the posterior distribution of vj through Bayes' rule; the resulting posterior serves as the prior for subsequent updates. See Agarwal5 for an application of this technique to a Web portal.
Initializing online models. For an item j that is new or has received little click feedback, CTR prediction depends crucially on the prior mean j, which is the initial estimate. One may naively set j of all items to the same value for all users, say 0. This can be improved by initializing item factors vj through feature vector zj for item j as vj = D(zj) + j, where D is a multivariate regression function and j are correction terms learning item-specific idiosyncrasies not captured through item features. Similar considerations apply to user factors. We refer the reader to Agarwal2 and Stern33 for more details.
After obtaining scores of some particular type (for example, CTR estimates) for each user-item pair, one could simply rank items by sorting scores. However, different types of scores (for example, CTR and semantic relevance) may have to be combined, and a number of competing recommendation objectives may have to be balanced. We first present two ranking scenarios to illustrate these nuances, and then discuss how to combine measures for multi-objective ranking.
One potential approach to combining multiple objectives is to compute a score quantifying the value of an item with respect to each of the aspects of interest and then combine these scores through a weighted average. A more general framework of multi-objective optimization can also be used.
Multi-objective optimization. Balancing multiple objectives has broad applicability beyond the two scenarios we just presented. For example, it is often desired to consider various weighted CTR objectives that measure different expected post-click utilities like ad revenue, time spent, likelihood of sharing, and so on. Multi-objective optimization provides an attractive mathematical framework to achieve such trade-offs. Although there is rich and mature literature on multi-objective programming,34 its application to content optimization is fairly new. For instance, Agarwal et al.6 use this approach to simultaneously optimize CTR and time spent reading the article to recommend content links on the Yahoo! homepage. Two recent studies14,35 applied multi-objective optimization in the areas of collaborative filtering.
It is essential to have a rigorous and principled approach to empirically evaluate the quality of solutions.
As discussed earlier, an optimal explore/ exploit solution for personalized content recommendation is still elusive. Existing approaches are based on heuristics, so no single strategy necessarily dominates and the actual solutions are application dependent. As such, it is essential to have a rigorous and principled approach to empirically evaluate the quality of solutions.
Strategies to evaluate the performance of an algorithm depend on the aspect we are interested in quantifying. In general, we evaluate two aspects of algorithms used in content recommendationout-of-sample prediction accuracy and overall recommendation performance.
Evaluating predictive accuracy. A prediction algorithm predicts an unobserved quantity. For example, algorithms that try to predict the CTR of an item by a user fall into this category, including methods discussed earlier in this article. Standard machine-learning or statistical evaluation methods can be used to evaluate the accuracy of predictions. A common approach is to collect a click log from the system that records whether users clicked items shown to them, and then split the log data into a training set and a test set. The training set is used to build a model, which is then evaluated using the test set. Commonly used metrics include log-likelihood of the model on the test data, precision-recall curve, ROC curve, area under ROC curve (AUC).38 We note that algorithms that predict quantities other than CTR (for example, semantic relevance, time-spent) can also be similarly evaluated as long as ground-truth data is obtainable (for example, by human judgment or processing system logs).
Evaluating recommendation performance. Overall recommendation performance is typically measured by using a notion of "reward" (for example, total number of clicks) that the system receives from its users. Explore/exploit algorithms and ranking algorithms (both of which use the output of some prediction algorithms to decide which items to show to each user) fall into this category. As long as the reward is measurable, algorithms can be compared through online experiments, in which random sub-populations of users are subjected to different serving algorithms to adjust for population bias in measuring performance improvements. Such experiments are frequently conducted by Web portals, and are usually referred to as bucket tests or A/B test. See Kohavi18 for a survey.
Bucket tests are expensive and may be risky, since an inferior serving algorithm may degrade user experience. It is desired to be able to evaluate an algorithm using retrospective data collected from the system, instead of experimenting with live traffic. Offline evaluation helps in reducing experimental cost and also facilitates research by the broader community, many of whom may not have access to experimental infrastructure. However, it can be difficult to obtain an unbiased estimate of the reward from retrospective data. For example, if an item has never been shown to a user in the retrospective data, estimating the number of clicks that will be generated when the algorithm recommends the item is difficult. If we base the evaluation only on the items that have been shown to the user by the current "serving algorithm," then this evaluation would be biased. Fortunately, as long as the historical algorithm has a non-zero probability of showing each item to each user, the bias can be removed by using importance-sampling techniques. For example, Li et al.22 developed such an unbiased evaluation method for explore/exploit algorithms, and empirically showed that the reward of an algorithm estimated using retrospective data correlates well with its actual reward in an online bucket test.
Data for research purposes. To facilitate further research in this area, it is important to have benchmark datasets available for evaluating new algorithmic approaches. Several such datasets are available to test predictive accuracy of new approaches,12,15,40 while not many datasets are available to evaluate the overall recommendation performance. Fortunately, Yahoo!39 provides the first such dataset.
More such datasets are required to facilitate participation by a large number of researchers in this area. This is a challenging task since these datasets are most easily generated by large Web companies but such companies are often constrained by concerns over user privacy. Further collaboration between academic and industrial research is necessary.
Content recommendation techniques have been adopted by every major Web portal to increase user engagement.27 We have presented a rigorous technical framework covering all real use cases that we are aware of, rather than a survey of individual methods used by various portals. The actual methods used can be adequately described by a combination of techniques within this framework.
Content recommendation techniques have been adopted by every major Web portal to increase user engagement.
Several portals use methods that are close to related item recommendation techniques described earlier; typically, they use a combination of collaborative filtering and semantic similarity. Such techniques have been successfully used in e-commerce (for example, Amazon24) and entertainment (for example, deezer. com). One of the earliest uses of these techniques for content recommendation was by Findory23 to recommend news articles based on semantic similarity between a user's content profile (obtained through historical reads) and article content, blended with a collaborative filtering component that recommends articles read by a user to others with similar profiles. More recently Das9 describes an online collaborative filtering framework to recommend news on Google, using pure collaborative filtering (semantic relevance is not incorporated). In a subsequent paper,25 content-based user profiles are used to enhance the collaborative filtering algorithm and popularity estimates. The YouTube video recommendation system10 scores a candidate pool of videos in a multi-objective framework with respect to video quality, user specificity and diversity, via a linear combination. The candidate pool of videos for a user is generated by considering related videos (based on co-visitation patterns within the same session) in the relatedness graphs that are at most n hops from the seed set (which consists of videos previously watched, commented, shared or liked by the user).
Several portals recommend articles using some estimates of article popularity. For instance, Szabo and Huberman36 describe methods to estimate long-term article popularity based on historical data. Agarwal et al.5 showed that this approach suffered from the presence of serving bias in historical data, and used popularity estimation methods based on randomized data to recommend articles on the Yahoo! front page. Several other portals including AOL, LinkedIn and MSN have modules on their home page that rely on methods to estimate article popularity. Factor models to personalize recommendations are known to be used by Netflix19 for movie recommendations and Yahoo! for personalized content recommendations.2
In this article, we described the content recommendation problem and observed that it is sufficiently different from search and ad targeting to merit careful study and distinct technical approaches. We presented a rigorous framework that allows us to capture current recommendation approaches, used widely on several Web portals.
However, challenges and open problems remain, including the following:
1. Adomavicius, G. and Tuzhilin, A. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Trans. on Knowledge and Data Engineering 17 (June 2005), 734749.
10. Davidson, J., Liebald, B., Liu, J., Nandy, P. Van Vleet, T., Gargi, U., Gupta, S., He, Y., Lambert, M., Livingston, B. and Sampath, D. The youtube video recommendation system. In ACM Conference on Recommender Systems (2010). ACM Press, New York, 293296.
12. GroupLens Research. Movielens Data Sets; http://www.grouplens.org/node/73.
15. Kaggle. http://www.kaggle.com.
18. Kohavi, R., Longbotham, R., Sommerfield, D. and Henne, R. Controlled experiments on the Web: survey and practical guide. Data Mining and Knowledge Discovery 18 (2009), 140181. 10.1007/s10618-008-0114-1.
21. Li, L., Chu, W., Langford, J. and Schapire, R. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web. ACM Press, New York (2010) 661670.
22. Li, L., Chu, W., Langford, J. and Wang, X. Unbiased offine evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining (2011). ACM Press, New York, 297306.
23. Linden, G. In http://glinden.blogspot.com/2008/01/brief-history-of-findory.html.
24. Linden, G., Smith, B. and York, J. Amazon.com recommendations: Item-to-item collaborative filtering. IEEE Internet Computing 7, 1 (2003), 7680.
27. Mondaynote. In http://www.mondaynote.com/2009/02/15/recommendation-engines-a-must-for-news-sites, 2009.
35. Svore, K.M., Volkovs, M.N. and Burges, C.J. Learning to rank with multiple objective functions. In Proceedings of the 20th International Conference on World Wide Web (2011). ACM Press, New York, 367376.
39. Yahoo! Academic Relations. R6AYahoo! Front Page Today Module User Click Log Dataset, Version 1.0; http://Webscope.sandbox.yahoo.com, 2012.
40. Yahoo! Academic Relations. Yahoo! webscope rating datasets; http://webscope.sandbox.yahoo.com/catalog.php?datatype=r, 2012.
©2013 ACM 0001-0782/13/06
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from email@example.com or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2013 ACM, Inc.
No entries found