As the distribution of video over the Internet is becoming mainstream, user expectation for high quality is constantly increasing. In this context, it is crucial for content providers to understand if and how video quality affects user engagement and how to best invest their resources to optimize video quality. This paper is a first step toward addressing these questions. We use a unique dataset that spans different content types, including short video on demand (VoD), long VoD, and live content from popular video content providers. Using client-side instrumentation, we measure quality metrics such as the join time, buffering ratio, average bitrate, rendering quality, and rate of buffering events. We find that the percentage of time spent in buffering (buffering ratio) has the largest impact on the user engagement across all types of content. However, the magnitude of this impact depends on the content type, with live content being the most impacted. For example, a 1% increase in buffering ratio can reduce user engagement by more than 3 min for a 90-min live video event.
Video content already constitutes a dominant fraction of Internet traffic today and several analysts forecast that this contribution is set to increase in the next few years.1, 15 This trend is fueled by the ever decreasing cost of content delivery and the emergence of new subscription-and ad-based business models. Premier examples are Netflix which now has reached 20 million US subscribers, and Hulu which distributes over one billion videos per month.
As Internet video goes mainstream, users' expectations for quality have dramatically increased; for example, when viewing content on TV screens anything less than "SD" quality (i.e., 480p) is not acceptable. In the spirit of Herbert Simon's articulation of attention economics, the overabundance of video content increases the onus on content providers to maximize their ability to attract users' attention.18 Thus, it becomes critical to systematically understand the interplay between video quality and user engagement. This knowledge can help providers to better invest their network and server resources toward optimizing the quality metrics that really matter.2 However, our understanding of many key questions regarding the impact of video quality on user engagement "in the wild" is limited on several fronts:
This paper is a first step toward answering these questions. We do so using a unique dataset of client-side measurements obtained over 2 million unique video viewing sessions from over 1 million viewers across popular content providers. Using this dataset, we analyze the impact of video quality on user engagement along three dimensions:
To identify the critical quality metrics and to understand the dependencies among these metrics, we employ well-known techniques such as correlation and information gain. We also augment this qualitative analysis with regression techniques to quantify the impact. Our main observations are
These results have important implications on how content providers can best use their resources to maximize user engagement. Reducing the buffering ratio can increase the engagement for all content types, minimizing the rate of buffering events can improve the engagement for long VoD and live content, and increasing the average bitrate can increase the engagement for live content. However, there are also trade-offs between the buffering and the bitrate that we should take into account. Our ultimate goal is to use such measurement-driven insights so that content providers, delivery systems, and end users can objectively evaluate and improve Internet video delivery. The insights we present are a small, but significant, first step toward realizing this vision.
We begin this section with an overview of how our dataset was collected. Then, we scope the three dimensions of the problem space: user engagement, video quality metrics, and types of video content.
2.1. Data collection
We have implemented a highly scalable and available real-time data collection and processing system. The system consists of two parts: (a) a client-resident instrumentation library in the video player and (b) a data aggregation and processing service that runs in data centers. Our client library gets loaded when Internet users watch video on our affiliates' sites and monitors fine-grained events and player statistics. This library collects high fidelity raw data to generate higher level information on the client side and transmits these in real time with minimal overhead. We collect and process 0.5TB of data on average per day from various affiliates over a diverse spectrum of end users, video content, Internet service providers, and content delivery networks.
Video player instrumentation: Figure 1 illustrates the lifetime of a video session as observed at the client. The video player goes through multiple states (connecting and joining, playing, paused, buffering, stopped). For example, the player goes to paused state if the user presses the pause button on the screen, or if the video buffer becomes empty then the player goes into buffering state. By instrumenting the client, we can observe all player states and events and also collect statistics about the playback quality.
2.2. Engagement and quality metrics
Qualitatively, engagement is a reflection of user involvement and interaction. While there are many ways in which we can define engagement (e.g., user-perceived satisfaction with the content or willingness to click advertisements), in this study we focus on objectively measurable metrics of engagement at two levels:
For completeness, we briefly describe the five industry-standard video quality metrics we use in this study2:
We collect close to 4TB of data each week. On average, 1 week of our data captures measurements of over 300 million views watched by about 100 million unique viewers across all of our affiliate content providers. The analysis in this paper is based on the data collected from five of our affiliates in the fall of 2010. These providers appear in the Top-500 most popular sites and serve a large volume of video content, thus providing a representative view of Internet video quality and engagement.
We organize the data into three content types and within each content type we choose two datasets from different providers. We choose diverse providers in order to rule out biases induced by the particular provider or the player-specific optimizations and algorithms they use. For live content, we use additional data from the largest live streaming sports event of 2010: the FIFA World Cup. Table 1 summarizes the number of unique videos and viewers for each dataset, described below. To ensure that our analysis is statistically meaningful, we only select videos that have at least 1000 views over the week-long period.
In this section, we show preliminary measurements to motivate the types of questions that we want to answer and briefly describe the analysis techniques we use.
Overview: Figure 2 shows the cumulative distribution functions (CDF) of four quality metrics for dataset LvodA. We see that most viewing sessions experience very good quality, that is, have low BufRatio, low JoinTime, and relatively high RendQual. At the same time, however, the number of views that suffer from quality issues is not trivial7% experience BufRatio larger than 10%, 5% have JoinTime larger than 10s, and 37% have RendQual lower than 90%. Finally, only a small fraction of views receive the highest bitrate. Given that a significant number of views experience quality issues, content providers would naturally like to know if (and by how much) improving their quality could have potentially increased the user engagement.
As an example, we consider one video object each from LiveA and LvodA, bin the different sessions based on the quality metrics, and calculate the average play time for each bin in Figures 3 and 4. These figures visually confirm that quality matters. At the same time, these initial visualizations also give rise to several questions:
To address the first two questions, we use the well-known concepts of correlation and information gain. To measure the quantitative impact, we also use linear-regression-based models for the most important metric(s). Finally, we use domain-specific insights and controlled experiments to explain the anomalous observations. Next, we briefly describe the statistical techniques we employ.
Correlation: To avoid making assumptions about the nature of the relationships between the variables, we choose the Kendall correlation instead of the Pearson correlation. The Kendall correlation is a rank correlation that does not make any assumption about the underlying distributions, noise, or the nature of the relationships. (Pearson correlation assumes that the noise in the data is Gaussian and that the relationship is roughly linear.)
Given the raw dataa vector of (x, y) values where each x is the measured quality metric and y the engagement metric (play time or number of views)we bin it based on the value of the quality metric. We choose bin sizes that are appropriate for each quality metric of interest: for JoinTime, we use 0.5s intervals, for BufRatio and RendQual we use 1% bins, for RateBuf we use 0.01/min sized bins, and for AvgBitrate we use 20 kbps-sized bins. For each bin, we compute the empirical mean of the engagement metric across the sessions/viewers that fall in the bin.
We compute the Kendall correlation between the mean-per-bin vector and the values of the bin indices. We use this binned correlation metric for two reasons. First, we observed that the correlation coefficient was biased by a large mass of users that had high quality but very low play time, possibly because of low user interest. Our goal in this paper is not to study user interest. Rather, we want to understand how the quality impacts user engagement. To this end, we look at the average value for each bin and compute the correlation on the binned data. The second reason is scale. Computing the rank correlation is expensive at the scale of analysis we target; binned correlation retains the qualitative properties at much lower computation cost.
Information Gain: Correlation is useful when the relationship between the variables is roughly monotone increasing or decreasing. As Figure 3(c) shows, this may not hold. Furthermore, we want to move beyond analyzing a single quality metric. First, we want to understand if a pair (or a set) of quality metrics are complementary or if they capture the same effects. As an example, consider RendQual in Figure 3; RendQual could reflect either a network issue or a client-side CPU issue. Because BufRatio is also correlated with PlayTime, we may suspect that RendQual is mirroring the same effect. Identifying and uncovering these hidden relationships, however, is tedious. Second, content providers may want to know the top-k metrics that they should optimize to improve user engagement.
To this end, we augment the correlation analysis using information gain,16 which is based on the concept of entropy. Intuitively, this metric quantifies how our knowledge of a variable X reduces the uncertainty in another variable Y; for example, what does knowing the AvgBitrate or BufRatio "inform" us about the PlayTime distribution? We use a similar strategy to bin the data and for the PlayTime, we choose different bin sizes depending on the duration of the content.
Note that these analysis techniques are complementary. Correlation provides a first-order summary of monotone relationships between engagement and quality. The information gain can corroborate the correlation or augment it when the relationship is not monotone. Further, it extends our understanding to analyze interactions across quality metrics.
Regression: Kendall correlation and information gain are largely qualitative measures. It is also useful to understand the quantitative impact; for example, what is the expected increase in engagement if we improve a specific quality metric by a given amount? Here, we rely on regression. However, as the visualizations show, the relationships between the quality metrics and the engagement are not obvious and many metrics have intrinsic dependencies. Thus, directly applying regression techniques may not be meaningful. As a simpler and more intuitive alternative, we use linear regression to quantify the impact of specific ranges of the most critical quality metric. However, we do so only after visually confirming that the relationship is roughly linear over this range so that the linear data fit is easy to interpret.
We begin by analyzing engagement at the per-view level, where our metric of interest is PlayTime. We begin with long VoD content, then proceed to live and short VoD content. In each case, we compute the binned correlation and information gain per video and then look at the distribution of the coefficients across all videos. Having identified the most critical metric(s), we quantify the impact of improving this quality using a linear regression model over a specific range of the quality metric.
At the same time, content providers also want to understand if good quality improves customer retention or if it encourages users to try more videos. Thus, we also analyze the user engagement at the viewer level by considering the number of views per viewer and the total play time across all videos watched by the viewer in a 1-week interval.
4.1. Long VoD content
Figure 5 shows the absolute and signed values of the correlation coefficients for LvodA to show the magnitude and the nature (increasing or decreasing) of the correlation. We summarize the median values for both datasets in Table 2 and find that the results are consistent for the common quality metrics BufRatio, JoinTime, and RendQual, confirming that our observations are not unique to a specific provider.
The result shows that BufRatio has the strongest correlation with PlayTime. Intuitively, we expect a higher BufRatio to decrease PlayTime (i.e., more negative correlation) and a higher RendQual to increase PlayTime (i.e., a positive correlation). Figure 5(b) confirms this intuition regarding the nature of these relationships. We also notice that JoinTime has little impact on the play duration.
Next, we use the univariate information gain analysis to corroborate and complement the correlation results. In Figure 6, the relative order between RateBuf and BufRatio is reversed compared to Figure 5. The reason is that most of the probability mass is in the first bin (01% BufRatio) and the entropy here is the same as the overall distribution (not shown). Consequently, the information gain for BufRatio is low; RateBuf does not suffer this problem and has higher information gain. Curiously, we see that AvgBitrate has high information gain even though its correlation with PlayTime is very low; we revisit this later in the section.
So far we have looked at each quality metric in isolation. A natural question then is whether two or more metrics when combined together yield new insights that a single metric does not provide. However, this may not be the case if the metrics are themselves interdependent. For example, BufRatio and RendQual may be correlated with each other; thus knowing that both are correlated with PlayTime does not add new information. Thus, we consider the distribution of the bivariate relative information gain values in Figure 7. For clarity, rather than showing all combinations, for each metric we include the bivariate combination with the highest relative information gain. We see that the combination with the AvgBitrate provides the highest bivariate information gain. Even though BufRatio, RateBuf, and RendQual had strong correlations in Figure 5, combining them does increase the information gain suggesting that they are interdependent.
Surprising behavior in AvgBitrate: We noticed that AvgBitrate has low correlation but high information gain in the univariate and bivariate analysis. This is related to our earlier observation in Figure 3. The relationship between PlayTime and AvgBitrate is not monotone; it peaks between 800 and 1000 Kbps, low on either side of this region, and increases slightly at the highest rate. Because of this non-monotone relationship, the correlation is low.
However, knowing the value of AvgBitrate allows us to predict the PlayTime and thus there is a non-trivial information gain. This still leaves open the issue of low PlayTime in the 10001600 kbps band. This range corresponds to clients that observe many bitrate switches because of buffering induced by poor network conditions. Thus, the PlayTime is low here as a result of buffering, which we already observed to be the most critical factor.
4.2. Live content
Figure 8 shows the distribution of the correlation coefficients for dataset LiveA, and we summarize the median values for the two datasets in Table 3. We notice one key difference with respect to the LvodA results: AvgBitrate is more strongly correlated for live content. Similar to dataset LvodA, BufRatio is strongly correlated, while JoinTime is weakly correlated.
For both long VoD and live content, BufRatio is a critical metric. Interestingly, for live, we see that RateBuf has a much stronger negative correlation with PlayTime. This suggests that the Live users are more sensitive to each buffering event compared to the Long VoD audience. Investigating this further, we find that the average buffering duration is much smaller for long VoD (3 s), compared to live (7 s). That is, each buffering event in the case of live content is more disruptive. Because the buffer sizes in long VoD are larger, the system fares better in face of fluctuations in link bandwidth. Furthermore, the system can be more proactive in predicting buffering and hence preventing it by switching to another server, or switching bitrates. Consequently, there are fewer and shorter buffering events for long VoD.
Information gain analysis reconfirms that AvgBitrate is a critical metric and that JoinTime is less critical for Live content (not shown). The bivariate results (not shown for brevity) mimic the same effects as those depicted in Figure 7, where the combination with AvgBitrate has the largest information gains.
Surprising behavior with RendQual: Figure 4(d) shows the counter-intuitive effect where RendQual was negatively correlated with PlayTime for live content. The above results for the LiveA and LiveB datasets confirm that this is not an anomaly specific to one video but a more pervasive phenomenon. Investigating this further, we found a surprisingly large fraction of viewers with low rendering quality and high play time. Furthermore, the BufRatio values for these users were also very low. In other words, these users see a drop in RendQual even without any network issues but continue to view the video.
We hypothesized that this effect arises out of a combination of user behavior and player optimizations. Unlike long VoD viewers, live video viewers may run the video player in background or minimize the browser (and maybe listening to the commentary). In this case, the player may try to reduce the CPU consumption by decreasing the frame rendering rate. To confirm this hypothesis, we replicated this behavior in a controlled setup and found that the player drops the RendQual to 20%. Interestingly, the PlayTime peak in Figure 4(d) also occurs at 20%. These suggest that the anomalous relationship is due to player optimizations when users play the video in the background.
Case study with high impact events: One concern for content providers is whether the observations from typical videos can be applied to "high impact" events (e.g., Olympics10). To address this concern, we consider the LiveWG dataset. We focus here on BufRatio and AvgBitrate, which we observed as the most critical metrics for live content in the previous discussion. Figure 9 shows that the results for LiveWC1 roughly match the results for LiveA and LiveB. We also confirmed that the coefficients for LiveWG2 and LiveWC3 are identical. These results suggest that our observations apply to such events as well.
4.3. Short VoD content
Finally, we consider the short VoD category. For both datasets SvodA and SvodB, the player uses a discrete set of 23 bitrates without switching and was not instrumented to gather buffering event data. Thus, we do not show the AvgBitrate (correlation is not meaningful on two points) and RateBuf. Table 4 summarizes the median values for both datasets. We notice similarities between long and short VoD: BufRatio and RendQual are the most critical metrics. As before, JoinTime is weakly correlated. The information gain results for short VoD largely mirror the results from the correlation analysis and we do not show these.
4.4. Quantitative analysis
As our measurements show, the interaction between the PlayTime and the quality metrics can be quite complex. Thus, we avoid black-box regression models and restrict our analysis to the most critical metric (BufRatio) and only apply regression to the 010% range of BufRatio after visually confirming that this is roughly a linear relationship.
We notice that the distribution of the linear-fit slopes is very similar within the same content type in Figure 10. The median magnitudes of the slopes are one for long VoD, two for live, and close to zero for short VoD. That is, BufRatio has the strongest quantitative impact on live, then on long VoD, and then on short VoD. Figure 9 also includes linear data fits on the 010% subrange for BufRatio for the LiveWG data. These show that, within the selected subrange, a 1% increase in BufRatio can reduce the average play time by more than 3 min (assuming a game duration of 90 min). In other words, providers can increase the average user engagement by more than 3 min by investing resources to reduce BufRatio by 1%. Note that the 3 min drop is not from the 90-min content time but from expected view time which is around 40 min; that is, engagement drops by roughly 7.5% (3/40).
4.5. Viewer-level engagement
At the viewer level, we look at the aggregate number of views and play time per viewer across all objects irrespective of that video's popularity. For each viewer, we correlate the average value of each quality metric across different views with these two aggregate engagement metrics.
Figure 11 visually confirms that the quality metrics also impact the number of views. One interesting observation with JoinTime is that the number of views increases in the range 115s before starting to decrease. We also see a similar effect for BufRatio, where the first few bins have fewer total views. This effect does not, however, occur for the total play time. We speculate that this is an effect of user interest. Many users have very good quality but little interest in the content; they sample the content and leave without returning. Users who are actually interested in the content are more tolerant of longer join times (and buffering). However, the tolerance drops beyond a certain point (around 15s for JoinTime).
The values of the correlation coefficients are qualitatively consistent across the different datasets (not shown) and also similar to the trends we observed at the view level. The key difference is that while JoinTime has relatively little impact at the view level, it has a more pronounced impact at the viewer level. This has interesting system design implications. For example, a provider may decide to increase the buffer size to alleviate buffering issues. However, increasing buffer size can increase JoinTime. The above result shows that doing so without evaluating the impact at the viewer level may reduce the likelihood of a viewer visiting the site again.
Content popularity: There is an extensive literature on modeling content popularity and its implications for caching (e.g., Cheng et al.,6 Yu et al.,12 and Huang et al.14). While our analysis of the impact on quality on engagement is orthogonal, one interesting question is if the impact of quality differs across popularity segments, for example, is niche content more likely to be affected by poor quality?
User behavior: Yu et al. observe that many users have small session times as they "sample" a video and leave.12 Removing this potential bias was one of the motivations for our binned correlation analysis. Other researchers have studied channel switching in IPTV (e.g., Cha et al.8) and seek-pause-forward behaviors in streaming systems (e.g., Costa et al.7). These highlight the need to understand user behavior to provide better context for the measurements similar to our browser minimization scenario for live content.
Measurements of video delivery systems: The research community has benefited immensely from measurement studies of deployed VoD and streaming systems using both "black-box" inference (e.g., Gill et al.,4 Hei et al.,13 and Saroiu et al.17) and "white-box" measurements (e.g., Chang et al.,9 Yin et al.,10 and Sripanidkulchai et al.19). Our work follows in this rich tradition of measuring real deployments. At the same time, we have taken a significant first step to systematically analyze the impact of the video quality on user engagement.
User perceived quality: Prior work has relied on controlled user studies to capture user perceived quality indices (e.g., Gulliver and Ghinea11). The difference in our work is simply an issue of timing and scale. Internet video has only recently attained widespread adoption; revisiting user engagement is ever more relevant now than before. Also, we rely on real-world measurements with millions of viewers rather than small-scale controlled experiments with a few users.
Engagement in other media: Analysis of understanding user engagement appears in other content delivery mechanisms as well: impact of Website loading times on user satisfaction (e.g., Bouch et al.5), impact of quality metrics such as bitrate, jitter, and delay on call duration in VoIP (e.g., Chen et al.3), among others. Our work is a step toward obtaining similar insights for Internet video delivery.
The findings presented in this paper are the result of an iterative process that included more false starts and misinterpretations than we care to admit. We conclude with two cautionary lessons we learned that apply more broadly to future studies of this scale.
The need for complementary analysis: For the long VoD case, we observed that the correlation coefficient for the average bitrate was weak, but the univariate information gain was high. The process of trying to explain this discrepancy led us to visualize the behaviors. In this case, the correlation was weak because the relationship was not monotone. The information gain, however, was high because the intermediate bins near the natural modes had significantly lower engagement and consequently low entropy in the play time distribution. This observation guided us to a different phenomenon, sessions that were forced to switch rates because of poor network quality. If we had restricted ourselves to a purely correlation-based analysis, we might have missed this effect and incorrectly inferred that AvgBitrate was not important. This highlights the value of using multiple views from complementary analysis techniques in dealing with large datasets.
The importance of context: Our second lesson is that while statistical techniques are excellent tools, they need to be used with caution and we need to take the results of these analyses together with the context of the user and system-level factors. For example, naively acting on the observation that the RendQual quality is negatively correlated for live content can lead to an incorrect understanding of its impact on engagement. As we saw, this is an outcome of user behavior and player optimizations. This highlights the importance of backing the statistical analysis with domain-specific insights and controlled experiments to replicate the observations.
2. Driving engagement for online video. http://registration.digitallyspeaking.com/akamai/mddec10/registration.html?b=videonuze.
The original version of this paper with the same title was published in ACM SIGCOMM, 2011.
Table 2. Median values of the Kendall rank correlation coefficients for LvodA and LvodB. We do not show AvgBitrate and RateBuf for LvodB because the player did not switch bitrates or gather buffering event data. for the remaining metrics, the results are consistent with dataset LvodA.
Table 3. Median values of the Kendall rank correlation coefficients for LiveA and LiveB. We do not show AvgBitrate and RateBuf because they do not apply for LiveB. For the remaining metrics the results are consistent with dataset LiveA.
©2013 ACM 0001-0782/13/03
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from firstname.lastname@example.org or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2013 ACM, Inc.
No entries found