Research and Advances
Architecture and Hardware Research highlights

Understanding the Impact of Video Quality on User Engagement

Posted
  1. Abstract
  2. 1. Introduction
  3. 2. Preliminaries and Datasets
  4. 3. Analysis Techniques
  5. 4. Engagement Analysis
  6. 5. Related Work
  7. 6. Reflections
  8. References
  9. Authors
  10. Footnotes
  11. Figures
  12. Tables
Read the related Technical Perspective
man watching laptop screen

As the distribution of video over the Internet is becoming mainstream, user expectation for high quality is constantly increasing. In this context, it is crucial for content providers to understand if and how video quality affects user engagement and how to best invest their resources to optimize video quality. This paper is a first step toward addressing these questions. We use a unique dataset that spans different content types, including short video on demand (VoD), long VoD, and live content from popular video content providers. Using client-side instrumentation, we measure quality metrics such as the join time, buffering ratio, average bitrate, rendering quality, and rate of buffering events. We find that the percentage of time spent in buffering (buffering ratio) has the largest impact on the user engagement across all types of content. However, the magnitude of this impact depends on the content type, with live content being the most impacted. For example, a 1% increase in buffering ratio can reduce user engagement by more than 3 min for a 90-min live video event.

Back to Top

1. Introduction

Video content already constitutes a dominant fraction of Internet traffic today and several analysts forecast that this contribution is set to increase in the next few years.1, 15 This trend is fueled by the ever decreasing cost of content delivery and the emergence of new subscription-and ad-based business models. Premier examples are Netflix which now has reached 20 million US subscribers, and Hulu which distributes over one billion videos per month.

As Internet video goes mainstream, users’ expectations for quality have dramatically increased; for example, when viewing content on TV screens anything less than “SD” quality (i.e., 480p) is not acceptable. In the spirit of Herbert Simon’s articulation of attention economics, the overabundance of video content increases the onus on content providers to maximize their ability to attract users’ attention.18 Thus, it becomes critical to systematically understand the interplay between video quality and user engagement. This knowledge can help providers to better invest their network and server resources toward optimizing the quality metrics that really matter.2 However, our understanding of many key questions regarding the impact of video quality on user engagement “in the wild” is limited on several fronts:

  1. Does poor video quality reduce user engagement? And by how much?
  2. Do different quality metrics vary in the degree in which they impact the user engagement?
  3. Does the impact of the quality metrics differ across content genres and across different granularities of user engagement?

This paper is a first step toward answering these questions. We do so using a unique dataset of client-side measurements obtained over 2 million unique video viewing sessions from over 1 million viewers across popular content providers. Using this dataset, we analyze the impact of video quality on user engagement along three dimensions:

  • Different quality metrics: We capture characteristics of the start up latency, the rate at which the video was encoded, how much and how frequently the user experienced a buffering event, and what was the observed quality of the video rendered to the user.
  • Multiple timescales of engagement: We quantify the user engagement at two levels: per-view (i.e., a single video being watched) and per-viewer (i.e., aggregated over all views for a specific user).
  • Different types of content: We partition our data based on video type and length into short VoD, long VoD, and live, to represent the three broad types of video content being served today.

To identify the critical quality metrics and to understand the dependencies among these metrics, we employ well-known techniques such as correlation and information gain. We also augment this qualitative analysis with regression techniques to quantify the impact. Our main observations are

  • The percentage of time spent in buffering (buffering ratio) has the largest impact on the user engagement across all types of content. However, this impact is quantitatively different for different content types, with live content being the most impacted. For a highly popular 90-min soccer game, for example, an increase in the buffering ratio of only 1% can lead to more than 3 min of reduction in the user engagement.
  • The average bitrate at which the content is streamed has a significantly higher impact for live content than for VoD content.
  • The quality metrics affect not only the per-view engagement but also the number of views watched by a viewer over a time period. Further, the join time has greater impact on viewer-level engagement.

These results have important implications on how content providers can best use their resources to maximize user engagement. Reducing the buffering ratio can increase the engagement for all content types, minimizing the rate of buffering events can improve the engagement for long VoD and live content, and increasing the average bitrate can increase the engagement for live content. However, there are also trade-offs between the buffering and the bitrate that we should take into account. Our ultimate goal is to use such measurement-driven insights so that content providers, delivery systems, and end users can objectively evaluate and improve Internet video delivery. The insights we present are a small, but significant, first step toward realizing this vision.

Back to Top

2. Preliminaries and Datasets

We begin this section with an overview of how our dataset was collected. Then, we scope the three dimensions of the problem space: user engagement, video quality metrics, and types of video content.

*  2.1. Data collection

We have implemented a highly scalable and available real-time data collection and processing system. The system consists of two parts: (a) a client-resident instrumentation library in the video player and (b) a data aggregation and processing service that runs in data centers. Our client library gets loaded when Internet users watch video on our affiliates’ sites and monitors fine-grained events and player statistics. This library collects high fidelity raw data to generate higher level information on the client side and transmits these in real time with minimal overhead. We collect and process 0.5TB of data on average per day from various affiliates over a diverse spectrum of end users, video content, Internet service providers, and content delivery networks.

Video player instrumentation: Figure 1 illustrates the lifetime of a video session as observed at the client. The video player goes through multiple states (connecting and joining, playing, paused, buffering, stopped). For example, the player goes to paused state if the user presses the pause button on the screen, or if the video buffer becomes empty then the player goes into buffering state. By instrumenting the client, we can observe all player states and events and also collect statistics about the playback quality.

*  2.2. Engagement and quality metrics

Qualitatively, engagement is a reflection of user involvement and interaction. While there are many ways in which we can define engagement (e.g., user-perceived satisfaction with the content or willingness to click advertisements), in this study we focus on objectively measurable metrics of engagement at two levels:

  1. View level: A user watching a single video continuously is a view. For example, this could be watching a movie trailer clip, an episode of a TV serial, or a football game. The view-level engagement metric of interest is the play time—the duration of the viewing session. Note that we do not count ads in the stream as separate views; they are part of the actual content.
  2. Viewer level: To capture the aggregate experience of a single viewer (an end user identified by a unique system-generated client ID), we study the viewer-level engagement metrics for each unique viewer. The two metrics we use are the number of views and the total play time across all videos watched by the viewer.

For completeness, we briefly describe the five industry-standard video quality metrics we use in this study2:

  1. Join time (JoinTime): This represents the duration from the time at which the player initiates a connection to a video server till the time sufficient player video buffer has filled up and the player starts rendering video frames and moves to playing state. In Figure 1, this is the length of the joining state.
  2. Buffering ratio (BufRatio): This is the fraction of the total session time (i.e., playing plus buffering time) spent in buffering. This is an aggregate metric that can capture periods of long video “freeze” observed by the user. As illustrated in Figure 1, the player goes into a buffering state when the video buffer becomes empty and moves out of buffering (back to playing state) when the buffer is replenished.
  3. Rate of buffering events (RateBuf): BufRatio does not capture the frequency of induced interruptions. For example, a video session that experiences “video stuttering,” where each interruption is small but the total number of interruptions is high, might not have a high buffering ratio, but may be just as annoying to a user. Thus, we use the rate of buffering events cacm5603_k.gif .
  4. Average bitrate (AverageBitrate): A single video session can have multiple bitrates if the 4video player can switch between different bitrate streams. Such bitrate adaptation logic is widely deployed in commercial players today. This metric is simply the average of the bitrates played weighted by the duration each bitrate is played.
  5. Rendering quality (RendQual): Rendering rate (frames per second) is central to user’s visual perception. This rate may drop because of either CPU effects (e.g., player may drop frames if the CPU is overloaded) or network effects (e.g., congestion causes the buffer to become empty). To normalize the metric across videos that have different encoded frame rates, we define rendering quality as the ratio of the rendered frames per second to the encoded frames per second of the stream.

*  2.3. Dataset

We collect close to 4TB of data each week. On average, 1 week of our data captures measurements of over 300 million views watched by about 100 million unique viewers across all of our affiliate content providers. The analysis in this paper is based on the data collected from five of our affiliates in the fall of 2010. These providers appear in the Top-500 most popular sites and serve a large volume of video content, thus providing a representative view of Internet video quality and engagement.

We organize the data into three content types and within each content type we choose two datasets from different providers. We choose diverse providers in order to rule out biases induced by the particular provider or the player-specific optimizations and algorithms they use. For live content, we use additional data from the largest live streaming sports event of 2010: the FIFA World Cup. Table 1 summarizes the number of unique videos and viewers for each dataset, described below. To ensure that our analysis is statistically meaningful, we only select videos that have at least 1000 views over the week-long period.

  • Long VoD: Long VoD clips are videos that are at least 35 min and at most 60 min in duration. They are typically full episodes of TV shows. The two long VoD datasets are labeled as LvodA and LvodB.
  • Short VoD: We categorize video clips as short VoD if the video length is 2–5 min. These are often trailers, short interviews, short skits, and news clips. The two short VoD datasets are labeled as SvodA and SvodB.
  • Live: Sports events and news feeds are typically delivered as live video streams. There are two key differences between the VoD-type content and live streams. First, the client buffers in this case are sized such that the viewer does not lag more than a few seconds behind the video source. Second, all viewers are roughly synchronized in time. The two live datasets are labeled LiveA and LiveB. As a special case study, the dataset LiveWC corresponds to the three of the final World Cup games with almost a million viewers per game on average.

Back to Top

3. Analysis Techniques

In this section, we show preliminary measurements to motivate the types of questions that we want to answer and briefly describe the analysis techniques we use.

Overview: Figure 2 shows the cumulative distribution functions (CDF) of four quality metrics for dataset LvodA. We see that most viewing sessions experience very good quality, that is, have low BufRatio, low JoinTime, and relatively high RendQual. At the same time, however, the number of views that suffer from quality issues is not trivial—7% experience BufRatio larger than 10%, 5% have JoinTime larger than 10s, and 37% have RendQual lower than 90%. Finally, only a small fraction of views receive the highest bitrate. Given that a significant number of views experience quality issues, content providers would naturally like to know if (and by how much) improving their quality could have potentially increased the user engagement.

As an example, we consider one video object each from LiveA and LvodA, bin the different sessions based on the quality metrics, and calculate the average play time for each bin in Figures 3 and 4. These figures visually confirm that quality matters. At the same time, these initial visualizations also give rise to several questions:

  • How do we identify which metrics matter the most?
  • Are these quality metrics independent or is the observed relationship between the engagement and the quality metric M due to a hidden relationship between M and a more critical metric M’?
  • How do we quantify how important a quality metric is?
  • What causes the seemingly counterintuitive behaviors? For example, RendQual is negatively correlated (Figure 4(d)), while the AvgBitrate shows non-monotone relationships (Figure 3(c)).

To address the first two questions, we use the well-known concepts of correlation and information gain. To measure the quantitative impact, we also use linear-regression-based models for the most important metric(s). Finally, we use domain-specific insights and controlled experiments to explain the anomalous observations. Next, we briefly describe the statistical techniques we employ.

Correlation: To avoid making assumptions about the nature of the relationships between the variables, we choose the Kendall correlation instead of the Pearson correlation. The Kendall correlation is a rank correlation that does not make any assumption about the underlying distributions, noise, or the nature of the relationships. (Pearson correlation assumes that the noise in the data is Gaussian and that the relationship is roughly linear.)

Given the raw data—a vector of (x, y) values where each x is the measured quality metric and y the engagement metric (play time or number of views)—we bin it based on the value of the quality metric. We choose bin sizes that are appropriate for each quality metric of interest: for JoinTime, we use 0.5s intervals, for BufRatio and RendQual we use 1% bins, for RateBuf we use 0.01/min sized bins, and for AvgBitrate we use 20 kbps-sized bins. For each bin, we compute the empirical mean of the engagement metric across the sessions/viewers that fall in the bin.

We compute the Kendall correlation between the mean-per-bin vector and the values of the bin indices. We use this binned correlation metric for two reasons. First, we observed that the correlation coefficient was biased by a large mass of users that had high quality but very low play time, possibly because of low user interest. Our goal in this paper is not to study user interest. Rather, we want to understand how the quality impacts user engagement. To this end, we look at the average value for each bin and compute the correlation on the binned data. The second reason is scale. Computing the rank correlation is expensive at the scale of analysis we target; binned correlation retains the qualitative properties at much lower computation cost.

Information Gain: Correlation is useful when the relationship between the variables is roughly monotone increasing or decreasing. As Figure 3(c) shows, this may not hold. Furthermore, we want to move beyond analyzing a single quality metric. First, we want to understand if a pair (or a set) of quality metrics are complementary or if they capture the same effects. As an example, consider RendQual in Figure 3; RendQual could reflect either a network issue or a client-side CPU issue. Because BufRatio is also correlated with PlayTime, we may suspect that RendQual is mirroring the same effect. Identifying and uncovering these hidden relationships, however, is tedious. Second, content providers may want to know the top-k metrics that they should optimize to improve user engagement.

To this end, we augment the correlation analysis using information gain,16 which is based on the concept of entropy. Intuitively, this metric quantifies how our knowledge of a variable X reduces the uncertainty in another variable Y; for example, what does knowing the AvgBitrate or BufRatio “inform” us about the PlayTime distribution? We use a similar strategy to bin the data and for the PlayTime, we choose different bin sizes depending on the duration of the content.

Note that these analysis techniques are complementary. Correlation provides a first-order summary of monotone relationships between engagement and quality. The information gain can corroborate the correlation or augment it when the relationship is not monotone. Further, it extends our understanding to analyze interactions across quality metrics.

Regression: Kendall correlation and information gain are largely qualitative measures. It is also useful to understand the quantitative impact; for example, what is the expected increase in engagement if we improve a specific quality metric by a given amount? Here, we rely on regression. However, as the visualizations show, the relationships between the quality metrics and the engagement are not obvious and many metrics have intrinsic dependencies. Thus, directly applying regression techniques may not be meaningful. As a simpler and more intuitive alternative, we use linear regression to quantify the impact of specific ranges of the most critical quality metric. However, we do so only after visually confirming that the relationship is roughly linear over this range so that the linear data fit is easy to interpret.

Back to Top

4. Engagement Analysis

We begin by analyzing engagement at the per-view level, where our metric of interest is PlayTime. We begin with long VoD content, then proceed to live and short VoD content. In each case, we compute the binned correlation and information gain per video and then look at the distribution of the coefficients across all videos. Having identified the most critical metric(s), we quantify the impact of improving this quality using a linear regression model over a specific range of the quality metric.

At the same time, content providers also want to understand if good quality improves customer retention or if it encourages users to try more videos. Thus, we also analyze the user engagement at the viewer level by considering the number of views per viewer and the total play time across all videos watched by the viewer in a 1-week interval.

*  4.1. Long VoD content

Figure 5 shows the absolute and signed values of the correlation coefficients for LvodA to show the magnitude and the nature (increasing or decreasing) of the correlation. We summarize the median values for both datasets in Table 2 and find that the results are consistent for the common quality metrics BufRatio, JoinTime, and RendQual, confirming that our observations are not unique to a specific provider.

The result shows that BufRatio has the strongest correlation with PlayTime. Intuitively, we expect a higher BufRatio to decrease PlayTime (i.e., more negative correlation) and a higher RendQual to increase PlayTime (i.e., a positive correlation). Figure 5(b) confirms this intuition regarding the nature of these relationships. We also notice that JoinTime has little impact on the play duration.

Next, we use the univariate information gain analysis to corroborate and complement the correlation results. In Figure 6, the relative order between RateBuf and BufRatio is reversed compared to Figure 5. The reason is that most of the probability mass is in the first bin (0–1% BufRatio) and the entropy here is the same as the overall distribution (not shown). Consequently, the information gain for BufRatio is low; RateBuf does not suffer this problem and has higher information gain. Curiously, we see that AvgBitrate has high information gain even though its correlation with PlayTime is very low; we revisit this later in the section.

So far we have looked at each quality metric in isolation. A natural question then is whether two or more metrics when combined together yield new insights that a single metric does not provide. However, this may not be the case if the metrics are themselves interdependent. For example, BufRatio and RendQual may be correlated with each other; thus knowing that both are correlated with PlayTime does not add new information. Thus, we consider the distribution of the bivariate relative information gain values in Figure 7. For clarity, rather than showing all combinations, for each metric we include the bivariate combination with the highest relative information gain. We see that the combination with the AvgBitrate provides the highest bivariate information gain. Even though BufRatio, RateBuf, and RendQual had strong correlations in Figure 5, combining them does increase the information gain suggesting that they are interdependent.

Surprising behavior in AvgBitrate: We noticed that AvgBitrate has low correlation but high information gain in the univariate and bivariate analysis. This is related to our earlier observation in Figure 3. The relationship between PlayTime and AvgBitrate is not monotone; it peaks between 800 and 1000 Kbps, low on either side of this region, and increases slightly at the highest rate. Because of this non-monotone relationship, the correlation is low.

However, knowing the value of AvgBitrate allows us to predict the PlayTime and thus there is a non-trivial information gain. This still leaves open the issue of low PlayTime in the 1000–1600 kbps band. This range corresponds to clients that observe many bitrate switches because of buffering induced by poor network conditions. Thus, the PlayTime is low here as a result of buffering, which we already observed to be the most critical factor.

*  4.2. Live content

Figure 8 shows the distribution of the correlation coefficients for dataset LiveA, and we summarize the median values for the two datasets in Table 3. We notice one key difference with respect to the LvodA results: AvgBitrate is more strongly correlated for live content. Similar to dataset LvodA, BufRatio is strongly correlated, while JoinTime is weakly correlated.

For both long VoD and live content, BufRatio is a critical metric. Interestingly, for live, we see that RateBuf has a much stronger negative correlation with PlayTime. This suggests that the Live users are more sensitive to each buffering event compared to the Long VoD audience. Investigating this further, we find that the average buffering duration is much smaller for long VoD (3 s), compared to live (7 s). That is, each buffering event in the case of live content is more disruptive. Because the buffer sizes in long VoD are larger, the system fares better in face of fluctuations in link bandwidth. Furthermore, the system can be more proactive in predicting buffering and hence preventing it by switching to another server, or switching bitrates. Consequently, there are fewer and shorter buffering events for long VoD.

Information gain analysis reconfirms that AvgBitrate is a critical metric and that JoinTime is less critical for Live content (not shown). The bivariate results (not shown for brevity) mimic the same effects as those depicted in Figure 7, where the combination with AvgBitrate has the largest information gains.

Surprising behavior with RendQual: Figure 4(d) shows the counter-intuitive effect where RendQual was negatively correlated with PlayTime for live content. The above results for the LiveA and LiveB datasets confirm that this is not an anomaly specific to one video but a more pervasive phenomenon. Investigating this further, we found a surprisingly large fraction of viewers with low rendering quality and high play time. Furthermore, the BufRatio values for these users were also very low. In other words, these users see a drop in RendQual even without any network issues but continue to view the video.

We hypothesized that this effect arises out of a combination of user behavior and player optimizations. Unlike long VoD viewers, live video viewers may run the video player in background or minimize the browser (and maybe listening to the commentary). In this case, the player may try to reduce the CPU consumption by decreasing the frame rendering rate. To confirm this hypothesis, we replicated this behavior in a controlled setup and found that the player drops the RendQual to 20%. Interestingly, the PlayTime peak in Figure 4(d) also occurs at 20%. These suggest that the anomalous relationship is due to player optimizations when users play the video in the background.

Case study with high impact events: One concern for content providers is whether the observations from typical videos can be applied to “high impact” events (e.g., Olympics10). To address this concern, we consider the LiveWG dataset. We focus here on BufRatio and AvgBitrate, which we observed as the most critical metrics for live content in the previous discussion. Figure 9 shows that the results for LiveWC1 roughly match the results for LiveA and LiveB. We also confirmed that the coefficients for LiveWG2 and LiveWC3 are identical. These results suggest that our observations apply to such events as well.

*  4.3. Short VoD content

Finally, we consider the short VoD category. For both datasets SvodA and SvodB, the player uses a discrete set of 2–3 bitrates without switching and was not instrumented to gather buffering event data. Thus, we do not show the AvgBitrate (correlation is not meaningful on two points) and RateBuf. Table 4 summarizes the median values for both datasets. We notice similarities between long and short VoD: BufRatio and RendQual are the most critical metrics. As before, JoinTime is weakly correlated. The information gain results for short VoD largely mirror the results from the correlation analysis and we do not show these.

*  4.4. Quantitative analysis

As our measurements show, the interaction between the PlayTime and the quality metrics can be quite complex. Thus, we avoid black-box regression models and restrict our analysis to the most critical metric (BufRatio) and only apply regression to the 0–10% range of BufRatio after visually confirming that this is roughly a linear relationship.

We notice that the distribution of the linear-fit slopes is very similar within the same content type in Figure 10. The median magnitudes of the slopes are one for long VoD, two for live, and close to zero for short VoD. That is, BufRatio has the strongest quantitative impact on live, then on long VoD, and then on short VoD. Figure 9 also includes linear data fits on the 0–10% subrange for BufRatio for the LiveWG data. These show that, within the selected subrange, a 1% increase in BufRatio can reduce the average play time by more than 3 min (assuming a game duration of 90 min). In other words, providers can increase the average user engagement by more than 3 min by investing resources to reduce BufRatio by 1%. Note that the 3 min drop is not from the 90-min content time but from expected view time which is around 40 min; that is, engagement drops by roughly 7.5% (3/40).

*  4.5. Viewer-level engagement

At the viewer level, we look at the aggregate number of views and play time per viewer across all objects irrespective of that video’s popularity. For each viewer, we correlate the average value of each quality metric across different views with these two aggregate engagement metrics.

Figure 11 visually confirms that the quality metrics also impact the number of views. One interesting observation with JoinTime is that the number of views increases in the range 1–15s before starting to decrease. We also see a similar effect for BufRatio, where the first few bins have fewer total views. This effect does not, however, occur for the total play time. We speculate that this is an effect of user interest. Many users have very good quality but little interest in the content; they sample the content and leave without returning. Users who are actually interested in the content are more tolerant of longer join times (and buffering). However, the tolerance drops beyond a certain point (around 15s for JoinTime).

The values of the correlation coefficients are qualitatively consistent across the different datasets (not shown) and also similar to the trends we observed at the view level. The key difference is that while JoinTime has relatively little impact at the view level, it has a more pronounced impact at the viewer level. This has interesting system design implications. For example, a provider may decide to increase the buffer size to alleviate buffering issues. However, increasing buffer size can increase JoinTime. The above result shows that doing so without evaluating the impact at the viewer level may reduce the likelihood of a viewer visiting the site again.

Back to Top

5. Related Work

Content popularity: There is an extensive literature on modeling content popularity and its implications for caching (e.g., Cheng et al.,6 Yu et al.,12 and Huang et al.14). While our analysis of the impact on quality on engagement is orthogonal, one interesting question is if the impact of quality differs across popularity segments, for example, is niche content more likely to be affected by poor quality?

User behavior: Yu et al. observe that many users have small session times as they “sample” a video and leave.12 Removing this potential bias was one of the motivations for our binned correlation analysis. Other researchers have studied channel switching in IPTV (e.g., Cha et al.8) and seek-pause-forward behaviors in streaming systems (e.g., Costa et al.7). These highlight the need to understand user behavior to provide better context for the measurements similar to our browser minimization scenario for live content.

Measurements of video delivery systems: The research community has benefited immensely from measurement studies of deployed VoD and streaming systems using both “black-box” inference (e.g., Gill et al.,4 Hei et al.,13 and Saroiu et al.17) and “white-box” measurements (e.g., Chang et al.,9 Yin et al.,10 and Sripanidkulchai et al.19). Our work follows in this rich tradition of measuring real deployments. At the same time, we have taken a significant first step to systematically analyze the impact of the video quality on user engagement.

User perceived quality: Prior work has relied on controlled user studies to capture user perceived quality indices (e.g., Gulliver and Ghinea11). The difference in our work is simply an issue of timing and scale. Internet video has only recently attained widespread adoption; revisiting user engagement is ever more relevant now than before. Also, we rely on real-world measurements with millions of viewers rather than small-scale controlled experiments with a few users.

Engagement in other media: Analysis of understanding user engagement appears in other content delivery mechanisms as well: impact of Website loading times on user satisfaction (e.g., Bouch et al.5), impact of quality metrics such as bitrate, jitter, and delay on call duration in VoIP (e.g., Chen et al.3), among others. Our work is a step toward obtaining similar insights for Internet video delivery.

Back to Top

6. Reflections

The findings presented in this paper are the result of an iterative process that included more false starts and misinterpretations than we care to admit. We conclude with two cautionary lessons we learned that apply more broadly to future studies of this scale.

The need for complementary analysis: For the long VoD case, we observed that the correlation coefficient for the average bitrate was weak, but the univariate information gain was high. The process of trying to explain this discrepancy led us to visualize the behaviors. In this case, the correlation was weak because the relationship was not monotone. The information gain, however, was high because the intermediate bins near the natural modes had significantly lower engagement and consequently low entropy in the play time distribution. This observation guided us to a different phenomenon, sessions that were forced to switch rates because of poor network quality. If we had restricted ourselves to a purely correlation-based analysis, we might have missed this effect and incorrectly inferred that AvgBitrate was not important. This highlights the value of using multiple views from complementary analysis techniques in dealing with large datasets.

The importance of context: Our second lesson is that while statistical techniques are excellent tools, they need to be used with caution and we need to take the results of these analyses together with the context of the user and system-level factors. For example, naively acting on the observation that the RendQual quality is negatively correlated for live content can lead to an incorrect understanding of its impact on engagement. As we saw, this is an outcome of user behavior and player optimizations. This highlights the importance of backing the statistical analysis with domain-specific insights and controlled experiments to replicate the observations.

Back to Top

Back to Top

Back to Top

Back to Top

Figures

F1 Figure 1. An illustration of a video session lifetime and associated video player events.

F2 Figure 2. Cumulative distribution functions for four quality metrics for dataset LvodA.

F3 Figure 3. Qualitative relationships between four quality metrics and the play time for one video in LvodA.

F4 Figure 4. Qualitative relationships between four quality metrics and the play time for one video in LiveA.

F5 Figure 5. Distribution of the Kendall rank correlation coefficients between quality metrics and play time for LvodA.

F6 Figure 6. Distribution of the univariate gain between the quality metrics and play time for LvodA.

F7 Figure 7. Distribution of the bivariate (relative) information gain for LvodA. for brevity, we only show the best bivariate combinations.

F8 Figure 8. Distribution of the Kendall rank correlation coefficient between the quality metrics and play time, for dataset LiveA.

F9 Figure 9. Impact of two quality metrics for LiveWC1, one of the three final games from the 2010 FIFA World Cup. A linear data fit is shown over the 0–10% subrange of BufRatio.

F10 Figure 10. CDF of the linear-fit slopes between PlayTime and the 0–10% subrange of BufRatio.

F11 Figure 11. Visualizing the impact of JoinTime and BufRatio on the number of views and play time for LvodA.

Back to Top

Tables

T1 Table 1. Summary of the datasets in our study.

T2 Table 2. Median values of the Kendall rank correlation coefficients for LvodA and LvodB. We do not show AvgBitrate and RateBuf for LvodB because the player did not switch bitrates or gather buffering event data. for the remaining metrics, the results are consistent with dataset LvodA.

T3 Table 3. Median values of the Kendall rank correlation coefficients for LiveA and LiveB. We do not show AvgBitrate and RateBuf because they do not apply for LiveB. For the remaining metrics the results are consistent with dataset LiveA.

T4 Table 4. Median values of the Kendall rank correlation coefficients for SvodA and SvodB. We do not show AvgBitrate and RateBuf because they do not apply here.

Back to top

    1. Cisco forecast, http://blogs.cisco.com/sp/comnents/cisco_visual_networking_index_forecast_annual_update/.

    2. Driving engagement for online video. http://registration.digitallyspeaking.com/akamai/mddec10/registration.html?b=videonuze.

    3. Chen, K., Huang, C., Huang, P., Lei, C. Quantifying Skype user satisfaction. In Proceedings of SIGCOMM (2006).

    4. Gill, P., Arlitt, M., Li, Z., Mahanti, A. YouTube traffic characterization: A view from the edge. In Proceedings of IMG (2007).

    5. Bouch, A., Kuchinsky, A., Bhatti, N. Quality is in the eye of the beholder: Meeting users' requirements for Internet quality of service. In Proceedings of CHI (2000).

    6. Cheng, B., Liu, X., Zhang, Z., Jin, H. A measurement study of a peer-to-peer video-on-demand system. In Proceedings of IPTPS (2007).

    7. Costa, C., Cunha, I., Borges, A., Ramos, C., Rocha, M., Almeida, J., Ribeiro-Neto, B. Analyzing client interactivity in streaming media. In Proceedings of WWW (2004).

    8. Cha, M., Rodriguez, P., Crowcroft, J., Moon, S., Amatriain, X. Watching television over an IP network. In Proceedings of IMC (2003).

    9. Chang, H., Jamin, S., Wang, W. Live streaming performance of the Zattoo network. In Proceedings Of IMC (2009).

    10. Yin, H., Liu, X., Qiu, R, Xia, N., Lin, C., Zhang, H., Sekar, V., Min, G. Inside the bird's nest: Measurements of large-scale live VoD from the 2008 Olympics. In Proceedings of IMC (2009).

    11. Gulliver, S.R., Ghinea, G. Defining user perception of distributed multimedia quality. ACM Trans. Multimed. Comput. Comm. Appl. 2, 4 (Nov. 2006).

    12. Yu, H., Zheng, D., Zhao, B.Y., Zheng, W. Understanding user behavior in large-scale video-on-demand systems. In Proceedings of Eurosys (2006).

    13. Hei, X., Liang, C., Liang, J., Liu, Y., Ross, K.W. A measurement study of a large-scale P2P IPTV system. IEEE Trans. Multimed. 9 (2007).

    14. Huang, Y., Tom, D.M.C., Fu, Z.J., Lui, J.C.S., Huang, C. Challenges, design and analysis of a large-scale P2P-VoD system. In Proceedings of SIGCOMM (2003).

    15. Cho, K., Fukuda, K., Esaki, H. The impact and implications of the growth in residential user-to-user traffic. In Proceedings of SIGCOMM (2006).

    16. Mitchell, T. Machine Learning, McGraw-Hill, 1997.

    17. Saroiu, S., Gummadi, K.P., Dunn, R.J., Gribble, S.D., Levy, H.M. An analysis of Internet content delivery systems. In Proceedings of OSDI (2002).

    18. Simon, H.A. Designing organizations for an information-rich world. Computers, Communication, and the Public Interest. M. Greenberger, ed. The Johns Hopkins Press.

    19. Sripanidkulchai, K., Ganjam, A., Maggs, B., Zhang, H. The feasibility of supporting large-scale live streaming applications with dynamic application end-points. In Proceedings of SIGCOMM (2004).

    The original version of this paper with the same title was published in ACM SIGCOMM, 2011.

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More