Although the nominal topic of the following paper is managing crash reports from an installed software base, the paper’s greatest contributions are its insights about managing large-scale systems. Kinshumann et al. describe how the Windows error reporting process became almost unmanageable as the scale of Windows deployment increased. They then show how an automated reporting and management system (Windows Error Reporting, or WER) not only eliminated the existing problems, but capitalized on the scale of the system to provide features that would not be possible at smaller scale. WER turned scale from enemy to friend.
Scale has been the single most important force driving changes in system software over the last decade, and this trend will probably continue for the next decade. The impact of scale is most obvious in the Web arena, where a single large application today can harness 1,00010,000 times as many servers as the largest pre-Web applications of 1020 years ago and supports 1,000 times as many users. However, scale also impacts developers outside the Web; in this paper, scale comes from the large installed base of Windows and the correspondingly large number of error reports emanating from the installed base.
Scale creates numerous problems for system developers and managers. Manual techniques that are sufficient at small scale become unworkable at large scale. Rare corner cases that are unnoticeable at small scale become common occurrences that impact overall system behavior at large scale. It would be easy to conclude that scale offers nothing to developers except an unending parade of problems to overcome.
Microsoft, like most companies, originally used an error reporting process with a significant manual component, but it gradually broke down as the scale of Windows deployment increased. As the number of Windows installation skyrocketed, so did the rate of error reports. In addition, the size and complexity of the Windows system increased, making it more difficult to track down problems. For example, a buggy third-party device driver could cause crashes that were difficult to distinguish from problems in the main kernel.
In reading this paper and observing other large-scale systems, I have noticed four common steps by which scale can be converted from enemy to friend. The first and most important step is automation: humans must be removed from the most important and common processes. In any system of sufficiently large scale, automation is not only necessary, but it is cheap: it’s much less expensive to build tools than to manage a large system manually. WER automated the process of detecting errors, collecting information about them, and reporting that information back to Microsoft.
The second step in capitalizing on scale is to maintain records; this is usually easy once the processes have been automated. In the case of WER the data consists of information about each error, such as a stack trace. The authors developed mechanisms for categorizing errors into buckets, such that all the errors in a bucket probably share the same root cause. Accurate and complete data enables the third and fourth steps.
In any system of sufficiently large scale, automation is not only necessary, but it is cheap.
The third step is to use the data to make better decisions. At this point the scale of the system becomes an asset: the more data, the better. For example, WER analyzes error statistics to discover correlations with particular system configurations (a particular error might occur only when a particular device driver is present). WER also identifies the buckets with the most reports so they can be addressed first.
The fourth and final step is that processes change in fundamental ways to capitalize on the level of automation and data analysis. For example, WER allows a bug fix to be associated with a particular error bucket; when the same error is reported in the future, WER can offer the fix to the user at the time the error happens. This allows fixes to be disseminated much more rapidly, which is crucial in situations such as virus attacks.
Other systems besides WER are also taking advantage of scale. For example, Web search indexes initially kept independent caches of index data in the main memory of each server. As the number of servers increased they discovered that the sum total of all the caches was greater than the total amount of index data; by reorganizing their servers to eliminate duplication they were able to keep the entire index in DRAM. This enabled higher performance and new features. Another example is that many large-scale Web sites use an incremental release process to test new features on a small subset of users before exposing them to the full user base.
I hope you enjoy reading this paper, as I did, and that it will stimulate you to think about scale as an opportunity, not an obstacle.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment