Errors in dynamic random access memory (DRAM) are a common form of hardware failure in modern compute clusters. Failures are costly both in terms of hardware replacement costs and service disruption. While a large body of work exists on DRAM in laboratory conditions, little has been reported on real DRAM failures in large production clusters. In this paper, we analyze measurements of memory errors in a large fleet of commodity servers over a period of 2.5 years. The collected data covers multiple vendors, DRAM capacities and technologies, and comprises many millions of dual in-line memory module (DIMM) days.
The goal of this paper is to answer questions such as the following: How common are memory errors in practice? What are their statistical properties? How are they affected by external factors, such as temperature and utilization, and by chip-specific factors, such as chip density, memory technology, and DIMM age?
We find that DRAM error behavior in the field differs in many key aspects from commonly held assumptions. For example, we observe DRAM error rates that are orders of magnitude higher than previously reported, with 25,00070,000 errors per billion device hours per Mb and more than 8% of DIMMs affected by errors per year. We provide strong evidence that memory errors are dominated by hard errors, rather than soft errors, which previous work suspects to be the dominant error mode. We find that temperature, known to strongly impact DIMM error rates in lab conditions, has a surprisingly small effect on error behavior in the field, when taking all other factors into account. Finally, unlike commonly feared, we do not observe any indication that newer generations of DIMMs have worse error behavior.
1. Introduction
Errors in dynamic random access memory (DRAM) devices have been a concern for a long time.3, 11, 1517, 22 A memory error is an event that leads to the logical state of one or multiple bits being read differently from how they were last written. Memory errors can be caused by electrical or magnetic interference (e.g., due to cosmic rays), can be due to problems with the hardware (e.g., a bit being permanently damaged), or can be the result of corruption along the data path between the memories and the processing elements. Memory errors can be classified into soft errors, which randomly corrupt bits but do not leave physical damage; and hard errors, which corrupt bits in a repeatable manner because of a physical defect.
The consequence of a memory error is system-dependent. In systems using memory without support for error correction and detection, a memory error can lead to a machine crash or applications using corrupted data. Most memory systems in server machines employ error correcting codes (ECC),6 which allow the detection and correction of one or multiple bit errors. If an error is uncorrectable, i.e., the number of affected bits exceed the limit of what the ECC can correct, typically a machine shutdown is forced. In many production environments, including ours, a single uncorrectable error (UE) is considered serious enough to replace the dual in-line memory module (DIMM) that caused it.
Memory errors are costly in terms of the system failures they cause and the repair costs associated with them. In production sites running large-scale systems, memory component replacements rank near the top of component replacements19 and memory errors are one of the most common hardware problems to lead to machine crashes.18 There is also a fear that advancing densities in DRAM technology might lead to increased memory errors, exacerbating this problem in the future.3, 12, 13
Despite the practical relevance of DRAM errors, very little is known about their prevalence in real production systems. Existing studies; for example, see Baumann, Borucki et al., Johnston, May and Woods, Normand, and Ziegler and Lanford,3, 4, 9, 11, 16, 22 are mostly based on lab experiments using accelerated testing, where DRAM is exposed to extreme conditions (such as high temperature) to artificially induce errors. It is not clear how such results carry over to real production systems. The few prior studies that are based on measurements in real systems are small in scale, such as recent work by Li et al.,10 who report on DRAM errors in 300 machines over a period of 37 months. Moreover, existing work is not always conclusive in their results. Li et al. cite error rates in the 2005000 FIT per Mb range from previous lab studies, and themselves found error rates of <1 FIT per Mb.
This paper provides the first large-scale study of DRAM memory errors in the field. It is based on data collected from Google’s server fleet over a period of more than 2 years making up many millions of DIMM days. The DRAM in our study covers multiple vendors, DRAM densities and technologies (DDR1, DDR2, and FBDIMM).
The goal of this paper is to answer the following questions: How common are memory errors in practice? How are they affected by external factors, such as temperature, and system utilization? How do they vary with chip-specific factors, such as chip density, memory technology, and DIMM age? What are their statistical properties?
2. Background and Data
Our data covers the majority of machines in Google’s fleet and spans nearly 2.5 years, from January 2006 to June 2008. Each machine comprises a motherboard with some processors and memory DIMMs. We study six different hardware platforms, where a platform is defined by the motherboard and memory generation. We refer to these platforms as platforms A to F throughout the paper.
The memory in these systems covers a wide variety of the most commonly used types of DRAM. The DIMMs come from multiple manufacturers and models, with three different capacities (1GB, 2GB, 4GB), and cover the three most common DRAM technologies: Double Data Rate 1 (DDR1), Double Data Rate 2 (DDR2), and Fully-Buffered (FBDIMM).
Most memory systems in use in servers today are protected by error detection and correction codes. Typical error codes today fall in the single error correct double error detect (SECDED) category. That means they can reliably detect and correct any single-bit error, but they can only detect and not correct multiple bit errors. More powerful codes can correct and detect more error bits in a single memory word. For example, a code family known as chip-kill7 can correct up to four adjacent bits at once, thus being able to work around a completely broken 4-bit wide DRAM chip. In our systems, Platforms C, D, and F use SECDED, while Platforms A, B, and E rely on error protection based on chipkill. We use the terms correctable error (CE) and uncorrectable error (UE) in this paper to generalize away the details of the actual error codes used. Our study relies on data collected by low-level daemons running on all our machines that directly access hardware counters on the machine to obtain counts of correctable and uncorrectable DRAM errors.
If done well, the handling of correctable memory errors is largely invisible to application software. In contrast, UEs typically lead to a catastrophic failure. Either there is an explicit response action (such as a machine reboot), or there is risk of a data-corruption-induced failure, such as a kernel panic. In the systems we study, all UEs are considered serious enough to shut down the machine and replace the DIMM at fault.
Memory errors can be soft errors, which randomly corrupt bits, but do not leave any physical damage; or hard errors, which corrupt bits in a repeatable manner because of a physical defect (e.g., stuck bits). Our measurement infrastructure captures both hard and soft errors, but does not allow us to reliably distinguish these types of errors. All our numbers include both hard and soft errors.
In order to avoid the accumulation of single-bit errors in a memory array over time, memory systems can employ a hardware scrubber14 that scans through the memory, while the memory is otherwise idle. Any memory words with single-bit errors are written back after correction, thus eliminating the single-bit error if it was soft. Three of our hardware platforms (Platforms C, D, and F) make use of memory scrubbers. The typical scrubbing rate in those systems is 1GB every 45 min. In the other platforms (Platforms A, B, and E) errors are only detected on access.
3. How Common Are Errors?
The analysis of our data shows that CEs are not rare events: We find that about a third of all machines in Google’s fleet, and over 8% of individual DIMMs saw at least one CE per year. Figure 1 (left) shows the average number of CEs across all DIMMs in our study per year of operation broken down by hardware platform. Figure 1 (middle) shows the fraction of DIMMs per year that experience at least one CE. Consistently across all platforms, errors occur at a significant rate, with a fleet-wide average of nearly 4,000 errors per DIMM per year. The fraction of DIMMs that experience CEs varies from around 3% (for Platforms C, D and F) to around 20% (for Platforms A and B). Our per-DIMM rates of CEs translate to an average of 25,00075,000 FIT (failures in time per billion hours of operation) per Mb and a median FIT range of 77825,000 per Mb (median for DIMMs with errors). We note that this rate is significantly higher than the 2005,000 FIT per Mb reported in previous studies and will discuss later in the paper reasons for the differences in results.
We also analyzed the rate of UEs and found that across the entire fleet 1.3% of machines are affected by UEs per year, with some platforms seeing as many as 2%4% of machines affected. Figure 1 (right) shows the fractions of DIMMs that see the UEs in a given year, broken down by hardware platform. We note that, while the rate of CEs was comparable across platforms (recall Figure 1 (left)), the incidence of UEs is much more variable, ranging from 0.05% to 0.4%. In particular, Platforms C and D have a 36 times higher probability of seeing a UE than Platforms A and E.
The differences in the rates of UEs between different platforms bring up the question of what factors impact the frequency of UEs. We investigated a number of factors that might explain the difference in memory rates across platforms, including temperature, utilization, DIMM age, capacity, DIMM manufacturer or memory technology (detailed tables included in the full paper20). While some of these affect the frequency of errors, they are not sufficient to explain the differences we observe between platforms.
While we cannot be certain about the cause of the differences between platforms, we hypothesize that the differences in UEs are due to differences in the error correction codes in use. In particular, Platforms C, D, and F are the only platforms that do not use a form of chip-kill.7 Chip-kill is a more powerful code that can correct certain types of multiple bit errors, while the codes in Platforms C, D, and F can only correct single-bit errors.
While the above discussion focused on descriptive statistics, we also studied the statistical distribution of errors in detail. We observe that for all platforms the distribution of the number of CEs per DIMM per year is highly variable. For example, when looking only at those DIMMs that had at least one CE, there is a large difference between the mean and the median number of errors: the mean ranges from 20,000 to 140,000, while the median numbers are between 42 and 167.
When plotting the distribution of CEs over DIMMs (see Figure 2), we find that for all platforms the top 20% of DIMMs with errors make up over 94% of all observed errors. The shape of the distribution curve provides evidence that it follows a power-law distribution. Intuitively, the skew in the distribution means that a DIMM that has seen a large number of errors is likely to see more errors in the future. This is an interesting observation as this is not a property one would expect for soft errors (which should follow a random pattern) and might point to hard (or intermittent) errors as a major source of errors. This observation motivates us to take a closer look at correlations in Section 5.
4. Impact of External Factors
In this section, we study the effect of various factors, including DIMM capacity, temperature, utilization, and age. We consider all platforms, except for Platform F, for which we do not have enough data to allow for a fine-grained analysis, and Platform E, for which we do not have data on CEs.
Temperature is considered to (negatively) affect the reliability of many hardware components due to the strong physical changes on materials that it causes. In the case of memory chips, high temperature is expected to increase leakage current,2, 8 which in turn leads to a higher likelihood of flipped bits in the memory array. In the context of large-scale production systems, understanding the exact impact of temperature on system reliability is important, since cooling is a major cost factor. There is a trade-off to be made between increased cooling costs and increased downtime and maintenance costs due to higher failure rates.
To investigate the effect of temperature on memory errors, we plot in Figure 3 (left) the monthly rate of CEs as a function of temperature, as measured by a temperature sensor on the motherboard of each machine. Since temperature information is considered confidential, we report relative temperature values, where a temperature of x on the X-axis means the temperature was x°C higher than the lowest temperature observed for a given platform. For better readability of the graphs, we normalize CE error rates for each platform by the platform’s average CE rate, i.e., a value of y on the Y-axis refers to a CE rate that was y times higher than the average CE rate.
Figure 3 (left) shows that for all platforms higher temperatures are correlated with higher CE rates. For all platforms, the CE rate increases by at least a factor of 2 for an increase of temperature by 20°C; for some it nearly triples.
It is not clear whether this correlation indicates a causal relationship, i.e., higher temperatures inducing higher error rates. Higher temperatures might just be a proxy for higher system utilization, i.e., the utilization increases leading independently to higher error rates and higher temperatures. In Figure 3 (right), we therefore isolate the effects of temperature from the effects of utilization. We divide the utilization measurements (CPU utilization) into deciles and report for each decile the observed error rate when temperature was “high” (above median temperature) or “low” (below median temperature). We observe that when controlling for utilization, the effects of temperature vanish. We also repeated these experiments with higher differences in temperature, e.g., by comparing the effect of temperatures above the 9th decile to temperatures below the 1st decile. In all cases, for the same utilization levels the error rates for high versus low temperature are very similar.
The results presented above were achieved by correlating the number of errors observed in a given month with the average temperature in that month. In our analysis, we also experimented with different measures of temperature, including temperatures averaged over different time scales (ranging from 1 h, to 1 day, to 1 month, to a dimm’s lifetime), variability in temperature, and number of temperature excursions (i.e., number of times the temperature went above some threshold). We could not find significant levels of correlations between errors and any of the above measures for temperature when controlling for utilization.
The observations in Section 4.1 point to system utilization as a major contributing factor in the observed memory error rates. Ideally, we would like to study specifically the impact of memory utilization (i.e., number of memory accesses). Unfortunately, obtaining data on memory utilization requires the use of hardware counters, which our measurement infrastructure does not collect. Instead, we study two signals that we believe provide indirect indication of memory activity: CPU utilization and memory allocated. CPU utilization is the load activity on the CPU(s) measured instantaneously as a percentage of total CPU cycles used out of the total CPU cycles available and are averaged per machine for each month. For lack of space, we include here only results for CPU utilization. Results for memory allocated are similar and provided in the full paper.20
Figure 4 (left) shows the normalized monthly rate of CEs as a function of CPU utilization. We observe clear trends of increasing CE rates with increasing CPU utilization. Averaging across all platforms, the CE rates grow roughly logarithmically as a function of utilization levels (based on the roughly linear increase of error rates in the graphs, which have log scales on the X-axis).
One might ask whether utilization is just a proxy for temperature, where higher utilization leads to higher system temperatures, which then cause higher error rates. In Figure 4 (right), we therefore isolate the effects of utilization from those of temperature. We divide the observed temperature values into deciles and report for each range the observed error rates when utilization was “high” or “low.” High utilization means the utilization (CPU utilization and allocated memory, respectively) is above median, and low means the utilization was below median. We observe that even when keeping temperature fixed and focusing on one particular temperature decile, there is still a huge difference in the error rates, depending on the utilization. For all temperature levels, the CE rates are by a factor of 23 higher for high utilization compared to low utilization.
One might argue that the higher error rate for higher utilization levels might simply be due to a higher detection rate of errors: In systems, where errors are only detected on application access, higher utilization increases the chance that an error will be detected and recorded (when an application accesses the affected cell). However, we also observe a correlation between utilization and error rates for Platforms C and D, which employ a memory scrubber. For these systems, any error will eventually be detected and recorded, if not by an application access then by the scrubber (unless it is overwritten before it is being read).
Our hypothesis is that the correlation between error rates and utilization is due to hard errors, such as a defective memory cell. In these cases, even if an error is detected and the system tries to correct it by writing back the correct value, the next time the memory cell is accessed it might again trigger a memory error. In systems with high utilization, chances are higher that the part of the hardware that is defective will be exercised frequently, leading to increased error rates. We will provide more evidence for our hard error hypothesis in Section 5.
Age is one of the most important factors in analyzing the reliability of hardware components, since increased error rates due to early aging/wear-out limit the lifetime of a device. As such, we look at changes in error behavior over time for our DRAM population, breaking it down by age, platform, technology, correctable, and UEs.
Figure 5 shows normalized CE rates as a function of age for all platforms that have been in production for long enough to study age-related affects. We find that age clearly affects the CE rates for all platforms, and we observe similar trends also if we break the data further down by platform, manufacturer, and capacity (graphs included in full paper20).
For a more fine-grained view of the effects of aging and to identify trends, we study the mean cumulative function (MCF) of errors. While our full paper20 includes several MCF plots, for lack of space we only summarize the results here. In short, we find that age severely affects CE rates: We observe an increasing incidence of errors as DIMMs get older, but only up to a certain point, when the incidence becomes almost constant (few DIMMs start to have CEs at very old ages). The age when errors first start to increase and the steepness of the increase vary per platform, manufacturer, and DRAM technology, but is generally in the 1018 month range. We also note the lack of infant mortality for almost all populations. We attribute this to the weeding out of bad DIMMs that happens during the burn-in of DIMMs prior to putting them into production.
4.4. DIMM capacity and chip size
Since the amount of memory used in typical server systems keeps growing from generation to generation, a commonly asked question, when projecting for future systems, is how an increase in memory affects the frequency of memory errors. In this section, we focus on one aspect of this question. We ask how error rates change, when increasing the capacity of individual DIMMs.
To answer this question we consider all DIMM types (type being defined by the combination of platform and manufacturer) that exist in our systems in two different capacities. Typically, the capacities of these DIMM pairs are either 1GB and 2GB, or 2GB and 4GB. Figure 6 shows for each of these pairs the factor by which the monthly probability of CEs, the CE rate, and the probability of UEs changes, when doubling capacity.
Figure 6 indicates a trend toward worse error behavior for increased capacities, although this trend is not consistent. While in some cases the doubling of capacity has a clear negative effect (factors larger than 1 in the graph), in others it has hardly any effect (factor close to 1 in the graph). For example, for Platform A, Mfg1 doubling the capacity increases UEs, but not CEs. Conversely, for Platform D, Mfg-6 doubling the capacity affects CEs, but not UEs.
The difference in how scaling capacity affects errors might be due to differences in how larger DIMM capacities are built, since a given DIMM capacity can be achieved in multiple ways. For example, a 1Gb DIMM with ECC can be manufactured with 36 256-Mb chips, or 18 512-Mb chips or with 9 1-Gb chips.
We studied the effect of chip sizes on correctable and UEs, controlling for capacity, platform (dimm technology), and age. The results are mixed. When two chip configurations were available within the same platform, capacity and manufacturer, we sometimes observed an increase in average CE rates and sometimes a decrease. This either indicates that chip size does not play a dominant role in influencing CEs or there are other, stronger confounders in our data that we did not control for.
In addition to a correlation of chip size with error rates, we also looked for correlations of chip size with incidence of correctable and UEs. Again we observe no clear trends. We also repeated the study of chip size effect without taking information on the manufacturer and/or age into account, again without any clear trends emerging.
The best we can conclude therefore is that any chip size effect is unlikely to dominate error rates given that the trends are not consistent across various other confounders, such as age and manufacturer.
5. A Closer Look at Correlations
The goal of this section is to study correlations between errors. Understanding correlations might help identify when a DIMM is likely to produce a large number of errors in the future and replace it before it starts to cause serious problems.
We begin by looking at correlations between CEs within the same DIMM. Figure 7 shows the probability of seeing a CE in a given month, depending on whether there were CEs in the same month (group of bars on the left) or the previous month (group of bars on the right). As the graph shows, for each platform the monthly CE probability increases dramatically in the presence of prior errors. In more than 85% of the cases a CE is followed by at least one more CE in the same month. Depending on the platform, this corresponds to an increase in probability between 13× to more than 90×, compared to an average month. Also seeing CEs in the previous month significantly increases the probability of seeing a CE: The probability increases by factors of 35× to more than 200×, compared to the case when the previous month had no CEs.
We also study correlations over time periods longer than a month and correlations between the number of errors in 1 month and the next, rather than just the probability of occurrence. Our study of the autocorrelation function for the number of errors observed per DIMM per month shows that even at lags of up to 7 months the level of correlation is still significant. When looking at the number of errors observed per month, we find that the larger the number of errors experienced in a month, the larger the expected number of errors in the following month. For example, in the case of Platform C, if the number of CEs in a month exceeds 100, the expected number of CEs in the following month is more than 1,000. This is a 100× increase compared to the CE rate for a random month. Graphs illustrating the above findings and more details are included in the full paper.20
While the above observations let us conclude that CEs are predictive of future CEs, maybe the more interesting question is how CEs affect the probability of future uncorrectable errors. Since UEs are simply multiple bit corruptions (too many for the ECC to correct), one might suspect that the presence of CEs increases the probability of seeing a UE in the future. This is the question we focus on next.
The three left-most bars in Figure 8 show how the probability of experiencing a UE in a given month increases if there are CEs in the same month. For all platforms, the probability of a UE is significantly larger in a month with CEs compared to a month without CEs. The increase in the probability of a UE ranges from a factor of 27× (for Platform A) to more than 400× (for Platform D). While not quite as strong, the presence of CEs in the preceding month also affects the probability of UEs. The three right-most bars in Figure 8 show that the probability of seeing a UE in a month following a month with at least one CEs is larger by a factor of 9× to 47× than if the previous month had no CEs.
We find that not only the presence, but also the rate of observed CEs in the same month affects the probability of a later UE. Higher rates of CEs translate to a higher probability of UEs. We see similar, albeit somewhat weaker trends when plotting the probability of UEs as a function of the number of CEs in the previous month. The UE probabilities are about 8× lower than if the same number of CEs had happened in the same month, but still significantly higher than in a random month.
Given the above observations, one might want to use CEs as an early warning sign for impending UEs. Another interesting view is therefore what fraction of UEs are actually preceded by a CE, either in the same month or the previous month. We find that 65%80% of UEs are preceded by a CE in the same month. Nearly 20%40% of UEs are preceded by a CE in the previous month. These probabilities are significantly higher than those in an average month.
The above observations lead to the idea of early replacement policies, where a DIMM is replaced once it experiences a significant number of CEs, rather than waiting for the first UE. However, while UE probabilities are greatly increased after observing CEs, the absolute probabilities of a UE are still relatively low (e.g., 1.7%2.3% in the case of Platform C and Platform D, see Figure 8).
We also experimented with more sophisticated methods for predicting UEs, including CART (classification and regression trees) models based on parameters such as the number of CEs in the same and previous month, CEs and UEs in other DIMMs in the machine, DIMM capacity and model, but were not able to achieve significantly better prediction accuracy. Hence, replacing DIMMs solely based on CEs might be worth the price only in environments where the cost of downtime is high enough to outweigh the cost of the relatively high rate of false positives.
Our study of correlations and the presented evidence of correlations between errors, both in short and in longer time scales, might also shed some light on the common nature of errors. In simple terms, our results indicate that once a DIMM starts to experience errors it is likely to continue to have errors. This observation makes it more likely that most of the observed errors are due to hard errors, rather than soft errors. The occurrence of hard errors would also explain the correlation between utilization and errors that we observed in Section 4.1.
6. Summary and Discussion
This paper studied the incidence and characteristics of DRAM errors in a large fleet of commodity servers. Our study is based on data collected over more than 2 years and covers DIMMs of multiple vendors, generations, technologies, and capacities. Below; we briefly summarize our results and discuss their implications.
Conclusion 1: We found the incidence of memory errors and the range of error rates across different DIMMs to be much higher than previously reported.
A third of machines and over 8% of DIMMs in our fleet saw at least one CE per year. Our per-DIMM rates of CEs translate to an average of 25,00075,000 FIT (failures in time per billion hours of operation) per Mb, while previous studies report 2005,000 FIT per Mb. The number of CEs per DIMM is highly variable, with some DIMMs experiencing a huge number of errors, compared to others. The annual incidence of UEs was 1.3% per machine and 0.22% per DIMM.
Conclusion 2: More powerful error codes (chip-kill versus SECDED) can reduce the rate of UEs by a factor of 38.
We observe that platforms with more powerful error codes (chip-kill versus SECDED) were able to significantly reduce the rate of UEs (from 0.25%0.4% per DIMM per year for SECDED-based platforms, to 0.05%0.08% for chipkill based platforms). Nonetheless, the remaining incidence of UEs makes a crash-tolerant application layer indispensable for large-scale server farms.
Conclusion 3: There is no evidence that newer generation DIMMs have worse error behavior (even when controlling for DIMM age). There is also no evidence that one technology (DDR1, DDR2, FB-DIMM) or one manufacturer consistently outperforms the others.
There has been much concern that advancing densities in DRAM technology will lead to higher rates of memory errors in future generations of DIMMs. We study DIMMs in six different platforms, which were introduced over a period of several years, and observe no evidence that CE rates increase with newer generations. In fact, the DIMMs used in the three most recent platforms exhibit lower CE rates, than the two older platforms, despite generally higher DIMM capacities. This indicates that improvements in technology are able to keep up with adversarial trends in DIMM scaling.
Conclusion 4: Within the range of temperatures our production systems experience in the field, temperature has a surprisingly low effect on memory errors.
Temperature is well known to increase error rates. In fact, artificially increasing the temperature is a commonly used tool for accelerating error rates in lab studies. Interestingly, we find that differences in temperature in the range they arise naturally in our fleet’s operation (a difference of around 20°C between the 1st and 9th temperature decile) seem to have a marginal impact on the incidence of memory errors, when controlling for other factors, such as utilization.
Conclusion 5: Error rates are strongly correlated with utilization.
We find that DIMMs in machines with high levels of utilization, as measured by CPU utilization and the amount of memory allocated, see on average a 410 times higher rates of CEs, even when controlling for other factors, such as temperature.
Conclusion 6: DIMM capacity tends to be correlated with CE and UE incidence.
When considering DIMMs of the same type (manufacturer and hardware platform), that only differ in their capacity, we see a trend of increased CE and UE rates for higher capacity DIMMs. Based on our data we do not have conclusive results on the effect of chip size and chip density, but we are in the process of conducting a more detailed study that includes these factors.
Conclusion 7: The incidence of CEs increases with age.
Given that DRAM DIMMs are devices without any mechanical components, unlike for example hard drives, we see a surprisingly strong and early effect of age on error rates. For all DIMM types we studied, aging in the form of increased CE rates sets in after only 1018 months in the field.
Conclusion 8: Memory errors are strongly correlated.
We observe strong correlations among CEs within the same DIMM. A DIMM that sees a CE is 13228 times more likely to see another CE in the same month, compared to a DIMM that has not seen errors. Correlations exist at short time scales (days) and long time scales (up to 7 months).
We also observe strong correlations between CEs and UEs. Most UEs are preceded by one or more CEs, and the presence of prior CEs greatly increases the probability of later UEs. Still, the absolute probabilities of observing a UE following a CE are relatively small, between 0.1% and 2.3% per month, so replacing a DIMM solely based on the presence of CEs would be attractive only in environments where the cost of downtime is high enough to outweigh the cost of the expected high rate of false positives.
Conclusion 9: Error rates are unlikely to be dominated by soft errors.
The strong correlation between errors in a DIMM at both short and long time scales, together with the correlation between utilization and errors, leads us to believe that a large fraction of the errors are due to hard errors.
Conclusion 9 is an interesting observation, since much previous work has assumed that soft errors are the dominating error mode in DRAM. Some earlier work estimates hard errors to be orders of magnitude less common than soft errors21 and to make up about 2% of all errors.1 Conclusion 9 might also explain the significantly higher rates of memory errors we observe compared to previous studies.
Acknowledgments
We would like to thank Luiz Barroso, Urs Hoelzle, Chris Johnson, Nick Sanders, and Kai Shen for their feedback on drafts of this paper. We would also like to thank those who contributed directly or indirectly to this work: Kevin Bartz, Bill Heavlin, Nick Sanders, Rob Sprinkle, and John Zapisek. Special thanks to the System Health Infrastructure team for providing the data collection and aggregation mechanisms. Finally, the first author would like to thank the System Health Group at Google for hosting her during the summer of 2008.
Figures
Figure 1. Frequency of errors: The average number of correctable errors (CEs) per year per DIMM (left), the fraction of DIMMs that see at least one CE in a given year (middle) and the fraction of DIMMs that see at least one uncorrectable error (UE) in a given year (right). Platforms C, D, and F use SECDED, while platforms A, B, and E rely on error protection based on chipkill.
Figure 2. The distribution of correctable errors over DIMMs: The graph plots the fraction Y of all errors that is made up by the fraction X of DIMMMs with the largest number of errors.
Figure 3. The effect of temperature: The left graph shows the normalized monthly rate of experiencing a correctable error (CE) as a function of the monthly average temperature, in deciles. The right graph shows the monthly rate of experiencing a CE as a function of CPU utilization, depending on whether the temperature was high (above median temperature) or low (below median temperature). We observe that when isolating the effects of temperature by controlling for utilization, it has much less of an effect.
Figure 4. The effect of utilization: The normalized monthly CE rate as a function of CPU utilization (left), and while controlling for temperature (right).
Figure 5. The effect of age: the normalized monthly rate of experiencing a CE as a function of age by platform.
Figure 6. Memory errors and DIMM capacity: the graph shows for different Platform-Manufacturer pairs the factor increase in CE rates, CE probabilities and UE probabilities, when doubling the capacity of a DIMM.
Figure 7. Correlations between correctable and uncorrectable errors: The graph shows the probability of seeing a CE in a given month, depending on whether there were previously CEs observed in the same month (three left-most bars) or in the previous month (three right-most bars). The numbers on top of each bar show the factor increase in probability compared to the CE probability in a randon month (three left-most bars) and compared to the CE probability when there was no CE in the previous month (three right-most bars).
Figure 8. Correlations between correctable and uncorrectable errors: The graph shows the UE probability in a month depending on whether there were CEs earlier in the same month (three left-most bars) or in the previous month (three right-most bars). The numbers on top of the bars give the increase in UE probability compared to a month without CEs (three left-most bars) and the case where there were no CEs in the previous month (three right-most bars).
Join the Discussion (0)
Become a Member or Sign In to Post a Comment