This paper provides the first large-scale study of DRAM
memory errors in the field. It is based on data collected
from Google’s server fleet over a period of more than two
years making up many millions of DIMM days. The DRAM
in our study covers multiple vendors, DRAM densities and
technologies (DDR1, DDR2, and FBDIMM).
The paper addresses the following questions: How com mon are memory errors in practice? What are their statistical properties? How are they affected by external factors, such as temperature, and system utilization? And how do they vary with chip-specific factors, such as chip density, memory technology and DIMM age?
We find that in many aspects DRAM errors in the field behave very differently than commonly assumed. For example, we observe DRAM error rates that are orders of magnitude
higher than previously reported, with FIT rates (failures in time per billion device hours) of 25,000 to 70,000 per Mbit and more than 8% of DIMMs affected per year. We provide strong evidence that memory errors are dominated by hard errors, rather than soft errors, which most previous work focuses on. We find that, out of all the factors that impact a DIMM’s error behavior in the field, temperature has a surprisingly small effect. Finally, unlike commonly feared, we don’t observe any indication that per-DIMM error rates increase with newer generations of DIMMs.
Interesting that most memory errors were hard -- hard memory errors are unrecoverable, meaning the memory has to be physically replaced as failed, whereas soft memory errors can be fixed by overwriting the memory with the correct value. This indicates to me the value of ECC is fairly limited.
There are two kinds of errors that can typically occur in a memory system. The first is called a repeatable or hard error. In this situation, a piece of hardware is broken and will consistently return incorrect results. A bit may be stuck so that it always returns "0" for example, no matter what is written to it. Hard errors usually indicate loose memory modules, blown chips, motherboard defects or other physical problems. They are relatively easy to diagnose and correct because they are consistent and repeatable.
Sounds like all the servers in the study used ECC though, so we can't know ECC vs. non-ECC error rates..
This paper studied the incidence and characteristics of
DRAM errors in a large fleet of commodity servers. Our
study is based on data collected over more than 2 years and
covers DIMMs of multiple vendors, generations, technologies, and capacities. All DIMMs were equipped with error
correcting logic (ECC) to correct at least single bit errors.
ECC RAM can recover from small errors in bits, by utilizing parity bits. Since servers are a shared resource where up-time and reliability are important, ECC RAM is generally used with only a modest difference in price. ECC RAM is also used in CAD/CAM workstations were small bit errors could cause calculation mistakes which become more significant problems when a design goes to manufacturing.
ECC has several advantages over parity. For one, it can detect and repair single-bit errors and do so without having to stop the whole system. Multiple-bit errors will still return a parity error, but the odds of this happening are astronomically low during the lifetime of a PC unless the memory itself is defective. ECC is like auto insurance: It covers you for the majority of things that can go wrong, but it can't prevent a multi-car pileup.
Electrical or magnetic interference inside a computer system can cause a single bit of DRAM to spontaneously flip to the opposite state. It was initially thought that this was mainly due to alpha particles emitted by contaminants in chip packaging material, but research [5] has shown that the majority of one-off ("soft") errors in DRAM chips occur as a result of background radiation
...
This problem can be mitigated by using DRAM modules that include extra memory bits and memory controllers that exploit these bits. These extra bits are used to record parity or to use an error-correcting code
Excellent real-world study:
DRAM Errors in the Wild: A Large-Scale Field Study (pdf)
Interesting that most memory errors were hard -- hard memory errors are unrecoverable, meaning the memory has to be physically replaced as failed, whereas soft memory errors can be fixed by overwriting the memory with the correct value. This indicates to me the value of ECC is fairly limited.
Sounds like all the servers in the study used ECC though, so we can't know ECC vs. non-ECC error rates..
ECC RAM can recover from small errors in bits, by utilizing parity bits. Since servers are a shared resource where up-time and reliability are important, ECC RAM is generally used with only a modest difference in price. ECC RAM is also used in CAD/CAM workstations were small bit errors could cause calculation mistakes which become more significant problems when a design goes to manufacturing.
ECC has several advantages over parity. For one, it can detect and repair single-bit errors and do so without having to stop the whole system. Multiple-bit errors will still return a parity error, but the odds of this happening are astronomically low during the lifetime of a PC unless the memory itself is defective. ECC is like auto insurance: It covers you for the majority of things that can go wrong, but it can't prevent a multi-car pileup.
more detail here: ECC memory: A must for servers, not for desktop PCs
To make things simple, quoting from Wikipedia: