Considering the fact that many server-class systems are equipped with ECC RAM, is it necessary or useful to burn-in the memory DIMMs prior to their deployment?
I've encountered an environment where all server RAM is placed through a lengthy burn-in/stress-tesing process. This has delayed system deployments on occasion and impacts hardware lead-time.
The server hardware is primarily Supermicro, so the RAM is sourced from a variety of vendors; not directly from the manufacturer like a Dell Poweredge or HP ProLiant.
Is this a useful exercise? In my past experience, I simply used vendor RAM out of the box. Shouldn't the POST memory tests catch DOA memory? I've responded to ECC errors long before a DIMM actually failed, as the ECC thresholds were usually the trigger for warranty placement.
- Do you burn-in your RAM?
- If so, what method(s) do you use to perform the tests?
- Has it identified any problems ahead of deployment?
- Has the burn-in process resulted in any additional platform stability versus not performing that step?
- What do you do when adding RAM to an existing running server?
No.
The goal of burning in hardware is to stress it to the point of catalyzing a failure in a component.
Doing this with mechanical hard drives will get some results, but it's just not going to do a lot for RAM. The nature of the component is such that environmental factors and age are far more likely to be the cause of failures than reading and writing to the RAM (even at its maximum bandwidth for a few hours or days) would ever be.
Assuming your RAM is high enough quality that the solder won't melt the first time you really start to use it, a burn-in process won't help you find defects.
I found a document by Kingston detailing how they work with Server Memory, I believe that this process would, normally, be the same for most known manufacturers. Memory chips, as well as all semiconductor devices, follow a particular reliability/failure pattern that is known as the Bathtub Curve:
Time is represented on the horizontal axis, beginning with the factory shipment and continuing through three distinct time periods:
Early Life Failures: Most failures occur during the early usage period. However, as time goes on, the number of failures diminishes quickly. The Early Life Failure period, shown in yellow, is approximately 3 months.
Useful Life: During this period, failures are extremely rare. The useful life period is shown in blue and is estimated to be 20+ years.
End-of-Life Failures: Eventually, semiconductor products wear out and fail. The End-of-Life period is shown in green
Now because Kingston noted that high fail-rates would occur the first three months (after these three months the unit is considered good until it's EOL about 15 - 20 years later). They designed a test using a unit called the KT2400 which brutally tests the server memory modules for 24 hours at 100 degrees celsius at high voltage, by which all cells of every DRAM chip is continuously exercised; this high level of stress testing has the effect of aging the modules by at least three months (as noted before the critical period where most modules show failures).
The results were:
So why is burning in memory not useful for server memory? Simply, because it's already done by your manufacturer!
We buy blades and we generally buy in reasonably large block of them at a time, as such we get them in and install them over DAYS before our network ports are ready/secure. So we use that time to use memtest for around 24hrs, sometimes longer if it goes over a weekend - once that's done we spray down the basic ESXi and IP is ready for its host profile to be applied once the network's up. So yeah we test it, more out of opportunity than necessity but it's caught a few DOA DIMMs before now, and it's not me physically doing it so it takes me no effort. I'm for it.
Well I guess it depends on exactly what your processes is. I ALWAYS run MemTest86 on memory before I put it in a system (server or otherwise). After you have a system up and running, problems caused by faulty memory can be hard to troubleshoot.
As for actually "stress-testing" the memory; I have yet to even see why this would be useful unless you are testing for overclocking purposes.
I don't, but I've seen people who do. I never saw them gain anything from it though, I think it might be a hangover or superstition perhaps.
Personally, i'm like you in that the ECC error rates are more useful to me - assuming the RAM isn't DOA but then you'd know that anyhow.
For non-ECC ram running a 30 minutes on memtest86+ is useful as there is usually no reliable method of detecting bit-errors when the system is running.
Blue-screening is not considered to be reliable method...
And slightly flaky RAM often doesn't show immediately, only after the system has seen some full-memory load and then only if the data in that RAM was code that got used and then crashed. Data-corruption can go unnoticed for long periods of time.
For ECC ram it won't do anything the memory controller itself won't be doing so it really doesn't make sense. It's just a waste of time.
In my experience people who insist on burning in are usually old guys who have always done it like this and who keep doing it out of habit without really thinking things true.
Or they are young guys following the prescribed procedure written by those old guys.
It depends.
If you are deploying 50 000 new RAMs, and you know that this particular hardware have a failure rate of 0.01% after operating less than a day, statistically speaking there got to be several of them that will fail on their first day. Burning in are meant to catch that. With deployments on that scale, failure is expected, not an exceptional situation.
If you're deploying only a couple hundreds items though, statistics are most likely on your side as you must be quite unlucky to get a failed parts.
For one server, it's potentially a waste of time, depend on the context.
But if you install 2000 server at a time and you don't do a valid "stress test", you are pretty sure to find one server which behave badly. And it's not only for RAM, it's for network, CPU, hard drive, etc. When you replace one DIMM, it's a good thing too, just to be sure, the right DIMM was replaced ( sometime it's not you which replace the DIMM), so launching a stress test will tell you if it's fixed or not fixed.
From my experience on large scale cluster, HPL is a good tool to have an idea if you have DIMM error. And mono node HPL are enough, but larger HPL could help too. If the system behave as expected and don't throw MCE error which are catchable by Linux in the logs then you're good !