We have +500 servers built with Supermicro motherboards and Kingston memory and we usually see the following alerts:
# fmdump -v
TIME UUID SUNW-MSG-ID
Oct 27 15:49:44.9379 108510ec-b4e1-c94b-dd9f-f7b2969a4725 INTEL-8001-94
100% fault.memory.intel.dimm_ce
Problem in: hc://:product-id=X7DB8:chassis-id=0123456789:server-id=hostname:serial=180104092839051c6a:part=KINGSTON:revision=C1/motherboard=0/memory-controller=1/dram-channel=3/dimm=0/rank=1
Affects: hc://:product-id=X7DB8:chassis-id=0123456789:server-id=hostname:serial=180104092839051c6a:part=KINGSTON:revision=C1/motherboard=0/memory-controller=1/dram-channel=3/dimm=0/rank=1
FRU: hc://:product-id=X7DB8:chassis-id=0123456789:server-id=hostname:serial=180104092839051c6a:part=KINGSTON:revision=C1/motherboard=0/memory-controller=1/dram-channel=3/dimm=0
Location: DIMM4A
My question is: how trustworthy are these faults when you are running on non-Oracle hardware?
We tried almost everything (short of never using these components again) but the faults randomly come back (eg. replace dimm4a and after a few months dimm1b has a fault, replace all memories and motherboard and another fault shows up after a few days).
The memory we replace is tested for days with memtest and we can never find a problem. Other teams using the same hardware with Windows & Linux don't see it. Is Solaris being too sensitive?
Right now we are going over another round of memory replacements but it's becoming a pain. We couldn't find anything wrong with the servers either, they've been working just fine but the randomly appearing memory faults are scary. Should we ignore them?
OS: OpenSolaris 2009.6 (b111)
I can only guess but from what I've read up is that the fault you are experiencing is due to the fact the number of correctable ECC errors in a given time have been exceeded. This is sure a problem and should be addressed.
If however, you other team runs windows on these boxes and don't experience any issues this might be due the fact, that windows just corrects the correctable ECC error and keeps silent where OpenSolaris or FMA fire a warning.
It should definitively not being ignored. If I were you I'd take the time to further investigate the windows machine and if there is a possibility to check for those corrected, correctable ECC errors.