About a week ago I experienced one very interesting situation. I had a workstation - old desktop with Asus P5LD2 motherboard and 4 x 1 GB non-registered DDR2 Kingston memory. That same machine was a victim of a power stroke quite some time ago, IIRC 12-14 months ago. At the time of the power stroke the PSU fried and the HDD died. I replaced both, ran tests, including memtest
and everything seemed fine. The user was working happily on it, until one day last week when he found some recent data "corruption" in some of his files. I investigated the issue and managed to narrow it down to motherboard fault. However, the "data corruption" was rather interesting and reproducible:
- copying text files from local directory to another local directory and running
diff
between both versions, there was only 1 bit changed somewhere random in the file; - this bit was always the 6th out of 8, viewed in hex text editor, i.e. hex 19 becomes hex 39;
- the issue was reproducible while accessing NFS mounts and local mounts. Same exact tests repeated from other clients produced no differences;
- while copying from this machine over the network with
rsync -av
the command failed withCorrupted MAC on input. Disconnecting: Packet corrupt
; - tried same MB, but different memory set - again differences;
- old memory set on another Asus P5LD2 MB - no differences;
- memtest ran for more than 24 hours - no single error reported.
Conclusion from the tests - the bit flipping occurs only on this exact machine, regardless of the memory set used and the data location (local or NFS).
Based on all my tests, the only components left in the equation are the motherboard and the CPU.
My question(s) are:
- what causes the bit flipping and how exactly it happens?;
- is there a way to detect it?;
- how to test/probe for it, when
memtest
fails?
I still have the troublesome machine in-house and am willing to run any tests to learn more about this.
The OS is Ubuntu Lucid 10.04, 64-bit.
Edit I forgot to mention that most (if not all) capacitors on the MB where bended on top, instead of flat.
Sounds like a problem with the CPU accessing peripherals like the disk controller and network card. It could be the northbridge overheating. When the CPU is hot, the northbridge gets hotter than otherwise. It could also be the CPU overheating.
During memtest, there's minimal I/O and minimal CPU work.
That will cause the DC power supplied to components like the RAM, CPU, and northbridge to get noisy as load goes up. That could easily be the cause of your problem. I'd say the motherboard should be retired.