We are working on a project which involves different hardware all hosted in a single rack. The machines are mainly IBM servers: 2 x206 (scsi), 1 x226(scsi), 2 x3400(sata) and another assembled machine with sata controllers. We are using several raid controller. Some machines have only one Serveraid controller, others have one or more controllers not always Adaptec ones. All the firmwares and bios are updated. All the servers and connected devices are under ups.
Over the last 4 months we experienced several strange behaviours in our hardware. Suddenly and randomly we loose 2 or 3 drives and the raid volumes stop to work. It can happen once a week but never at the same time of the day or week.
Most of the times a rebuild process fixes the problem, sometimes we loose the data. Very often we just need to unplug the raid controllers, restart the server and the problem is fixed.
At the beginning we thought it was due to firmware bugs but we performed an accurate update for every machine and raid controller and there is nothing else we can do on the hardware. We have really no hint on what's causing all these troubles.
We are starting to think that it's an environmental problem but we don't know if there could be something interfering with our hardware. Have you ever heard of something like that? Do you have any idea on how to investigate the problem?
This can easily be due to firmware bugs, not on the controller, but on the drives. Seen that much too often to count.
If I had drives from different vendors on RAID controllers from different vendors in servers from different vendors failing at an abnormal rate, I'd start looking at high temperatures and poor airflow in the server room as a potential cause of the problem.