Here's a weird one I've been fighting for a while. I've got a old out-of-warranty Dell PowerEdge 6650 server with a PERC 3/DC RAID controller controlling four newer (maybe a year old) Fujitsu 136GB U320 SCSI disks in a RAID5 array.
Maybe once a month or so one of these disks will randomly "fail." By fail, that means the PERC decides that they've failed and it starts beeping and blasting alerts. All I have to do to resolve the issue is remove and reseat the "failed" disk and it starts resyncing the array. Once the resync is complete, the bezel light on the front of the machine goes back to blue from orange and the beeping stop.
My main question is what is causing these disks to "fail," when in fact they're perfectly fine. At first I thought it might be a firmware issue, so I reflashed every flashable component in the system. BIOS, PERC firmware, disk firmware, everything.
There doesn't seem to be a cause or event that triggers one of the non-failures, it just happens at random.
It's not exactly a huge issue, but it's definitely something I'd like to resolve. Dell won't provide support since the machine is out of warranty, and their website/forums are useless as always.
I like running old hardware as long as possible, but I'd get the machine replaced. You're going to have a tough time making any headway in resolving this issue.
My suspicion would be subtle interaction between the firmware on the "failing" drives, possibly the hot-swap backplane, and the RAID controller. No one at either Dell or Fujitsu is testing those drives what that controller anymore, and you're unlikely to get anyone at either company interested.
You're puting the array at risk each time this happens, since the array is becoming degraded and being rebuilt. If a legitimate failure happens on another disk during the rebuild process you're going to be in an array failure scenario. Hopefully you've got good backups.
It's frustrating because adding disks really should work fine, but with something this age you're really better off biting the bullet and getting something with active manufacturer support.
First thing I would have said would be to update the firmware, as this happens fairly often with PE servers with PERC controllers.
Just because the array is able to rebuild when you re-seat the disk, I don't think that means that the drive is okay, it could be on its way out and that is why it keeps dropping out of the array. That is why when Dell tells me just to reseat it I try to get them to send me a new one (even though they are probably just sending me one that someone sent back :-/ ).
I had the same issues with a Power Edge 2650, in fact, it was a PERC's problem, if you have some spare, try to swap it.
You said you already flashed the firmware of the raid card. Did you update the driver for it at the same time? In previously support calls with Dell about failed drives, they've always been annoyingly adamant that we were using both the latest firmware and driver for the raid card.
One of them even suggested that I needed to re-build the array from scratch after updating the firmware to make the drive stop failing. Fortunately, I got them to replace the drive before I resorted to doing that (which was the problem). So I can't confirm or deny whether his suggestion would've worked.
I had one last thought and only because you didn't mention it explicitly. Have you checked for a firmware update for the actual drives?