I have a drive—part of a RAID 1 mirror—that has two bad blocks. Adaptec Storage Manger e-mailed me when it detected the blocks. It shows 4 medium errors for that drive, but state is still “optimal”.
This is my first time using Adaptec RAID controllers. I don’t know if an occasional bad block is normal, or if I should immediately replace that drive.
Update: The drive failed later the same day!
The disk subsystem is:
- Adaptec 6405 with ZMM
- (2) Seagate near-line SAS drives (ST31000424SS)
The other drive hasn’t reported any bad blocks yet. I am running a consistency check.
When drives are used in an array, the controller will set Time Limited Error Recovery. This will cause disks to report medium errors if they can't immediately read the data. This doesn't mean that they will not recover from the read error, or that the sector is completely unreadable.
(Cheap SATA drives do not support TLER, and will cause the read operation to hang while the drive tries to recover the data; this is just one of many reasons cheaper SATA drives shouldn't bused in arrays; this of course doesn't apply to this particular question)
If the disk determines that the sector is unreadable, it will remap the sector. The original bad sector will not be reported up the chain, so software running on the OS has no way of knowing. The only thing you can do is lookup the SMART report and see if/how many sectors have been remapped. Many sectors being remapped is a good indication of bad things to come. SMART may also report how many times the disk has experience a soft error vs a hard error.
In any case, SMART pre-failure prediction has been less than helpful; a Google SMART Study backs that up.
Large drives have lots of extra space for moving bad sectors, I've seen hundreds of sectors replaced over the course of 2 weeks and then had the drive keep going for another month (RAID6 so we didn't rush).
If it keeps alerting you each day with a few more replaced sectors then I'd replace it before it fails. One burst of bad sectors when you first use the drive isn't scary at all but a continuing condition usually means particulates in the enclosure or a damaged read/write head.
I have not used SAS drives, but I have had regular SCSI drives and IDE drives that get a few bad blocks and then work for years without any other problems. The S.M.A.R.T. status should tell you when a drive is declining and risking failure.
Also, as long as you are using RAID, other than RAID 0, then you are protected in case of a failure.
I don’t usually answer my own question, but in this case I have a definitive answer: replace the drive ASAP. The drive in question failed later the same day.
In the early AM hours I had received three e-mails that looked like the following. That’s how I knew the drive had bad blocks, and was the only warning:
By the end of the day, it had failed.