So, let's say your server had 6 healthy hard drives. A drive fails (will not mount/detect, drops out of raid with errors) or is failing (SMART getting worse, etc). You need to swap out the bad drive. When you open the case you see.. six identical hard drives.
How can you tell which one is no longer healthy/mounting/functioning?
System would be linux, most likely ubuntu server, using at most simple software RAID. The hard drives would be SATA and connected directly to the motherboard. (no raid controller)
I don't want to randomly disconnect drives until I pick the correct one. The drives all appear identical to me; I imagine there is some common way to identify which drive is which that I am unaware of. Does anyone have any pointers/tips/best practices? Thanks!
EDIT: I had wanted this to be 'generalized' in a hand-wavy sort of way, but it just came off as 'incomplete' and 'horrible'. My bad!
I had this exact problem on a (tower) server just like you explain, and it was easy:
smartctl will output the serial number of the drive
Vendors sometimes ship their own specific tools, like hdparm, that will do the same.
So output the serial of the bad drive, and then use a dentist's mirror and a flashlight to find the drive.
On a rackmount you'll usually have indicator lights like other people have said, but I bet the same would apply.
Putting stickers on drives (depending on the design of the tray) may not be feasible. By the time the drive dies, the stickers could be dried up and fallen off.
ledctl (from package ledmon) is really the way to go with this.
or
will illuminate the drive fail light on your chassis for the specified drive. I provided two examples to illustrate that it doesn't matter HOW you identify the drive. You can use serial, name, etc... Whatever information is available to you can be used. The drives are referenced multiple ways under the /dev/ and /dev/disk/ path.
To turn the light back off, just execute it again, changing locate to locate_off like so:
Usually you would have to hope that the connections are labeled in some fashion then work from the identity of the failed device. For example...and someone would have to comment to correct me...if you have two IDE channels, you have up to 2 drives on each, you could have sda, sdb, sdc, and sdd. If sdd failed it would be the second drive on the cable of the second IDE channel.
If it's SATA and like the system I have in the back room the ports are labeled for each of the sata drives. Again, drive lettering goes from a through whatever the drives go up to, starting at port 0 of the SATA connectors and moving up.
If there are any manufacturing differences, the dmesg |grep sd or dmesg|grep hd should yield some clues.
If you have the serial numbers available I think the hdparm command might give it to you in software so you can trace it that way. You might want to label the drives somewhere if that's the case so you don't have to worry about that when you find there's an issue.
...I knew there was another reason I preferred hardware RAID over software RAID...blinky lights. Really like the blinky lights.
EDIT: smartctl, not hdparm, gives the serial number. My bad.
If you have no locate light and can't easily find the serial numbers on the outside of the drives, sometimes this cheesy technique can help: create a LOT of activity on that specific drive and then look for the drive with the activity LED on solid. It's best to follow up with a more detailed check of the serial number, but this can help narrow the search.
E.g.:
# while true; do dd if=/dev/disk/by-id/scsi-drive-that-is-dying of=/dev/null; sleep 1; done
(The while loop is not technically needed, but it will keep things moving while you head to the data center. The "sleep 1" helps avoid the high CPU usage created by a fast loop if the "dd" fails due to say... the drive being disconnected.)
Some drives expose a locate "file" in
/sys
into which you can echo a 1 for turning the locate indicator light on or 0 for off.For short answer -- "lsscsi" For Detailed answer -- "lshw -c disk" will show you the HDD and SATA ports in which those connected.
Six internal HDDS? If they are external, hot swap drives, the hot swap carrier likely has an error light to help you identify the bad drive. Also many Raid management programs have an option to flash the light on a particular drive to determine which is which. If they are all internal with no lights, then you are down to your RAID software telling you which IDs are good, and looking at the SCSI IDs, etc to to figure it out. If they are set to auto, then your RAID controller doc should tell you what order in the SCSI chain the IDs are assigned. Good Luck. Take a backup now while things are still running!
At the very least the RAID software/controller which told you about the failed drive should tell you which drive had failed (id number). 0 is usually the one on the top left, moving down, then to the right (if in two or more columns). The ports are probably labeled.
When all else fails, you can identify the not-failed drives and work backwards.
Whichever drives activity lights do NOT come on are likely bad (and hopefully it's just one.) Note that if you have hot-spares configured, those won't light up either.
scsirastools has a set of tools that let you do various diagnostic tests on SCSI disks. You can also use sgmon to power down a disk under software control. This would at least let you identify the physical disk of you could locate it with the diagnostics.
If you have a hardware RAID controller the controller's BIOS or management software should have a facility that lets you identify bad disks.