I have an external drive bay with 4 eSATA disks in it. My system has a 4-port eSATA card, as well as a pair of internal hardware RAID1 drives. The external drives are in software RAID1 pairs as /dev/md0
and /dev/md1
. Both have been configured as LVM physical volumes to create my storagevg
LVM volume group. Recently, a single drive went offline (I suspect cables), but there does not seem to be a good way to physically identify which drive I need to check, especially since initialization order isn't the same between boots. How can I find the disk needing attention?
Disk Utility (sitting in System -> Administration) will give you the serial numbers for all your disks.
Here's what I see (look at the top-right for the serial). You'll notice that this drive is within a mdadm RAID array. Disk Utility can penetrate the array for raw disk access.
I have 6 of the same model of disk in my PC so I drew a little diagram showing their position in the case and the serial number so I can locate them quickly on serial in an emergency.
The opposite is also true in that if a disk dies, I just need to find which disks are showing up and I can eliminate them until I know which serial is missing.
Edit: I'm trying to improve my bash-fu so I wrote this command line version to just give you a list of disk serial numbers that are current in your machine.
fdisk
may chuck out some errors but that doesn't taint the list:(And you can crumble that into one line if you need to - I've broken it up for readability)
Edit 2:
ls /dev/disk/by-id/
is somewhat easier ;)If you have trouble matching the drive serial number or port indication with your disks' spatial locations, you can run
cat /dev/sdz >/dev/null
(wheresdz
is the failed drive) and locate the drive by its LED (or by ear if you aren't in a noisy server room). If the drive won't even power up, that should be enough to tell which one it is. Be sure to put a visible label on the disks for next time.The info that
udisks
gives (either on the commandline or in the GNOME Disk Utility) includes the disk serial number. On the disks I have, the serial number is printed on the upper side and on the front side (the one on the other side of the one that contains the connectors), both as numbers and with a barcode. Unfortunately, most PC cases make it impossible to read those serials without pulling the disk out...You can also find the serial numbers in
/dev/disk/by-id/
.As your disk is off-line, I assume it isn't "seen" by the kernel currently? In that case, you might have to go by elimination: you want the disk with a serial number that is not listed...
With software raid this is a common issue. Hardware raids tend to have a feature that allows you to blink the LED associated with a drive, assuming that your hardware supports that.
But with software RAID each drive has some unique metadata. Which you can read it from each drive using the command
mdadm -E /dev/sda1
for each drive in the the array, modifying the devices to match your environment. So if you have a situation where a drive is giving you problems and is currently offline. I would run this on each drive that is online, recording the minor number for each drive. Then using a Live CD that supports MD, system rescue cd is a good one, with only one drive at a time connected and running this command to find the culprit. This probably isn't as straight forward as you'd like but it should work.lsscsi
if the disk is not in state running, that's a pretty good sign. So /proc/mdstat will tell you which member failed. Assuming you don't have a nice drive cage you'll have to drill down by serial number, sg_inq should help with that.
If you do have a good drive cage, you should be able to enable the disk beacon to help identify the faulty member.
http://www.mail-archive.com/[email protected]/msg07307.html
To get the serial codes of all harddisks run:
It's simple. This for example is the output on my PC:
as you can see I've /dev/sdh1 and /dev/sdg1 joined in /dev/md0
Since your array doesn't have SES smarts and the disk activity LED isn't directly drivable e.g. you need firmware support for that. The only other thing you can do is quiesce the I/O as best you can and then use something like
dd
orsg_read
on the members themselves to stride a pattern of reads to the disk that creates a uniquely identifiable blink pattern using the activity LED, a poor man's beacon if you will. It's really your only alternative, unless bringing the array down is an option.This kind of serviceability is what differentiates external storage arrays. Since you didn't plan ahead by scribbling down the serial numbers and their positions, you can't do the simple set difference to identify the faulty drive. It's the price you pay for the solution you deployed, whether you realize it or not, but hey, live and learn.