So the latest configuration of our Nova compute nodes are using raw /dev/sdX devices (no labels nor partitions) as components for an md0
(raid0) array on which they're hosting an XFS filesystem. When one of the underlying hard disks fail, then the raid remains blissfully unaware of this.
This is confirmed by other cases like mdadm did not notice a failed disk in raid0
The question then arises. After we replace a failed hard disk, how do we reassemble this array without being forced to perform a new mkfs
? Or would it be sufficient to fsck
the filesystem and have it rediscover the (no-longer "bad" blocks)? Is that even a thing? (If the OS tries to use the blocks on the failed device I presume that the drivers have to simply return "bad blocks" for that entire range. Traditionally in Unix filesystems backblocks are forever ... you never try to reclaim them. Is there a switch to xfs_repair
to force it to re-evaluate bad blocks?
Am I misunderstanding the underlying mechanics here?
You can not. As you were said before, raid0 provides no redundancy, regardless of the fact is it interleaved or sequential. One disk in raid0 still functioning while the same is not it's basically the same thing when you have wiped out a [second] half of the non-raid0 disk: you still can read and probably write some of the sectors because they still contain formatting and valid data, but as soon as you want to do something with others the OS will fail.
So, if you are insisting to continue to use the non-documented backsides of the raid0 failure, presenting them as design benefits, you have two choices: write some additional software [layers] yourself or meet your doom, because there's no methods ready to use.