I would like to know about your strategies on what to do when one of the Hadoop server disk fails.
Let's say, I have multiple (>15) Hadoop servers and 1 namenode, and one from 6 disks on slaves stops working, disks are connected via SAS. I don't care about retrieving data from this disk, but for general strategies for keeping cluster running.
What do you do?
We deployed hadoop. You can specify replication numbers for files. How many times a file gets replicated. Hadoop has a single point of failure on the namenode. If you are worried about disks going out, increase replication to 3 or more.
Then if a disk goes bad, it's very simple. Throw it out and reformat. Hadoop will adjust automatically. In fact as soon as a disk goes out, it will start rebalancing files to maintain the replication numbers.
I am not sure why you have such a large bounty. You said you don't care to retrieve data. Hadoop only has a single point of failure on the name node. All other nodes are expendable.
You mentioned this system was inherited (possibly not up to date) and that the load shoots up indicating a possible infinite loop. Does this bug report describe your situation?
https://issues.apache.org/jira/browse/HDFS-466
If so, it's been reported as fixed in the latest HDFS 0.21.0 (just released last week):
http://hadoop.apache.org/hdfs/docs/current/releasenotes.html
Disclaimer: To my disappointment I have yet to have the need to use Hadoop/HDFS :)