Have a HDFS/Hadoop cluster setup and am looking into tuning.
I wonder if changing the default HDFS replication factor (default:3) to something bigger will improve mapper performance, at the obvious expense of increasing disk storage used?
My reasoning being that if the data is already replicated to more nodes, mapper jobs can be run on more nodes in parallel without any data streaming/copying?
Anyone got any opinions?
Conceptually your conclusions are correct: with blocks available in more places, the scheduler has more freedom to allocate node-local tasks (on the same machine as the input block), and less data will be streamed.
However, before taking that step, are you sure that block streaming is the source of slowdown? Unless a small subset of HDFS nodes are hosting the blocks your workload needs, increasing replication factor isn't really going to help you. In other words, if you already have a well-balanced cluster in terms of the distribution of relevant blocks, having those blocks placed on additional nodes isn't going to speed up execution very much, because streaming is not your bottleneck.
One quick check would be the node-local vs rack-local stats on the JobTracker web interface for the given job.
If streaming really is the slowdown, is it disk I/O or network I/O that is the bottleneck? Some alternatives to a replication increase might be to raise (4) and then lower (3) the block replication, which should give a more even distribution across the cluster. OR, unload and reload the files.
Give some more details about why you think this is a bottleneck, and there may be some other appropriate solutions.