Hortonworks says this: "Most often performance of a Hadoop cluster will not be constrained by disk speed – I/O and RAM limitations will be more important." *
How is disk speed not related to I/O limitations?
Hortonworks says this: "Most often performance of a Hadoop cluster will not be constrained by disk speed – I/O and RAM limitations will be more important." *
How is disk speed not related to I/O limitations?
The comment is technically correct, but it's nuanced. You have to understand what your MapReduce jobs are doing.
While disk rotational speed is important, it's arguably less important than network speed, both off system and off switch, especially in the case where you generate large amounts of input data into your reduce phase, since reducers cannot take into account data locality.
More often than not, you're going to find clusters that utilize 7200rpm drives, configured in JBOD (because that's the general recommendation by the Hadoop community to balance cost vs performance vs reliability). In most configurations, you're probably not going to find more than 1-2 readers/writers going to each spindle (think: 1-2 tasks per spindle). The performance isn't going to be improved much due to improving the rotational latency (though, I'm deliberately sidestepping the issue of SSD here).
With modern 7200rpm drives, you're going to get between 100-200MBytes/s ... or the equivalent of 1-2Gbps a second. My clusters are built to do about about 25Gbps of disk I/O ... but in order to utilize that performance during the shuffle and reduce phase, I need to have at least that much performance available at the network just to get the data off the system.
That just gets me (almost) 1:1 oversubscription if I only have to communicate to other nodes on-switch. If my cluster bridges multiple switches, I now have to make sure I have sufficient capacity to handle the significant amounts of east-west traffic that can occur during the shuffle, as data is moved from the mappers into the reducers.
All the disk i/o in the world isn't helpful if you can't get the data where it's needed in the cluster. Data locality and rack awareness help, but only during certain portions of the whole MR process.
I'm totally with you on this, they are linked, especially for Hadoop - I've just finished designing a new pair of clusters and disk speed was definitely an important aspect of that.
Possible interpretations: