Ping a Specific Port

Question

Propulsion

Asked: 2015-06-12 21:47:05 +0800 CST2015-06-12 21:47:05 +0800 CST 2015-06-12 21:47:05 +0800 CST

I/O and RAM limitations are important for Hadoop performance. But is disk speed related to I/O?

772

Hortonworks says this: "Most often performance of a Hadoop cluster will not be constrained by disk speed – I/O and RAM limitations will be more important." *

How is disk speed not related to I/O limitations?

3 Answers

Voted

Travis Campbell · Answer 1 · 2015-09-16T16:19:53+08:00

The comment is technically correct, but it's nuanced. You have to understand what your MapReduce jobs are doing.

While disk rotational speed is important, it's arguably less important than network speed, both off system and off switch, especially in the case where you generate large amounts of input data into your reduce phase, since reducers cannot take into account data locality.

More often than not, you're going to find clusters that utilize 7200rpm drives, configured in JBOD (because that's the general recommendation by the Hadoop community to balance cost vs performance vs reliability). In most configurations, you're probably not going to find more than 1-2 readers/writers going to each spindle (think: 1-2 tasks per spindle). The performance isn't going to be improved much due to improving the rotational latency (though, I'm deliberately sidestepping the issue of SSD here).

With modern 7200rpm drives, you're going to get between 100-200MBytes/s ... or the equivalent of 1-2Gbps a second. My clusters are built to do about about 25Gbps of disk I/O ... but in order to utilize that performance during the shuffle and reduce phase, I need to have at least that much performance available at the network just to get the data off the system.

That just gets me (almost) 1:1 oversubscription if I only have to communicate to other nodes on-switch. If my cluster bridges multiple switches, I now have to make sure I have sufficient capacity to handle the significant amounts of east-west traffic that can occur during the shuffle, as data is moved from the mappers into the reducers.

All the disk i/o in the world isn't helpful if you can't get the data where it's needed in the cluster. Data locality and rack awareness help, but only during certain portions of the whole MR process.

Chopper3 · Answer 2 · 2015-06-12T22:58:26+08:00

Chopper3

2015-06-12T22:58:26+08:002015-06-12T22:58:26+08:00

How is disk speed not related to I/O limitations?

I'm totally with you on this, they are linked, especially for Hadoop - I've just finished designing a new pair of clusters and disk speed was definitely an important aspect of that.

1

Dan Cornilescu · Answer 3 · 2015-06-13T11:57:16+08:00

Dan Cornilescu

2015-06-13T11:57:16+08:002015-06-13T11:57:16+08:00

Possible interpretations:

not all I/O is disk I/O, network I/O is often an issue in clustered environments
the number and configurations of disks have often higher impact on the overall disk I/O than the speed of the disk itself (for example 2 slower disks in a RAID0 config can outperform a single very fast disk at the same overall price)
RAM limitations will always beat disk speed - as soon as the system starts swapping the performance will drop no matter how fast your disk is (true, not directly related to I/O)

0

I/O and RAM limitations are important for Hadoop performance. But is disk speed related to I/O?

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?