I have a situation where I'd like to run Hadoop spread across 2 clusters. The first cluster (ClusterA) is normal and all nodes are publicly accessible. The second cluster (ClusterB) is behind a NAT.
Nodes in ClusterA will be running both Mapred and HDFS, while nodes in ClusterB will be running Mapred without HDFS and will not be allowed to run Reduce Tasks. The master node (jobtracker, namenode, secondary namenode) will be in ClusterA.
My question is: if I start the ClusterB TaskTrackers independently without using bin/start-all.sh
from the JobTracker, will this setup work? TaskTrackers in ClusterB will open their own C&C connection to the JobTracker, and should receive MapTask assignments via this connection. HDFS will be entirely in ClusterA, so all nodes should be able to access chunks fine.
The only issue I can think of is Reduce tasks running in ClusterA attempting to get intermediate data stored on ClusterB nodes. Is this a push or a pull operation? Are there any other scenarios where the NAT will cause problems?
The answer is the Reduce tasks do pull and therefore need access through the NAT. Otherwise it appears to work.