I have nutch/hadoop pseudo distributed running fine. I want to add processing capacity by adding new nodes which are smaller than master (HD 3 times smaller) and cheaper of course.
Since the default HDFS replication is at 3, after balancing the data I will not get more space, which is not my concern first.
Do I still get more processing power ?
I don't understand how map/reduce tasks work against replication. How is it decided which nodes gets the work out of the different replica.
You will have to move to a cluster setup from your pseudocluster setup and by doing so, you will indeed get more processivity out of your cluster by adding more nodes i.e. you will be able to process more map and reduce tasks. The processivity increase as you would expect is linear.
Replication will determine the number of replicates that are present in your cluster for each HDFS block. So lets assume that you have a file that is split into 6 blocks, for a replication of 3, 18 blocks will be spread out in your cluster. The more nodes you have the higher coverage you will get and thus when it comes down to commencing your map phase, less data will have to be transfered between datanodes. And to answer your final question, Hadoop will always try to assign map tasks to nodes that serve as datanodes for the input to those map tasks. So in this case, the replicates will make this task easier since there will be a larger pool of tasktrackers to choose from.
Your question is a bit confusing. If you're running in pseudo-distributed mode, then that's where all four processes (JobTracker, NameNode, DataNode, TaskTracker) are all launched on the same (typically development) system.
The Hadoop xxx-site.xml configuration for pseudo-distributed has everything talking to localhost, thus adding new nodes won't work.
Leaving that aside, if you are adding more nodes, and these are running both DataNodes and TaskTrackers, then you will get added storage and CPU capacity.
Note that as you fill up HDFS, eventually the 3x replication won't be possible when all of the smaller nodes are at capacity, so you'll start getting warnings/errors.