I'm working on a team trying to create a system for creating Hadoop clusters on EC2 with minimal effort on the part of the user. Ideally, we would like slave instances to only require the hostname of the master instance as user data on boot. The slaves would then rsync their configurations from the master instance and start their TaskTracker and DataNode daemons automatically.
My question is this: is it necessary for the hostnames of the slave instances to be listed in the master instance's conf/slaves
file? The only time I have ever seen this file used in the code for Hadoop is by the start-{dfs,mapred}.sh
scripts, which SSH into all the machines listed and start the daemons. If the daemons on the slave nodes start automatically, and if they know the location of the JobTracker and NameNode (through the configuration), can they connect on to the JobTracker/NameNode on their own and be treated like "normal" slaves?
I suppose the best way to find out is to try it, but we are wondering about the time investment/complexity in such a system, so I thought I would see if anyone here has experience with this problem. I'll edit if I find an answer myself.
EDIT: I tested this out, and the whole system seems to work fine without slaves listed in the configuration. JobTracker shows the slave TaskTracker under its nodes list, and I have run a test job succesfully.
ANOTHER EDIT: It is worth noting that this will not work if you use the DFS host whitelist (conf/dfs.hosts
), which is a feature at least on Cloudera's distribution.
The slaves file is only used by the bin/start and stop scripts. If you're running on EC2 you should check out the EC2 scripts, eg "hadoop-ec2 update-slaves-file"