I've recently been playing with Hadoop. I have a six node cluster up and running - with HDFS, and having run a number of MapRed jobs. So far, so good. However I'm now looking to do this more systematically and with a larger number of nodes. Our base system is Ubuntu and the current setup has been administered using apt (to install the correct java runtime) and ssh/scp (to propagate out the various conf files). This is clearly not scalable over time.
Does anyone have any experience of good systems for administering (possibly slightly heterogenous: different disk sizes, different numbers of cpus on each node) hadoop clusters automagically? I would consider diskless boot - but imagine that with a large cluster, getting the cluster up and running might be bottle-necked on the machine serving the OS. Or some form of distributed debian apt to keep the machines native environment synchronised? And how do people successfully manage the conf files over a number of (potentially heterogenous) machines?
Thanks very much in advance,
Alex
I would recommend to keep your nodes as similar as possible. As you have found out, different setups for each node make life difficult.
The clusters I currently run each have every box exactly the same, which means all my configuration for every node is the same. Configuration is stored on an NFS homedir. The machines are installed as standard CentOS, and then a CFengine policy is applied that handles the installation of the CDH hadoop/hbase packages, set up to use the shared config. Once the daemons have been started, the machine will automatically become part of the cluster.
In general, I strongly recommend using CFengine, puppet, chef or one of the other configuration systems. This makes life a lot simpler, especially when all your configurations are different. It also means you can just have a standard base install of an operating system, and then apply the policies to handle all the installation and configuration. No network boots needed.
The slightly frustrating thing with different configs is that config files like
hdfs-site.xml
andmapred-site
can not use inheritance, like providing a generic file, and then a few specific settings, like datadir or amount of map slots, for a specific node. So what you would probably have to do is have a generic file and merge that with the specific settings for a node, and then push that out as the configuration for the node.