We're going to purchase some new hardware to use just for a Hadoop cluster and we're stuck on what we should purchase. Say we have a budget of $5k should we buy two super nice machines at $2500/each, four at around $1200/each or eight at around $600 each? Will hadoop work better with more slower machines or fewest much faster machines? Or, as like most things "it depends"? :-)
If you can I would look at utilizing Cloud Infrastructure Services like Amazon Web Services (AWS) Elastic Compute Cloud (EC2), at least until you determine that it makes sense to invest in your own hardware. It's easy to get caught up in buying the shiny gear (I have to resist daily). By trying before you buy in the cloud you can learn a lot and answer the question: Does my companies software X or map/reduce framework against this data set best match a small, medium, or large set of server(s). I ran a number of combination's on AWS, scaling up, down, in, and out for pennies on the dollar within a few days. We were so happy with our testing that we decided to stay with AWS and forgo buying a large cluster of machines that we have to cool, power, maintain, etcetera. Instance types range from:
Standard Instances
High-CPU Instances
High-CPU Medium Instance 1.7 GB of memory, 5 EC2 Compute Units (2 virtual cores with 2.5 EC2 Compute Units each), 350 GB of instance storage, 32-bit platform
High-CPU Extra Large Instance 7 GB of memory, 20 EC2 Compute Units (8 virtual cores with 2.5 EC2 Compute Units each), 1690 GB of instance storage, 64-bit platform
EC2 Compute Unit (ECU) – One EC2 Compute Unit (ECU) provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.
Standard On-Demand Instances Linux/UNIX Usage Windows Usage
Small (Default) $0.10 per hour $0.125 per hour
Large $0.40 per hour $0.50 per hour
Extra Large $0.80 per hour $1.00 per hour
High CPU On-Demand Instances Linux/UNIX Usage Windows Usage
Medium $0.20 per hour $0.30 per hour
Extra Large $0.80 per hour $1.20 per hour
Sorry to make an answer sound like a vendor pitch, but if your environment allows you to go this route, I think you'll be happy and make a much better purchase decision should you buy your own hardware in the future.
I don't think you should be thinking in terms of numbers of servers, but in the number of CPU cores and amount of memory. From what I remember hadoop loves memory. The more cores you have, the more job processes you can run at the same time.
I think it's going to be dependant on your work load. How well do your jobs partition? Fewer bigger chunks will probably favour few fast servers, where as more smaller jobs might favour more slower machines.
It totally depends on your workload. Is your task highly parallel? Or does it have a large serial component? If it scales well you should try to get as many cores as possible for your money. If it doesn't scale well then you should find the point at which the scaling breaks down. Then try to buy the most powerful CPU's you can for that number of cores.
This is just a general guideline, but I don't think there's anything specific about Hadoop that gives it any kind of special requirements beyond any other parallelization framework.
Keep in mind also that very small Hadoop clusters just don't work very well, especially under failure scenarios. The problem is that many heuristics are tuned with the assumption that the cluster will have >20 machines. Some of these heuristics simply fail on very small clusters.
A good example (which may still not have been fixed even in the most recent releases) is what happens when you write a block. Assuming replication = 3, three nodes are chosen at random to host the replicas. If one of the nodes fails during the write, then the namenode is queried for a different random three nodes. On a large cluster, the probability that the new three nodes contains the failing node is negligible, but on a very small cluster of, say 6 nodes, there is a high chance that the failed node will be in the new list. The write will fail again and possibly even again. This is enough to tank the job. The fix is obvious, but is much too low a probability for most committers for it to be integrated quickly.
Hadoop really doesn't have an enterprise level distribution yet that addresses the full range of scalability, up and down. Perhaps soon, but not yet.
The recommendation to use EC2/EMR until you are clear on your needs is an excellent one. Not only will it let you understand your constraints and needs better, but it will let you have considerably larger clusters than you are talking about buying.