Our company runs e-commerce web sites (thousands) on two clusters in two separate data centers.
Basically, all we require to operate is rack mountable server nodes. Each node needs:
1.) 4 or 8 cores 2.) 32 gb ram 3.) 1 250 gb sata disk 4.) 2 Port, Gigabit, Ethernet Adapters 5.) Ability to boot Windows XP Pro
That it. We run about 40 such nodes in a fully redundant, always up (hopefully!) cluster (we wrote the clustering part ourselves)
Previously, we bought our systems whiteboxed (basically had a small shop custom build our servers (supermicro) to our specs).
This scheme was working well up to our last round of node purchases. Out of the last round of node purchases have had a super, super high failure rate (30% failed in 6 mo.) No one reason, bad PSU, bad memory, mobo fried, etc.
My questions are these:
Will we have more consistent reliability if we purchase from a name brand vendor (IBM/DELL/HP) or are we basically in the same crap shoot of reliability we were in before? Remember, these are low end servers. We are not going to transition to a mainframe or anything exotic.
Will our reliability vary with the form factor of the servers? That is to say, will 2u servers be any more reliable than high density, 2 nodes in a 1 u box server?
Anybody out there transitioned from white box servers to name brand servers (or changed form factors) and have a tale to tell?
The brand names, in general, tend to be more reliable than whiteboxes (although supermicro don't count as "white box" in my world), however you will still have the occasional run of bad luck with hardware from the name brands. What you do tend to get, though, if you've got a large purchasing volume and history with one of the bigger kids is a quick turnaround on fixing those sorts of problems. If you get a dud batch of motherboards from a whitebox vendor, there's limited chance that they'll have a pile of spares sitting around to replace them with, whereas a big name will have spares out their ears -- and long-term, loyal customers (ie "cash cows") will get that stock first.
Ultimately, though, it's computer hardware, and this sort of thing is why we run extensive burn-in tests on all hardware received. This stuff happens with alarming regularity once you get into large-scale management, and having it fail on the test rack is a far better option than having it fail in production (even if you do have massively redundant systems).
Also, "runs XP Pro" -- are you serious?
change the builder but keep the brand.
Really, Supermicro hardware is really good. if you're getting such high failure rates, I'd first suspect that the build guys are messing it up.
Supermicro is a very reliable brand, from the motherboards to their full solutions.
A good builder should stand behind their work, and should help you out however possible. Going with a major brand like Dell and HP will get you the same thing.
As for the configuration type. The more heat you have in one spot, the higher the failure rate could be. So 2 nodes in a 1u is going to put off more heat then 1 in a 2u. If you have enough cooling in your rack, this shouldn't be a factor though at all.
One nice thing about Dell is that they do build your servers to spec and they do this in a very clean and nice environment - this adds to longlivety of their servers. In my experience never ever opening a server adds to longlivety. Id say that if the server works after the first year its likely to keep working for a long time. Further you want to keep your servers in a good datacentre that provides a good environment both electrically aswell as physically. Steady temperatures matter - varying temperatures kills hardware much faster.
As for formfactor any decent supplier like the well known brand names do conscruct their systems in such a manner as to negate the majority of effects due to formfactor. Personally Id say it doesnt matter, althought that isnt entirely true. Dell, HP and IBM are well known for flaming eachothers bladecenter designs. :-) But I dare say they are all pretty darn good anyway so at the end of the day its their hardware replacement plans that matter and TCO, aslong as its a serious corp.
We stick with Dell because thyere cheaper than IBM and HP, have in my experience very low failrates because of they way they distribute their stuff (build to spec and ship). THis also saves me a bunch of time. Last time I shopped HP I bought some 30 blades with assorted disks, storage etc.. IT was delivered as some 316 boxes.. Dell would ship it as more like 10. :-) I dont like spending three hours unboxing hardware, then have to drag it into the datacenter and get it in racks (because thats the only safe place to leave hardware anyway).
As temperature goes, Id look into the 55xx series xeon cpus, especially the L variants. They are highly energy efficient usually running at 60watts or thereabout.
And, hehe, yes, whats that with XP? Are you running your webservers on XP pro? :-)
The selling point for me when buying hardware from large OEM's is the fact that, as opposed to smaller vendors, large OEM's build thousands of machines everyday and have their manufacturing\assembly process fine tuned to a science. They have parts manufacturers and engineers at their beck and call and have parts depots and service technicians in every major metro area. Not only is the equipment "road tested" before it's delivered to you, it comes with thousands of man hours of experience and engineering behind it. IMHO this translates into reliability, stability, and consistency.
One thing I don't like about lower end hardware is ventilation. With high-density 1 or 2U servers, fans and lots of them are critical, and so are thermal zones. The IBM/HP/Dell servers have this down to a science, and they also have numerous temperature/fan speed sensors, and management software that will alert you if something is out of whack.
If you already have all of this covered, I wouldn't focus on switching hardware brands.
Most good servers are rated up to about 95 degrees F inlet temp, but it can quickly get much hotter than that in a rack or enclosure with poor ventilation.