Let's say I'm setting up a single machine server. Without knowing the specific components in it (and being able to look up their MTBFs), what are the typical relative failure rates of the hardware components in the server?
Equivalently, what are the rankings of the most often-replaced components across all the servers in corporate use?
About hard disks, many people misunderstand the MTBF and think a drive with a MTBF 100,000 hours will last, on average, for 11.5 years. What the manufacturer means is that in a collection of a large number of drives, N, all within their lifetime, that one drive will file for every 100,000/N hours. If you have 100,000 drives that each have a MTBF of 100,000 hours, then you should expect a drive to fail -- on average -- every hour.
Hard drives fail more often than people expect. Back up, back up, back up.
Anything with moving parts can fail, including tape drives, floppy drives, fans, and so on. I've had the fan on graphics cards die, causing the death of the graphics card. I've had the power supply fan die, causing most of the parts of the computer to die. (Since then I've never built a system without extra fans.) Tape drives require extra care, or their lifetimes will be significantly shortened. This is because not only does it move, but the tape head makes physical contact with the tape media -- at least in many kinds of tape drives. Cleaning the drive too often with ordinary tape cleaning media will wear away the tape heads.
I've had the built-in chipset fans die, but so far without any effect. So far I've never had a CPU fan die, but I tend to upgrade often enough that I probably avoid this via upgrades. (grin)
I replace my disk drives every several years (mostly because the capacity available increases so rapidly), so have experienced relatively few hard drive failures. I've had many power supplies fail -- many more than I would have naively expected for a component with no moving parts other than the fan. I assume that power irregularities are the cause of many power supply failures.
So far, in a few decades of computing, I have never had a CPU or RAM or motherboard fail unless there was a reasonable cause, such as overheating (fans dying). However, a few brands of motherboards over the years have had much shorter lifetimes than expected due to sub-par parts, often incorrectly manufactured capacitors where power enters the motherboard.
Anywhere that you have a plugged-in connection is a point of failure. I've had computers fail (mostly long ago) due to cheap tin-plated connectors. The tin oxidized and over time the connection because less and less reliable. Eventually I unplugged everything, took an eraser to the tin connectors to remove the oxidation, plugged everything back in, and was up and going for a while longer. Gold connectors are the connector of choice for a reason.
From what I've seen in a corporate environment, with my home experienced mixed in, components seem to fail in this order, from most to least frequently.
Not mentioned above, but you should expect all flash memory sticks/cards to eventually die, depending on frequency of use. But it will take a long time given the average use of most such cards. Flash memory "wears out" with use and memory cells will eventually fail.
Anecdotally, batteries.
I have no hard data, but I have replaced more failed or under-performing batteries in my life than any other component. This includes uninterruptible power supplies, laptops/notebooks, controller batteries, mobile phone batteries, and probably a lot of others.
This has led me to always stock an extra battery pack for a server room's UPS.
Anything that moves, which in a server is basically hard drives and fans, will fail much more often than solid-state components. Power supplies are a distant, but notable, second. Everything else (cpu, memory, etc) is pretty reliable... which is not to say immune to failure, but definitely should be worried about after you've got your disk/fan/psu bases covered.
Best to keep spares of everything on-site, though, unless you're OK with whatever downtime your hardware vendor decides to give you.
Just researching this for my company today, I found a summary of one of microsoft's whitepapers at extremetech.com with this chart for an 8 month period:
The rated column was a decent reference for my calculations of the value of Dell's hardware warranties (we're just going to invest in extra hardware instead).
The full whitepaper is here: http://research.microsoft.com/apps/pubs/default.aspx?id=144888
You will see more problems with the firmware and drivers for the hardware than you will actually see physical failures (at least early in the device's lifetime), so make sure those are up to date and tested first.
SATA drives will usually be the first to go. SAS tends to be more reliable. (Although I've heard good things about the latest SATA 2 drives)
Once upon a time, CPU fans also used to be on the list; lately, I can't remember the last time I saw one stop working, but it's a possibility, especially in a dusty environment.
Google has published a paper, "Failure Trends in a Large Disk Drive Population", about failure statistics for a wide set of drives. The main take away is that disks fail above and beyond what the MTBF would suggest. Disks are easily the most failure prone in the server room.