I'm trying to figure out what to do for a small business that has been plagued by ridiculous hardware problems. Right now, this business runs on five or six desktop machines; no server infrastructure is in place. On top of that, and I'm not embellishing this, they have seen four hardware failures this year to date, and it's got them bordering on madness.
I've already discussed with them the notion of putting a Small Business Server in place (they're a microsoft shop), and they're receptive to the idea. I also plan on getting my feet wet with System Center Essentials to keep on eye on things. The focus then becomes ensuring that this server remains available.
Also, I've just read through this other high availability thread. Much like the guy in that thread, I'm very new to IT, coming from a programming background instead.
Some ideas come to mind:
- Simple raid-5 with hot-swap edit: and hot-spare
- Get two cheaper server machines, configure to run one virtualized server with hot-migration (I've done some reading but sadly I can't tell if SBS Standard and SCE will support this)
- Failover clustering? I got this term from the other thread but haven't been exposed to it in the past.
Is there a best practice when it comes to this? The business owner is willing to dig into his pockets a little for this because he's becoming terrified of downtime, but I've got no experience with these to lead me in one direction over the other.
I'd appreciate your wisdom!
edit: To provide some addtional detail on the problems they've experienced, it's been a weird mix of inexplicable failures.
- switch on chassis fails to power on the system: motherboard had onboard switch, which provided a stop-gap solution, however switching out the case didn't fix the problem. Later, switching out the motherboard didn't fix the problem either.
- Two identical machines have both suffered drive failures in their raid-1 arrays, and both machines were assembled no more than 5 months ago.
- Boot failure issues: one system in raid-1 fails to boot at all. Unfortunately I didn't write down the original error message, but in my notes I have that "Failed to save startup options" in Windows Repair & Recovery led me to this thread which supported my suspicions that it was a hardware-related issue.
edit: Also, the machines are running in a collection of home offices, so residential-grade electrical is at play. I guess this may be more of a contributing factor than I'd given it credit for. However, the machines are all run on desks (literally desktops!) and not on the floor; I don't believe dustiness is involved.
First of all, SCE is overkill for 5-6 desktop machines. WSUS is probably a better option and is free.
You haven't said much about what exactly has failed. Was it a part in the machine? Is this a dusty environment? My primary support environment is approximately 40 users with approximately 10 servers (not including virtualized). We buy Dell machines (Optiplex's) and we have had maybe 4 hardware failures in the last 5 years on ALL of that stuff. So what you're seeing on the workstations, isn't normal.
Do they have a proper server room/location for the server (with cooling and not a lot of dust, at least?)
Raid-5 with hot swap is an inexpensive way to go on this server and provides some protection against hard drive failure. I would also add in redundant power supplies (inexpensive) and a UPS.
Failover clustering? You are starting to enter a realm that is both costly and complex for such a small environment. Remember that in such a small environment, while uptime is important, it's also important to remember that you'll want to keep things as simple as possible.
As for the workstations, address the problem (which you haven't been extremely clear about). Perhaps you could purchase an "extra" workstation that has your base image on it, that just sits there taking all of your updates from WSUS that you could use as a swap out machine if one of their workstations dies (which is what we do). We also have a shitload of parts that we can swap to replace the most common parts that die (power supplies, ram, hard drives) until the warranty part arrives.
Backups. No amount of redundancy is a substitute for good backups. You have numerous options here. With such a small environment you could look at many (Mozy, Carbonite come to mind) over-the-wire solutions which take care of offsite and automated at the same time for a reasonable cost. You could also put in a tape solution and use a service like Iron Mountain to vault the tapes off-site. Whatever you do, do not take tapes home with you! especially if they have valuable information on them (SS#, etc.)
From my experience, SBS has its own set of problems. Especially if you set it up clustered etc. The effort in maintenance is way too big for such a small shop.
Set up a proper little server, 4 disks, raid (5 | 10 | 6), pci-e raid controller, a basic fileserver, ups (thanks tomtom).
Mail for just a few people is probably best handled by an external provider.
Stay away from SCE and similar overkill situations, since you would have to have VPN, Active Directory, and similar. Setting all this up is a major effort, and perhaps not in the best interest of your customer.
By guiding your small customer to a simple, yet efficient and reliable solution, you will make them and yourself happy.
Teach them to look into eventlogs, maybe give them a simple script that checks for disk warnings. Visit them regularly, if they want that, and check the logs for them. Deal with the problems one at a time.
This is not a hardware issue primarily. Get a USV - NOW. One that is ON LINE (I.e. filters the electicity).
This is eihther comical - VERY rare - or based on for example fluctuating power or something th eserver did not handle that good. This is NOT normal, and the chance of that happening "just" is EXTREMELY low. Like lottery winning low. I have seen similar behavior - but based on either CRAP power supplies or... on unstable power supplies with spikes, partially home inducted (seen servers die when you turn on the lights thanks to a very bad switch where you could see sparks).
Just some additional insights:
I don't understand what problem the server is supposed to be solving.
If all four machines came from the same vendor, and there's nothing unusual about your location (very high humidity/dust, static electricity, lightning, or very unreliable power) you need a new hardware vendor. Whatever Dell, HP, and IBM did to get on the owner's bad side, the supplier for these machines is worse, at least from a hardware point of view. You'd get better reliability buying the cheapest machines you can find at Wal-Mart.
It may be that it's not wholly the vendor's fault - maybe someone specified particular hardware and/or insisted on some very low-spec gear - but they still should have refused to build machines that badly configured, or else done something heroic to replace the bad machines.
I suggest you buy some middle-of-the-road PC's from Dell/HP/Lenovo (or kick the butt of the current supplier to support what they sold), sign up for some paid Dropbox accounts (or box.net, or NetDocuments) to share files, and have your ISP or Google handle the mail and web serving.
[* Yes, "cloud" services are theoretically less secure than owning your own server - but if this is running in a bunch of home offices, the data is at risk if any of those homes are burglarized, or if someone's family member uses the work machine to run random malicious software from the internet when the employee's not home or on vacation. The biggest danger of downtime will come from consumer-grade net connections, not the cloud provider's downtime.]
It sounds like you need less hardware and simpler hardware if you want reliability, not more complicated and more expensive hardware/software.