I'm looking at putting together a support SLA. As a base line I'd like to know roughly what sort of percentage availability I should expect from a non-clustered Windows 2003 Server.
Assumptions are that the server is comfortably spec-ed for the application it's running (so it won't be labouring) and that by uptime I mean that the server is available. It needs to undergo reasonalbe general maintenance (security patching and the like).
What would people expect?
When drafting an SLA, it's more important to agree with the customer what they expect (and afford) vs what your willing to support within the constraints of the equipment and budget you have.
For example: a single non-clustered server is not suitable for a customer that wants 99.999% uptime and 24 hour on-call support and 1 hour "Return to operations" on a major failure. It's not technical reasonable to accept that and the customer needs to understand that.
Yes, Windows 2003 Server is reliable and can perform very nicely. Brand name servers come with proven reliability and rock solid warranties. Regular monitoring and TLC on a server can keep it going for many years.
You need to "hope for the best, but plan for the worst".
You'll also need to accurately calculate your availability statistics and have the calculation agreed with the customer (1 hour downtime at 2am is a different "cost" to 11am on a Tuesday).
You'll need to incorporate all the additional equipment that is required to keep a server alive (networking, switches, firewalls, operator time, backups).
Finally, you'll need to test your contingency plans, and keep your infrastructure flexible so you can solve the fault in several different ways.
There's not really a standard figure we can quote you, by itself server 2003 is a very stable system, but the uptime you can expect depends on a number of variable factors.
In theory the only thing you should need to take the server offline for is for applying updates, which should be at most once a week. You can work out your downtime for these by timing how long your server takes to reboot.
That's all fine in theory, but we all know that servers go offline for other reasons too, hardware failures, network problems, software hangs, and these are not something you can easily predict, but it would be advisable to fit in time for unpredictable events.
Finally your going to want to factor in time for planned upgrades or changes, is the use of the server likely to increase over time, will it need upgrades to cope with the change?
All these things factored in will give you your predicted uptime, and it may be that your actual uptime is better than this if you have no faults, not upgrades etc, but its better to be cautious.
From my experience with Server 2003 Standard R2, I can tell you, it's high if you do not have any hardware/network troubles.
The 2 servers I've got running Server 2003, never crashed once on it's own. One server has a record uptime of 240 days!!! Do note: this is because they never installed update's on the system.
It would take a lot to crash 2003 from normal operation.
You can plan an SLA for updates, ie down once per week for up to 2 hours for instance assuming everything goes fine. But unless you start cluster stuff with failovers, etc, there really is no way you can do an SLA for everything else. What happens if you do the updates, reboot the server and it doesn't come up? Or it gets a virus, or the drive controller dies, the issues could be endless.
You'd be better to specify the SLA for applying updates and an SLA for responding to issues that come up.
Do the SLA as I will respond to an outage within 1 hour, but the time to resolve or work around that outage will vary as it cannot be anticipated.
Windows, any version, benefits from regular reboots. The operating system itself has memory leaks, without even taking applications and services into account. Updates require reboots as well. You can easily combine the two operations and have a downtime each week of just the time it takes to reboot.
After trying a whole bunch of ways to apply updates and have regular reboots I've learned the best way is to script the updates but don't allow the updating process to reboot the machine. I've experineced multiple cases of servers either not shutting down properly or not coming back up properly when the reboot is triggered by a scripted update. Have the reboot performed separately. I schedule updates to install starting at 11PM on Saturday night and reboots staggered across the servers between 3AM and 4AM Sunday morning.
The monitoring system does not issue alerts during that period in order to prevent unnecessary alerts being produced. Additionaly, the servers send me an emai after rebooting. When I wake up of Sunday morning I check my emails. If there are any alerts or I don't have an email from each rebooted server I know I have a problem. Hasn't happened yet though.
In a 30 day month there are 43,200 minutes. 99.75% uptime is 43,092 - which gives you 108 minutes of downtime to perform any scheduled maintenance. That should be more than enough, although I think it is ok to write into the SLA major maintenance (including but not limited to upgrades) planned in advance is excluded from the SLA.
The harder part is in an emergency - how long will it take you to get to the server, identify the problem and fix it. In that case four hours might not be enough (99.44%).
I've been looking after colocated windows servers since 2000, and in all that time I can recall 4 outages caused by the firewall failing (separate hardware, 2x catastrophically), 1 DoS attack on the network (not against our servers but impacted), as well as a couple of significant scheduled maintenance windows required by the data center. The Windows servers themselves... other than applying the patches or service packs I can't think of any. (quickly touchwood)
What would I expect? Minimum 99.5% (which sounds low) but that at least gives you a chance. Most months you will be near 100.00%. Don't go higher (than 99.5) if client isn't paying for it...