I work for a small development house who are increasingly being asked to put together formal SLAs for our products based on particular configurations.
From a development side of things I'm comfortable with this, however there's no point in my saying that we'll meet particular targets from a software perspective if they're not realistic from a hardware/platform perspective - the clients only care about overall system availability.
What should I be looking at from a platform perspective? What sort of metrics and levels?
Also, what are the gotchas (for instance from a software perspective I'd never commit to a fix time - I have no idea whether I'm going to have to rewrite the whole product to correct something so saying that we can fix it in 5 days is potentially impossible - what should I avoid committing to from a hardware/OS/platform point of view)?
I have extensive experience in this space; I do a lot of work for a couple fortune-5 companies who operate their data centers like an ISP would to the various company departments needing hosting & support services.
They typically have two metrics called an SLA (Service Level Agreement) and an OLA (Operational Level Agreement).
SLAs are met through the type of hardware in use. When talking about SLAs we use levels to describe them. SLA-1 being zero down time, SLA-2 is something like up to 1 hour of downtime, SLA-3 is 8 hours, etc... SLAs are met through the use of redundant equipment. At one company we use a lot of Cisco to create high availability (Cisco CSMs and GSS gear). When talking about SLA levels we generally talk about HA (High Availability) and DR (Disaster Recovery). In situations where a company has multiple data centers, the HA component is usually a per data center attribute while the DR is an across data center attribute; both measured in terms of RPO (Recovery Point Objective) and RTO (Recovery Time Objective) to meant the SLA level.
OLAs are, in real basic terms, how quickly someone (a human) responds to an event requiring manual intervention/corrective action. OLAs are typically measured in terms of response times too; they use the same RTO/RPO objectives. One company I consult for uses 6 levels for their OLA metrics. The first 3 levels here are an example of this:
OLA-1: RTO 0 < 2 Hours OLA-2: RTO >= 2 & <= 4 hours OLA-3: RTO >= 24 hours & <= 30 days if not a data center failure, if dc failure > 30 days.
The things that drive OLA and SLA metrics is something called a CIA rating. CIA = Confidentiality, Integrity, and Availability. Data for an application should be classified by the business unit paying for said application. The CIA will help drive what the OLA and SLA should be. Each part of the CIA level is given a number from 1 to 3. So, for example, a CIA rating of 1-1-1 would be Highly Confidential, Highest Integrity level, and Highest Availability level. A CIA rating of 3-3-3 is the lowest you can go. Thus, a CIA rating of 3-3-3 typically maps to an SLA & OLA level of 6 where an SLA-6 & OLA-6 is the lowest (longest response time) guaranteed.
How you derive a CIA rating usually amounts to figuring out how much money a business will loose if the data is stolen (Confidentiality), compromised (Integrity), or when systems are down (Availability). So a company that stands to loose $10M if confidential data is stolen may have a C rating of 1 or if that lost of data is not critical and would only cost the company, say, $1,000 then you may have a C rating of 3 instead.
This is typically how big companies that I've consulted for handle such things.
I'd be slow to commit to a fix time on hardware issues, the same as on software. You never know when you'll be waiting for a vendor to fix a critical bug in something. In terms of SLA levels, I've found that they tend to be of the form that "someone will be working on your problem within X hours". X if of course dependent on how much they pay, but somewhere between 1 and 8 hours would seem normal, in my experience.
If you're being asked to provide an SLA for restoration of hardware issues where your software happens to be installed, the answer is "no". You could commit to a response time, but without controlling the whole hardware/os/software stack you cannot commit to a resolution time.
Maybe your customer is telling you in an awkward way that they really need a hosted offering for your product? That way they can avoid whatever internal problems they are worried about and just cut you a check.
One thing to consider when contracting an SLA is that SLA by itself means absolutely nothing, has to be observed together with the penalties in case SLA is not fulfiled.
For example, our ISP gives us 100% SLA on the network, but the maximum amount we can get back is our monthly bill which is really low as nowadays the bandwidth is cheap and nowhere near the amount of money we lose when the network is down.
Also, what is usually written in the contracts is how quickly someone will respond to the problem, never how long it will actually take to fix it. So if they make you commit to short response times, just put an intern in the night shift to shuffle tickets for you until you wake up and there ya go.
In my experience all this SLA business practically means very, very little, if anything.