I work for a Fortune 500 company that struggles with accurately measuring performance and availability for high availability applications (i.e., apps that are up 99.5% with 5 seconds page to page navigation). We factor in both scheduled and unscheduled downtime to determine this availability number. However, we recently added a CDN into the mix, which kind of complicates our metrics a bit. The CDN now handles about 75% of our traffic, while sending the remainder to our own servers.
We attempt to measure what we call a "true user experience" (i.e., our testing scripts emulate a typical user clicking through the application.) These monitoring scripts sit outside of our network, which means we're hitting the CDN about 75% of the time.
Management has decided that we take the worst case scenario to measure availability. So if our origin servers are having problems, but yet the CDN is serving content just fine, we still take a hit on availability. The same is true the other way around. My thought is that as long as the "user experience" is successful, we should not unnecessarily punish ourselves. After all, a CDN is there to improve performance and availability!
I'm just wondering if anyone has any knowledge of how other Fortune 500 companies calculate their availability numbers? I look at apple.com, for instance, of a storefront that uses a CDN that never seems to be down (unless there is about to be a major product announcement.) It would be great to have some hard, factual data because I don't believe that we need to unnecessarily hurt ourselves on these metrics. We are making business decisions based on these numbers.
I can say, however, given that these metrics are visible to management, issues get addressed and resolved pretty fast (read: we cut through the red-tape pretty quick.) Unfortunately, as a developer, I don't want management to think that the application is up or down because some external factor (i.e., CDN) is influencing the numbers.
Thoughts?
(I mistakenly posted this question on StackOverflow, sorry in advance for the cross-post)
In the abstract, I would say you should sharply define what constitutes "available" vs. "unavailable" and measure yourself against it. For example, you could have a client-side performance SLA for the site of 1 second to the "fold" and 3 seconds for a completely rendered page. When you don't meet the performance SLA, you should count that as an availablility failure for that time period. It shouldn't matter whether you're hitting the CDN or not -- the user experience is what matters.
However, since you're only taking measurements every 5 minutes, it seems reasonable to measure hits to the CDN vs. the master site separately, and calculate that 75% of availability is coming from the CDN and 25% from the master. The difficulty here is that 75% is just an average. To apportion blame accurately for a given time period, you need to know when one or the other site is not actually customer-facing, e.g., during a planned change or after manual action when a problem is detected. You also need to factor in what happens when one of the master site or the CDN are down. Does the customer get an HTTP 500, or do they just transparently fail over to the working site? A lot depends on your load balancing solution. The "worst-case" metric you described seems too simplistic. Ask yourself, "What are our customers experiencing?"
As far as whether you should take "blame" when the CDN is down: absolutely. If 75% of your hits are going to the CDN, then 75% of your customer experience is dependent on them. You're responsible for providing a good experience to your customers, so if the CDN is having issues, you need to use your engineering resources to prove it and follow up with the provider.
One other thing to think about is what happens when the master site is unavailable for an extended period of time. As you've described it, it sounds like the CDN is a static copy of the content on the master site. If the master site is down for a long time, the CDN could start to get stale. So maybe part of your SLA should be freshness: 1 second to the "fold" and 3 seconds for a completely rendered page, with content no more than 15 minutes old.
I agree with user44700, it is better to separate the availability testing for your servers versus the CDN and track the two independent independently. Your true availability will be Server Avail * CDN Avail, since if either goes down - you are considering it that your page/site is down. This will also cost you less with any of the monitoring vendors.
I would not go the route of creating one browser test and look at what items failed, while it could work and some companies like Catchpoint have the concept "content availability" - it might not be exactly what you want for this case. Say for example your webpage has a call to the CDN for a file that deliver 404, most monitoring solutions will tell you this is a failure - but was it really the CDN that failed? Was that file even important? maybe someone just forgot to remove some relic reference that no user notices.
You can read this blog post for some more ideas: http://blog.catchpoint.com/2010/07/21/true-availability-of-a-webpage/
The SLA reporting should accurately reflect reality. If you are measuring availability from a user perspective and only the server doing the measuring is experiencing issues, reporting that issue within your SLA would not reflect the user experience.
I can understand wanting to hold the source information to a high standard, perhaps always reporting it even if inaccurate but with a note identifying why.
If you cannot come to agreement, perhaps there is a technical solution to make the measuring server less fallible.
If the information is reported as an outage and it was not, what value does the reporting provide?
In my environment, we report from multiple sources. An external monitoring methodology to report availability from an external perspective as well as reporting our internal outage recording system, which is human entered and considers multiple factors that most accurately reflect the situation.
Gomez and Keynote are enterprise-accepted solutions for gathering the types of metrics you mentioned. Gomez also has a service that monitors your enduser UX by sourcing a google-analytics-esque javascript file.
Pingdom are good: http://www.pingdom.com/
We're a Fortune 500 with a CDN-enabled site, and we use several things. You have correctly determined that you need to measure different things if you want to detect different things. It's not clear to me what you specifically want - availability numbers to help you determine when an app is actually down, or numbers that get management off your back. Anyway...
To get the "CDN out of it" you could take another Keynote/Gomez monitor and point it at your apps not through the CDN using an alternate DNS name or whatnot. But since it still has static assets, it's more useful for performance than availability. And it keeps internet outages, agent outages, etc. in the loop, which is appropriate for some purposes and not others.
Real user monitoring. There's network-based (Coradiant, Tealeaf) and tag-based (Jiffy, Gomez). We use Coradiant as a network sniffer and it determines the real user-seen performance of assets hosted here at our data center- in other words, the actual applications and not all the static junk on the CDN. We then wrote reports to determine app error rates and performance and used the Apdex (apdex.org) as a derived metric. In some cases you can't use network based (too much traffic, or your assets aren't hosted where you can get at the network), and tag-based isn't that reliable. Has the immense benefit of actually seeing end user response time and errors - it's easy to set up a synthetic monitor that doesn't error in all the cases that a real user does.
Local synthetic monitoring. Nagios/zabbix/sitescope/a hundred others. Point a monitor at your app locally (don't go through the CDN). For actionable (as in, send a page to wake someone up) availability monitoring, this is the gold standard. Doesn't take into account network stuff.
Log monitoring. In a sense, this is ghetto real user monitoring. But if you really just want to see what errored when, it's pretty handy. Has the "no really that's what happened" benefit of real user monitoring. Often availability only, unless you're logging time-taken on the Web tier, in which case it shows you how long your server end took - not helpful for user facing SLA, but very helpful for "what code do we need to work on." Use splunk.
It's not an either or, we use all these, because you do want the "end user story" as well as "what programmer do we need to lean on."
BrowserMob is great