What is the most significant server outage/downtime that occurred within the last decade due to performance issues, bottlenecks and scalability problems?
Two such examples are the constant problems Twitter has had as soon as it became popular and the Google downtime early in 2009.
What other such incidents you are aware of that you believe created a large amount of havoc affecting the largest # of users? What is there to be learned from such incidents? How have those companies publicly responded to their downtimes?
Northeast Blackout of 2003
The Northeast Blackout of 2003 was a massive widespread power outage that occurred throughout parts of the Northeastern and Midwestern United States and Ontario, Canada on Thursday, August 14, 2003, at approximately 4:15 p.m.Eastern: UTC -5. At the time, it was the second most widespread electrical blackout in history, after the 1999 Southern Brazil blackout.[1][2] The blackout affected an estimated 10 million people in Ontario and 45 million people in eight U.S. states.
My money is on Amazon, June 6th, 2008.
At approximately 10:25am PST the Amazon retail site became unreachable. All other Amazon servers and services functioned properly. Furthermore, https access to the site was available.
The site was down for ~2 hours.
Estimates are that Amazon lost a potential income of $31,000/minute and a lot of credibility (Amazon stocks went down 2.7% that day).
The root cause is assumed to have been a faulty definition in the load balance layer, but no one from Amazon will confirm/deny.
There has been a 3 hours Amazon S3 and EC2 services outage in 2008 that affected thousand of websites including Twitter (storage), and 37 Signals for example.According to amazon this was due to scability problems (ref link):
An outage that affected Microsoft, Google, Yahoo, Apple and antivirus update services from Symantec and TrendMicro has to be a significant outage.
Akamai later reported that the outage was a result of an DOS attack from a bot-net of Zombified home PCs.
How about the TMobile Sidekick data loss a few weeks ago?
I'd say when McHost was shut down is November of last year and dramatically reduced the amount of spam being sent out between 50-75% by some reports.
What about when a2b2.com, fsck, cheapvps, vaserv, etc all went down for days and days and days and days a few months ago?
This is going back, but the MS outage in 2001 was pretty glamorous. MS had set up their DNS servers on one subnet and when a router took a dive so did, well, pretty much all of their stuff...
London stock exchange! http://www.theregister.co.uk/2009/11/26/lse_crash_again/
Thanks to microsoft.
Anything that makes the Risks List along with lots of comments and discussion.