Background
One of my clients is an IT-dependent workflow-driven law firm with about 50 seats. They have been audited by one of their clients (an FSA regulated mortgage lender) and told that their single-site is a threat to business continuity. I have proposed that we partition their business into two bits:
Client-side - PCs, monitors, chairs, desks, LAN switches and router and firewall
Server-side - The Virtual Machines running Active Directory, Exchange, SQL, SharePoint and other line-of-business apps, "robot" worker machines and Remote Desktop Services (around 14 VMs in total)
The idea is that we can store equipment and put arrangements in place to quickly reproduce at least a reduced-capacity client-side environment at an alternative location, or even have users connect from their homes if required.
The Server-side represents a greater challenge, as it includes services published from their (currently ADSL, soon to be 100 Mbps fibre) IP connection and about 3TB data, not including backups. I have proposed that we move the entire server-side environment out of their current self-hosted on-site server rooms and into a hosted facility. I still want the same level of privacy - this has to be firewalled off from the internet except for the small number of published services, and they would be best served from a web-server VM in a DMZ.
Currently there are two server rooms, each containing one node of a replicated SAN and one Hyper-V cluster host. Coupled with redundant fibre-channel and Ethernet links this means the system will keep running even if a whole server room is lost. I want the hosted server-side environment to be similarly resilient to single data-centre loss.
Basically, I want the security, availability and control I get from local self-hosting, but in the cloud, with geopraphical diversity of at least 30 km. I also don't really want to be buying kit and racking it myself and worrying about hardware lifespan and replacement, backups etc.
Questions
Is the replicating SAN and Hyper-V Cluster something I should try to replicate in the data-centre, or do large hosters & cloud providers have other ways of ensuring availability?
It looks like Amazon AWS has all the bits necessary (EC2, EBS, S3, VPC, VPN etc), but only one EU data centre. What kind of availability can I expect? E.g. If they have a major outage in their Ireland datacentre (imagine an aircraft landing on it, for example), what will happen to the services hosted there? And what about general reliability issues?
Can this be done at all using Windows Azure, Rackspace Cloud, or any other cloud service provide?
Thanks for considering my question.
I suggest keeping your primary operations inside and duplicating your servers and data outside as a backup.
EC2 is pretty awesome for this. Build machine images of each of the servers that you need and keep your data separate from them. Whenever you patch software on an internal machine, schedule to make the corresponding patch on your EC2 box. This will keep your costs low for the backup resource because you won't need the machine running most of the time, so you'll just be paying for storage, not the machine cost.
Push your data across the network as well. Your initial move will take you more than 3 days, but the incrementals should be a lot smoother.
By keeping EC2 as your backup, you'll avoid / minimize hardware costs, avoid reliance on a remote site and internet connection for day-to-day business, and provide yourself with the ability to quickly spin up services in an outage.
Direct Q&A
They have their methods of ensuring reliability. You can pay for service with a higher availability SLA. Always have backups anyway.
If it goes boom, it goes boom. If you're relying just on them, replicate to other data centers. Personally with my suggestion about using them only as your backup, I wouldn't worry about it too much. If the EU goes boom enough that your company and EC2 EU go offline, then life happens. For a 50 person firm, I wouldn't factor that kind of risk across more than 2 distant sites (your office and one EC2 data center).
Probably, but I've only kept familiar with Amazon services.
Going to the cloud is not just take all the servers and move those instances somewhere else. Your infrastructure must be build to work in the cloud. Otherwise you won't see resiliency nowhere near the levels you had in your own server room. Those are completely different environments.
Read about Chaos Monkey from Netflix and also from Coding Horror.
By means of having a DR site with a replica of your infrastructure and an acceptable RPO/RTO in some cases the DR site granted their operational level and service overview may be better suited for the PROD and leveraging their datacenter + infrastructure entirely for both prod/dr scenarios.
Scaling the active directory site, can do
Thin clients, hosted model with Citrix server is a recommended best practice.
MPLS connection to the provider and multiple zones +dmz as well as satisfying requirements for privacy, security and audits. validating the provide to offer safe harbor, saas70 (now ssae16), pci.
can do, depending factors, database architecture, licensed edition (standard/enterprise) required rpo/rto and more insight on the dataflow.
for security advanced change control, log management, intrusion detection... typical response times between datacenters within 6 hour timezones apart should be <70ms
Questions
not recommending block level replication over facilities this can get expensive there are many other options on the application/db software stack to handle this.
If your at one host and its down then your apps are down. there are other companies that can offer this as well, some work well with amazon, see Datapipe.
Stratosphere is an interesting approach you might want to look into, ping me if you'd like to discuss
I'm sorry this is a little late to the party, but Jeff Ferland is right about the boom.
Your question about a plane landing on the Ireland data centre could just as well be translated to what happens if a plane lands on your clients office or server room. Both are catastrophic situations that are out of anyone's control and will result in data loss for your client.
If you are worried about that sort of thing happening to your client, you should already be taking measures to offsite your clients in-premise servers.
If you're concerned about protecting your business, your client agreement should have a clause protecting you from being held responsible for events outside your control, and possibly somethings that are in your control.