After a recent power outage, we've been re-evaluating how we can most effectively provide, maintain, and support our IT resources.
One idea we've been considering is to move our non-critical infrastructure into the cloud. For example, we could maintain only the following services on-campus:
- DNS (perhaps just a forwarder)
- DHCP (with failover)
- Directory and Kerberos (for network authentication)
...and that's all. With a plan like this, even in the case of a disaster, we'd simply focus on keeping our Internet connection up, and all of our other services would still be available.
I was looking into Amazon EC2, but I'm not committed to anything yet. How have other colleges or businesses gone about doing something like this? Is there a "catch" or hurdle that we should know about? Are there any blogs/forums that might detail an enterprise-level cloud migration?
Potential catches/hurdles:
Bandwidth: When you talk about moving bandwidth-intensive services like database and file servers offsite, you need to take a long look at how fat a pipe you would need to maintain your existing level of performance. Depending on the type of institution that you represent, you may or may not already have gigabit or 10-gigabit fiber to your campus. If you need guaranteed bandwidth and can afford the investment, a private connection via AWS Direct Connect may be something to consider.
Latency: Your users may have database-driven applications that do clever things like executing multiple queries on the fly to populate drop-down lists (worst if they're running against .mdb files instead of SQL servers). Or, they may have a Windows file share containing 10,000 strategically named Word documents, and it might suddenly take minutes instead of seconds to browse. These issues can be mitigated only in part by ensuring that you have a clean, low-latency/low-jitter connection to your hosted environment. However, the ultimate solution is using applications that are properly designed for the cloud (e.g. Google Apps instead of MS Office, or a web-based database frontend instead of an Access database).
Cost: You will need to assess carefully whether the cost of leasing virtual servers is comparable to the cost of maintaining your on-site infrastructure. There are numerous variables, from capital expenses to electricity to maintenance costs, and all would need to be considered in an apples-to-apples comparison. Don't forget that EC2 bills for outbound bandwidth, and that services that traditionally live "on-premise" may consume an awful lot of it.
Reliability: Is your internet connectivity more reliable than your existing server infrastructure? If not, how much redundancy would you need to add, and at what cost? How would an EC2 outage like those experienced in April and August of 2011 affect your operations?
Consider the following as points of further discussion; there's no concrete answers here, just things off the top of my head that you should be thinking about and hashing out with your peers:
You can probably rule out VoIP, along with print server and file server from the cloud. I would suspect that the latency for the VoIP wouldn't be acceptable without alot of work. However, your telco/SIP provider may offer a hosted service of their own that may be worth looking into to. File and print are probably tied into legacy software (image management for example) that rely on LAN speed/latency to work properly. I'm sure you'll hit a number of stumble points the further you look into this.
Databases might be ok, but again the latency may not be acceptable for the applications that use them.
The Web-based servers would likely work fine in a cloud-based environment out-of-the-box (Wordpress, Moodle, etc.) but if they're dependent on core services that at this point only reside within the campus network, you'll need to look at replication or securely accessing these services remotely (which you'll have to do anyways in the form of a private cloud and VPN tunnelling).
And a big cloud misnomer is that people expect their software and services to behave differently just because they're "in the cloud" when unless your services stack has been optimized to take advantage of the distributed architecture that the cloud offers, you will experience single points of failure when an instance goes down.
Of course there's always software licensing that may or may not pose an additional hurdle.
And obviously there's the privacy of information and the safeguarding of said information that may or may not exist on these non-critical services that might get your legal counsel in a tizzy.
And you still have to improve your recovery/continuity of your core services: that's the real root issue here, isn't it? And add to the fact that your core services will now include redundant, prioritized, high-quality Internet connectivity because without that, your cloud services are effectively useless.
The amount of work just determining whether or not any of these "non-critical" services (and it may be non-critical to you, but to another department, may be their most important; service-level agreements might be a good thing here if you don't have any in place already) will be a successful cloud candidate is non-trivial.
The good thing is that you'll be doing a healthy inventory/discovery of your IT infrastructure (which is always beneficial) to determine what you can or cannot do (or at least pilot). And since you're only spending money on cloud run-time fees, your pilot does not require a big outlaying of cash.
I really like EC2 for disaster recovery, but I'm always skeptical about moving big chunks of infrastructure onto it. What I really love EC2 for is DR -- copying everything critical. If you really lay the planning groundwork, you can have your operating systems on EC2 and locally. If your IT building goes up in a glorious blaze, you can turn on a bunch of EC2 instances, repoint your DNS, and start serving. Beauty is that when they're offline, they cost nothing more than static storage fees.
If you really do it right, you can turn on systems in either location as needed. That also allows you to experiment freely and easily use the least expensive systems.
Where you're a campus, you can go really big with the DR tactic. I don't know the semantics of it for dealing with Amazon, but many colleges are toting around a Class B address space. You can kick out a BGP route for a class C that is reserved for mirrored services. If your web server is in that address space, anything on campus or directly connected will route to the campus machine, and anything that finds Amazon closer will route there. (as of Mar. 2010, this wasn't possible. It was highly requested, and may or may not be possible now). In fact, you could actually establish that strictly on campus without BGP routes if you focus only on your internal routing tables.
There's no doubt that this provides added complexity, but it will definitely make you very resilient.