I've been tasked with figuring out how to do a large scale deployment for a high availability, high traffic, dynamic, flash based web application. There is a good possibility that this application can grow to 2 million users or possibly a lot more.
Of course when the time comes to actually do it, I will likely bring in an expert to help me. For now, this is all theoretical, and I just wanted to get a good idea about how to do it.
Here is my current thinking:
- Get a 1 or 2 gigabit internet connection.
- 2x HA Proxies / load balancers.
- 12x physical servers (2x quad cores, 32GB of ram) to run 25 Virtual Machine Apache web servers - 300 VMs total.
- 2x memcached Servers (2x quad cores, 32GB of ram)
- 2x Master Mysql DB servers, 3x read only slave mysql BD servers
- 2x 6TB SAN RAIDs
- All 10GbE or fiber connected...
My job is mostly to figure out the hardware, and how to deploy and maintain it.
As for maintaining this beast, I'm looking into Slack, Rightscale, Landscape, VMware.
I am also looking into just deploying it on Amazon EC2, but I would prefer to rack it myself as it looks like it would be much cheaper in the long run. I may start out with this rack and add EC2 instances if more is needed.
What do you think, is this overkill? Not enough? Have I got it figured out, and is there anything I'm missing?
Thanks to anyone who reads and responds!
Develop it on EC2. If/when you want to move it in-house, set up your own Eucalyptus cluster and move it to there, which will be easy since Eucalyptus is an open source EC2 infrastructure clone.
From what I've read on High Scalability, your best strategy is simply to work on the current bottlenecks and start to plan for the next bottlenecks. Unless you're using exactly the same software and hardware as someone has already used, and unless the usage patterns are identical as you scale, you can't plan your entire growth now. You need to plan for the current phase and the next one, not the 2-million-user mark.
I think you've seriously overestimated your requirements. As a point of reference, a 2 ipvs primary/primary in front of a pair of dual quadcore xeons running nginx+tornado processes 286 million ~2400 byte iframe adblocks + full adstream logging for the 7 elements within the iframe. Utilization of each is <40% (allowing for a single node to fail with no 'overload')
The second issue you say is dynamic 'flash-based' which means that the majority of your bandwidth is going to be serving the initial flash application - a candidate for a CDN or cache.
Your dynamic flash application is likely to have small messages sent back and forth using gevent or some similar technology. Those messages are likely to be small. Your application intelligence will likely have to parse, log and respond to those, but, 2 million users a day, 80 interactions each is 160 million interactions. At peak times, you would probably have to handle 3600tps. If you need 300 servers to handle that, you need to rewrite your code.
Virtualizing your webservers as mentioned above is probably a waste of performance. Even the lightest weight virtualization won't run as quickly as bare iron.
However, without knowing your architecture or system design, any opinions given are guesswork given the loosely defined requirements.
Why would you put your web servers in a VM enviroment? It just causes overhead that you dont need, they should all serve data from the same storage and have identical configuration.
I can reply for the sizing of the haproxy servers. Use a recent Linux kernel (>= 2.6.27) to benefit from the TCP splicing feature that will save you a lot of CPU on high bandwidth. Use the highest frequency you can find for the CPU. No need for many cores, better find a 3.6 GHz dual-core than a 2 GHz 8-core. For the RAM, count 1GB per 20k concurrent connections. Use two VIPs for the LBs (active/active) managed by the keepalived daemon and announce them both in your DNS. That way your two LBs will work at the same time and if either fails, the other one takes over its IP. One very large site I know runs more than 300k concurrent connections on only two properly tuned LBs with quite some margin.
Do not underestimate the network cards. The lower latency of some NICs such as Myricom's Myri10GE NICs compared to other solutions (1 or 10 GbE) can bring nice savings on the packet processing cost and the number of concurrent connections on the servers.
Concerning the apache servers, haproxy can concentrate incoming connections into fewer outgoing, thus reducing the number of servers and protecting them from traffic surges. However, that's usable for short connections. Don't rely on that for long-polling requests, chat or large file uploads/downloads.
Take a look at systemimager
http://wiki.systemimager.org/index.php/Main_Page
As far as reading on this topic goes, I highly recommend Scalable Internet Architectures by Theo Schlossnagle. It covers HA deployments and horizontal scalability in rather exhaustive detail.
http://www.amazon.com/Scalable-Internet-Architectures-Theo-Schlossnagle/dp/067232699X
Also, High Performance Web Sites and Cloud Application Architectures might be worth taking a look at, however the latter might be slightly outdated.