I have been tasked with scoping out infrastructure requirements for a site that will be pulling over 10 million plus unique visitors per month. This site will be several gigs in size content wise. I know right off that all of the interactive content will be placed on a CDN, but what about the back end? this site will also have a CMS attached which means that any dual server setup would need to be clustered, and I'm guessing load balanced as well. Just wanting any suggestions you may have.
To add more detail. Most likely we will be using a webmux load balancer.
Sadly you won't know what to fix until you go live. It's very hard to put money in the right place without having some data to back up your decisions. I'd recommend The Art of Capacity Planning to get an idea of what things you should be doing to plan your capacity. The general rule though is to monitor everything. You want graphs galore. If you can't see where things are having problems, you have no chance to fix it. Do not leave monitoring to the last minute. I can not stress enough how important it is to have an idea of how your site is currently performing and how it has performed over the last day, month or year. We use munin for our graphing, as it's very quick to get up and running. Other people use Ganglia and Cacti to great effect.
However, there are various things you can do to improve your chances of surviving.
1) Duplicate everything. Lots. You want to be able to add more hardware to places where you're having problems. You do not want to be buying bigger hardware to replace hardware that is too slow. Look at load balancing your application servers. Look at using a master/slave database setup, where reads come from your slaves and writes do to your master. You've said you're storing most media on a CDN. Good.
2) Avoid storing anything that transient in your database. Databases are too slow for temporary data, and you want them serving other requests.
3) Avoid server-side state if possible. With server side state, you will have to have some sort of shared session replication between web servers, limiting your ability to add more hardware, or you'll need to use sticky sessions, which will work, but can cause uneven load and sessions dying if your server dies.
4) Cache everything. Use memcached to cache data between your database and your application. It's more effective if you store data that's the result of multiple queries. Use a cache in front of your web tier. Something like apache's mod_cache or squid in a reverse proxy mode.
5) Profile your site. Find where it's slow.
6) Profile your html. A large proportion of user perceived slowness on the web is in the front end. High Performance Web Sites has a lot of useful techniques. The YSlow firefox extension from Yahoo is also useful.
I can recommend Building Scalable Web Sites and the High Scalability blog.
There are a lot of options. Some technologies I am using for similar situations are: haproxy for load balancing, nginx and lighttpd to serve static content, varnish for proxy-cache, heartbeat for high availability between servers. I still keep apache to serve dynamic content with the cms publishing static html files and trying to avoid connections to the database in the frontend.
You know, depending on exactly what you're doing... a somewhat simplistic approach might be to leverage something like Amazon's CloudFront service:
http://aws.amazon.com/cloudfront
With load balancing I would also highly recommend: