I'm setting up a large Drupal (Pressflow) site and this is my current plan. Have I gone and done anything blatantly stupid? Does anyone have any experience hosting a large, multi-server Drupal installation like this?
I'd be tempted to have a pair of varnish nodes behind HAProxy to deliver a HA Varnish cluster.
You could easily have 2+ varnish nodes alone, without the need for HAProxy, but then you can only load balance HTTP Traffic. At least with HAProxy, you've got a TCP load balancer too.
What do you propose the edge of your network looks like? Do you plan to have a HA Pair of hardware firewalls? Do you need edge-routing, BGP and multiple transits?
Another thing to consider is how your file server works. You could probably benefit from having a pair of file servers, using a storage server like GlusterFS, or MogileFS. That way you can ensure redundancy all the way through the infrastructure.
Adding multiple Memcached nodes is also trivial, gives you more redundancy and resilience against traffic spikes and hardware failure.
Make sure that you take steps to optimize your front-end delivery of content, especially if you anticipate high traffic. Keep all media on a media domain, ideally a cookieless one, like http://blog.stackoverflow.com/2009/08/a-few-speed-improvements/ do with sstatic.net
You might also want to consider the use of a CDN to cache static content, such as CSS and non-changing JS. This multiple-level cache infrastructure will even out the slashdot effect, and also give you more resilience to failure.
This is because such a large proportion of browser requests are for static content, which can be effectively served from a CDN's PoP which is nearest to the requester. The other advantage of caching on multiple layers (Browser, CDN, Varnish, Memcache) is that after a while, everything is cached multiple times, in multiple places. This gives you the resilience against failures.
A large drupal site is really no different to a large anything site. Just ensure you have multiple levels of redundancy on every layer of the network.
As for the specification of the actual servers, you probably want >8G of ram on the varnish nodes.
I'd recommend Intel server NICs on the load balancer boxes, and either Cisco or HP Procurve switches for the core of your network.
Your database nodes should be fast multi-processor servers with 15k SAS disks for speed. For redundancy, put 4+ Disks in a RAID10 array.
I wouldn't recommend doing this in a shared hosting environment. Dedicated servers might be OK, but for piece of mind, I'd be specifying a 1/4 rack in a carrier neutral datacenter. This way, you get the most freedom for the actual configuration and management of the servers.
Added:
Do you absolutely need to run apache?
For the servers hosting the media files on the cookieless domain, you'd probably be better off hosting these from a lighter weight HTTP Server, Nginx is a fantastic solution for this.
Apache is probably more suited to the hosting of Drupal itself, but there's no real reason you couldn't use Nginx and FastCGI for example.
Something worth mentioning is that if you plan on using https you need something in front of your load balancer to handle https connections. I am not sure if varnish can handle that, but I'd recommend using either nginx or stunnel for that job.
Can I just ask how you plan to implement a seperate file server? This is something I am really after but standard srupal does not seem to support this.
I'd be tempted to have a pair of varnish nodes behind HAProxy to deliver a HA Varnish cluster.
You could easily have 2+ varnish nodes alone, without the need for HAProxy, but then you can only load balance HTTP Traffic. At least with HAProxy, you've got a TCP load balancer too.
What do you propose the edge of your network looks like? Do you plan to have a HA Pair of hardware firewalls? Do you need edge-routing, BGP and multiple transits?
Another thing to consider is how your file server works. You could probably benefit from having a pair of file servers, using a storage server like GlusterFS, or MogileFS. That way you can ensure redundancy all the way through the infrastructure.
Adding multiple Memcached nodes is also trivial, gives you more redundancy and resilience against traffic spikes and hardware failure.
Make sure that you take steps to optimize your front-end delivery of content, especially if you anticipate high traffic. Keep all media on a media domain, ideally a cookieless one, like http://blog.stackoverflow.com/2009/08/a-few-speed-improvements/ do with sstatic.net
You might also want to consider the use of a CDN to cache static content, such as CSS and non-changing JS. This multiple-level cache infrastructure will even out the slashdot effect, and also give you more resilience to failure. This is because such a large proportion of browser requests are for static content, which can be effectively served from a CDN's PoP which is nearest to the requester. The other advantage of caching on multiple layers (Browser, CDN, Varnish, Memcache) is that after a while, everything is cached multiple times, in multiple places. This gives you the resilience against failures.
A large drupal site is really no different to a large anything site. Just ensure you have multiple levels of redundancy on every layer of the network.
As for the specification of the actual servers, you probably want >8G of ram on the varnish nodes.
I'd recommend Intel server NICs on the load balancer boxes, and either Cisco or HP Procurve switches for the core of your network.
Your database nodes should be fast multi-processor servers with 15k SAS disks for speed. For redundancy, put 4+ Disks in a RAID10 array.
I wouldn't recommend doing this in a shared hosting environment. Dedicated servers might be OK, but for piece of mind, I'd be specifying a 1/4 rack in a carrier neutral datacenter. This way, you get the most freedom for the actual configuration and management of the servers.
Added:
Do you absolutely need to run apache?
For the servers hosting the media files on the cookieless domain, you'd probably be better off hosting these from a lighter weight HTTP Server, Nginx is a fantastic solution for this. Apache is probably more suited to the hosting of Drupal itself, but there's no real reason you couldn't use Nginx and FastCGI for example.
Something worth mentioning is that if you plan on using https you need something in front of your load balancer to handle https connections. I am not sure if varnish can handle that, but I'd recommend using either nginx or stunnel for that job.
Can I just ask how you plan to implement a seperate file server? This is something I am really after but standard srupal does not seem to support this.