I have a Facebook app written in PHP. It has 150 page views per minute and will have up to 300 page views per minute till end of this year. While getting more PV I start to have problems with scalability and therefore I would like to ask you for an advice how to scale to sucessfully handle 300 PV / minute.
My application is a quizz like app, it is hosted on a VPS that can use:
- 100% of one core 2,6 GHz processor
- 500 MB, up to 2 GB of RAM (cat /proc/user_beancounters said that I have really privvmpages = 500 MB, free -m shows 2 GB)
Configuration of my VPS goes like this:
- Centos 5
- Lighttpd
- Memcached
- APC
- MySQL
- PHP using FastCGI
While last months I've acomplished to optimize MySQL, Lighttpd and PHP configuration using some tutorials provided on the internet. I've managed to use extensively Memcached so many requests dropped to 1ms, and those not handled by memcache take up to 300 ms. I've added good indexes to MySQL so it's not on the fire range of users.
For some time above optimizations was enough to handle new requests, but lately due to increasing popularity of application, I've noticed that some requests take longer than 3 seconds and in critical burst my Lighttpd just says f*** you and users get Internal Server Error 500.
I've managed to find (I'll know this today for sure) a solution to fix error 500 by setting:
"PHP_FCGI_MAX_REQUESTS" => "500"
But still scalability issue is not resolved. I need to be able to handle 2 x more requests than now. And I think how to do this. Here are solutions that I came up today:
- Upgrage VPS to 3,3 GHz on 2 cores
- Buy another VPS and move database there
- Ask someone for help (that I do now)
I can buy at my VPS distributor a bigger plan that has 3,3 Ghz in place of 2,6 Ghz that I have now and on 2 cores not one. It will take some more money, but can it help me? How to calculate if it will handle 300 PV?
Second idea I have is to buy another VPS and move database there. It should give gain of CPU and Memory for FastCGI processes and Database process. But how to know if it's better to spawn another server or to buy bigger plan for this I have now?
So I come into 3 point - to ask someone. So here I am - a programmer, not a administrator, with a very huge scalability problem and ask for your help.
I would like to know how can I calculate how many PV per minute my current VPS can handle - it would help me decide. Because if 300 PV is beyond my current VPS abilities - I can think right away on other solution and not messing more with configuration.
Secondly - if it's possible that my VPS can handle more requests - it's issue of configuration - than I would need some help from someone with more knowledge in this issue to help me set up config right. I can provide this config here or send someone by email and hope to know from you who has time and knowledge to help me with this. I don't have time for more experiments in this matter.
Lastly - if it is beyond of my VPS abilities I would like to know from you how to decide if I should upgrade my VPS or spawn another server? What solution will be better for 300 PV target?
If you came to this point of my questions thank you very much in advance for asking. Your help, advice or contacts to persons who can help in in this issue will be very helfull for me!
The killer bottleneck for reasonably specced VPSs is usually disk I/O as all the VMs running on a given host will be sharing the same disk (or array of disks - good VPS hosts will have your VMs on a RAID10 array or similar), in fact sometimes several hosts worth of VMs will share the same array if they are setup with an extrernal drive array. This is particualrly obvious when memory becomes short as your database queries will always be hitting disk due to having no RAM to cache even a core working-set of the data.
You might find that getting your own low-spec dedicated server would improve matters simply because your needs can monopolise the raw I/O bandwidth and you'll see less I/O latency as the drive heads are only flipping back and forth for your I/O requests not several other machines worth of I/O requests too. This might even end up costing less than the "run two VPSs" solution, particularly when you consider than in many cases data transfer between VMs will count against your badwidth quotas for the machines (check with your host - this is not always the case but unless you are exlicitly told it isn't it is safer to assume it is) so you may have increased bandwidth related costs. You might be surprised how little ou can rent a small P4 based machine for, and from your description I doubt CPU power is your bottleneck (memory and I/O contention are the more likely culprits).
500Mb of memory may be a limitation, so going back to the two VPSs idea splitting off to two VMs so your datbase isn't competing with your FastCGI and memcached processess may help. Similarly, it might just be worth getting more fixed RAM allocated - I've never hd any faith in the idea of "burstable RAM allocation" as I assume each OS will try use as much RAM as it can for I/O efficiency (though I've never used a host that uses burstable RAM allocation so ave no direct evidence to back up muy lack of faith!). What does the rest of
free -m
show? Also, what sort of size are your databases? Getting more fixed RAM allocated may help more than moving to cheap dedicated server (as most of the cheaper options come with only 512Mb physical RAM, though most can also be upgraded for extra cost) depending on how cramped 512Mb actually is for your needs.Sorry that is isn't a particularly straight answer...
To test how RAM dependant your performance is you could setup a VM of similar spec on your local machine, duplicate your setup in that, and throw some benchmarking software at it (http://httpd.apache.org/docs/1.3/programs/ab.html is a place to start) then increase the RAM allocated to the VM to see what difference it makes to where errors start to kick in. You can simulate bad I/O contention too by running a couple of other simple VMs each performing some sort of I/O benchmark like bonie++.
Sorry, but are you sure you're talking about page views per MINUTE, not per SECOND ? 300 pages per minute only means 5 pages per second, which any mobile phone should be able to deliver without sweating, so I really can't imagine that a 2.6 GHz CPU fails to do that !
If you're really sure you're talking about minutes, then monitor your disk I/O, CPU and memory. It is not possible that a properly designed application runs that slow, so you must have a huge tuning issue somewhere. Maybe you're doing thousands of accesses to the MySQL database or to memcache and you are very sensible to I/O latency (in this case, CPU will remain almost unused). If your CPU is constantly full, then you have something wrong in the code and it's worthless to try to optimize I/O and other components, the only viable solution is to fix the code then.
I tend to agree with David Spillett's response. I would add that putting your app and database on same node is also a major bottleneck because databases are memory hungry in general. I hosted several high traffic sites just as busy as the one you describe and we never put our database tier on VMs nor on the same tier as our web and app tiers; our databases are always running on real and dedicated hardware.
Our front ends, depending on the architecture, are load balanced with Cisco CSMs but you can do similar load balancing with apache.
If you're a Linux shop, there are tons of ways to tackle this without the expensive Cisco hardware.
Have a look at this: http://haproxy.1wt.eu/
That's very, very difficult. It's very difficult to predict what impact optimizations will have. And the interactions with other parts of your system.
You need to experiment.
If you can't experiment, all anyone can do is blind guessing. Which might work. And you might have a particularly lucky and accurate blind guess.
You should profile and examine your running system. Someone "guessed" above that you are hitting swap, and that might be a good guess. First use top, vmstat, sar to get a picture of what's the box doing. Do you have your CPU pegged? Are you doing massive IO? Are you swapping? These will give you a reasonable idea of what's the problem.
Your problem might lie anywhere between lighthttpd, PHP, memcache, MySQL. The usual suspects there would be:
You should be able to pinpoint the problem to one of those three.
300 page views per minute is not much, that's 5 page views per second, so there seems to be something wonky going on.