For the past few weeks I've been getting more and more reports about lag on one of my sites. I've finally been experiencing it first hand over the last week, but I haven't been able to pinpoint the problem.
The server load is never higher than about 0.5 out of 16 cores, and the memory usage tops out at around 12-13%. The issue isn't the database as the lag can happen on static resources. About 1 out of 10 page views gets a 502 error. About 1 in 5 pages takes 5-20 seconds to load. When looking at Chrome's network tab, it shows "waiting" for almost all of that time.
I rebooted the server last night, and it seemed okay for a few hours, but less than 12 hours later it was back to the normal lag issues. Anyone have any tips on where I can look to try and figure out the problem?
I would do a couple things.
Hit a "laggy" resource using curl from the box and get timings - see if the problem is network between browser and server or the server itself.
Use something like Firebug + YSlow or Pagespeed or KITE to get a waterfall diagram of your Web page - these tools should explain whether the problem is slow downloads or DNS or Web site response time (aka "time to first byte"). This will also localize the problem.
Make sure you're logging time-taken in your Apache log (%D) and see what that tells you.
Just a hunch but this sounds networky. Do a netstat at least and see if you've got a billion connections running.
502s are not a usual timeout response, that's "bad gateway." They tend to occur if there's some proxy or gateway having trouble with your site. Might be an app going bad behind mod_proxy on your site? I'd try to eliminate that by hitting static content from local and then expanding from there.