Correction: response time (%D
) is μs not ms! 1
This doesn't change anything about the weirdness of this pattern but it means that it is practically way less devastating.
Why is response time inversely correlated to request frequency?
Shouldn't the server respond faster when it is less busy handling requests?
Any suggestion how to make Apache "take advantage" of less load?
This pattern is periodic. That means it will show up if impressions drop below about 200 requests per minute - which happens (due to natural user-activity) from late night to early morning.
The requests are very simple POSTs sending a JSON of less than 1000 characters - this JSON is stored (appended to a text file) - that's it. The reply is just "-".
The data shown in the graphs was logged with Apache itself:
LogFormat "%{%Y-%m-%d+%H:%M:%S}t %k %D %I %O" performance
CustomLog "/var/log/apache2/performance.log" performance
This is common behavior in data centers. The times your response time is slow corresponds to what is commonly called the Batch Window. This is a period of time when user activity is expected to be low and batch processes can be run. Backups are also done during this period. These activities can strain the resources of server and networks causing performance issues such as you see.
There are a few resources that can cause issues:
I use
sar
to investigate issued like this.atsar
can be used gathersar
data into daily data files. These can be examined to see what the system behavior is like during the daytime when performance is normal, and overnight when performance is variable.If you are monitoring the system with
munin
or some other system that gathers and graphs resource utilization, you may find some indicators there. I still findsar
more precise.There are tools like
nice
andionice
that can be applied to batch processes to minimize their impact. They are only effective for CPU or I/O issues. They are unlikely to resolve issues with Memory or Network activity.Moving backup activity to a separate network can reduce network contention. Some backup software can be configured to limit the bandwidth that will be used. This could resolve or reduce network contention issues.
Depending on how the batch processes are triggered you may be able to limit the number of batch processes running in parallel. This may actually improve the performance of the batch processes as they are likely experiencing the same resource contention.
This relation may happen in the other direction if the request senders wait for a previous request to complete before submitting a new one. In that case traffic drops as request times grows (for whatever reason), due to client-side queuing.
Or it can be an artifact of your measurement - if the graph above shows completed requests, as opposed to arriving requests, the rate will drop as request processing time grows (assuming finite capacity :D).
Though @BillThor's answer may be correct, it seems unlikely that the period of low load is entirely taken up by backup processes (i.e. that the periods match precisely).
An alternative explanation is simply caching. If a given script / database / whatever has not been used recently, the relevant cached data may have been dropped in order to free up memory for the rest of the operating system. This might be indexes on a database, or O/S buffers in relation to a file, or anything else similar. A query is then going to have to reconstitute this information if it has been a while since the last query. In busy periods this won't occur as the last query will have been frequent. This would also explain why you are seeing low response times and high response times during the busy period.
What you're seeing there looks, to me, like it could be a statistical issue. It might not be, @BillThor's answer could well be right, but I'll post this for completeness.
The response time graphs are percentile based. A sample pool of 800-1000 requests is a good sample count for this, a pool of 50-100 requests maybe not so much.
If you assume that the number of slow requests isn't a linear function of request volume, such that an order of magnitude increase in requests doesn't result in an order of magnitude increase in slow requests, then higher volumes of requests will result in lower average request time.
There are lies, big lies and statistics.
My hypothesis: you've got three distinct categories of requests:
At night, the 50 requests per minute are correspondingly 20+20+10. And so, the 50% percentile's result now strongly depends on the result of stream 2. And 95% percentile depends on stream 3 so it can never even show on the graph.
During the day, streams 2+3 are well hidden above the 95% percentile.
The more I look at it, the more I'm inclined to think that there's a problem with the data collection.
First off, there is something really weird going on with your TPS. While the overall pattern looks normal, there is a very sharp break occurring at about 9pm, and then again at about 7am. A normal chart will be much smoother during the transition to off-peak hours.
That suggests that there is a change in the profile, and you possibly have 2 distinct types of clients:
The second hint is around 18:00. Most of the time before and after, we have the high volume profile--high TPS and low latency. But at around 18:00 there is a sudden drop from 800-1000 RPM to less than 400 RPM. What could possibly cause that?
The third hint is the step-down in the 5th-percentile response times. I actually prefer to look at the min response times (but the 5th percentile is possibly better) for two reasons: It tells me service time (i.e. response time minus queueing), and response times tend to follow a Weibull distribution which means that the mode (or the most common value) is just above the minimum.
So the step-down in the 5th percentile says to me that there is a sudden break in the series, and the service time has actually dropped even though both the variance and the average response times have greatly increased.
Next steps
At this stage I would do a deep dive into the logs to find out what is different about the 18:00 low-volume samples compared to the high-volume samples before and after it.
I would look for:
BTW, the 18:00 "event" is enough evidence for me that it's nothing to do with data center congestion/activity. For that to be true, the congestion would have to cause a drop in TPS, which is possible at 18:00 but extremely unlikely to be causing a sustained, and smoothly curving drop in TPS for 10 hours between 9pm and 7am.