Ping a Specific Port

Question

Raffael

Asked: 2016-07-21 02:11:44 +0800 CST2016-07-21 02:11:44 +0800 CST 2016-07-21 02:11:44 +0800 CST

Why is response time exploding when request frequency drops?

772

Correction: response time (%D) is μs not ms! 1

This doesn't change anything about the weirdness of this pattern but it means that it is practically way less devastating.

Why is response time inversely correlated to request frequency?

Shouldn't the server respond faster when it is less busy handling requests?

Any suggestion how to make Apache "take advantage" of less load?

This pattern is periodic. That means it will show up if impressions drop below about 200 requests per minute - which happens (due to natural user-activity) from late night to early morning.

The requests are very simple POSTs sending a JSON of less than 1000 characters - this JSON is stored (appended to a text file) - that's it. The reply is just "-".

The data shown in the graphs was logged with Apache itself:

LogFormat "%{%Y-%m-%d+%H:%M:%S}t %k %D %I %O" performance
CustomLog "/var/log/apache2/performance.log" performance

6 Answers

Voted

BillThor · Answer 1 · 2016-07-21T05:38:38+08:00

This is common behavior in data centers. The times your response time is slow corresponds to what is commonly called the Batch Window. This is a period of time when user activity is expected to be low and batch processes can be run. Backups are also done during this period. These activities can strain the resources of server and networks causing performance issues such as you see.

There are a few resources that can cause issues:

High CPU load. This can cause Apache to wait for a time slice to process the request.
High memory usage. This can flush buffers that enable Apache to serve resources without reading them from disk. It can also cause paging/swapping of Apache workers.
High disk activity. This can cause disk I/O activity to be queued with corresponding delays in serving content.
High network activity. This can cause packets to be queued for transmission, increase retries and otherwise degrade service.

I use sar to investigate issued like this. atsar can be used gather sar data into daily data files. These can be examined to see what the system behavior is like during the daytime when performance is normal, and overnight when performance is variable.

If you are monitoring the system with munin or some other system that gathers and graphs resource utilization, you may find some indicators there. I still find sar more precise.

There are tools like nice and ionice that can be applied to batch processes to minimize their impact. They are only effective for CPU or I/O issues. They are unlikely to resolve issues with Memory or Network activity.

Moving backup activity to a separate network can reduce network contention. Some backup software can be configured to limit the bandwidth that will be used. This could resolve or reduce network contention issues.

Depending on how the batch processes are triggered you may be able to limit the number of batch processes running in parallel. This may actually improve the performance of the batch processes as they are likely experiencing the same resource contention.

Karol Nowak · Answer 2 · 2016-07-21T07:24:15+08:00

Karol Nowak

2016-07-21T07:24:15+08:002016-07-21T07:24:15+08:00

This relation may happen in the other direction if the request senders wait for a previous request to complete before submitting a new one. In that case traffic drops as request times grows (for whatever reason), due to client-side queuing.

Or it can be an artifact of your measurement - if the graph above shows completed requests, as opposed to arriving requests, the rate will drop as request processing time grows (assuming finite capacity :D).

8

abligh · Answer 3 · 2016-07-21T09:05:49+08:00

abligh

2016-07-21T09:05:49+08:002016-07-21T09:05:49+08:00

Though @BillThor's answer may be correct, it seems unlikely that the period of low load is entirely taken up by backup processes (i.e. that the periods match precisely).

An alternative explanation is simply caching. If a given script / database / whatever has not been used recently, the relevant cached data may have been dropped in order to free up memory for the rest of the operating system. This might be indexes on a database, or O/S buffers in relation to a file, or anything else similar. A query is then going to have to reconstitute this information if it has been a while since the last query. In busy periods this won't occur as the last query will have been frequent. This would also explain why you are seeing low response times and high response times during the busy period.

7

Kaithar · Answer 4 · 2016-07-22T00:16:52+08:00

Kaithar

2016-07-22T00:16:52+08:002016-07-22T00:16:52+08:00

What you're seeing there looks, to me, like it could be a statistical issue. It might not be, @BillThor's answer could well be right, but I'll post this for completeness.

The response time graphs are percentile based. A sample pool of 800-1000 requests is a good sample count for this, a pool of 50-100 requests maybe not so much.

If you assume that the number of slow requests isn't a linear function of request volume, such that an order of magnitude increase in requests doesn't result in an order of magnitude increase in slow requests, then higher volumes of requests will result in lower average request time.

2

kubanczyk · Answer 5 · 2016-07-23T02:02:16+08:00

kubanczyk

2016-07-23T02:02:16+08:002016-07-23T02:02:16+08:00

There are lies, big lies and statistics.

My hypothesis: you've got three distinct categories of requests:

The normal variable stream that contains a majority of requests and these are all completed within 200-300 μs.
Small stream at a constant rate of about 20 requests per minute (even at night). Each takes about 2.500 μs to complete.
Tiny stream at a constant rate of about 10 requests per minute (even at night). Each takes well above 4.000 μs.

At night, the 50 requests per minute are correspondingly 20+20+10. And so, the 50% percentile's result now strongly depends on the result of stream 2. And 95% percentile depends on stream 3 so it can never even show on the graph.

During the day, streams 2+3 are well hidden above the 95% percentile.

0

Nathan Webb · Answer 6 · 2016-07-28T04:03:00+08:00

The more I look at it, the more I'm inclined to think that there's a problem with the data collection.

First off, there is something really weird going on with your TPS. While the overall pattern looks normal, there is a very sharp break occurring at about 9pm, and then again at about 7am. A normal chart will be much smoother during the transition to off-peak hours.

That suggests that there is a change in the profile, and you possibly have 2 distinct types of clients:

One that operates only between 7am(ish) and 9pm(ish), at high volumes, and
another that probably operates around the clock, at lower volumes.

The second hint is around 18:00. Most of the time before and after, we have the high volume profile--high TPS and low latency. But at around 18:00 there is a sudden drop from 800-1000 RPM to less than 400 RPM. What could possibly cause that?

The third hint is the step-down in the 5th-percentile response times. I actually prefer to look at the min response times (but the 5th percentile is possibly better) for two reasons: It tells me service time (i.e. response time minus queueing), and response times tend to follow a Weibull distribution which means that the mode (or the most common value) is just above the minimum.

So the step-down in the 5th percentile says to me that there is a sudden break in the series, and the service time has actually dropped even though both the variance and the average response times have greatly increased.

Next steps

At this stage I would do a deep dive into the logs to find out what is different about the 18:00 low-volume samples compared to the high-volume samples before and after it.

I would look for:

differences in geographic location (in case latency is affecting the $request_time)
differences in URL (should be none)
differences in HTTP method (POST/GET) (should be none)
repeated requests from the same IP
and any other differences...

BTW, the 18:00 "event" is enough evidence for me that it's nothing to do with data center congestion/activity. For that to be true, the congestion would have to cause a drop in TPS, which is possible at 18:00 but extremely unlikely to be causing a sustained, and smoothly curving drop in TPS for 10 hours between 9pm and 7am.

Why is response time exploding when request frequency drops?

Next steps

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?