Ping a Specific Port

Question

Arnaud

Asked: 2012-12-07 03:04:32 +0800 CST2012-12-07 03:04:32 +0800 CST 2012-12-07 03:04:32 +0800 CST

Load/stress testing methodology. What to expect and how to interpret results?

772

We have this new client for whom we're reviewing our server infrastructure.

I know pretty well the web API because I've helped building it and now I'm on my own maintaining and pushing it forward, so big challenge and very interesting.

It is based on an Amazon m1.large instance, nginx (+ssl), django, amazon RDS (with MySQL) and a self hosted memcached for now.

The thing is we had some inputs from our client saying that they expect something like a max of 2500 users connecting to the API for a range of four hours, twice a day at least.

We have no idea of when exactly those connections will arise and we should not make any assumptions, so the thing I ended up thinking was that our server better has to support the 2500 connections at one point in time.

I've been playing around with apache benchmark sending 2500 concurrent connections while connecting/disconnecting memcache or some nginx settings, just to see the performances changes.

The best I came with was around 100 requests per second but the longest requests take more than 20 seconds (for 2500 concurrent connections, with only 100 the requests take max 1s). From a user point of view, I wouldn't like to wait more than 1 or 2 seconds for getting my result...

I'd like to play more with all the settings I can tune on nginx, django, mysql or memcache but at this point I think I need a methodology and more than a methodology, I need a goal to reach.

Searching on the web I see blog posts talking about services that reach several hundreds of requests per second. I'm far from that.

Seeing all those numbers coming out from apachebench are just giving me the impression that I am launching tests, seeing the results, but that I don't really understand them and don't really know what to do with them to improve our API.

So what would be a good methodology, a good approach to reach the goal of having a web API able to cope with this number of connections as fast as possible?

If you need more details just ask!

2 Answers

Voted

rhetonik · Answer 1 · 2012-12-07T04:30:34+08:00

I have never worked with a Django setup so may not be able to get into Django specifics. It would be great if you could provide details on the CPU, IO, Memory stats when you hit 100 requests per second. You may get those 20 second delays due to varied reasons based on the nature of your resource crunch. You may not be able to make sense of the performance statistics without knowing the health of your system under stress. Good place to start with could be Amazon CloudWatch metrics and/or to enable monitoring with Munin, Nagios or similar with an appropriate graphing tool such as Graphite or Ganglia. Even tracking vmstat output could reveal a lot of things.

The key to identifying your problem is to gather enough data about your system's health and follow it. You could simply graph your traffic trend on Graphite along with other stats such as CPU usage, IO waits, Context switches, number of Interrupts, Memory available and try to co-relate this data. You could even split your request cycle into database, middleware and render phases and track time spent in each phase.

Check for slow database queries. I'm not sure about this, but RDS may provide you with statistics on slow queries. You may want to optimize those.
If CPU is causing a bottleneck and you see CPU spikes during peak traffic, you may want to check process stats during peak and zero-in on the process which hogs on CPU, thus making your web server unavailable for some time (causing a 20 second delay). Further, you could come up with measures on optimizing that process or if that's not possible, you may want to switch to a c1.xlarge instance.
If your server experiences memory crunch and shows low available memory during peak traffic, you could check for the areas of your application which are memory intensive. You may want to optimize on those areas and/or throw in some more memory by upgrading your instances to high-memory alternatives. Further, you could even consider tweaking your application code to make your memory-bound processes cpu-bound. Usually CPU is under-utilized during a memory crunch.
If your server's CPU is under-utilized and you don't even see a memory crunch, then there's a high possibility that your system is spending a lot of time in IO wait. This can be due to latency within any of your dependencies. Also check for the number of context switches and interrupts during peak load using vmstat. This may happen if you choose to run more number of worker processes than the number of available cores. The server may also be waiting on block IO. Some have experienced block IO latency when on EBS, although I'm cool with it.

Hope this helps.

FINESEC · Answer 2 · 2012-12-07T03:56:00+08:00

FINESEC

2012-12-07T03:56:00+08:002012-12-07T03:56:00+08:00

First you need to establish what's the bottleneck for this web service. It's probably slow DB queries and/or poor django performance. Please note that most frameworks for fast web application development (Django performance) are not really optimized for speed. Unless you can afford using many servers and load balancing you can't really expect great performance.

Anyways...for starters I'd try to:

Check the speed of key SQL queries and optimize your queries if necessary. Consider using memcached to cache results of SQL queries (you probably need some changes to your code for that).
Test how many requests django can handle (with and without database queries)
Check the size of a typical request/response and ask yourself if your hard drives/connection can handle this kind of traffic. Using I/O monitoring like iotop can help a bit.
Check if your CPU can handle this kind of traffic - use top command.

1

Load/stress testing methodology. What to expect and how to interpret results?

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?