My company is developing a web-based data viewer application which requires a fairly decent amount of bandwidth to function well. However recently we have been changing a lot of things. For example, we changed our internal network infrastructure so that data can be hosted on separate machines connected by Gigabit Ethernet. On top of that, the application itself keeps coming out with new versions since we are still in alpha and beta testing.
Recently we made some changes that are causing poorer performance, and we want to try to identify where the problem is before we start tearing things apart. It is a very small network, and I have limited experience as an IT admin. I have a few ideas for where to start, but I would like to harvest a little wisdom from the pros first: How do you tackle/avoid similar problems? What are the most useful (Windows) tools you have used?
I always follow this approach: Try to test one thing at a time.
The trusty "Scientific method" works really well for troubleshooting:
For a webapp this might mean:
also running basic benchmarks for testing cpu,memory,disk speed can help rule one of those things out before you go any further.
I see things like this all the time:
But no one did a basic disk benchmark to find out that the older server had twice as many spindles than the new server does... or a network benchmark to find out that the new servers gigabit ethernet was only running at 100M.
all that said, if this is a custom web application, the framework you are using most definitely has a way to dump performance information to a log file.. but that is more of a question for stackoverflow.
I have subscribed to the "Sherlock Holmes" method of troubleshooting, aka Binary Search Troubleshooting Method:
In my experience, you sometimes get lucky by trying some obvious things first, but once you exhaust the truly quick fixes, you need to get methodical quickly.
This method is compatible with Scientific Method and Test One Thing At A Time.
The sum of the answers above are 90% of what I would say, here's the other 10%:
Some of the best tools to be found for Windows troubleshooting are from Microsoft's Sysinternals. And some of the best info on how to use them (and Windows technical info in general) can be found on Mark Russinovich's blog and webcasts. His book on Windows Internals is also full of good information.
With the above, I would suggest starting with the programs Process Explorer and Process Monitor to take a look at whatever web service you have running, and seeing what's going on. Both programs allow you to display a large amount of info about running processes, which can be configured by right-clicking the column headings.
What was changed that introduced the performance problem? If only the code was changed, then I'd start my troubleshooting there.
Compare Problem Stat to a Known Good State and look for the discrepancies.
A Known Good State can be an actual documented state. It can also be based on a standard of expected behavior, such as known expected behavior of networking protocols or such as rules of thumb about appropriate average CPU usage.
Examples:
Using Wireshark or other network sniffer tool, you repeatedly see duplicate packets. Now you can delve in to try an figure out why you are seeing the same IP packet on the wire. Perhaps you have a "local router" scenario, or perhaps something is fragmenting IP packets.
Average CPU usage is at 90%. If the average is 90%, then the server is likely maxing out CPU frequently, causing everything to back up.
At the recommendation of John T, I have been enjoying using dstat with gnuplot.