My employer has over a thousand servers (running SQL Server 2005 x64 and a couple of other apps) all across the country. And in my opinion they are all massively underpowered for what they need to do.
Specifically, I feel that the servers simply do not have enough RAM for the amount of volume the machines are asked to do. All the servers currently have 6GB of RAM. The users are pretty much always complaining about performance (mostly because, immo, the server dips into the paging file quite often).
I finally convinced the powers that be to at least try out a memory upgrade on one box and see the results. However, they want before and after metrics, so that they can see that the expense will be justified.
My question is what metrics should I collect to see whether the performance truly improves on the box? I am a dev, so I am not sure how and what to collect (i have a passing knowledge of Perfmon).
EDIT: I guess I am looking for specific counters to test for.
I would recommend that you do load tests on the box before and after the memory upgrade via the application. Simulate the load the causes that degradation in performance from the user perspective, and then show the improvements after the memory upgrade (Something like jmeter could do this on a webapp). If you can't do it with the load testing of the application, maybe you can simulate the queries.
Then while doing this, you can also run the counters Farseeker recommends. The reason I think you should do it via the front end is that these are business people, and they probably won't get the whole pagefile explanation or query times etc. They should however understand application response time, since that it what everyone is looking to improve.
However, if the tests cost more than the memory itself (making the test plans, setting up servers to generate the load, etc), maybe you should just ask them to trust your judgment, or do the best tests you can.
Checking if you need a memory upgrade is usually pretty simple. Some
perfmon
counters will tell you how many times the OS is dipping into the page file, as well as memory utilisation, pages, etc. Also, as it's SQL Server, you can also use the profiler to see how many disk reads are being done for certain queries. If the memory utilisation is anything < 90% then the SQL server is not configured optimally. Don't use the task manager for this, as it's "free" memory column includes the amount allocated to prefetch.You need to be able to convince them (and yourself) that this is nessesary through these metrics before you even bother doing before/after tests. The before/after tests usually just back up your original proof. And if your metrics don't suggest you need more RAM then this saves the egg on your face.
However, for before/after queries, I would take a commonly used query (not too simple, something real-life), throw it into the SQL Management Studio, turn on Execution Plan (so you can make sure it's running the same plan each time, and thus you're getting valid results), and time how long they take.
It may also be worth gathering some perfmon stats for page rates, disk queues etc.
If you do go ahead with the upgrade, remember that not all RAM is produced perfectly. Even ECC ram can be produced with defects, although serious defects will be rarer. If possible, do an initial verification of the memory using something like Memtest86+ before installing it in the servers. Even better if you can run the same test after installation, but that means more downtime.
Your client won't be happy if your "upgrade" leads to unstable servers.
Performance monitor counters are all well and good but they don't always tell the whole story. I think you also need to measure this in terms of changes to user perception of app performance.
Do you have a "SLA" that defines acceptable performance for this app in certain tasks/scenarios (and if not why not?).
Either you will see a "real terms" improvement in responsiveness of the app that leads to a quantifiable drop in performance complaints, and/or the app doing a better job of meeting its SLA requirements or you won't.
Have the services been "tuned" correctly for the system? Could it be that the SQL process is using a lot of memory, as it likes to do, and hasn't had a limit defined on what it can use and this is impacting performance for other components outside of the SQL based part of the app?
Have you established that this isn't a disk or network bottleneck?