We use WhatsUp Gold to monitor all of our web servers. On our Linux servers (and much to the same degree, our FreeBSD servers) I'm having a little bit of an issue with the memory monitors. We're using SNMP with WUG to grab the data from the servers. The memory counter that the SNMP daemon returns on the servers is the combined value (used, cached, buffers). Right now one of my servers looks like this:
[admin@stgwww snmp]$ free -m
total used free shared buffers cached
Mem: 7872 1656 6216 0 143 1107
-/+ buffers/cache: 404 7467
Swap: 4867 0 4867
The Value being returned via SNMP to WUG is 1656. From what I understand, the cached RAM is essentially FREE RAM with the added benefit of hanging on to data that previously occupied it in case it's needed again. So for our purposes of wanting to know how much RAM is actually being actively used, the value we're getting back is misleading. If we go off of what's being graphed by WUG, we're being led to believe that more RAM is being used and less is available than there really is.
So whats that best way to go about monitoring this? WUG allows me to write SSH scripts, which can SSH into the server every 5 minutes or so, execute a script and return the value (as long as it's a single numeric value). With this I've written a script that pulls the "404" number from the example above and divides it by the total amount giving me a percent used value which I return to WUG and graph on a chart that scales from 0 to 100. But this seems like way to much of a hack.
Am I better off monitoring the free+buffers+cached value? Is there a better way to do this in WUG? Thoughts?
Go and take a look at linuxatemyram.com. WUG is telling you what Linux thinks is used (used+buffers+cache). What you have decided to monitor (used/total) seems reasonable to me especially for a graph as it requires no knowledge of the system specifics.
Free ram is free ram and buffers are cached ram which can be reclaimed. Most of the monitoring tools I've used present this difference in an accumulative area graph which presents at least used, cached and inactive memory stacked under the 100% level and swap over these. The only way to have a correct knowledge of how server is performing is to view all of them.
If you only can graph a value I'll recomend to you graph the used memory, and consider 'free' the rest. Oh and i will recomend also switch monitoring tools. Even munin with the default config has a decent memory graph.
I recommend ganglia: http://ganglia.sourceforge.net/
It does memory monitoring and divides it into constituent parts. There is almost zero configuration. You install a daemon on each linux box and then designate one central box to record the RRDs.
Here's an example memory graph:
For those (Renan) interested to know what solution I came up with.
I've been using a custom bash script to retrieve the memory (Used / Total) and then converting it to a percentage.
I then use a custom SNMP counter to execute that script and return the value. In the snmpd.conf file, it looks like this:
Each exec script returns a few OIDs with things like the name of the script, the exit status, the return value, and so on. The unfortunate part is that the return value is a string and not an integer, so WUG has some problems graphing it (it still graphs it, but the real time graphs won't work). So in this case where we know the value will always be under 100, I set it to the exit status and then I poll that OID.
To monitor it in WUG create a custom SNMP performance monitor and monitor the OID of the exit status of that exec script. You can then create custom alerts and what not.
We've been using it for a while now and it works great. Hope that helps!