Is there a way to log the amount of memory used by each process in linux? One of our servers has a memory usage spike that brings down the machine from time to time and the logging would really help us find the root cause.
Here follow a couple of suggestions on how to diagnose a memory spike that causes thrashing on a server, using a tool I wrote for Linux process analysis, Procpath.
The most obvious ad-hoc diagnostic is to record all process' metrics and when the issue occurs again to take a look at the recordings. This can be done, for example, in a terminal multiplexor like Byobu (with least effort to make it run in the background).
procpath record -i 60 -d all.sqlite
This will produce an SQLite database and record Procfs metrics for each process there every minute. Post-mortem you can query the database, say for the PIDs that consumed from than 10% of available main memory (RSS measured in pages).
SELECT DISTINCT stat_pid
FROM record
WHERE 1.0 * stat_rss / (SELECT value FROM meta WHERE key = 'physical_pages') > 0.1
And plot a PID's RSS growth chart.
procpath plot -d all.sqlite -q rss -p $PID
Alternatively if you want to take metrics more often, say because the spike is quick, and to avoid ending up with a big SQLite database, you can:
record only relevant process subtree(s) (e.g. $ROOT_PID and its descendants):
procpath record -i 10 -d tree.sqlite "$..children[?(@.stat.pid == $ROOT_PID)]"
You may be able to ask your process supervisor (systemd, supervisord, etc)
for the PID.
record only processes that consume more that X MiB (say 256 MiB which
typically equals to 65536 memory pages):
procpath query '$..children[?(@.stat.rss > 65536 and @.pop\("children", 1\))]'
@.pop("children", 1) is there to get rid of descendants of the matched
process unless they themselves match.
If either produces a few PIDs in the database, you may be able to analyse them visually with procpath plot right away.
If your needs aren't very complex perhaps you could make do with
ps
and a while loop:This logs the top ten processes (by memory use) every minute.
The cleanest solution would be to put in a cron job which does something similar to
This would log
ps
output through your logging daemon to the local0 facility (can be changed to whatever you want)Here follow a couple of suggestions on how to diagnose a memory spike that causes thrashing on a server, using a tool I wrote for Linux process analysis, Procpath.
The most obvious ad-hoc diagnostic is to record all process' metrics and when the issue occurs again to take a look at the recordings. This can be done, for example, in a terminal multiplexor like Byobu (with least effort to make it run in the background).
This will produce an SQLite database and record Procfs metrics for each process there every minute. Post-mortem you can query the database, say for the PIDs that consumed from than 10% of available main memory (RSS measured in pages).
And plot a PID's RSS growth chart.
Alternatively if you want to take metrics more often, say because the spike is quick, and to avoid ending up with a big SQLite database, you can:
record only relevant process subtree(s) (e.g.
$ROOT_PID
and its descendants):You may be able to ask your process supervisor (systemd, supervisord, etc) for the PID.
record only processes that consume more that X MiB (say 256 MiB which typically equals to 65536 memory pages):
@.pop("children", 1)
is there to get rid of descendants of the matched process unless they themselves match.If either produces a few PIDs in the database, you may be able to analyse them visually with
procpath plot
right away.