Ubuntu 18.04 lsb, I'm trying to figure out the best way to diagnose whats happening when the Amazon Ec2 instance (free tier) hangs.
There are experimental services running and there may be / is a memory leak.
For quality of life I'm using a utility called lnav to help me browse system logs. Also I've installed a utility called monitorix to visiualise whats happening.
Can I / how do I to identify the specific process causing the problem from system logs ? Which log might help me? (/var/log/syslog does not help)
These charts show high CPU load associated with system swap space being consumed until catastrophic failure occurs.
But this does not tell me the specific process. How can I do this through the terminal?
Is there some other process monitoring I could to configure ?
Any help appriciated...
Edit: Thanks to the hint from @Rinzwind sar
is now installed and cron is running it every 2 mins... but it doesn't give process level info. So with help from this other answer:
pidstat 5 > pidhist.log
pipes out to a text file, and running it in persistant session will aid diagnosis when the event happens again.
@heynnema suggested iotop
Running iotop -P -a
which is top
for file I/O as a totaliser. It indicated that the experimental process (a mono service) was the one consuming the most swap with SWAPIN
****
We see can see the same pattern of consumption, and then after restarting the process return to normal ~20% from monitorix.
The system is stable for weeks on end between these random events. The evidence from iotop
proves the underlying issue is within the experimental process!
Yet, this is still a run time diagnosis. Is there a way to determine from existing logs which process was at fault after the fact? to do that without without preemptive monitoring and logging.
That proof of what went wrong is the critical issue to be resolved. how can we do that without waiting for it to reoccur if no logging is enabled? kernel logs???
Thanks for any help.
From the comments...
We looked at the output of
free -h
andsysctl vm.swappiness
andcat /etc/fstab
, and installediotop
to determine why swap if used so much.There are a few reasons why the system is thrashing.
you don't have enough RAM
you don't have enough swap
vm.swappiness has been modified incorrectly
The fix...
add more RAM
increase /swapfile space
set vm.swappiness to 60-90 (60 is default)
We don't add RAM to solve this issue.
Identifying a process causing a memory leak has nothing to do with system configuration.
iotop -P -a
helped identify the process consuming swap during reoccurance of the event.Steps for digital forensic log investigation would be a better solution.