Ubuntu 18.04 lsb, I'm trying to figure out the best way to diagnose whats happening when the Amazon Ec2 instance (free tier) hangs.
There are experimental services running and there may be / is a memory leak.
For quality of life I'm using a utility called lnav to help me browse system logs. Also I've installed a utility called monitorix to visiualise whats happening.
Can I / how do I to identify the specific process causing the problem from system logs ? Which log might help me? (/var/log/syslog does not help)
These charts show high CPU load associated with system swap space being consumed until catastrophic failure occurs.
But this does not tell me the specific process. How can I do this through the terminal?
Is there some other process monitoring I could to configure ?
Any help appriciated...
Edit: Thanks to the hint from @Rinzwind sar
is now installed and cron is running it every 2 mins... but it doesn't give process level info. So with help from this other answer:
pidstat 5 > pidhist.log
pipes out to a text file, and running it in persistant session will aid diagnosis when the event happens again.
@heynnema suggested iotop
Running iotop -P -a
which is top
for file I/O as a totaliser. It indicated that the experimental process (a mono service) was the one consuming the most swap with SWAPIN
****
We see can see the same pattern of consumption, and then after restarting the process return to normal ~20% from monitorix.
The system is stable for weeks on end between these random events. The evidence from iotop
proves the underlying issue is within the experimental process!
Yet, this is still a run time diagnosis. Is there a way to determine from existing logs which process was at fault after the fact? to do that without without preemptive monitoring and logging.
That proof of what went wrong is the critical issue to be resolved. how can we do that without waiting for it to reoccur if no logging is enabled? kernel logs???
Thanks for any help.