As in the picture below, my Linux machine was hung and I couldn't login.
How do I identify the reason for the "hang", according to the messages on the console?
I searched /var/log/messages
for more info (but I get lost in there, can't find anything useful), but I do not know exactly where to find core files.
What are the other files that we can find info in for this situation ?
First, try looking at your sar logs for resource usage around the time this error occurred:
CPU:
sar -u
The two main columns you will want to review are
%iowait
and%idle
High
%iowait
and low%idle
are good indicators of the CPU bottlenecking.Memory:
sar -r
%memused
, but more importantly check%commit
.Load:
sar -q
cat /proc/cpuinfo | grep proc
).Secondly, and most importantly, this error occurred because there is a time limit of 120 seconds to flush outstanding data to the disk. Linux, by default, uses up to 40% of available memory for file system caching. The outstanding data will be all data past this 40% mark. Once it moves past the 40% mark, the cache will switch from writing asynchronously (non-blocking background operation, letting the process continue) to synchronously (blocks and makes the process wait till the I/O is committed to the disk). If the IO subsystem cannot keep up and fails to flush the data within 120 seconds, this error will occur.
One popular solution is to force the system to flush sooner.
You can add the following to
/etc/sysctl.conf
:vm.dirty_ratio=10
(absolute max amount (10=10% in this case) of system memory that can be filled with dirty pages before flushing to disk)vm.dirty_background_ratio=5
(percentage of system memory that can be filled with dirty pages before flushing)I hope this helps you out!
You can view older sar entries using the files in the
/var/log/sa
directory, use the same commands as normal, but add-f /var/log/sa${day}
, I suppose 22 in your case.