It's happened again! I have 4 servers which are crashing periodically, and there is no information printed to the system logs or the serial console.
In addition, the Linux kdump service isn't writing core dumps to the default location of /var/crash
.
- Can you help me figure out why?
- Does it matter if my root filesystem is a LVM volume?
Here is what I've tried.
My system is Scientific Linux 6.5 with the latest kernel.
[root@host1 ~]# uname -r 2.6.32-431.11.2.el6.x86_64 [root@host1 ~]# cat /etc/issue Scientific Linux release 6.5 (Carbon)
The file
/etc/kdump.conf
is the vanilla file containing the default settings. Most lines are commented out, there are only two active lines forpath
andcore_collector
.#net my.server.com:/export/tmp #net [email protected] path /var/crash core_collector makedumpfile -c --message-level 1 -d 31 #core_collector scp
I ensure that the
kdump
service is running, and thatkdump
doesn't need to rebuild myinitrd
.[root@host1 ~]# chkconfig --list kdump kdump 0:off 1:off 2:off 3:on 4:on 5:on 6:off [root@host1 ~]# /etc/init.d/kdump restart Stopping kdump: [ OK ] Starting kdump: [ OK ] [root@host1 ~]#
Then, I force a Kernel crash using these commands borrowed from the RHEL6 Deployment Guide: Chapter 29. The kdump Crash Recovery Service:
Then type the following commands at a shell prompt:
echo 1 > /proc/sys/kernel/sysrq echo c > /proc/sysrq-trigger
This will force the Linux kernel to crash
The system crashes. I can view the progress on my serial console. I see the message
Saving to the local filesystem UUID=e7abcdeb-1987-4c69-a867-fabdceffghi2
, but immediately after that I see the strange message ofUsage: fsck.ext4
, which sort of looks like something is accidentally callingfsck
instead of whatever it should be doing. I see no mention of an out-of-memory error or anything.host1.example.org login: SysRq : Trigger a crash BUG: unable to handle kernel NULL pointer dereference at (null) ... ... skipping 50 lines of output ... Creating block device ram8 Creating block device ram9 Creating Remain Block Devices Making device-mapper control node Scanning logical volumes Reading all physical volumes. This may take a while... No volume groups found No volume groups found Activating logical volumes No volume groups found No volume groups found Free memory/Total memory (free %): 58272 / 116616 ( 49.9691 ) Saving to the local filesystem UUID=e7abcdeb-1987-4c69-a867-fabdceffghi2 Usage: fsck.ext4 [-panyrcdfvtDFV] [-b superblock] [-B blocksize] [-I inode_buffer_blocks] [-P process_inode_size] [-l|-L bad_blocks_file] [-C fd] [-j external_journal] [-E extended-options] device Emergency help: -p Autom
And then the system reboots (which is the default).
When the system comes back online, there is nothing in
/var/crash
. I assume that the crash dump was not written.[root@host1 ~]# ls -lA /var/crash/ total 0 [root@host1 ~]#
I know that crash dumps can work in general. If I tell
kdump
to copy the core dump to another system with the following configuration, kdump will successfully write the core dump to another host:path vmcore ssh [email protected] sshkey /root/.ssh/kdump_id_rsa
If I set
default shell
in/etc/kdump.conf
and rebuild initrd, and then crash the system again I get a slightly more informative error aboutmount: can't find /mnt in /etc/fstab
Free memory/Total memory (free %): 58272 / 116616 ( 49.9691 ) Saving to the local filesystem UUID=e720481b-1987-4c69-a867-f2b4cba3b312 Usage: fsck.ext4 [-panyrcdfvtDFV] [-b superblock] [-B blocksize] [-I inode_buffer_blocks] [-P process_inode_size] [-l|-L bad_blocks_file] [-C fd] [-j external_journal] [-E extended-options] device Emergency help: -p Automatic repair (no questions) -n Make no changes to the filesystem -y Assume "yes" to all questions -c Check for bad blocks and add them to the badblock list -f Force checking even if filesystem is marked clean -v Be verbose -b superblock Use alternative superblock -B blocksize Force blocksize when looking for superblock -j external_journal Set location of the external journal -l bad_blocks_file Add to badblocks list -L bad_blocks_file Set badblocks list mount: can't find /mnt in /etc/fstab dropping to initramfs shell exiting this shell will reboot your system /sys/block #
But now, I'm stuck.
A little late to the game but if you need to configure
kdump
for the future:I think the path directive designates a path from the partition or file system designated. By default this is the root FS. If you have a separate partition in
/etc/fstab
for/var
it will obfuscate the crash directory when your system is booted normally. ie if you were to boot normally andunmount /var
you would see thecrash/[UniqCoreDir]
. You can adjust this by adding anext4 /PATH/TO/DEVICE
directive inkdump.conf
. Also you could use a different path that won't be mounted over.Just a guess but might have a number of vmcores burried under
/var
.Pull apart your kdump initrd in /boot/ check to to see the final path that its trying to dump to.
I think the "path" option is a little weird, I'd probably leave it to the default or set it explicitly to /var/crash
Do you have some kind of watchdog rebooting the machine ? this may also prevent the core being created by rebooting the machine before the is started.