On a semi-regular basis, I've seen GCE instances freezing with the following error message (from the serial console):
g[1375589.784755] BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
g[1375589.786206] IP: [<ffffffff810a67d9>] check_preempt_wakeup+0xd9/0x1d0
g[1375589.787341] PGD 5da04067 PUD db83067 PMD 0
g[1375589.788607] Oops: 0000 [#1] SMP
g[1375589.788705] Modules linked in: veth xt_addrtype xt_conntrack iptable_filter ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 ip_tables x_tables nf_nat nf_conntrack bridge stp llc aufs(C) softdog crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 processor psmouse parport_pc parport i2c_piix4 i2c_core thermal_sys lrw virtio_net evdev pcspkr serio_raw gf128mul glue_helper ablk_helper cryptd button ext4 crc16 mbcache jbd2 sd_mod crc_t10dif crct10dif_common virtio_scsi scsi_mod virtio_pci virtio virtio_ring
g[1375589.788705] CPU: 1 PID: 1515 Comm: docker Tainted: G C 3.16.0-0.bpo.4-amd64 #1 Debian 3.16.7-ckt9-3~deb8u1~bpo70+1
g[1375589.788705] Hardware name: Google Google, BIOS Google 01/01/2011
g[1375589.788705] task: ffff88006fffc110 ti: ffff880003ac4000 task.ti: ffff880003ac4000
g[1375589.788705] RIP: 0010:[<ffffffff810a67d9>] [<ffffffff810a67d9>] check_preempt_wakeup+0xd9/0x1d0
g[1375589.788705] RSP: 0018:ffff880003ac7e30 EFLAGS: 00010002
g[1375589.788705] RAX: 0000000000000001 RBX: ffff880073112ec0 RCX: 0000000000000002
g[1375589.788705] RDX: 0000000000000001 RSI: ffff880009156d20 RDI: ffff880073112f38
g[1375589.788705] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
g[1375589.788705] R10: ffffffffffffffe0 R11: 0000000000000000 R12: ffff88006d2dcd00
g[1375589.788705] R13: ffff88006fffc110 R14: 0000000000000000 R15: 0000000000000000
g[1375589.788705] FS: 000000000323a880(0063) GS:ffff880073100000(0000) knlGS:0000000000000000
g[1375589.788705] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
g[1375589.788705] CR2: 0000000000000078 CR3: 0000000034bff000 CR4: 00000000000406e0
g[1375589.788705] Stack:
g[1375589.788705] 0000000000000000 ffffffff00000000 ffff88000000006e ffff880073112ec0
g[1375589.788705] ffff8800091573a4 0000000000000286 0000000000012ec0 ffff880073112ec0
g[1375589.788705] 0000000000000002 ffffffff8109cef4 ffff880009156d20 ffffffff810a01a4
g[1375589.788705] Call Trace:
g[1375589.788705] [<ffffffff8109cef4>] ? check_preempt_curr+0x84/0xa0
g[1375589.788705] [<ffffffff810a01a4>] ? wake_up_new_task+0xf4/0x1b0
g[1375589.788705] [<ffffffff8118516d>] ? mprotect_fixup+0x15d/0x250
g[1375589.788705] [<ffffffff8106d10f>] ? do_fork+0xcf/0x340
g[1375589.788705] [<ffffffff8154b779>] ? stub_clone+0x69/0x90
g[1375589.788705] [<ffffffff8154b40d>] ? system_call_fast_compare_end+0x10/0x15
g[1375589.788705] Code: 00 00 83 e8 01 4d 8b 64 24 70 39 d0 7f f4 48 8b 7d 78 49 3b 7c 24 78 74 1d 66 0f 1f 84 00 00 00 00 00 48 8b 6d 70 4d 8b 64 24 70 <48> 8b 7d 78 49 3b 7c 24 78 75 ec 48 85 ff 74 e7 e8 f2 f9 ff ff
g[1375589.788705] RIP [<ffffffff810a67d9>] check_preempt_wakeup+0xd9/0x1d0
g[1375589.788705] RSP <ffff880003ac7e30>
g[1375589.788705] CR2: 0000000000000078
g[1375589.788705] ---[ end trace 5fab7713cb2d171f ]---
The only way I've been able to restore them is to login to the web interface and manually reset them. Needless to say, it doesn't scale.
I've already tried setting up a watchdog device and setting kernel.panic = 10
, which in theory should reboot the VM.
For these VMs, I'm using the 'container-vm' as the OS flavor (i.e. Debian with Docker-preinstalled more or less).
Have anyone else seen this?
I don't have enough reputation to comment. So I put my comment here. I'm having the same issue. I checked the Internet for the bug reports and found the almost every kernel output contains
do_fork()
function in it. After that I found that:http://www.serverphorums.com/read.php?12,1053418
And update version here:
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/kernel/sched/core.c?id=ea86cb4b7621e1298a37197005bf0abcc86348d4
I hope it helps someone.
I'd like to have this fixed in my distro, but I don't know how to push the distro guys to put this patch to the default kernel.