hopefully someone can help interpret what is going on here:
[ 2081.280253] BUG: unable to handle kernel paging request at ffff8801ad287000
[ 2081.280262] IP: [<ffffffff8000f549>] __sanitize_i387_state+0x29/0x120
[ 2081.280272] PGD 1e30067 PUD 39ab067 PMD 3b15067 PTE 0
[ 2081.280277] Oops: 0000 [#4] SMP
[ 2081.280281] last sysfs file: /sys/devices/xen-backend/vbd-5-51715/uevent
[ 2081.280285] CPU 1
[ 2081.280286] Modules linked in: tun md5 ip6table_filter ip6_tables iptable_filter ip_tables x_tables usbbk gntdev netbk blkbk blkback_pagemap blktap xenbus_be evtchn nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs bridge stp llc edd sbs sbshc max6650 lm75 coretemp domctl snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device adm1021 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx dm_mod snd_hda_codec_hdmi 8250_pci snd_hda_codec_realtek snd_hda_intel snd_hda_codec ir_lirc_codec lirc_dev ir_sony_decoder ir_jvc_decoder snd_hwdep ir_rc6_decoder ir_rc5_decoder rc_rc6_mce sg ir_nec_decoder nouveau ttm tpm_tis tpm mceusb ir_core i2c_i801 e1000e snd_pcm pcspkr tpm_bios iTCO_wdt iTCO_vendor_support snd_timer 8250 serial_core snd soundcore snd_page_alloc ext4 jbd2 crc16 drm_kms_helper drm i2c_algo_bit i2c_core video output ehci_hcd usbcore button xenblk cdrom xennet fan processor thermal thermal_sys hwmon ata_generic
[ 2081.280350]
[ 2081.280354] Pid: 6623, comm: block Tainted: G D 2.6.37.6-0.5-xen #1 /DQ67OW
[ 2081.280359] RIP: e030:[<ffffffff8000f549>] [<ffffffff8000f549>] __sanitize_i387_state+0x29/0x120
[ 2081.280365] RSP: e02b:ffff88006bb0dd98 EFLAGS: 00010246
[ 2081.280368] RAX: 0000000000000000 RBX: ffff8801ad286e00 RCX: ffff88006bb0dfd8
[ 2081.280371] RDX: ffff88006bae4440 RSI: 0000000000000200 RDI: ffff88006bae4440
[ 2081.280375] RBP: ffff88006bae4440 R08: ffff88006bb0df58 R09: 0000000000000000
[ 2081.280378] R10: 0000000000000000 R11: 00000000ffffffff R12: 0000000000000011
[ 2081.280381] R13: ffff88006bb0df58 R14: 00007fffc379b800 R15: 00007fffc379b638
[ 2081.280388] FS: 00007f89c8b00700(0000) GS:ffff8801e651d000(0000) knlGS:0000000000000000
[ 2081.280391] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2081.280394] CR2: ffff8801ad287000 CR3: 000000006bb10000 CR4: 0000000000002660
[ 2081.280398] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2081.280408] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 2081.280412] Process block (pid: 6623, threadinfo ffff88006bb0c000, task ffff88006bae4440)
[ 2081.280415] Stack:
[ 2081.280417] 00007fffc379b800 ffff88006bae4440 0000000000000011 ffffffff8000f90a
[ 2081.280422] ffff88006bb0dee8 ffff88006bae4998 0000000000000011 ffffffff80006a22
[ 2081.280426] ffff8801d88d65c0 ffff88006bae4440 ffff88006bb0de68 000000116bae4440
[ 2081.280431] Call Trace:
[ 2081.280438] [<ffffffff8000f90a>] save_i387_xstate+0x1aa/0x210
[ 2081.280444] [<ffffffff80006a22>] __setup_rt_frame+0x2f2/0x370
[ 2081.280449] [<ffffffff80006dd1>] handle_signal+0x201/0x2b0
[ 2081.280454] [<ffffffff80006f09>] do_signal+0x89/0x1b0
[ 2081.280459] [<ffffffff800070b5>] do_notify_resume+0x65/0x90
[ 2081.280464] [<ffffffff8000770e>] int_signal+0x12/0x17
[ 2081.280471] [<00007f89c7fb1090>] 0x7f89c7fb1090
[ 2081.280474] Code: 00 00 41 54 55 53 48 8b 9f 10 05 00 00 48 85 db 0f 84 9c 00 00 00 48 8b 47 08 f6 40 14 01 0f 85 ef 00 00 00 48 8b 05 37 55 89 00 <48> 8b ab 00 02 00 00 48 89 c2 48 21 ea 48 39 d0 74 75 48 89 e8
[ 2081.280499] RIP [<ffffffff8000f549>] __sanitize_i387_state+0x29/0x120
[ 2081.280504] RSP <ffff88006bb0dd98>
[ 2081.280506] CR2: ffff8801ad287000
[ 2081.284005] ---[ end trace 56e37f97ef72fda4 ]---
This is a new server build running opensuse 11.4, kernel 2.6.37.6-0.5-xen on an i2500 with 8GB RAM.
I have tried a couple of different kernels (through happening to have an update via zypper), I have tried both sticks of RAM (4GB) individually and swapped their position. The motherboard DQ67OW has integrated graphics, and I have tried discrete in case the integrated was consuming memory the kernel was unaware of. The error can occur with any of the CPU cores.
It doesn't seem to be triggered by any specific activity - I am running mdadm raid5, and often the 'block' process is the one triggering the oops, however bash and udevd have triggered it also.
It seems that if the oops happens to a critical enough process, the entire server hangs with the flashing caps lock and scroll lock lights.
The processor, motherboard and RAM are all new. I am expecting this to be triggered by a hardware fault, or perhaps a driver bug. Perhaps the nic driver...?
Any suggestions as to how I can narrow down the culprit would be great.
Cheers,
Paul
Followup trace:
[17836.273843] BUG: unable to handle kernel paging request at ffff8801ad287000
[17836.273853] IP: [<ffffffff8000f549>] __sanitize_i387_state+0x29/0x120
[17836.273863] PGD 1e30067 PUD 39ab067 PMD 3b15067 PTE 0
[17836.273868] Oops: 0000 [#6] SMP
[17836.273871] last sysfs file: /sys/devices/xen-backend/vbd-6-51715/statistics/wr_sect
[17836.273875] CPU 1
[17836.273876] Modules linked in: usb_storage uas tun md5 ip6table_filter ip6_tables iptable_filter ip_tables x_tables usbbk gntdev netbk blkbk blkback_pagemap blktap xenbus_be evtchn nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs bridge stp llc edd sbs sbshc max6650 lm75 coretemp domctl snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device adm1021 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx dm_mod snd_hda_codec_hdmi 8250_pci snd_hda_codec_realtek snd_hda_intel snd_hda_codec ir_lirc_codec lirc_dev ir_sony_decoder ir_jvc_decoder snd_hwdep ir_rc6_decoder ir_rc5_decoder rc_rc6_mce sg ir_nec_decoder nouveau ttm tpm_tis tpm mceusb ir_core i2c_i801 e1000e snd_pcm pcspkr tpm_bios iTCO_wdt iTCO_vendor_support snd_timer 8250 serial_core snd soundcore snd_page_alloc ext4 jbd2 crc16 drm_kms_helper drm i2c_algo_bit i2c_core video output ehci_hcd usbcore button xenblk cdrom xennet fan processor thermal thermal_sys hwmon ata_generic
[17836.273940]
[17836.273943] Pid: 9479, comm: bash Tainted: G D 2.6.37.6-0.5-xen #1 /DQ67OW
[17836.273949] RIP: e030:[<ffffffff8000f549>] [<ffffffff8000f549>] __sanitize_i387_state+0x29/0x120
[17836.273954] RSP: e02b:ffff88002afebd98 EFLAGS: 00010246
[17836.273957] RAX: 0000000000000000 RBX: ffff8801ad286e00 RCX: ffff88002afebfd8
[17836.273960] RDX: ffff88002ad62800 RSI: 0000000000000200 RDI: ffff88002ad62800
[17836.273964] RBP: ffff88002ad62800 R08: ffff88002afebf58 R09: 0000000000000000
[17836.273967] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000011
[17836.273970] R13: ffff88002afebf58 R14: 00007fff522ce400 R15: 00007fff522ce238
[17836.273976] FS: 00007f5908ab2700(0000) GS:ffff8801e651d000(0000) knlGS:0000000000000000
[17836.273979] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[17836.273982] CR2: ffff8801ad287000 CR3: 00000000fa6a2000 CR4: 0000000000002660
[17836.273986] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[17836.273989] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[17836.273993] Process bash (pid: 9479, threadinfo ffff88002afea000, task ffff88002ad62800)
[17836.273996] Stack:
[17836.273998] 00007fff522ce400 ffff88002ad62800 0000000000000011 ffffffff8000f90a
[17836.274003] ffff88002afebee8 ffff88002ad62d58 0000000000000011 ffffffff80006a22
[17836.274007] ffff8801d91c4e80 ffff88002ad62800 ffff88002afebe68 000000112ad62800
[17836.274011] Call Trace:
[17836.274019] [<ffffffff8000f90a>] save_i387_xstate+0x1aa/0x210
[17836.274025] [<ffffffff80006a22>] __setup_rt_frame+0x2f2/0x370
[17836.274030] [<ffffffff80006dd1>] handle_signal+0x201/0x2b0
[17836.274035] [<ffffffff80006f09>] do_signal+0x89/0x1b0
[17836.274040] [<ffffffff800070b5>] do_notify_resume+0x65/0x90
[17836.274046] [<ffffffff8000770e>] int_signal+0x12/0x17
[17836.274052] [<00007f5907ecfd80>] 0x7f5907ecfd80
[17836.274055] Code: 00 00 41 54 55 53 48 8b 9f 10 05 00 00 48 85 db 0f 84 9c 00 00 00 48 8b 47 08 f6 40 14 01 0f 85 ef 00 00 00 48 8b 05 37 55 89 00 <48> 8b ab 00 02 00 00 48 89 c2 48 21 ea 48 39 d0 74 75 48 89 e8
[17836.274081] RIP [<ffffffff8000f549>] __sanitize_i387_state+0x29/0x120
[17836.274085] RSP <ffff88002afebd98>
[17836.274088] CR2: ffff8801ad287000
[17836.274091] ---[ end trace 56e37f97ef72fda6 ]---
This is typically due to bad memory, but as you say it can also be due to a software error. (It's the equivalent of a segfault in kernel space.)
Run memtest overnight. It should show up as a boot option after you install the package.
If that doesn't reveal anything, it's probably software. Compare different crash logs to see if there's any commonality in the address reported in the first line, or the call trace given partway down. If they're all very similar, it's probably a software bug. Report this as a kernel bug for the distro and see what help you get.
While I didn't run memtest for very long, I was becoming suspicious of the opensuse install. It was a clean install, but my hunch was a kernel issue or something along those lines.
So I installed Debian into a different partition, and spun up my VMs and everything else, and haven't had a glitch since.
I think the mostly likely contributer was that the Debian Xen kernel is 2.6.32 and Opensuse is at 2.6.37. It may be a bug in the kernel, or just an incompatibility in the configuration.
I'll compare the .configs when I get time. It has been running for a couple of days, and I was getting an oops every hour on average, and now I don't...