I'm having an issue with a linux server. Once a week the running mysql instance hangs and there is no way to fully stop it. If I kill it, it remains in zombie status and init does not reap its pid.
The server is used for staging deployments and some internal tools, so it's not under heavy load. The only process constantly used id mysql and for this I think that it's the only process which suffer of this issue.
I've searched system logs for errors and the only thing I found is this error (repeated a couple of times) in dmesg output:
[706560.640085] INFO: task mysqld:31965 blocked for more than 120 seconds.
[706560.640198] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[706560.640312] mysqld D ffff88032fd93f40 0 31965 1 0x00000000
[706560.640317] ffff880242a27d18 0000000000000086 ffff88031a50dd00 ffff880242a27fd8
[706560.640321] ffff880242a27fd8 ffff880242a27fd8 ffff88031e549740 ffff88031a50dd00
[706560.640325] ffff88031a50dd00 ffff88032fd947f8 0000000000000002 ffffffff8112f250
[706560.640328] Call Trace:
[706560.640338] [<ffffffff8112f250>] ? __lock_page+0x70/0x70
[706560.640344] [<ffffffff816cb1b9>] schedule+0x29/0x70
[706560.640347] [<ffffffff816cb28f>] io_schedule+0x8f/0xd0
[706560.640350] [<ffffffff8112f25e>] sleep_on_page+0xe/0x20
[706560.640353] [<ffffffff816c9900>] __wait_on_bit+0x60/0x90
[706560.640356] [<ffffffff8112f390>] wait_on_page_bit+0x80/0x90
[706560.640360] [<ffffffff8107dce0>] ? autoremove_wake_function+0x40/0x40
[706560.640363] [<ffffffff8112f891>] filemap_fdatawait_range+0x101/0x190
[706560.640366] [<ffffffff81130975>] filemap_write_and_wait_range+0x65/0x70
[706560.640371] [<ffffffff8122e441>] ext4_sync_file+0x71/0x320
[706560.640376] [<ffffffff811c3e6d>] do_fsync+0x5d/0x90
[706560.640379] [<ffffffff811c40d0>] sys_fsync+0x10/0x20
[706560.640383] [<ffffffff816d495d>] system_call_fastpath+0x1a/0x1f
When this happens the only way to make everything working again is a full reboot, but in order to do that I'm forced to use this command after I've manually stopped all running processes
echo b > /proc/sysrq-trigger
otherwise normal reboot process hangs forever. I've tracked reboots script and I've found out that also the reboot process hangs on a sync call, this one in /etc/init.d/sendsigs
(I'm on ubuntu)
# Flush the kernel I/O buffer before we start to kill
# processes, to make sure the IO of already stopped services to
# not slow down the remaining processes to a point where they
# are accidentily killed with SIGKILL because they did not
# manage to shut down in time.
sync
I'm almost sure that the cause of this is an hardware issue (the RAID controller???) also because I've other two machines with the same hardware and software configuration and they don't suffer of this, but I can't find any hint in syslog or dmesg. I've also installed smartmontools and mcelog packages but none of them did report any issue.
What can I do to track the cause of this issue?
Today is happened again, here is the status of system after triggering a reboot
init─┬─console-kit-dae───64*[{console-kit-dae}]
├─dbus-daemon
├─mcelog
├─mysqld───{mysqld}
├─newrelic-daemon───newrelic-daemon───11*[{newrelic-daemon}]
├─ntpd
├─polkitd───{polkitd}
├─python3
├─rpc.idmapd
├─rpc.statd
├─rpcbind
├─sh───rc───S20sendsigs───sync
├─smartd
├─snmpd
├─sshd───sshd───zsh───sudo───zsh───pstree
└─sshd───sshd───zsh───sudo───zsh
And here is the status of sync process
# ps aux | grep sync
root 3637 0.1 0.0 4352 372 ? D 05:53 0:00 sync
i.e. Uninterruptible sleep...
Hardware specs as reported by lshw
I think the raid controller is a fake raid. I usually don't deal with hardware (and for the record I don't have physical access to it)
description: Computer
product: X7DBP ()
vendor: Supermicro
version: 0123456789
serial: 0123456789
width: 64 bits
capabilities: smbios-2.4 dmi-2.4 vsyscall32
configuration: administrator_password=disabled boot=normal frontpanel_password=unknown keyboard_password=unknown power-on_password=disabled uuid=53D19F64-D663-A017-8922-0030487C1FEE
*-core
description: Motherboard
product: X7DBP
vendor: Supermicro
physical id: 0
version: PCB Version
serial: 0123456789
*-firmware
description: BIOS
vendor: Phoenix Technologies LTD
physical id: 0
version: 6.00
date: 05/29/2007
size: 106KiB
capacity: 960KiB
capabilities: pci pnp upgrade shadowing escd cdboot bootselect edd int13floppy2880 acpi usb ls120boot zipboot biosbootspecification
*-storage
description: RAID bus controller
product: 631xESB/632xESB SATA RAID Controller
vendor: Intel Corporation
physical id: 1f.2
bus info: pci@0000:00:1f.2
version: 09
width: 32 bits
clock: 66MHz
capabilities: storage pm bus_master cap_list
configuration: driver=ahci latency=0
resources: irq:19 ioport:18a0(size=8) ioport:1874(size=4) ioport:1878(size=8) ioport:1870(size=4) ioport:1880(size=32) memory:d8500400-d85007ff