My system has started behaving strangely, intermittently locking up. I see messages like the following in syslog
:
Nov 18 22:22:00 claypool kernel: [ 3428.078156] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Nov 18 22:22:00 claypool kernel: [ 3428.078163] ata3.00: irq_stat 0x40000000
Nov 18 22:22:00 claypool kernel: [ 3428.078167] sr 2:0:0:0: CDB: Test Unit Ready: 00 00 00 00 00 00
Nov 18 22:22:00 claypool kernel: [ 3428.078182] ata3.00: cmd a0/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Nov 18 22:22:00 claypool kernel: [ 3428.078184] res 50/00:03:00:00:00/00:00:00:00:00/a0 Emask 0x1 (device error)
Nov 18 22:22:00 claypool kernel: [ 3428.078188] ata3.00: status: { DRDY }
Nov 18 22:22:00 claypool kernel: [ 3428.080887] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Nov 18 22:22:00 claypool kernel: [ 3428.080890] ata3.00: irq_stat 0x40000000
Nov 18 22:22:00 claypool kernel: [ 3428.080893] sr 2:0:0:0: CDB: Test Unit Ready: 00 00 00 00 00 00
Nov 18 22:22:00 claypool kernel: [ 3428.080905] ata3.00: cmd a0/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Nov 18 22:22:00 claypool kernel: [ 3428.080906] res 50/00:03:00:00:00/00:00:00:00:00/a0 Emask 0x1 (device error)
Nov 18 22:22:00 claypool kernel: [ 3428.080910] ata3.00: status: { DRDY }
And then this:
Nov 18 23:13:56 claypool kernel: [ 6544.000798] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Nov 18 23:13:56 claypool kernel: [ 6544.000804] ata1.00: failed command: FLUSH CACHE EXT
Nov 18 23:13:56 claypool kernel: [ 6544.000814] ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Nov 18 23:13:56 claypool kernel: [ 6544.000815] res 40/00:00:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout)
Nov 18 23:13:56 claypool kernel: [ 6544.000819] ata1.00: status: { DRDY }
Nov 18 23:13:56 claypool kernel: [ 6544.000825] ata1: hard resetting link
Nov 18 23:14:01 claypool kernel: [ 6549.360324] ata1: link is slow to respond, please be patient (ready=0)
Nov 18 23:14:06 claypool kernel: [ 6554.008091] ata1: COMRESET failed (errno=-16)
Nov 18 23:14:06 claypool kernel: [ 6554.008103] ata1: hard resetting link
Nov 18 23:14:11 claypool kernel: [ 6559.372246] ata1: link is slow to respond, please be patient (ready=0)
Nov 18 23:14:16 claypool kernel: [ 6564.020228] ata1: COMRESET failed (errno=-16)
Nov 18 23:14:16 claypool kernel: [ 6564.020235] ata1: hard resetting link
Nov 18 23:14:21 claypool kernel: [ 6569.380109] ata1: link is slow to respond, please be patient (ready=0)
Nov 18 23:14:31 claypool kernel: [ 6579.460243] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Nov 18 23:14:31 claypool kernel: [ 6579.486595] ata1.00: configured for UDMA/133
Nov 18 23:14:31 claypool kernel: [ 6579.486601] ata1.00: retrying FLUSH 0xea Emask 0x4
Nov 18 23:14:31 claypool kernel: [ 6579.486939] ata1.00: device reported invalid CHS sector 0
Nov 18 23:14:31 claypool kernel: [ 6579.486952] ata1: EH complete
Nov 18 23:17:01 claypool CRON[3910]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Nov 18 23:17:01 claypool CRON[3908]: (CRON) error (grandchild #3910 failed with exit status 1)
Nov 18 23:17:01 claypool postfix/sendmail[3925]: fatal: open /etc/postfix/main.cf: No such file or directory
Nov 18 23:17:01 claypool CRON[3908]: (root) MAIL (mailed 1 byte of output; but got status 0x004b, #012)
Nov 18 23:39:01 claypool CRON[4200]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -d /var/lib/php5 ] && find /var/lib/php5/ -type f -cmin +$(/usr/lib/php5/maxlifetime) -print0 | xargs -n 200 -r -0 rm)
There are no messages marked after 23:39. When I next tried to use the machine, it would not return from the screensaver (blank screen), nor switch to another terminal, and I had to hard reboot it.
[UPDATE] The output of smartctl
is here. I had trouble getting this, because /
is being mounted read-only (?!), which prevents most applications from running.
Also, it may not be related, but I have the following worrying messages in dmesg
:
[ 10.084596] k8temp 0000:00:18.3: Temperature readouts might be wrong - check erratum #141
[ 10.098477] i2c i2c-0: nForce2 SMBus adapter at 0x600
[ 10.098483] ACPI: resource nForce2_smbus [io 0x0700-0x073f] conflicts with ACPI region SM00 [??? 0x00000700-0x0000073f flags 0x30]
[ 10.098486] ACPI: This conflict may cause random problems and system instability
[ 10.098487] ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
[ 10.098509] i2c i2c-1: nForce2 SMBus adapter at 0x700
[ 10.112570] Linux agpgart interface v0.103
[ 10.155329] atk: Resources not safely usable due to acpi_enforce_resources kernel parameter
[ 10.161506] it87: Found IT8712F chip at 0x290, revision 8
[ 10.161517] it87: VID is disabled (pins used for GPIO)
[ 10.161527] it87: in3 is VCC (+5V)
[ 10.161528] it87: in7 is VCCH (+5V Stand-By)
[ 10.161560] ACPI: resource it87 [io 0x0295-0x0296] conflicts with ACPI region ECRE [??? 0x00000290-0x000002af flags 0x45]
[ 10.161562] ACPI: This conflict may cause random problems and system instability
[ 10.161564] ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
[UPDATE 2] I swapped in a new SATA cable, per Phil's suggestion. The current output of smartctl is here, if it helps.
[UPDATE 3] I don't think the cable fixed it. The system hasn't locked up yet, but my media player crashed a few minutes ago and I have the following in the syslog
:
Nov 20 16:07:17 claypool kernel: [ 2294.400033] ata1: link is slow to respond, please be patient (ready=0)
Nov 20 16:07:47 claypool kernel: [ 2324.084581] ata1: COMRESET failed (errno=-16)
Nov 20 16:07:47 claypool kernel: [ 2324.084588] ata1: limiting SATA link speed to 1.5 Gbps
Nov 20 16:07:47 claypool kernel: [ 2324.084592] ata1: hard resetting link
I get the following response from smartctl
:
$ sudo smartctl -a /dev/sda
[sudo] password for chris:
sudo: Can't open /var/lib/sudo/chris/0: Read-only file system
smartctl 5.40 2010-03-16 r3077 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
Device: /0:0:0:0 Version:
scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46
>> Terminate command early due to bad response to IEC mode page
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
Don't forget that the Live CD (or Live USB) system that you used to install Ubuntu is also a fully functional environment that does not depend on your hard drive. You could boot from the install disc and have access to any program that you might require to investigate the problem. If I ever suspect a hard drive problem, my first priority is to boot from a Live CD or USB drive and make a backup of all the critical data on the hard drive. Then I use the tools on the live system to diagnose the problem.
This appears to be a hard disk failure. Swapping the cable didn't fix it. The "Reallocated Sector Count" in
smartctl
is way above the threshold and it's a pre-failure indicator. Luckily I have backups and I can still read from the disk to copy over old data.I recently had this error. Turned out to be the SATA cables connecting the HDDs to the motherboard. Brought two new ones and its been working perfectly for the last week.
Also in "System" -> "Administration" -> "Disk Utility" check out the SMART data for the drive. It might tell you why its failing. My cable problem reported no