Yesterday I got report about SSD ATA errors on one of my hosts.
SSD disk is 128MB OCZ-VERTEX4 Firmware rev 1.3 about 8 months old.
OS is Ubuntu 11.04 running kernel 2.6.38-16-generic.
Motherboard is Intel DP35DP.
There are no read errors or any other disk errors since these two below.
Should I prepare replacement drive?
Smart attributes:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x0000 --- --- --- Old_age Offline - 393222
3 Spin_Up_Time 0x0000 100 100 000 Old_age Offline - 0
4 Start_Stop_Count 0x0000 100 100 000 Old_age Offline - 0
5 Reallocated_Sector_Ct 0x0000 100 100 000 Old_age Offline - 0
9 Power_On_Hours 0x0000 --- --- --- Old_age Offline - 507536484
12 Power_Cycle_Count 0x0000 --- --- --- Old_age Offline - 1664100
232 Available_Reservd_Space 0x0000 100 100 000 Old_age Offline - 4804710657
233 Media_Wearout_Indicator 0x0000 099 000 000 Old_age Offline - 99
Kernel log:
Jun 1 11:50:42 kernel: [424453.095411] ata4: exception Emask 0x10 SAct 0x0 SErr 0x4010000 action 0xe frozen
Jun 1 11:50:42 kernel: [424453.095415] ata4: irq_stat 0x00400040, connection status changed
Jun 1 11:50:42 kernel: [424453.095418] ata4: SError: { PHYRdyChg DevExch }
Jun 1 11:50:42 kernel: [424453.095422] ata4: hard resetting link
Jun 1 11:50:42 kernel: [424453.840022] ata4: SATA link down (SStatus 0 SControl 300)
Jun 1 11:50:44 kernel: [424455.948532] ata4: hard resetting link
Jun 1 11:50:45 kernel: [424456.490021] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jun 1 11:50:45 kernel: [424456.490288] ata4.00: configured for UDMA/133
Jun 1 11:50:45 kernel: [424456.490294] ata4: EH complete
Jun 1 19:18:23 kernel: [451311.319525] ata4.00: exception Emask 0x10 SAct 0x0 SErr 0x4010000 action 0xe frozen
Jun 1 19:18:23 kernel: [451311.319529] ata4.00: irq_stat 0x00400040, connection status changed
Jun 1 19:18:23 kernel: [451311.319532] ata4: SError: { PHYRdyChg DevExch }
Jun 1 19:18:23 kernel: [451311.319535] ata4.00: failed command: FLUSH CACHE
Jun 1 19:18:23 kernel: [451311.319541] ata4.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Jun 1 19:18:23 kernel: [451311.319542] res 40/00:0c:78:c6:c7/00:00:0a:00:00/40 Emask 0x10 (ATA bus error)
Jun 1 19:18:23 kernel: [451311.319545] ata4.00: status: { DRDY }
Jun 1 19:18:23 kernel: [451311.319549] ata4: hard resetting link
Jun 1 19:18:23 kernel: [451312.060033] ata4: SATA link down (SStatus 0 SControl 300)
Jun 1 19:18:23 kernel: [451314.082062] ata4: hard resetting link
Jun 1 19:18:23 kernel: [451314.630022] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jun 1 19:18:23 kernel: [451314.630295] ata4.00: configured for UDMA/133
Jun 1 19:18:23 kernel: [451314.630298] ata4.00: retrying FLUSH 0xe7 Emask 0x10
Jun 1 19:18:23 kernel: [451314.630320] ata4: EH complete
It's possible the cable might be bad, but it's also possible the drive's firmware is bad. It can also (very rarely) happen as a one-off. This error shows up when the drive fails to respond to ATA commands, or when data isn't coming across the connection properly.
Consider replacing the cable, and check for firmware updates (and if you're not taking backups, yesterday is a perfectly fine time to start). If you see this happen again, or more frequently, you'll be needing to replace the drive.
Very rarely, this can also be a bad IDE controller (on your RAID card or motherboard).
I had a similar thing happen to my older Vertex and after the failed CACHE FLUSH command, the drive went dead until power cycled. It was repeatable. Seems it was a bug in the drive firmware and a security erase reset the drive to a working state.