I'm getting these errors in dmesg
after about half an hour after I turn on the computer:
[ 1355.677957] EXT4-fs error (device sda2): htree_dirblock_to_tree: inode #1318420: (comm updatedb.mlocat) bad entry in directory: directory entry across blocks - block=5251700offset=0(0), inode=1802725748, rec_len=179136, name_len=32
[ 1355.677973] Aborting journal on device sda2-8.
[ 1355.678101] EXT4-fs (sda2): Remounting filesystem read-only
[ 1355.690144] EXT4-fs error (device sda2): htree_dirblock_to_tree: inode #1318416: (comm updatedb.mlocat) bad entry in directory: directory entry across blocks - block=5251699offset=0(0), inode=2194783952, rec_len=53280, name_len=152
[ 1356.864720] EXT4-fs error (device sda2): htree_dirblock_to_tree: inode #1312795: (comm updatedb.mlocat) bad entry in directory: directory entry across blocks - block=5251176offset=1460(13748), inode=1432317541, rec_len=208208, name_len=119
/dev/sda
is an SSD, and it's using the noop scheduler.
/etc/fstab
entry:
UUID=acb4eefa-48ff-4ee1-bb5f-2dccce7d011f / ext4 errors=remount-ro,noatime,discard,user_xattr 0 1
System information:
$ cat /proc/mounts | grep /dev/sd
/dev/sda1 /boot ext2 rw,noatime,errors=continue 0 0
$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=10.04
DISTRIB_CODENAME=lucid
DISTRIB_DESCRIPTION="Ubuntu 10.04.3 LTS"
$ uname -a
Linux leetpad 2.6.35-30-generic-pae #61~lucid1-Ubuntu SMP Thu Oct 13 21:14:29 UTC 2011 i686 GNU/Linux
Output of smartctl -a
:
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Device Model: STT_FTM28GX25H
Serial Number: P637510-MIBY-706A009
Firmware Version: 1916
User Capacity: 128,035,676,160 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Thu Nov 24 20:53:48 2011 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x1d) SMART execute Offline immediate.
No Auto Offline data collection support.
Abort Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x00) Error logging NOT supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 0) minutes.
Extended self-test routine
recommended polling time: ( 0) minutes.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x0000 005 000 000 Old_age Offline In_the_past 0
9 Power_On_Hours 0x0000 141 002 000 Old_age Offline - 0
12 Power_Cycle_Count 0x0000 115 002 000 Old_age Offline - 0
184 Unknown_Attribute 0x0000 084 000 000 Old_age Offline In_the_past 0
195 Hardware_ECC_Recovered 0x0000 000 000 000 Old_age Offline FAILING_NOW 0
196 Reallocated_Event_Count 0x0000 000 000 000 Old_age Offline FAILING_NOW 0
197 Current_Pending_Sector 0x0000 000 000 000 Old_age Offline FAILING_NOW 0
198 Offline_Uncorrectable 0x0000 002 107 000 Old_age Offline - 21198
199 UDMA_CRC_Error_Count 0x0000 063 003 000 Old_age Offline - 26957
200 Multi_Zone_Error_Rate 0x0000 099 124 000 Old_age Offline - 446
201 Soft_Read_Error_Rate 0x0000 024 154 000 Old_age Offline - 328
202 TA_Increase_Count 0x0000 115 254 000 Old_age Offline - 115
203 Run_Out_Cancel 0x0000 247 245 000 Old_age Offline - 83
204 Shock_Count_Write_Opern 0x0000 000 000 000 Old_age Offline FAILING_NOW 0
205 Shock_Rate_Write_Opern 0x0000 016 039 000 Old_age Offline - 0
206 Flying_Height 0x0000 005 000 000 Old_age Offline In_the_past 0
207 Spin_High_Current 0x0000 055 015 000 Old_age Offline - 0
208 Spin_Buzz 0x0000 248 001 000 Old_age Offline - 0
209 Offline_Seek_Performnce 0x0000 095 000 000 Old_age Offline In_the_past 0
211 Unknown_Attribute 0x0000 000 000 000 Old_age Offline FAILING_NOW 0
212 Unknown_Attribute 0x0000 000 000 000 Old_age Offline FAILING_NOW 0
213 Unknown_Attribute 0x0000 000 000 000 Old_age Offline FAILING_NOW 0
Warning: device does not support Error Logging
Warning! SMART ATA Error Log Structure error: invalid SMART checksum.
SMART Error Log Version: 1
No Errors Logged
Warning! SMART Self-Test Log Structure error: invalid SMART checksum.
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
Device does not support Selective Self Tests/Logging
I've run memtest for 7 hours, it didn't found any memory errors.
Any obvious ideas what can go wrong in this case? The most reasonable thing I can imagine is that the SSD is silently dropping some write requests, which eventually leads to an EXT4 filesystem inconsistency (but no disk I/O errors). How can this happen? Is there a relevant configuration option I should ensure to be set correctly?
What tools should I use to diagnose the hardware failures? Would it be possible to diagnose the SSD failure without overwriting data?
It has failed, RMA it.
You may want to run SMART test on it, but with such values it's just a formality, it's highly unlikely it won't fail.
To run a test, use
It will tell you when the test will end, then you run
smartctl -a /dev/sda
again it will show the test result in self test section.First, you might want to do a full fsck of the root disk. Sometimes, I have found that the quick check sometimes misses some important errors. You can do this by either touch a file in the root directory (maybe Linux distribution dependent) but might try
AND rebooting OR starting up the rescue CD and doing the performing the fsck of the root there. By full, I mean use the -f fsck parameter.
Second, is your syslog indicating any hardware errors?
As Mr. Kario indicated, you might look at checking the disk health using smartctl. I find that some disks that I have used do not report information however.