My Linux system has started throwing SMART errors in the syslog. I tracked it down and believe the problem is a single block on the disk. How do I go about easily getting the disk to reallocate that one block? I'd like to know what file got destroyed in the process. (I'm aware that if one block fails on a disk others are likely to follow; I have a good ongoing backup and just want to try to keep this disk working.)
Searching the web leads to the Bad block HOWTO, which describes a manual process on an unmounted disk. It seems complicated and error-prone. Is there a tool to automate this process in Linux? My only other option is the manufacturer's diagnostic tool, but I presume that'll clobber the bad block without any reporting on what got destroyed. Worst case, it might be filesystem metadata.
The disk in question is the primary system partition. Using ext3fs and LVM. Here's the error log from syslog and the relevant bit from smartctl.
smartd[5226]: Device: /dev/hda, 1 Currently unreadable (pending) sectors
Error 1 occurred at disk power-on lifetime: 17449 hours (727 days + 1 hours)
... Error: UNC at LBA = 0x00d39eee = 13868782
There's a full smartctl dump on pastebin.
I used to write disk firmware for WD, and I once wrote the firmware which reassigned bad blocks.
First, most bad blocks are detected on reads, not writes. Writes are done blindly, meaning the data is written without being checked. Thus on a write if the media is bad, you won't know it until the host does a read to that sector. There is a small part of the sector (the sector header) which is read on writes to locate the correct sector, so that if there is an error in reading the sector header, the drive will reassign the sector and write it with the data received from the write command. But the vast majority of bad blocks are detected on reads, and just because a write succeeds to a sector doesn't mean the media is good or that the sector has been reassigned.
Now about bad block reassignment (also called reallocation). Yes, normally the drive will attempt to reassign a sector if the error is bad enough (i.e., the ECC failure is bad enough) but the drive still could recover the data after ECC correction. Usually this is done automatically. The only exception is that the host could have previously told the drive not to do automatic reallocations, but this is seldom done.
So what happens if the drive does a read and cannot recover the data? Nothing. The error is reported to the host, but no reassignment is done. The problem is that the drive could reassign the sector, but it doesn't have the slightest idea what data to write in the newly reassigned sector. If it just wrote a bunch of zeros, say, and then the sector was read again, it would return all the zeros without any indication that the data wasn't valid. This is essentially the same thing as data corruption. The drive can't count on the host keeping track of errors for a variety of reasons (for example, what if the drive was moved to a new host?), so the best course of action is to do nothing when the data can't be recovered.
Modern drives, however, will save the location of the bad sector when it can't be reallocated. The number of bad sectors waiting reallocation can be found in the SMART data. What happens is if a write is done to one of the bad sectors awaiting reallocation, the reallocation is done because the drive now has valid data to write to it after the reallocation. Thus when people say writing to a bad sector will reallocate it, that's really only half the story. The drive must be read first so the drive can discover all the bad sectors that can't be reallocated automatically. Thus you can write an entire drive, and the SMART data will say there are no bad sectors waiting reallocation, but you haven't necessarily cleared the drive of all bad sectors. So if you really want to clear a drive of all bad sectors, the best thing is to read the entire drive first, followed by writing the entire drive (of course, this will destroy all previous data on the drive).
There are other ways of dealing with bad blocks which can't be reallocated. If the drive is part of a redundant RAID configuration (i.e., anything but RAID 0), the RAID software should automatically recover the data for a bad sector from the other drives and write it to the reallocated sector. SCSI disks have an explicit reassign blocks command which the host can use to force the reassignment even when there is no valid data to write to the block, but its use is pretty low-level.
You could try
hdparm --write-sector <LBA> /dev/ice
.I don't know any other way of doing this - you need to manually convert the LBA into filesystem blocks (as you've already found)
I think all you have to do is:
assuming /dev/hda1 is the (unmounted) partition. Or:
to do a (slower) non-destructive read-write test. It will still have to be unmounted. I don't think this will give you details on any lost data, though.
Michael has it correct and under most cases I would say just replace the drive they are cheap. However if you don't have backups and can't get important data off the drive, or just want to attempt to repair the drive then you may want to try using spinrite, on the highest level.
I had a laptop drive that started making some noises a few years ago. Badblocks showed that the drive had 118 or so bad blocks visible to the end user. Since I already had a copy of SpinRite I decided to give it a try before buying a new drive. After running spinrite on the drive badblocks showed 0 bad blocks and the noises stopped. The drive had been working for over two years since then.
If you have backups and you know this is a logical error and not phisical one, then the best way to go about this would be to zero out the disc.
I would use MHDD it is fairly easy to use and as long as you remember to set your HDD in Bios to IDE emulation and then back to AHCI when your work is done you have nothing to worry about.
Once you boot to MHDD pick your drive type in ERASE command and confirm your choice.
Get yourself coffie this might take a while.
After Drive is zeroed out run scan(f4) with Remap set to ON (default is off). If there still are issues with the drive (it would mean that there is a phisical damage on the platter and drive is on a stedy downwards slope) this option will "Fix" them by mapping damaged area to healthy parts of the drive.
If there are no UNC errors then congratulations you and your drive can still be friends for years to come.
If the disk is going bad, replace it. It's not worth the risk that it will fall apart more.