I have a USB hard drive attached to a Debian GNU/Linux server. I'm trying to format it (NTFS), with this command:
# mkntfs -v /dev/sdd1
which takes a few hours, because it checks the disk too. While checking, dmesg -T
shows the following:
[Wed Jan 12 15:22:53 2022] sd 9:0:0:0: [sdd] Attached SCSI disk
[Wed Jan 12 18:03:26 2022] usb 1-4: USB disconnect, device number 5
[Wed Jan 12 18:03:26 2022] blk_update_request: I/O error, dev sdd, sector 621745808 op 0x1:(WRITE) flags 0x104000 phys_seg 240 prio class 0
[Wed Jan 12 18:03:26 2022] Buffer I/O error on dev sdd1, logical block 621743760, lost async page write
[Wed Jan 12 18:03:26 2022] Buffer I/O error on dev sdd1, logical block 621743761, lost async page write
(...and so on for a few lines, then)
[Wed Jan 12 18:03:26 2022] blk_update_request: I/O error, dev sdd, sector 621746048 op 0x1:(WRITE) flags 0x104000 phys_seg 240 prio class 0
[Wed Jan 12 18:03:26 2022] blk_update_request: I/O error, dev sdd, sector 621746288 op 0x1:(WRITE) flags 0x100000 phys_seg 8 prio class 0
[Wed Jan 12 18:03:26 2022] blk_update_request: I/O error, dev sdd, sector 621746296 op 0x1:(WRITE) flags 0x800 phys_seg 16 prio class 0
(...and so on for a few lines, then)
[Wed Jan 12 18:03:31 2022] buffer_io_error: 9015384 callbacks suppressed
(...other errors...)
Looking at the sheer amount of error messages I'd say the HDD is almost dead, but attaching it to a Windows PC it seems to work. Moreover the first error (usb 1-4: USB disconnect, device number 5
) that comes in dmesg
before the other errors, makes me suspect the problem is not the HDD itself, but something else that causes the HDD to disconnect in the first place, and subsequent errors could be the obvious consequence of the disconnect.
However I'm not very experienced in dmesg
output, so it's very possible that I'm reading it wrong.
EDIT: as requested by NiKiZe, here is the output of smartctl -a /dev/sdd
:
# smartctl -a /dev/sdd
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-5.10.0-3-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Blue Mobile
Device Model: WDC WD10SPCX-00KHST0
Serial Number: WD-WXF1A95F0J3X
LU WWN Device Id: 5 0014ee 65b7e0332
Firmware Version: 01.01A01
User Capacity: 1.000.204.886.016 bytes [1,00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 1.5 Gb/s)
Local Time is: Thu Jan 13 11:04:19 2022 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Status not supported: Incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (16080) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 184) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x7035) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 190 184 021 Pre-fail Always - 1500
4 Start_Stop_Count 0x0032 081 081 000 Old_age Always - 19048
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 073 073 000 Old_age Always - 20415
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 188
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 43
193 Load_Cycle_Count 0x0032 187 187 000 Old_age Always - 41054
194 Temperature_Celsius 0x0022 119 095 000 Old_age Always - 28
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Assuming this HDD is really failing and that those error messages in dmesg
are there for real bad sectors, why does dmesg
show a disconnect before the bad sectors messages rather than after?
The disconnect happens first, then you get a mix of errors caused by the disconnect, where various layers that generate disk writes notice that something is wrong.
Most I/O happens in the form of a "dirty" memory page that has a disk backing, that the kernel eventually gets around to flushing to disk, long after the write has been acknowledged to the application, and the only driver layers that are still interested in the request are the memory subsystem (that wants to know if the page can be freed) and the file system (that needs to generate more writes to update metadata after the data has been written) -- anything above that has been dismantled already.
Some requests do not go through the memory subsystem (e.g. the SMART status query), some requests like reads are likely only attached to the memory subsystem as the file system has looked up the block mapping and no longer needs to be involved.
So, the error messages you get as a result of the disconnection have no particular order -- they happen as requests are generated, submitted and rejected due to the device being offline. If there is a causal relationship between some of the requests (like a journal entry that reflects the new state after the aborted data transfer -- after all, the journal could be on a different device), then this defines a bit of an order.
I'd probably try to use the disk on a proper SATA port, and with power from a good supply -- the voltage converters in USB-to-SATA enclosures are often a bit dodgy.
Please consider the drive dead.
Normally it don't take that long to format a drive. However the retries and resets on bad blocks does take time.
Sure this could be due to bad power supply or something else.
Not all USB devices support smart passthru, but do try:
smartctl -a /dev/sdd
Look especially forReallocated_Sector_Ct
,Reallocated_Event_Count
,Current_Pending_Sector
,Offline_Uncorrectable
If any of them has aRAW_VALUE
other than0
, I would try and avoid using the disk for anything other than non important data.If there is no data on the disk, you could always try wiping it, using something like (be careful here!!!)
dd if=/dev/zero of=/dev/sdd bs=4M status=progress
. You will need to recrate the partition table afterwards.