Ping a Specific Port

Question

SamK

Asked: 2010-02-19 03:21:44 +0800 CST2010-02-19 03:21:44 +0800 CST 2010-02-19 03:21:44 +0800 CST

SMART warns me but I don't trust it

772

I've got a server with four Samsung hard drives. All drives are the same model and have been bought together. The drives are SAMSUNG HE753LJ with firmware 1AA01113.

I'm getting SMART errors but I have the feeling that smartctl does not understand the value he gets from the hard drive.

Here's the result of a SMART test:

asgard:~# smartctl -H /dev/sdb
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
Failed Attributes:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  3 Spin_Up_Time            0x0007   001   001   011    Pre-fail  Always   FAILING_NOW 60340

I don't trust SMART because:

It's been over one year that all disks are about to fail within less than 24 hours. Nothing blew up yet.
Wikipedia says that "Spin-Up Time is the average time of spindle spin up (from zero RPM to fully operational [millisecs])." That would mean that the drives need about one minute to wake up?!

I would like to follow smartctl's advice and change these disks but I just don't trust the results I read.

What do you think about this? What would you do?

Thanks for your help.

6 Answers

Voted

Martin Bøgelund · Answer 1 · 2010-02-19T04:12:07+08:00

Best Answer

Martin Bøgelund

2010-02-19T04:12:07+08:002010-02-19T04:12:07+08:00

All drives are the same model and have been bought together.

This is a ticking bomb.

Based on both the message from SMART and the quote above, you should change disks right away.

Since the drives have been bought together and are the same model, they will probably have the same weaknesses, and probably all fail simultaneously under the same condition...

The main concept of RAID is that disks fail at different times, giving you the opportunity to swap one disk at a time, and avoid data loss.

Others have reported simultaneous failure of an entire array of identical disks in a RAID configuration, coming from the same production batch, and thus being subject to the same weakness.

I can't stress this enough: You need to start swapping your drives!

7

Andy · Answer 2 · 2010-02-19T04:51:31+08:00

I had a spare drive that I can still boot from that fails SMART checks every boot and requires a soft reset, has for years, but it's just a dump, not a system disk! So although SMART errors can persist for a long time they should ALWAYS be heeded in production, as the risks heavily outweigh the cost, time and data integrity benefits. Google studied 100,00 disks and found:

SMART data (Self-Monitoring, Analysis and Reporting Technology) can be useful in determining whether a drive is going to fail. Up to 30 percent of drives that indicated SMART errors eventually failed, and the probability of crash gets worse and worse the longer an "erroring" drive is left in service. That said, many drives exhibit SMART errors at some point in their lives.

So it's not always a robust indicator. However SMART error significantly increase the likelihood of a disk crash in the time immediately after initial detection:

Work at Google on over 100,000 drives has shown little overall predictive value of S.M.A.R.T. status as a whole, but suggests that certain sub-categories of information which some S.M.A.R.T. implementations track do correlate with actual failure rates – specifically, in the 60 days following the first scan error on a drive, the drive is, on average, 39 times more likely to fail than it would have been had no such error occurred.

So statistically your disk is probably OK, as it's well exceeded the 60 day limit.

Despite those strong correlations, we find that failure prediction models based on SMART parameters alone are likely to be severely limited in their prediction accuracy, given that a large fraction of our failed drives have shown no SMART error signals whatsoever

But are you willing to continue taking the risk? I'd change the disk ASAP to avoid having to get up in the early hours.

David Spillett · Answer 3 · 2010-02-19T03:53:14+08:00

David Spillett

2010-02-19T03:53:14+08:002010-02-19T03:53:14+08:00

SMART overall-health self-assessment test result: FAILED!

That part is not interpreted by smartctl (assuming I understand correctly, of course) - that drive has told smartctl that is isn't happy with its current state (for whatever reason) and smartctl is just echoing that warning to you. Even if it is misinterpreting the spin-up time reading, I don't think it is doing any interpretation on the "self assessment test" reading.

I would suggest moving your data off that drive ASAP, preferably before it next power cycles in case the spin-up problem is real and might get worse.

6

François Feugeas · Answer 4 · 2010-02-19T03:36:25+08:00

François Feugeas

2010-02-19T03:36:25+08:002010-02-19T03:36:25+08:00

I would change the disks right away without thinking much about it. You'd be on the safe side, disks are dirt cheap and you'll sleep better. Your time spent diagnosing the disks is probably worth more than the disks themselves.

2

jeffatrackaid · Answer 5 · 2010-02-19T05:40:36+08:00

jeffatrackaid

2010-02-19T05:40:36+08:002010-02-19T05:40:36+08:00

Make sure you have the latest copy of the smart utils not just the ones included in your OS. smart utils are updated frequently and some of the errors reporting from specific drives to get resolved.

Google's study was very informative. 30% of disk with SMART errors eventually fail. That's not odds I would be will to deal with. That's a 9% chance that two disks will fail and your RAID at that point will be destroyed.

I had a similar issues with some Seagate drives a few years ago. We bought about 8 disks at the same time and they all were from the same lot. About at about 3 years, one drive went. 18 hours later another drive went, 24 hours later a 3rd drive went.

2

dyasny · Answer 6 · 2010-02-19T03:32:25+08:00

dyasny

2010-02-19T03:32:25+08:002010-02-19T03:32:25+08:00

Run a DST on the disks, and replace them accordingly.

1

SMART warns me but I don't trust it

Ping a Specific Port

How do I tell Git for Windows where to find my private RSA key?

How do you restart php-fpm?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Resolve host name from IP address

How can I sort du -h output by size

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?