We have a 64GB SSD drive in a tower server with a local colocation company. This drive and the enter system was built about six months ago, brand new parts.
Until this weekend the SSD/system were working perfectly. We're running CentOS 6.2
After booting perfectly, the system can be used about 20-30 minutes (no real consistency with time) before the drive starts acting funny.
Libraries start saying they can't load, ssh starts denying public key logins. Shutdown starts saying "input/outout error". Some programs start indicating the drive is read-only.
Only 25GB of the 64GB are used.
I can't find any errors that indicate what happened. I tried running fsck from a live cd on the drive and it showed no problems and most of the time boot works fine. There was one boot that said "couldn't find os" but that's not happening anymore.
Where can I look to find logs about what happens? Are there any other disk checks I should do? It seems like a repairable problem, and not that I need a new drive.
Update:
I enabled SMART after rebooting the server. After 1 hour of uptime and normal system operation (running services are httpd, mysql, but very little to no traffic), suddenly things just stop working. During the hour of uptime it responded with a PASS for the smart health check. After the hour I tried it again (through webmin) and it now says SMART is disabled.
The hard drive is now showing the same issues I've seen before - trying most commands show "input/output error".
Running a smart health check now shows:
Log Sense failed, IE page [scsi response fails sanity test]
What can I do to figure out what's causing this to fail after a random period of time? It runs perfectly for 30-60 minutes and then it starts acting odd like this.
Update 2
Some requested that I try dmesg and this was the result: http://www.pastie.org/private/hk7jfhxilj7ypy828irna. Someone else recommended that I not assume it's the drive, but possibly the drive controller. I don't understand how to determine whether the error is the controller versus the drive - aside from trying a different drive. If I have to buy a replacement motherboard or drive I need to know which is failing first.
Running fsck shows:
fsck from util-linux-ng 2.17.2
e2fsck 1.41.12 (17-May-2010)
fsck.ext4: Superblock invalid, trying backup blocks...
fsck.ext4: Bad magic number in super-block while trying to open /dev/mapper/vg_192-lv_root
The superblock could not be read or does not describe a correct ext2
filesystem. If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
e2fsck -b 8193 <device>
SSDs are notoriously fragile. Jeff Atwood outlines some failure rates here. They will fail without any warning and turn your data into a distant memory.
Looks like it's time to RMA and restore from backup. It shouldn't be a problem though, because you're not running a production server on a single, non-RAIDed disk, right? And you definitely have recent backups you can use to get back on your feet, right?
Right?
If your hard drive has SMART statistics (and it is almost guaranteed to have them) use a SMART utility to cull all of the available messages and statistics. The answer likely lies there, or at least some hint as to where to look next.
EDIT
Consider that you might be misdirecting your suspicions. Your drive controller could be part of the problem. Look into what metrics that it collects as well as what logs it creates. Keep it in the circle of suspects for now. Everything in IT is guilty until proven innocent.
I had exact the same fault with my home PC running an EXT-4 filesystem on a 64Gb Crucial/ Micron M4 SSD. I ran smartctl -a on my device and it passed all tests satisfactorily. I booted my server from a systemrescue cd and reran smartctl and this detected old firmware, v 0009 known to cause issues and provided the fix. My firmware is now at release 070H and the problems have now vanished. So the soluton in my case was to visit the crucial website and download a smal bootable iso image to update my SSD firmware. No more Input/ Output errors