One of our servers faced random time restarts, we asked the dc to run hardware test and they said one of the SSD are likely failed. Could it be the reason for restarts in random times?
We have completed testing of the system and we are showing that SDD is showing signs of failure as shown by the following:
Device Model: Samsung SSD 840 EVO 500GB Serial Number:
S1DHNSAF218733W ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 098 098 010 Pre-fail Always
- 135 9 Power_On_Hours 0x0032 087 087 000 Old_age Always - 62573 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 109 177 Wear_Leveling_Count 0x0013 001 001 000 Pre-fail Always
- 1806 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 098 098 010 Pre-fail Always - 135 181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0 182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always
- 0 183 Runtime_Bad_Block 0x0013 098 098 010 Pre-fail Always - 135 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0032 065 052 000 Old_age Always
- 35 195 Hardware_ECC_Recovered 0x001a 200 200 000 Old_age Always - 0 199 UDMA_CRC_Error_Count 0x003e
100 100 000 Old_age Always - 0 235 Unknown_Attribute 0x0012 099 099 000 Old_age Always
- 107 241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 269296231666This is most likely what is causing the reboots of the system.
Yes, hardware failure can cause reboot, disk failure is critical and must be fix.
You didn't mentioned how you operate the SDD drive. In case you have several SSD drives and operate it in RAID (1+) one failed storage drive would not cause the server reboot - it would have impact on performance but not stability.
Anyway once you recognize failing device it is good idea to replace it. It can be critical but even it is redundant reducing redundancy getting you close to failure.