Intro
Recently one of my systems using ext4 on LVM on hardware RAID6 experienced a disastrous failure. To name some of the disasters: several filesystems failed beyond repair, at least one was granted ladders to heavens. It really took me by surprise! I considered this setup was resilient enough to withstand even worse, and it proved to be so for 7 years of replacing failed hard drives without a glitch. But recently it had failed. Hard.
Inspecting all the information with regards to this failure I could not come to conclusion on what could fail and most importantly why. I hope that someone experienced might identify the cause and the reason for it. All of the gathered information follows.
Course of events
Approximately at 10:45 local time power supply to our building stopped as a result of district power grid failure. After approximately 40 minutes of battery operation it was decided to shut servers down as there were no ETA on power restoration. It was definitely a clean shut-down since at that moment we had 20 minutes more till batteries drained. These are not mission-critical servers so it is acceptable for them to be offline while waiting for power to be restored. Later these servers were back online. This particular server had 1 failed hard drive upon starting. With RAID6 I did fear not as it takes 2 more hard drives to fail, which is highly unlikely to happen until I get a replacement. As I always did I have removed failed hard drive and allowed server proceed with degraded RAID6. After some time I accidentally found that LDAP could not start due to its data folder was lacking +r
permission. It was weird, but easily fixed. Just to be sure I checked other services, and they were fine. Later that server was rebooted once as a part of maintenance and RAID consistency check started just in case. At 18:00 local time a colleague did some work on one of our services on that server and assured me, that it was fully operational. At around 21:00 local time he messaged me that particular service suddenly lost styles and didn't allow him in. At first I thought that it is the same mysterious loss of +r
permission but found that the static folder was missing completely. I decided that it is still a trivial fix and delayed this till the next morning completely unaware of the events happening. The next morning came and I was facing absolute disaster in place of trivial fix I thought of last evening (a hard to express feeling I must say).
Worth noting that other servers survived this event absolutely undisturbed. One having ext4 on LVM on software RAID10, the other using bare hard drives.
Inspecting the logs I found that at approximately 20:05 PostgreSQL started failing to write its data because of file disappeared type errors. At 20:00 however we have our backups scheduled. With heavy I/O load obviously. Backups did also fail because of the same errors. Soon ext4_lookup errors started spamming to syslog and all other services started to fail. In the end that very service replied with random gibberish page and the page without any style was just a cached copy supplied by browser client-side.
Failure mode
RAID6 reported being degraded, but never reported failed. Neither reported any errors consistency check. So from the hardware perspective it was sub-optimal but in no way failed.
As I have already pointed by links, filesystems failed in some dire way. Interestingly only ext4 filesystems on LVM on hardware RAID6 were affected. Other filesystems outside RAID6 (on SSD) were intact. Identified problems with failed filesystems were:
- Some directories and files became special files (sockets and device files)
- Some files became directories (a number of files within one directory became directories within each other) and vice-versa directories became ordinary files
- Some directories became parents and children to each other at the same time
- Some files were marked both in-use and deleted
- Some files had blocks swapped among each other (seen a syslog file comprised of lines belonging to apache and kernel logs interwoven with another binary data apparently not belonging there)
- Some in use inodes had references to blocknumbers beyond maximum block number
- Some in use inodes had references to the same block number (or a range of them) many times repeated
- Some in use inodes had absolutely random data in metadata (dates, UID/GID etc)
Some of this bullet-list points might have been the result of an attempt to fix the filesystem with e2fsck, while others obviously are the corruption itself. Inspecting inodes I found that surrounded by valid inodes there are unlucky ones filled with random data in their metadata. I found no coincidences as files were randomly corrupted: irrespective of whether they were new or 10 years old, in-use or long time not touched -- every kind of files suffered and conversely there were survivors from all kinds as well.
Given that list of filesystem failures it follows that some libraries were mixed and because of that apache was giving a complete gibberish as output of our services. Also I could not recover PostgreSQL data as its files were a complete mess. Luckily however I could repair MySQL data as it had only system metadata corrupted which was relatively easy to recreate. It was also lucky that LDAP data directory only had accesslog corrupted, which was easily recreated. dpkg
installed packages list became mixed with other binary data. And so on and so forth...
Performance counters graphs also stop at approximately 20:05, but show no unexpected activity apart from the usual I/O rise from file operations.
Additional steps taken
- Running memtest on a server RAM didn't find bad memory
- Could not find any way to test RAID controller memory
- Failed hard drive was replaced and after complete rebuild consistency check was ran once again indicating no errors from the hardware perspective
- I managed to extract low-level log from RAID controller, which showed no warnings or errors during those two days
- Reinstalled all packages to overwrite any corrupted data
- Inspected the system for a suspicious processes, but found nothing unusual
- Several weeks past and I'm still trying to fix corrupted filesystems
Conclusion
I did suspect a virus of course, but performance counters didn't catch any unusual activity at all and no evidence was found while inspecting the system.
Given no coincidences on corrupted file age I believe it was not an ext journal failure. Considering huge amount of corrupted data and metadata I think it is unlikely that server RAM or RAID controller RAM could corrupt this as corrupted amount of data orders of magnitude more than both RAM combined. Even randomly filled inodes are not in some specific region but are all over the volumes. All this leads me to conclude that something should have made RAID controller write all this random data to hard drives if it insists that hard drives read valid data (RAID checksums would have detected errors), but I cannot imagine any way this could have happened.
Thus I'm asking anyone experienced to analyse the information provided if they could identify the fault and its cause or maybe advise me on any steps I have not taken yet leading to identifying the fault and its cause.
PS: Please note that this question is not about making backups.