PF4Public's questions -server

PF4Public

Asked: 2017-11-28 15:33:58 +0800 CST

What could go wrong with ext4 on LVM on a hardware RAID6?

Intro

Recently one of my systems using ext4 on LVM on hardware RAID6 experienced a disastrous failure. To name some of the disasters: several filesystems failed beyond repair, at least one was granted ladders to heavens. It really took me by surprise! I considered this setup was resilient enough to withstand even worse, and it proved to be so for 7 years of replacing failed hard drives without a glitch. But recently it had failed. Hard.

Inspecting all the information with regards to this failure I could not come to conclusion on what could fail and most importantly why. I hope that someone experienced might identify the cause and the reason for it. All of the gathered information follows.

Course of events

Approximately at 10:45 local time power supply to our building stopped as a result of district power grid failure. After approximately 40 minutes of battery operation it was decided to shut servers down as there were no ETA on power restoration. It was definitely a clean shut-down since at that moment we had 20 minutes more till batteries drained. These are not mission-critical servers so it is acceptable for them to be offline while waiting for power to be restored. Later these servers were back online. This particular server had 1 failed hard drive upon starting. With RAID6 I did fear not as it takes 2 more hard drives to fail, which is highly unlikely to happen until I get a replacement. As I always did I have removed failed hard drive and allowed server proceed with degraded RAID6. After some time I accidentally found that LDAP could not start due to its data folder was lacking +r permission. It was weird, but easily fixed. Just to be sure I checked other services, and they were fine. Later that server was rebooted once as a part of maintenance and RAID consistency check started just in case. At 18:00 local time a colleague did some work on one of our services on that server and assured me, that it was fully operational. At around 21:00 local time he messaged me that particular service suddenly lost styles and didn't allow him in. At first I thought that it is the same mysterious loss of +r permission but found that the static folder was missing completely. I decided that it is still a trivial fix and delayed this till the next morning completely unaware of the events happening. The next morning came and I was facing absolute disaster in place of trivial fix I thought of last evening (a hard to express feeling I must say).

Worth noting that other servers survived this event absolutely undisturbed. One having ext4 on LVM on software RAID10, the other using bare hard drives.

Inspecting the logs I found that at approximately 20:05 PostgreSQL started failing to write its data because of file disappeared type errors. At 20:00 however we have our backups scheduled. With heavy I/O load obviously. Backups did also fail because of the same errors. Soon ext4_lookup errors started spamming to syslog and all other services started to fail. In the end that very service replied with random gibberish page and the page without any style was just a cached copy supplied by browser client-side.

Failure mode

RAID6 reported being degraded, but never reported failed. Neither reported any errors consistency check. So from the hardware perspective it was sub-optimal but in no way failed.

As I have already pointed by links, filesystems failed in some dire way. Interestingly only ext4 filesystems on LVM on hardware RAID6 were affected. Other filesystems outside RAID6 (on SSD) were intact. Identified problems with failed filesystems were:

Some directories and files became special files (sockets and device files)
Some files became directories (a number of files within one directory became directories within each other) and vice-versa directories became ordinary files
Some directories became parents and children to each other at the same time
Some files were marked both in-use and deleted
Some files had blocks swapped among each other (seen a syslog file comprised of lines belonging to apache and kernel logs interwoven with another binary data apparently not belonging there)
Some in use inodes had references to blocknumbers beyond maximum block number
Some in use inodes had references to the same block number (or a range of them) many times repeated
Some in use inodes had absolutely random data in metadata (dates, UID/GID etc)

Some of this bullet-list points might have been the result of an attempt to fix the filesystem with e2fsck, while others obviously are the corruption itself. Inspecting inodes I found that surrounded by valid inodes there are unlucky ones filled with random data in their metadata. I found no coincidences as files were randomly corrupted: irrespective of whether they were new or 10 years old, in-use or long time not touched -- every kind of files suffered and conversely there were survivors from all kinds as well.

Given that list of filesystem failures it follows that some libraries were mixed and because of that apache was giving a complete gibberish as output of our services. Also I could not recover PostgreSQL data as its files were a complete mess. Luckily however I could repair MySQL data as it had only system metadata corrupted which was relatively easy to recreate. It was also lucky that LDAP data directory only had accesslog corrupted, which was easily recreated. dpkg installed packages list became mixed with other binary data. And so on and so forth...

Performance counters graphs also stop at approximately 20:05, but show no unexpected activity apart from the usual I/O rise from file operations.

Additional steps taken

Running memtest on a server RAM didn't find bad memory
Could not find any way to test RAID controller memory
Failed hard drive was replaced and after complete rebuild consistency check was ran once again indicating no errors from the hardware perspective
I managed to extract low-level log from RAID controller, which showed no warnings or errors during those two days
Reinstalled all packages to overwrite any corrupted data
Inspected the system for a suspicious processes, but found nothing unusual
Several weeks past and I'm still trying to fix corrupted filesystems

Conclusion

I did suspect a virus of course, but performance counters didn't catch any unusual activity at all and no evidence was found while inspecting the system.

Given no coincidences on corrupted file age I believe it was not an ext journal failure. Considering huge amount of corrupted data and metadata I think it is unlikely that server RAM or RAID controller RAM could corrupt this as corrupted amount of data orders of magnitude more than both RAM combined. Even randomly filled inodes are not in some specific region but are all over the volumes. All this leads me to conclude that something should have made RAID controller write all this random data to hard drives if it insists that hard drives read valid data (RAID checksums would have detected errors), but I cannot imagine any way this could have happened.

Thus I'm asking anyone experienced to analyse the information provided if they could identify the fault and its cause or maybe advise me on any steps I have not taken yet leading to identifying the fault and its cause.

PS: Please note that this question is not about making backups.

PF4Public

Asked: 2015-02-18 07:59:06 +0800 CST

Downloading with U-Boot's tftp randomly times out

I have a custom board with TI OMAP SoC. I'm trying to download uImage from linux machine via U-Boot's tftp. It fails with timeouts (most of the tries timeout limit exceeds and very rarely it gets through) on several, but succeeds on others. However any other combination not involving U-Boot is flawless. Even when the board in question has booted kernel. Comparing network settings (incl. sysctl) gave no significant difference between serving machines, which run Linux.

Following tests were taken:

u-boot <-> i686-pae Linux
u-boot <-> i686-pae Linux kvm guest
u-boot <-> x86_64 windows 7

Results are as follows:

u-boot <-> i686-pae Linux

Using DaVinci-EMAC device
TFTP from server 192.168.100.254; our IP address is 192.168.100.88
Filename 'uImage'.
Load address: 0xc0700000
Loading: ############T ###############################T ##########T ############
              #######T ################################################T ##########
              ##########################T #######################################
              ###########################T ######################################
              ################################T #################################
    #################################################################
              ########T #########################################################
              ##################
              11.7 KiB/s
done
Bytes transferred = 2418464 (24e720 hex)

Corresponding traffic dump can be found here: http://pastebin.com/hBBwe9bL

u-boot <-> i686-pae Linux kvm guest

Using DaVinci-EMAC device
TFTP from server 192.168.100.112; our IP address is 192.168.100.88
Filename 'uImage'.
Load address: 0xc0700000
Loading: #################################################################
#################################################################
#################################################################
#################################################################
#################################################################
#################################################################
#################################################################
          ##################
          795.9 KiB/s
done
Bytes transferred = 2418464 (24e720 hex)

Corresponding traffic dump can be found here: http://pastebin.com/ZXYdpmSe

u-boot <-> x86_64 windows 7

Using DaVinci-EMAC device
TFTP from server 192.168.100.86; our IP address is 192.168.100.88
Filename 'uImage'.
Load address: 0xc0700000
Loading: #################################################################
#################################################################
          ###################################
          173.8 KiB/s
done
Bytes transferred = 2418464 (24e720 hex)

Corresponding traffic dump can be found here: http://pastebin.com/UWFEZjTz

At this point I have no idea, what could cause timeouts for u-boot and I have no more clues on how to solve this. Any help greatly appreciated.

It certainly has something to do with U-Boot network stack, but I believe this is the right place to ask this question.

I have read this article: http://www.denx.de/wiki/view/DULG/TFTPTimeout, however what is described there is not related to my situation since results do not depend on switches in-between.

What I have tried already: tftpd / tftpd-hpa; tftpblocksize=512; x86_64 linux kernel (tftp server); changing switch port settings to not aneg, but explicit full-duplex; as well as half-; adding/removing switches in-between; changing MTU at the serving machine; building latest U-Boot from source; varying server IP-address within /24; changing sysctl net. mem settings; sent a message to U-Boot mailing list, but got no reply; made static arp for U-Boot MAC.

What could go wrong with ext4 on LVM on a hardware RAID6?

Intro

Course of events

Failure mode

Additional steps taken

Conclusion

Downloading with U-Boot's tftp randomly times out

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?