Ping a Specific Port

Question

Olaf Mandel

Asked: 2020-06-09 06:23:34 +0800 CST2020-06-09 06:23:34 +0800 CST 2020-06-09 06:23:34 +0800 CST

Reason for EXT4 file system corruption of Hyper-V guest

772

We had our second corruption of an ext4 partition in a relatively short time and ext4 is supposedly very reliable. As this is a virtual machine and the host providing the resources saw no disk errors or power-loss or such, I want to rule out hardware errors for now.

So I am wondering if we have such an unusual setup (a CoreOS guest under a Hyper-V host), such an unusual workload (Docker containers of Nginx, Gitlab, Redmine, MediaWiki, MariaDB) or a bad configuration. Any input / suggestion would be welcome.

The original error message (in the second instance) was:

Jun 05 02:00:50 localhost kernel: EXT4-fs error (device sda9): ext4_lookup:1595: inode #8347255: comm git: deleted inode referenced: 106338109
Jun 05 02:00:50 localhost kernel: Aborting journal on device sda9-8.
Jun 05 02:00:50 localhost kernel: EXT4-fs (sda9): Remounting filesystem read-only

At this point, an e2fsck run found lots of errors (didn't think to keep log) and placed about 357MB in lost+found for a 2TB partition with about 512GB data on it. The OS still bootes after this, so the lost parts seem to lie in user-data or docker containers.

Here are a few more details about the affected system:

$ uname -srm
Linux 4.19.123-coreos x86_64
$ sudo tune2fs -l /dev/sda9
tune2fs 1.45.5 (07-Jan-2020)
Filesystem volume name:   ROOT
Last mounted on:          /sysroot
Filesystem UUID:          04ab23af-a14f-48c8-af59-6ca97b3263bc
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg inline_data sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Remount read-only
Filesystem OS type:       Linux
Inode count:              533138816
Block count:              536263675
Reserved block count:     21455406
Free blocks:              391577109
Free inodes:              532851311
First block:              0
Block size:               4096
Fragment size:            4096
Group descriptor size:    64
Reserved GDT blocks:      15
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         32576
Inode blocks per group:   1018
Flex block group size:    16
Filesystem created:       Tue Sep 11 00:02:46 2018
Last mount time:          Fri Jun  5 15:40:01 2020
Last write time:          Fri Jun  5 15:40:01 2020
Mount count:              3
Maximum mount count:      -1
Last checked:             Fri Jun  5 08:14:10 2020
Check interval:           0 (<none>)
Lifetime writes:          79 GB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               128
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      595db5c2-beda-4f32-836f-ee025416b0f1
Journal backup:           inode blocks

Update:

And a few more details about the host setup:

using Hyper-V Server 2016
the disk is based on a virtual disk file (as opposed to a physical disk)
the disk is setup to be dynamic (ie.e growing)
there are several snapshots/restore-points on the VM. I am not sure if this switches the disk image from dynamic to differencing(?)

1 Answers

Voted

John Mahowald · Answer 1 · 2020-06-10T09:20:54+08:00

What data orphaned inodes contains is a tricky enough problem. Why the storage system did such a thing is considerably more difficult.

First, do incident response. Check if any of these workloads is having unplanned downtime. Evaluate your recovery options: any DR environment on separate storage, backups, other copies of the data.

Consider making a backup of the VHD before changing anything. Allows undo of your actions, and perhaps you can let support examine the broken volume.

Identify what data is affected.

Run file on those lost inodes to guess their format. Open and examine their contents.
Run integrity checks on the application data.
- GitLab wraps git fsck in a task. Particularly relevant given the syslog message indicates a git binary accessed problem data.
- Run checks on your DBMS.

Check everything in the storage and compute systems.

Storage array volume status: online, free capacity
Health of individual physical disks
Search guest logs for every message relating to EXT4
Run Windows Best Practices Analyzer. In the comments, we found a recommendation not to use VHD dynamic disks.

There may not be an obvious cause. Even so, consider moving to a different system to rule out a hardware problem. If you have a DR system on different hardware, consider cutting over to that. Or try replacing smaller components, like disks in the array. Or migrate the VM to a different compute host.

Reason for EXT4 file system corruption of Hyper-V guest

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?