In mid-November, a VPS that I am renting from a hosting company stopped responding. When I contacted support, they explained that a power outage in the datacenter caused a forced reboot and fsck. Eventually, I asked why it was taking so long, and was told that the size of the volume is 30 TB. The last time I received an update was in February, and they have not responded to my most recent inquiry.
I understand that fsck can be very slow for some file systems, but is it possible for fsck to take 6 months on a 30 TB volume, or should I assume that this hosting company is lying to me so that I continue to pay my bill every month?
fsck
speed mainly depends on the number of files and how they are spread in the respective directory. That said, 6 month for afsck
is absolutely absurd: it should had completed in some hours at most, especially if usingxfs
which has the speedyxfs_repair
utility. Here you can find somefsck
run at a scale - all completed under one hour (3600s). So, it is not possible that yourfsck
is still running.Anyway, an unexpected power loss will not cause a full-blow
fsck
, rather only a very fast (some seconds) journal replay. However, if some key files was damaged, the OS can be unbootable.But they probably just lied to you. You should stop paying immediately, ask for an explanation and apply for a total refund.
Conjecture: Their system uses a BBU/FBWC-less RAID (or even software RAID) with all possible write caches (including these in the hard drives themselves) set at their most aggressive settings, in order to get maximum performance for minimal cost. A hard power outage on such a setup can leave a journaling filesystem in a condition where the journal cannot be trusted and cannot be used for recovery. The problem is that such a system aggressively reorders and postpones writes, which means that a journal entry can be written with the effect of the data action being lost ... or the journal entry being lost on a data action that was consequential.
Recovering such a system from a worst case outage can mean that you have to do a "slow" fsck/repair that actually examines all the filesystem structures as they are, which could indeed take a day or two for 30TB.... and it is not unlikely that you will have to run multiple repair cycles. Add to that that personnel might not be always available to monitor this, you could easily be down to one fsck being done per week. They probably gave up and forgot.
For most filesystems it will be much faster, even when there are errors, as normally only the metadata is checked.
In the worst case, it may read the whole disk, (e.g. something like
fsck.ext4 -cc /dev/sda
, which does a non-destructive write test on every block), that could take a few days for 30 TB. If you know the speed of the drives, you can calculate size/speed. For a consumer hard drive with about 100 MB/s copying a few TB can take more hours than most people would expect.If it were your server, you could have the problem that it boots then hangs when
fsck
asks you if you want to fix an error. But the datacenter admin won't leave afsck
hanging for 6 month while all VPS are offline.So they are either lying to you, or there is a huge misunderstanding. Or they were running fsck some time ago and did not update you about the new problem after it finished.