I have a small 3-host Ceph cluster with Ubuntu 20.04.1 and Ceph 15.2.5 using docker containers and deployed with cephadm. Yesterday one of the hosts (s65-ceph) had a power outage. The other two hosts continued working for a while but then s63-ceph and s64-ceph started filling their logs with "e5 handle_auth_request failed to assign global_id
", coming from the monitors and their HDDs were full because of this. I have received this type of error other times, where the problem was just clock skew. I've synchronised their clocks but clock skew doesn't appear to be the problem here. I have also restarted ceph.target and rebooted all machines multiple times to no avail. I haven't made any changes to the cluster in a long time so I also know this error isn't some sort of configuration error.
I have tried following the documentation's monitor troubleshooting page. The odd thing though is s64-ceph lists itself and s63-ceph in the quorum while s63-ceph says it's out of quorum. Here are what the mon_status commands return: s63-ceph, s64-ceph. I am also able to telnet to 3300 & 6789 from other hosts, so I know the network is fine. The monitor (and its container) on s65-ceph are down so that has no mon_status. If it's of use, hee's the cluster config. I've uploaded the journal of s64-ceph to here (having removed duplicate lines with the handle_auth_request error and all grafana/prometheus/dashboard lines). S63-ceph has basically the same logs anyways.
I am aware there is an extremely similar question here, but that hasn't had any answers. This is a production system, so if there is no way to get the cluster back to normal operation, I'd appreciate if there is some safe way of recovering my files (the cluster is used only for CephFS).
Thanks in advance.