Ping a Specific Port

Question

Liquid

Asked: 2024-09-06 00:07:40 +0800 CST2024-09-06 00:07:40 +0800 CST 2024-09-06 00:07:40 +0800 CST

ceph orchestrator unavailbale after ceph upgrade

772

I was trying to issue ceph upgrade from 17 to 18.2.4, as outlined here

ceph orch upgrade start --ceph-version 18.2.4
Initiating upgrade to quay.io/ceph/ceph:v18.2.4

After this, however, the orchestrator no longer responds

ceph orch upgrade status
Error ENOENT: Module not found

Setting the backend back to orchestrator or cephadm fails, because the service appears as 'disabled'. Ceph mgr swears instead that the service is on.

from what I can gather now I'm stuck with one mgr daemon running reef, while the rest of the cluster runs on quincy.

ceph versions
{
    "mon": {
        "ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)": 5
    },
    "mgr": {
        "ceph version 18.2.4 (e7ad5345525c7aa95470c26863873b581076945d) reef (stable)": 1
    },
    "osd": {
        "ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)": 31
    },
    "mds": {
        "ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)": 4
    },
    "overall": {
        "ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)": 40,
        "ceph version 18.2.4 (e7ad5345525c7aa95470c26863873b581076945d) reef (stable)": 1
    }
}

How do I reinstate the cluster to a healthy state?

EDIT 1 Ceph health:

  cluster:
    id:     16249ca6-4060-11ef-a8a1-7509512e051b
    health: HEALTH_WARN
            insufficient standby MDS daemons available
            mon gpu001 is low on available space
            1/5 mons down, quorum ***
            Degraded data redundancy: 92072087/608856489 objects degraded (15.122%), 97 pgs degraded, 97 pgs undersized
            7 pgs not deep-scrubbed in time

  services:
    mon: 5 daemons, quorum ***
    mgr: cpu01.fcxjpi(active, since 5m)
    mds: 4/4 daemons up
    osd: 34 osds: 31 up (since 45h), 31 in (since 46h); 31 remapped pgs
 
  data:
    volumes: 1/1 healthy
    pools:   4 pools, 193 pgs
    objects: 121.96M objects, 32 TiB
    usage:   36 TiB used, 73 TiB / 108 TiB avail
    pgs:     92072087/608856489 objects degraded (15.122%)
             29422795/608856489 objects misplaced (4.832%)
             97 active+undersized+degraded
             65 active+clean
             31 active+clean+remapped
  io:
    client:   253 KiB/s rd, 51 KiB/s wr, 3 op/s rd, 2 op/s wr

Note: the question was originally asked on SO [https://stackoverflow.com/posts/78949269], i was advised to move it here. I'm currently in the process of searching the MGR logs to investigate the status, and eventually force a downgrade.

1 Answers

Voted

Liquid · Answer 1 · 2024-09-07T00:05:35+08:00

Credits to @eblock that was pointing me to the right direction. This is indeed related to the bug. https://tracker.ceph.com/issues/67329

I confirmed by looking at the mgr logs, that showed the following:

    debug2024-09-06T10:53:37.389+0000 7fe918b9a700 -1 mgr load Traceback (most recent call last):
  File "/usr/share/ceph/mgr/cephadm/module.py", line 535, in __init__
    self.to_remove_osds.load_from_store()
  File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 917, in load_from_store
    osd_obj = OSD.from_json(osd, rm_util=self.rm_util)
  File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 782, in from_json
    return cls(**inp)
TypeError: __init__() got an unexpected keyword argument 'original_weight'

I wasn't sure it had to do with anything I did. While looking at config values (ceph config) I did find out the problematic value:

ceph config-key get mgr/cephadm/osd_remove_queue
[{"osd_id": 32, "started": true, "draining": false, "stopped": false, "replace": false, "force": false, "zap": false, "hostname": "cpu03", "original_weight": null, ...}]

this comes from an osd whose removal was still pending on a faulty machine. Running ceph config-key rm mgr/cephadm/osd_remove_queue, restarting the manager got the cep

ceph orchestrator unavailbale after ceph upgrade

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?