Danyright

Asked: 2021-11-14 09:14:13 +0800 CST2021-11-14 09:14:13 +0800 CST 2021-11-14 09:14:13 +0800 CST

Proxmox on Ceph performance & stability issues / Configuration doubts

We have just installed a cluster of 6 Proxmox servers, using 3 nodes as Ceph storage, and 3 nodes as compute nodes.

We are experiencing strange and critical issues with the performances and stability of our cluster.

VMs and Proxmox web access tends to hang for no obvious reason, from a few seconds to a few minutes - when accessing via SSH, RDP or VNC console directly. Even the Proxmox hosts seem to be out of reach, as can be seen in this monitoring capture. This also creates Proxmox cluster issues with some servers falling out of sync.

For instance, when testing ping between host nodes, it would work perfectly a few pings, hang, carry on (without any pingback time increase - still <1ms), hang again, etc.

We had some performance issues initially, but those have been fixed by adjusting the NICs' MTU to 9000 (+1300% read/write improvement). Now we need to get all this stable, because right now it's not production ready.

Hardware configuration

We have a similar network architecture to the one described in Ceph's official doc, with a 1 Gbps public network, and a 10 Gbps cluster network. Those are connected to two physical network cards for each of the 6 servers.

Storage server nodes:

CPU: Xeon E-2136 (6 cores, 12 threads), 3.3 GHz, Turbo 4.5 GHz
RAM: 16 GB
Storage:
- 2x RAID 1 256 GB NVMe, LVM
  - system root logical volume: 15 GB (~55% free)
  - swap: 7.4 GB
  - WAL for OSD2: 80 GB
- 4 TB SATA SSD (OSD1)
- 12 TB SATA HDD (OSD2)
Network Interface Controller:
- Intel Corporation I350 Gigabit: connected to public 1 Gbps network
- Intel Corporation 82599 10 Gigabit: connected to 10 Gbps cluster (internal) network

Compute server nodes:

CPU: Xeon E-2136 (6 cores, 12 threads), 3.3 GHz, Turbo 4.5 GHz
RAM: 64 GB
Storage:
- 2x RAID 1 256 GB SATA SSD
  - system root logical volume: 15 GB (~65% free)

Software: (on all 6 nodes)

Proxmox 7.0-13, installed on top of Debian 11
Ceph v16.2.6, installed with Proxmox GUI
Ceph Monitor on each storage node
Ceph manager on storage node 1 + 3

Ceph configuration

ceph.conf of the cluster:

[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 192.168.0.100/30
     fsid = 97637047-5283-4ae7-96f2-7009a4cfbcb1
     mon_allow_pool_delete = true
     mon_host = 1.2.3.100 1.2.3.101 1.2.3.102
     ms_bind_ipv4 = true
     ms_bind_ipv6 = false
     osd_pool_default_min_size = 2
     osd_pool_default_size = 3
     public_network = 1.2.3.100/30

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
     keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.asrv-pxdn-402]
     host = asrv-pxdn-402
     mds standby for name = pve

[mds.asrv-pxdn-403]
     host = asrv-pxdn-403
     mds_standby_for_name = pve

[mon.asrv-pxdn-401]
     public_addr = 1.2.3.100

[mon.asrv-pxdn-402]
     public_addr = 1.2.3.101

[mon.asrv-pxdn-403]
     public_addr = 1.2.3.102

Questions:

Is our architecture correct?
Should the Ceph Monitors and Managers be accessed through the public network? (Which is what Proxmox's default configuration gave us)
Does anyone know where these disturbances/instabilities come from and how to fix those?

[edit]

Is it correct to use a default pool size of 3, when you have 3 storage nodes? I was initially temped by using 2, but couldn't find similar examples and decided to use the default config.

Noticed issues

We noticed that arping somehow is returning pings from two MAC addresses (public NIC and private NIC), which doesn't make any sens since these are seperate NICs, linked by a separate switch. This is maybe part of the network issue.
During a backup task on one of the VMs (backup to a Proxmox Backup Server, physically remote), somehow it seems to affect the cluster. The VM gets stuck in backup/locked mode, even though the backup seems to have finished properly (visible and accessible on the backup server).
Since the first backup issue, Ceph has been trying to rebuild itself, but hasn't managed to do so. It is in a degraded state, indicating that it lacks an MDS daemon. However, I double check and there are working MDS daemons on storage node 2 & 3. It was working on rebuilding itself until it got stuck in this state.

Here's the status:

root@storage-node-2:~# ceph -s
  cluster:
    id:     97637047-5283-4ae7-96f2-7009a4cfbcb1
    health: HEALTH_WARN
            insufficient standby MDS daemons available
            Slow OSD heartbeats on back (longest 10055.902ms)
            Slow OSD heartbeats on front (longest 10360.184ms)
            Degraded data redundancy: 141397/1524759 objects degraded (9.273%), 156 pgs degraded, 288 pgs undersized
 
  services:
    mon: 3 daemons, quorum asrv-pxdn-402,asrv-pxdn-401,asrv-pxdn-403 (age 4m)
    mgr: asrv-pxdn-401(active, since 16m)
    mds: 1/1 daemons up
    osd: 6 osds: 4 up (since 22h), 4 in (since 21h)
 
  data:
    volumes: 1/1 healthy
    pools:   5 pools, 480 pgs
    objects: 691.68k objects, 2.6 TiB
    usage:   5.2 TiB used, 24 TiB / 29 TiB avail
    pgs:     141397/1524759 objects degraded (9.273%)
             192 active+clean
             156 active+undersized+degraded
             132 active+undersized

[edit 2]

root@storage-node-2:~# ceph osd tree
ID  CLASS  WEIGHT    TYPE NAME               STATUS  REWEIGHT  PRI-AFF
-1         43.65834  root default                                     
-3         14.55278      host asrv-pxdn-401                           
 0    hdd  10.91409          osd.0               up   1.00000  1.00000
 3    ssd   3.63869          osd.3               up   1.00000  1.00000
-5         14.55278      host asrv-pxdn-402                           
 1    hdd  10.91409          osd.1               up   1.00000  1.00000
 4    ssd   3.63869          osd.4               up   1.00000  1.00000
-7         14.55278      host asrv-pxdn-403                           
 2    hdd  10.91409          osd.2             down         0  1.00000
 5    ssd   3.63869          osd.5             down         0  1.00000

Proxmox on Ceph performance & stability issues / Configuration doubts

Hardware configuration

Storage server nodes:

Compute server nodes:

Software: (on all 6 nodes)

Ceph configuration

Questions:

Noticed issues

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?

Proxmox on Ceph performance & stability issues / Configuration doubts

Hardware configuration

Storage server nodes:

Compute server nodes:

Software: (on all 6 nodes)

Ceph configuration

Questions:

Noticed issues

0 Answers