I've installed on 4 nodes a completly fresh OS with Proxmox. Every node has 2xNVMe und 1xHD, one NIC public, one NIC private. On the public network there is an additional wireguard interface running for PVE cluster communication. The private interface should be used only for the upcoming distributed storage.
# ip a s
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp3s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
link/ether 6c:b3:11:07:f1:18 brd ff:ff:ff:ff:ff:ff
inet 10.255.255.2/24 brd 10.255.255.255 scope global enp3s0
valid_lft forever preferred_lft forever
inet6 fe80::6eb3:11ff:fe07:f118/64 scope link
valid_lft forever preferred_lft forever
3: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether b4:2e:... brd ff:ff:ff:ff:ff:ff
inet 168..../26 brd 168....127 scope global eno1
valid_lft forever preferred_lft forever
inet6 2a01:.../128 scope global
valid_lft forever preferred_lft forever
inet6 fe80::b62e:99ff:fecc:f5d0/64 scope link
valid_lft forever preferred_lft forever
4: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether a2:fd:6a:c7:f0:be brd ff:ff:ff:ff:ff:ff
inet6 2a01:....::2/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::..:f0be/64 scope link
valid_lft forever preferred_lft forever
6: wg0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN group default qlen 1000
link/none
inet 10.3.0.10/32 scope global wg0
valid_lft forever preferred_lft forever
inet6 fd01:3::a/128 scope global
valid_lft forever preferred_lft forever
The nodes are fine and the PVE cluster is running as expected.
# pvecm status
Cluster information
-------------------
Name: ac-c01
Config Version: 4
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Tue Dec 15 22:36:44 2020
Quorum provider: corosync_votequorum
Nodes: 4
Node ID: 0x00000002
Ring ID: 1.11
Quorate: Yes
Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 4
Quorum: 3
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.3.0.4
0x00000002 1 10.3.0.10 (local)
0x00000003 1 10.3.0.13
0x00000004 1 10.3.0.16
The PVE firewall is active in the cluster but there is a rule, that all PVE nodes can talk to each other on any protocol on any port on any interface. This is true - I can ping, ssh, etc. between all nodes on all IPs.
Then I installed ceph.
pveceph install
On the first node I've initialized ceph with
pveceph init -network 10.255.255.0/24
pveceph createmon
That works.
On the second - I tried the same (I'm not sure, if I need to set the -network
option - I tried with and without). That works too.
But pveceph createmon fails on any node with:
# pveceph createmon
got timeout
I can also reach port 10.255.255.1:6789 on any node. Whatever I try - I'm getting a "got timeout" on any node then node1. Also disabling firewall doesn't have any effect.
When I remove the -network
option, I can run all commands. It looks like it cannot talk via the second interface. But the interface is fine.
When I set network to 10.3.0.0/24
and cluster-network to 10.255.255.0/24
it works too, but I want all ceph communication running via 10.255.255.0/24
. What is wrong?
The problem is - the MTU 9000 is a problem. Even when I run the complete Proxmox cluster via the private network, there are errors.
So, Ceph has a problem with jumbo frames.
Just for reference, the official documentation mentions jumbo frames as bringing important performance improvements:
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/configuration_guide/ceph-network-configuration#verifying-and-configuring-the-mtu-value_conf
https://ceph.io/en/news/blog/2015/ceph-loves-jumbo-frames/
I for one, have seen read/write performances improvements of around 1400% after changing the MTU on the 6 nodes we set up (3 storage, 3 compute).
And no, this is not a typo. We went from 110 MB/s read/write with
dd
tests in Linux VMs to 1.5-1.6 GB/s afterwards (1 Gbps public network, 10 Gbps private network, OSD's on SATA SSDs).Nota Bene: changing the MTU on all network interfaces (public AND private) seems quite important! In our case, changing it only on the private NICs made the whole system go haywire.
From Redhat's doc:
I hope this helps someone! Cheers