I have a k8s cluster on 4 VMs. 1 master and 3 workers. On each of the workers, I use rook to deploy a ceph OSD. The OSDs are using the same disk as the VM Operating System.
The VM disks are remote (the underlaying infrastructure is again a Ceph cluster).
This is the VM disk performance (similar for all 3 of them):
$ dd if=/dev/zero of=testfile bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 4.82804 s, 222 MB/s
And the latency (await) while idle is around 8ms.
If I mount an RBD volume inside a K8S POD, the performance is very poor:
$ dd if=/dev/zero of=testfile bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 128.619 s, 8.3 MB/s
During high load (100% util for the rbd volume), the latency of the RBD volume is greater than 30 seconds.
I know that my setup is not what ceph recommends and that dd is not the best tool to profile disk performance, but the penalty from having ceph on top of VM disks is still huge.
VM Operating system is
CentOS 7.7.1908.
Kernel 3.10.0-1062.12.1.el7.x86_64
Network bandwidth between worker nodes:
[ ID] Interval Transfer Bandwidth
[ 5] 0.00-10.00 sec 2.35 GBytes 2.02 Gbits/sec
Network latency is less than 1 ms.
I'm looking for some hints into further troubleshooting the issue and improving performance.
It is not enough information about your CEPH cluster. But some thing will improve the performance: