Ping a Specific Port

Question

Laurentiu Soica

Asked: 2020-02-29 11:08:05 +0800 CST2020-02-29 11:08:05 +0800 CST 2020-02-29 11:08:05 +0800 CST

Poor performance with rook, ceph and RBD

772

I have a k8s cluster on 4 VMs. 1 master and 3 workers. On each of the workers, I use rook to deploy a ceph OSD. The OSDs are using the same disk as the VM Operating System.

The VM disks are remote (the underlaying infrastructure is again a Ceph cluster).

This is the VM disk performance (similar for all 3 of them):

$ dd if=/dev/zero of=testfile bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 4.82804 s, 222 MB/s

And the latency (await) while idle is around 8ms.

If I mount an RBD volume inside a K8S POD, the performance is very poor:

$ dd if=/dev/zero of=testfile bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 128.619 s, 8.3 MB/s

During high load (100% util for the rbd volume), the latency of the RBD volume is greater than 30 seconds.

I know that my setup is not what ceph recommends and that dd is not the best tool to profile disk performance, but the penalty from having ceph on top of VM disks is still huge.

VM Operating system is

CentOS 7.7.1908.
Kernel 3.10.0-1062.12.1.el7.x86_64

Network bandwidth between worker nodes:

[ ID] Interval           Transfer     Bandwidth
[  5]   0.00-10.00  sec  2.35 GBytes  2.02 Gbits/sec

Network latency is less than 1 ms.

I'm looking for some hints into further troubleshooting the issue and improving performance.

1 Answers

Voted

MaksaSila · Answer 1 · 2020-02-29T14:09:09+08:00

Best Answer

MaksaSila

2020-02-29T14:09:09+08:002020-02-29T14:09:09+08:00

It is not enough information about your CEPH cluster. But some thing will improve the performance:

It is necessary to put the journal on separated SSD (NVMe is even better). Even if you SSDs.
Use 10GbE network and separate cluster and external network. It will improve network latency.
Don't use 3 copies volumes. It is nice feature, but it makes your cluster slower.
By default, the scrubbing works all the time. It is necessary to change it. Better to do the scrubbing during the night.
Use BlueStore as format for OSDs.
Tune the server for maximum performance. For example, the CPUs governor should be performance.

4

Poor performance with rook, ceph and RBD

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?