Solution inline
We encountered a strange issue and are basically out of ideas by now:
We setup a galera cluster (3 Nodes + MaxScale LB) for a customer and he reported slowness. We were unable to identify the issue so we setup a tes scenario to dig deeper:
- We cloned the complete cluster + the application server in a seperate subnet to prevent any interference by/to current users
- We managed to reproduce the slowness: The operation was ~10s
- In order to reduce variables we installed the application on one of the cluster nodes to allow us to do tests using the db connection to localhost
After extensive testing, tweaking and researching we decided to give the same setup a try on VmWare ESX. So we migrated the Cluster+Application to ESX and did the exact same tests - with weird results...
From there we did following tests:
Test | Result HyperV | Result ESX |
---|---|---|
App -> Load Balancer | 10s | 6s |
App -> Direct DB (localhost) | 6.5s | 3,6s |
App -> Direct DB (other node) | 9s | 5s |
App -> localhost; no cluster | 1.5s | 1.3s |
App (HyperV) -> LB (ESX) | 13s |
What we tried without any real change in results:
- move all cluster nodes onto the same hardware
- switch the maxscale between round robin and read-write-split
- apply various mariadb/galera seetings
- applied various settings in hyperV
Following setup:
- HyperV Windows server 2019
- MariaDb on ubuntu 20.04
- All-Flash HD
- 16GBit Fibre Channel
- Inter Network card
- Load on the host (and the VM acutally) was neglible
We are completely stumped because we cannot explain why there is such a huge difference in timings between hyperV and ESX. We figure it must be an Network IO, but cannot figure out which setting is at fault.
From the numbers/test we could conclue what parts ar enot at fault:
- HD/IO: since the performance drops drastically each time we add a "network" node
- CPU: the numbers are reproducable, and we did did our tests on a VM whithout any other load
- Slow DB Queries: since the numbers change depending on if we connect directly to one of cluster nodes or using localhost - that can be excluded
Can anyone give us pointers that what else we can try or how to speed up hyperv? or are we messing up some galera/maxscale settings?
Edit: We checked for bad segments and found (netstat -s | grep segments):
HyperV | ESX | |
---|---|---|
Received | 2448010940 | 2551382424 |
Sent | 5502198473 | 2576919172 |
Retransmitted | 9054212 | 7070 |
Bad Segments | 83 | 0 |
% Retransmitted | 0.16% | 0.00027% |
Solution
Thanks to input from Mircea we finally got the numbers way down on hyperV.
Following configuration changes helped:
- release the Default Windows Bond
- activate SET Team
- on the SET Team activate: RDMA and Jumbo Frames
With this the numbers on hyperV are basically equivalent to ESX