galera cluster slow on hyperV (faster on ESX)

Solution inline

We encountered a strange issue and are basically out of ideas by now:

We setup a galera cluster (3 Nodes + MaxScale LB) for a customer and he reported slowness. We were unable to identify the issue so we setup a tes scenario to dig deeper:

We cloned the complete cluster + the application server in a seperate subnet to prevent any interference by/to current users
We managed to reproduce the slowness: The operation was ~10s
In order to reduce variables we installed the application on one of the cluster nodes to allow us to do tests using the db connection to localhost

After extensive testing, tweaking and researching we decided to give the same setup a try on VmWare ESX. So we migrated the Cluster+Application to ESX and did the exact same tests - with weird results...

From there we did following tests:

Test	Result HyperV	Result ESX
App -> Load Balancer	10s	6s
App -> Direct DB (localhost)	6.5s	3,6s
App -> Direct DB (other node)	9s	5s
App -> localhost; no cluster	1.5s	1.3s
App (HyperV) -> LB (ESX)	13s

What we tried without any real change in results:

move all cluster nodes onto the same hardware
switch the maxscale between round robin and read-write-split
apply various mariadb/galera seetings
applied various settings in hyperV
- VMQ settings 1 2
- SET Virtual Switch
- Jumbo Frames
- instead of bond we added a physical network card
- activate switch internal instead of using NIC
- installed all the latest patches and updated network card drivers
- install the linux-cloud-tools and linux-azure kernel

Following setup:

HyperV Windows server 2019
MariaDb on ubuntu 20.04
All-Flash HD
16GBit Fibre Channel
Inter Network card
Load on the host (and the VM acutally) was neglible

We are completely stumped because we cannot explain why there is such a huge difference in timings between hyperV and ESX. We figure it must be an Network IO, but cannot figure out which setting is at fault.

From the numbers/test we could conclue what parts ar enot at fault:

HD/IO: since the performance drops drastically each time we add a "network" node
CPU: the numbers are reproducable, and we did did our tests on a VM whithout any other load
Slow DB Queries: since the numbers change depending on if we connect directly to one of cluster nodes or using localhost - that can be excluded

Can anyone give us pointers that what else we can try or how to speed up hyperv? or are we messing up some galera/maxscale settings?

Edit: We checked for bad segments and found (netstat -s | grep segments):

	HyperV	ESX
Received	2448010940	2551382424
Sent	5502198473	2576919172
Retransmitted	9054212	7070
Bad Segments	83	0
% Retransmitted	0.16%	0.00027%

Solution

Thanks to input from Mircea we finally got the numbers way down on hyperV.

Following configuration changes helped:

release the Default Windows Bond
activate SET Team
on the SET Team activate: RDMA and Jumbo Frames

With this the numbers on hyperV are basically equivalent to ESX

galera cluster slow on hyperV (faster on ESX)

Solution

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?

Niko's questions

Solution