Solution inline
We encountered a strange issue and are basically out of ideas by now:
We setup a galera cluster (3 Nodes + MaxScale LB) for a customer and he reported slowness. We were unable to identify the issue so we setup a tes scenario to dig deeper:
- We cloned the complete cluster + the application server in a seperate subnet to prevent any interference by/to current users
- We managed to reproduce the slowness: The operation was ~10s
- In order to reduce variables we installed the application on one of the cluster nodes to allow us to do tests using the db connection to localhost
After extensive testing, tweaking and researching we decided to give the same setup a try on VmWare ESX. So we migrated the Cluster+Application to ESX and did the exact same tests - with weird results...
From there we did following tests:
Test | Result HyperV | Result ESX |
---|---|---|
App -> Load Balancer | 10s | 6s |
App -> Direct DB (localhost) | 6.5s | 3,6s |
App -> Direct DB (other node) | 9s | 5s |
App -> localhost; no cluster | 1.5s | 1.3s |
App (HyperV) -> LB (ESX) | 13s |
What we tried without any real change in results:
- move all cluster nodes onto the same hardware
- switch the maxscale between round robin and read-write-split
- apply various mariadb/galera seetings
- applied various settings in hyperV
Following setup:
- HyperV Windows server 2019
- MariaDb on ubuntu 20.04
- All-Flash HD
- 16GBit Fibre Channel
- Inter Network card
- Load on the host (and the VM acutally) was neglible
We are completely stumped because we cannot explain why there is such a huge difference in timings between hyperV and ESX. We figure it must be an Network IO, but cannot figure out which setting is at fault.
From the numbers/test we could conclue what parts ar enot at fault:
- HD/IO: since the performance drops drastically each time we add a "network" node
- CPU: the numbers are reproducable, and we did did our tests on a VM whithout any other load
- Slow DB Queries: since the numbers change depending on if we connect directly to one of cluster nodes or using localhost - that can be excluded
Can anyone give us pointers that what else we can try or how to speed up hyperv? or are we messing up some galera/maxscale settings?
Edit: We checked for bad segments and found (netstat -s | grep segments):
HyperV | ESX | |
---|---|---|
Received | 2448010940 | 2551382424 |
Sent | 5502198473 | 2576919172 |
Retransmitted | 9054212 | 7070 |
Bad Segments | 83 | 0 |
% Retransmitted | 0.16% | 0.00027% |
Solution
Thanks to input from Mircea we finally got the numbers way down on hyperV.
Following configuration changes helped:
- release the Default Windows Bond
- activate SET Team
- on the SET Team activate: RDMA and Jumbo Frames
With this the numbers on hyperV are basically equivalent to ESX
For VMs, make sure you install paravirtualization drivers (Hyper-V guest integrations and VMWare Tools). Run synthetic benchmark for networking. Monitor all equipments on the path (switches, routers, hypervisors, VMs) for CPU, network counters, interrupts, context switches...
Capture traffic during application benchmarks. Check the frame size, the TCP window size, the dropped packets in TCP streams, the latency between SYN, SYN/ACK, ACK TCP handshake and compare with application latency like SQL "ping" query: SELECT 1 FROM DUAL; Monitor CPU, network, disk I/O during application benchmark.
Run benchmarks inside VMs and also on baremetal.
Some other literature: The USE Method (Utilization Saturation and Errors) and The TSA Method (Thread State Analysis)
Monitoring tools can affect performance. Check their usage (CPU, network and disk I/O). Load testing utilities are using resources too. Make sure that the workstation that is doing the load testing is not saturated.