I have a Dell server running VMware ESX, with 12TB local SSD drives, 1TB memory, Xeon Gold processor, and a single debian VM.
On that VM, when I do simultaneous writes on disk, or even just execute the following command:
dd if=/dev/urandom of=/local/ssd/drive/path/largefile bs=1M count=1024
I have a critical disk latency alert on VSphere for that VM.
The dd command finished successfully after 10 minutes.
Why does vSphere trigger a critical alert that is not critical?
How is it possible to overwhelm high-end SSD drives, with a single dd command?
EDIT:
The critical alert is triggered if latency exceeds 75ms latency in a period of five minutes.
In practice, the disk latency seems to be around 200-250 ms for that vm:
EDIT 2:
- Provisioning: thick lazy zeroed (no eager unfortunately)
EDIT 3:
I tried to define an IOPS limit on that disk, at the VM level (as you can see on the graph bellow).
I tried, 1000 IOPS, then 800, 600, 400, 200, 100. The critical disk latency alert is triggered even with 100 IOPS.
What is strange (as you can see on the graph), is that decreasing the limit (1000 IOPS to 100 IOPS) tend to increase the disk latency reported by vSphere. With 100 IOPS limit, the latency is 16,000 ms.
EDIT 4:
On the sofware side, I try to reduce the max simultaneous files writes from 24 to 4. The latency go from 200ms to 100ms, but the write bandwith go from 100MB/sec to 50MB/sec.
EDIT 5:
The switch from thick lazy zeroes provisionning, to thick eager zeroes, has not changed anything regarding the latency, always at 200ms