I have a Dell server running VMware ESX, with 12TB local SSD drives, 1TB memory, Xeon Gold processor, and a single debian VM.
On that VM, when I do simultaneous writes on disk, or even just execute the following command:
dd if=/dev/urandom of=/local/ssd/drive/path/largefile bs=1M count=1024
I have a critical disk latency alert on VSphere for that VM.
The dd command finished successfully after 10 minutes.
Why does vSphere trigger a critical alert that is not critical?
How is it possible to overwhelm high-end SSD drives, with a single dd command?
EDIT:
The critical alert is triggered if latency exceeds 75ms latency in a period of five minutes.
In practice, the disk latency seems to be around 200-250 ms for that vm:
EDIT 2:
- Provisioning: thick lazy zeroed (no eager unfortunately)
EDIT 3:
I tried to define an IOPS limit on that disk, at the VM level (as you can see on the graph bellow).
I tried, 1000 IOPS, then 800, 600, 400, 200, 100. The critical disk latency alert is triggered even with 100 IOPS.
What is strange (as you can see on the graph), is that decreasing the limit (1000 IOPS to 100 IOPS) tend to increase the disk latency reported by vSphere. With 100 IOPS limit, the latency is 16,000 ms.
EDIT 4:
On the sofware side, I try to reduce the max simultaneous files writes from 24 to 4. The latency go from 200ms to 100ms, but the write bandwith go from 100MB/sec to 50MB/sec.
EDIT 5:
The switch from thick lazy zeroes provisionning, to thick eager zeroes, has not changed anything regarding the latency, always at 200ms
Notice that it's latency alert. It means, the system issues a whole lot of I/O commands, which are get queued.
The latency grows, since if a new command is issued it's appended to the end of the queue and needs to wait for its turn to be executed, which happens only after all the commands before are executed — so this new command needs to wait for much longer than usual. This waiting time is called latency of the system. Imagine a large store or supermarket where customers get queued on checkout; when there are many customers all the cashiers will be busy, so the latency, the time any customer spends in the queue, grows.
This can be important, can be not. It depends on what this system does. For the online RDBMS, which handles some high load in real time, this will be unacceptable, because all the processing will slow down considerably, reducing the performance of the database down to crawl. Databases are very sensitive for the storage latency. For the interactive system, like desktop, this is of mild importance; starting of a new program or loading new document will be noticeably longer, but other load patterns are typically not affected. For the file server this can be safely tolerated, because there'll be much longer network and other protocol delays, so the additional delay incurred by the overloaded storage would be not as noticeable.
VMWare does't know what kind of a workload it runs. So, it stays on the safe side, alerting you that the storage is overloaded. To take actions in response or to not take is your decision. It also allows you to set resource limits on the VM instance for it to not be able to overload it that much.